Database Ingestion and Replication targets - preparation and usage
Before you configure database ingestion and replication tasks for initial load, incremental load, or combined initial and incremental load operations, review the following guidelines for your target types to avoid unexpected results:
Amazon Redshift targets
The following list identifies considerations for preparing and using Amazon Redshift targets:
•You can use either Amazon Redshift or Amazon Redshift Serverless targets in database ingestion and replication jobs.
•Before writing data to Amazon Redshift target tables, database ingestion and replication jobs stage the data in an Amazon S3 bucket. You must specify the name of the bucket when you configure the database ingestion and replication task. The database ingestion and replication jobs use the COPY command to load the data from the Amazon S3 bucket to the Amazon Redshift target tables. For more information about the COPY command, see the Amazon Web Services documentation.
•When you define a connection for an Amazon Redshift target, provide the access key and secret access key for the Amazon S3 bucket in which you want the database ingestion and replication jobs to stage the data before loading it to the Amazon Redshift target tables.
•Incremental load jobs and combined initial and incremental load jobs generate a recovery table named INFORMATICA_CDC_RECOVERY on the target to store internal service information. The data in the recovery table prevents jobs that are restarted after a failure from propagating previously processed data again. The recovery table is generated in the same schema as the target tables.
•Database ingestion and replication jobs convert infinity and NaN values in FLOAT columns in source tables to null values in Amazon Redshift target tables.
Amazon S3, Flat File, Google Cloud Storage, Microsoft Azure Data Lake Storage, Microsoft Fabric OneLake, and Oracle Cloud Object Storage targets
The following list identifies considerations for using Amazon S3, Flat File, Google Cloud Storage, Microsoft Azure Data Lake Storage, Microsoft Fabric OneLake, and Oracle Cloud Infrastructure (OCI) Object Storage targets:
•When you define a database ingestion and replication task that has an Amazon S3, Google Cloud Storage, Microsoft Azure Data Lake Storage, Microsoft Fabric OneLake, or Oracle Cloud Object Storage target, you can select CSV, Avro, or Parquet as the format for the generated output files that contain the source data to be applied to the target. For flat file targets, you can select either CSV or Avro as the output file format.
•If you select the CSV output format, Database Ingestion and Replication creates the following files on the target for each source table:
- A schema.ini file that describes the schema and includes some settings for the output file on the target.
- One or multiple output files for each source table, which contain the source data. Database Ingestion and Replication names these text files based on the name of the source table with an appended date and time.
The schema.ini file lists a sequence of columns for the rows in the corresponding output file. The following table describes the columns in the schema.ini file:
Column
Description
ColNameHeader
Indicates whether the source data files include column headers.
Format
Describes the format of the output files. Database Ingestion and Replication uses a comma (,) to delimit column values.
CharacterSet
Specifies the character set that is used for output files. Database Ingestion and Replication generates the files in the UTF-8 character set.
COL<sequence_number>
The name and data type of the column.
Important: You should not edit the schema.ini file.
•If you select the Avro output format, you can select an Avro format type, a file compression type, an Avro data compression type, and the directory that stores the Avro schema definitions generated for each source table. The schema definition files have the following naming pattern: schemaname_tablename.txt.
•If you select the Parquet output format, you can optionally select a compression type that Parquet supports.
•On Flat File, Microsoft Azure Data Lake Storage, and Microsoft Fabric OneLake targets, Database Ingestion and Replication creates an empty directory for each empty source table. Database Ingestion and Replication does not create empty directories on Amazon S3, Google Cloud Storage, and Oracle Cloud Object Storage targets.
•If you do not specify an access key and secret key in the Amazon S3 connection properties, Database Ingestion and Replication tries to find AWS credentials by using the default credential provider chain that is implemented by the DefaultAWSCredentialsProviderChain class. For more information, see the Amazon Web Services documentation.
•If database ingestion and replication incremental load and combined initial and incremental load jobs replicate Update operations that change primary key values on the source to any of these targets that use the CSV output format, the job processes each Update record as two records on the target: a Delete followed by an Insert. The Delete contains the before image. The Insert contains the after image for the same row.
For Update operations that do not change primary key values, database ingestion and replication jobs process each Update as one operation and writes only the after image to the target.
Note: If source tables do not have primary keys, Database Ingestion and Replication treats the tables as if all columns were part of the primary key. In this case, each Update operation is processed as a Delete followed by an Insert.
•Database Ingestion and Replication jobs unload binary data in hexadecimal format when the data is sent to an Amazon S3, Flat File, Microsoft Azure Data Lake Storage, or Microsoft Fabric OneLake target. Each hexadecimal column value has the "0x" prefix. If you want to use output files to load the data to a target, you might need to edit the files to remove the "0x" prefixes.
•If you run a Secure Agent service on Windows and plan to use Flat File connections, ensure that the logon account for the Secure Agent is an Administrator account. Otherwise, an error occurs when you try to configure a Flat File connection.
Databricks targets
To use Databricks targets in database ingestion and replication tasks, first prepare the target and review the usage considerations.
Target preparation:
1In the Databricks connection properties, set the JDBC Driver Class Name property to com.databricks.client.jdbc.Driver.
2On Windows, install Visual C++ Redistributable Packages for Visual Studio 2013 on the computer where the Secure Agent runs.
Usage considerations:
•For incremental load jobs, you must enable Change Data Capture (CDC) for all source columns.
• You can access Databricks tables created on top of the following storage types:
- Microsoft Azure Data Lake Storage (ADLS) Gen2
- Amazon Web Services (AWS) S3
The Databricks connection uses a JDBC URL to connect to the Databricks cluster. When you configure the target, specify the JDBC URL and credentials to use for connecting to the cluster. Also define the connection information that the target uses to connect to the staging location in Amazon S3 or ADLS Gen2.
•Before writing data to Databricks target tables, database ingestion and replication jobs stage the data in an Amazon S3 bucket or ADLS directory. You must specify the directory for the data when you configure the database ingestion and replication task.
Note: Database Ingestion and Replication does not use the ADLS Staging Filesystem Name and S3 Staging Bucket properties in the Databricks connection properties to determine the directory.
•Database Ingestion and Replication uses jobs that run once to load data from staging files on Amazon S3 or Amazon Data Lake Store Gen2 to external tables.
By default, Database Ingestion and Replication runs jobs on the cluster that is specified in the Databricks connection properties. If you want to run jobs on another cluster, set the dbDeltaUseExistingCluster custom property to false on the Target page in the database ingestion and replication task wizard.
•If the cluster specified in the Databricks connection properties is not up and running, the database ingestion and replication job waits for the cluster to start. By default, the job waits for 10 minutes. If the cluster does not start within 10 minutes, the connection times out and deployment of the job fails.
If you want to increase the timeout value for the connection, set the dbClusterStartWaitingTime custom property to the maximum time in seconds for which the ingestion job must wait for the cluster to be up and running. You can set the custom property on the Target page in the database ingestion and replication task wizard.
•By default, Database Ingestion and Replication uses the Databricks COPY INTO feature to load data from the staging file to Databricks target tables. You can disable it for all load types by setting the writerDatabricksUseSqlLoad custom property to false on the Target page in the database ingestion and replication task wizard.
•If you use an AWS cluster, you must specify the S3 Service Regional Endpoint value in the Databricks connection properties. For example:
s3.us-east-2.amazonaws.com
To test a Databricks connection using a Secure Agent on Linux, you must specify the JDBC URL in the SQL Endpoint JDBC URL field in the Databricks connection properties. After you test the connection, remove the SQL Endpoint JDBC URL value. Otherwise, when you define a database ingestion and replication task that uses the connection, a design-time error occurs because Data Ingestion and Replication tries to use the JDBC URL as well as the required Databricks Host, Cluster ID, Organization ID, and Databricks Token values to connect to target, resulting in login failures.
•You cannot test a Databricks connection using a Secure Agent on Windows. The test fails. In this situation, you can perform the test using a Secure Agent on Linux. However, note that you can use a Databricks connection with a Secure Agent on Windows when creating a database ingestion and replication task or running a database ingestion and replication job.
•Processing of Rename Column operations on Databricks target tables, without the need to rewrite the underlying Parquet files, requires the Databricks Column Mapping feature with Databricks Runtime 10.2 or later. If you set the Rename Column option to Replicate on the Schedule and Runtime Options page in the task wizard, you must alter the generated target table to set the following Databricks table properties after task deployment and before you run the job:
ALTER TABLE <target_table> SET TBLPROPERTIES ( 'delta.columnMapping.mode' = 'name', 'delta.minReaderVersion' = '2', 'delta.minWriterVersion' = '5')
These properties enable the Databricks Column Mapping feature with the required reader and writer versions. If you do not set these properties, the database ingestion and replication job will fail.
•Database ingestion and replication jobs that have Databricks targets can get schema information for generating the target tables from Databricks Unity Catalog. To enable access to information in a Unity Catalog, specify the catalog name in the Catalog Name field in the Databricks connection properties. The catalog name is appended to the SQL Warehouse JDBC URL value for a data warehouse.
Note: Catalog use is optional for SQL warehouses and does not apply to job clusters.
If you use Unity Catalog, a personal storage location is automatically provisioned. To use the personal staging location, in the Staging Environment field of the connection properties, select Personal Staging Location. The Parquet data files for ingestion and replication jobs can then be staged to the local personal storage location, which has a data retention period of 7 days. By default, the staging location is stage://tmp/<user_name> in the root AWS or Azure location. The <user_name> is taken from the Database Token connection property. This user must have read and write access to the personal staging location.
•Database Ingestion and Replication supports special characters in table and column names in the source tables. The special characters are replaced with an underscore (_) in the Databricks target table or column names.
The custom property key-value pair targetReplacementValue=toHex prevents Database Ingestion and Replication from replacing the special characters with an underscore in the generated target schema and converts the special characters to hexadecimal format.
To convert the special characters to their hexadecimal value, perform the following actions before deploying the database ingestion and replication task:
1Create a properties file to be used at the metadata-manager layer. Add the targetReplacementValue=toHex key-value pair to the properties file.
2Open the Runtime Environment page in Administrator and edit the Secure Agent. Create a custom property in the Custom Configuration Details area:
▪ Select the Database Ingestion service.
▪ Select the DBMI_AGENT_ENV type.
▪ Enter DBMI_TASK_OVERRIDE_PROPERTIES as the property name.
▪ Enter the properties file location as the property value.
3Set the targetReplacementValue custom property to toHex on the Target page of the task wizard.
Before running the task, add <jobname>. to the targetReplacementValue key in the properties file:
<jobname>.targetReplacementValue=toHex
If the property affects all jobs, add "alljobs." to the targetReplacementValue key:
alljobs.targetReplacementValue=toHex
•If you select source decimal or numeric columns to replicate to Databricks target columns, ensure that each source decimal or numeric column has a scale that is less than or equal to its precision. Otherwise, when you run the job, an error is issued that reports an invalid decimal scale. This consideration applies to any source type that is mapped to a Databricks target.
•If you generated Databricks unmanaged tables as the target tables and no longer need a target unmanaged table, use a SQL DROP TABLE statement to delete the table from the target database. You must not manually delete the external directory in Amazon S3 or Azure Data Lake Storage for the unmanaged table. If you do so and try to deploy another job that uses that table, the deployment fails with a Metadata Handler error.
Google BigQuery targets
The following list identifies considerations for preparing and using Google BigQuery targets:
Target preparation
•Download and install the Google BigQuery JDBC driver.
1Download the Google BigQuery JDBC driver version 1.2.25.1029 from the Google Cloud website.
2Copy all of the jar files in the installation zip to the following directory:
•Ensure that you have a service account in your Google account to access Google BigQuery and Google Cloud Storage.
•Ensure that you have the client_email, project_id, private_key, and region ID values for the service account. Enter the values in the corresponding Service Account ID, Project ID, Service Account Key, and Region ID connection properties when you create a Google BigQuery connection.
•If you want to configure a timeout interval for a Google BigQuery connection, specify the timeout interval property in the Provide Optional Properties field of the connection properties. Use the following format:
"timeout": "<timeout_interval_in_seconds>"
•Verify that you have read and write access to the following entities:
- Google BigQuery datasets that contains the target tables
- Google Cloud Storage path where Database Ingestion and Replication creates the staging file
•To use a Google BigQuery target, you must configure the required permissions. First, create an IAM & Admin service account in your Google Cloud project and assign the Custom Role to it. Then add the following permissions to the account's custom role:
- bigquery.datasets.get - To get metadata about a data set.
- bigquery.jobs.create - To run jobs and queries.
- bigquery.models.create - To create new models.
- bigquery .models.delete - To delete models.
- bigquery .models.export - To export a model.
- bigquery.models.getData - To get model data. The bigquery.models.getMetadata permission must also be specified.
- bigquery.models.getMetadata - To get model metadata. The bigquery.models.getData permission must also be specified.
- bigquery .models.list - To list models and metadata for the models.
- bigquery.models.updateData - To update model data. The bigquery.models.updateMetadata permission must also be specified.
- bigquery.models.updateMetadata - To update model metadata. The bigquery.models.updateData permission must also be specified.
- bigquery.routines.create - To create new routines, including stored procedures.
- bigquery.routines.delete - To delete routines.
- bigquery.routines.get - To get routine definitions and metadata.
- bigquery.routines.list - To list routines and metadata for the routines.
- bigquery.routines.update - To update routine definitions and metadata for the routines.
- bigquery.routines.updateTag - To update tags for routines.
- bigquery.tables.create - To create new tables.
- bigquery.tables.delete - To delete tables.
- bigquery.tables.deleteIndex - To drop search indexes on tables.
- bigquery.tables.deleteSnapshot - To delete table snapshots.
- bigquery.tables.export - To export table data out of BigQuery.
- bigquery.tables.get - To get table metadata. The bigquery.tables.getData permission must also be specified.
- bigquery.tables.getData - To get and query table data. The bigquery.tables.get permission must also be specified.
- bigquery.tables.list - To list tables and metadata for the tables.
- bigquery.tables.update - To update table metadata. The bigquery.tables.updateData permission must also be specified.
- bigquery.tables.updateData - To update table data. Thebigquery.tables.update permission must also be specified.
- bigquery.tables.updateTag - To update tags for tables.
- resourcemanager.projects.get - Get the name of the billing account associated with the project.
- storage.objects.create - To allow users to create objects.
- storage.objects.delete - To grant permissions to delete objects.
- storage.objects.get - To read object metadata when listing and reading bucket metadata.
- storage.objects.list - To list objects in a bucket.
Target usage
•Database Ingestion and Replication loads source data in bulk mode to Google BigQuery targets.
•For database ingestion and replication incremental load tasks, make sure that source database Change Data Capture (CDC) is enabled on all source columns.
•Database Ingestion and Replication supports special characters in table and column names in the source tables. The special characters are replaced with an underscore (_) in the Google BigQuery target table or column names.
The custom property key-value pair targetReplacementValue=toHex prevents Database Ingestion and Replication from replacing the special characters with an underscore in the generated target schema and converts the special characters to hexadecimal format.
To convert the special characters to their hexadecimal value, perform the following actions before deploying the database ingestion and replication task:
1Create a properties file to be used at the metadata-manager layer. Add the targetReplacementValue=toHex key-value pair to the properties file.
2Open the Runtime Environment page in Administrator and edit the Secure Agent. Create a custom property in the Custom Configuration Details area:
▪ Select the Database Ingestion service.
▪ Select the DBMI_AGENT_ENV type.
▪ Enter DBMI_TASK_OVERRIDE_PROPERTIES as the property name.
▪ Enter the properties file location as the property value.
3Set the targetReplacementValue custom property to toHex on the Target page of the task wizard.
Before running the task, add <jobname>. to the "targetReplacementValue" key in the properties file:
<jobname>.targetReplacementValue=toHex
If the property affects all jobs, add "alljobs." to the "targetReplacementValue" key:
alljobs.targetReplacementValue=toHex
•If you select Audit apply mode for a database ingestion and replication task, you can select the audit metadata columns to include in the target tables on the task wizard's Target page under Advanced. If you specify a value in the Prefix for Metadata Columns field, do not include special characters. Otherwise, task deployment will fail.
•By default, the following types of source columns are mapped to Google BigQuery string columns with no length specification:
- Source columns that have a character data type
- SAP HANA source columns that have the longdate or timestamp data type
•When you deploy database ingestion and replication jobs, Database Ingestion and Replication generates Google BigQuery target tables clustered by primary key or unique key columns, by default. Each key column must have one of the following data types that Google BigQuery supports for clustering:
- STRING
- INT64
- NUMERIC
- BIGNUMERIC
- DATE
- DATETIME
- TIMESTAMP
- BOOL
- GEOGRAPHY
If any column in the primary key or unique key has an unsupported data type, that column is skipped during clustering. For example, if the primary key contains the C1, C2, C3, C4, C5 columns and C2 has an unsupported data type, the target table is created with the C1, C3, C4, and C5 columns in the CLUSTER BY clause.
Kafka targets and Kafka-enabled Azure Event Hubs targets
The following list identifies considerations for using Kafka targets:
•Database Ingestion and Replication supports Apache Kafka, Confluent Kafka, Amazon Managed Streaming for Apache Kafka (MSK), and Kafka-enabled Azure Event Hubs as targets for incremental load jobs. All of these Kafka target types use the Kafka connection type.
To indicate the Kafka target type, you must specify Kafka producer properties in the task definition or Kafka connection properties. To specify these properties for a task, enter a comma-separated list of key:value pairs in the Producer Configuration Properties field on the Target page of the task wizard. To specify the producer properties for all tasks that use a Kafka connection, enter the list of properties in the Additional Connection Properties field in the connection properties. You can override the connection-level properties for specific tasks by also defining producer properties at the task level. For more information about producer properties, see the Apache Kafka, Confluent Kafka, Amazon MSK, or Azure Event Hubs for Kafka documentation.
• If you select AVRO as the output format for a Kafka target, Database Ingestion and Replication generates a schema definition file for each table with a name in the following format:
schemaname_tablename.txt
If a source schema change is expected to alter the target in an incremental load job, Database Ingestion and Replication regenerates the Avro schema definition file with a unique name that includes a timestamp:
schemaname_tablename_YYYYMMDDhhmmss.txt
This unique naming pattern preserves older schema definition files for audit purposes.
•If you have a Confluent Kafka target that uses Confluent Schema Registry to store schemas, you must configure the following settings on the Target page of the task wizard:
- In the Output Format field, select AVRO.
- In the Avro Serialization Format field, select None.
•You can specify Kafka producer properties in either the Producer Configuration Properties field on the Target page of the task wizard or in the Additional Connection Properties field in the Kafka connection properties. Enter property=value pairs that meet your business needs and that are supported by your Kafka vendor.
For example, if you use Confluent Kafka, you can use the following entry in either the Producer Configuration Properties field or Additional Connection Properties field to specify the Schema Registry URL and enable basic authentication:
If you use Amazon MSK, you can use the following Additional Connection Properties entry to enable IAM role authentication for access to Amazon MSK targets:
Ensure that you enable IAM role authentication on the Amazon EC2 instance where the Secure Agent is installed.
For more information about Kafka properties, see the documentation of your Kafka vendor.
•Database ingestion and replication incremental load jobs can replicate change data to Kafka targets that support SASL_SSL secured access, including Confluent Kafka, Amazon MSK, and Azure Event Hubs targets. In Administrator, you must configure a Kafka connection that includes the appropriate properties in the Additional Connection Properties field. For example, for Azure Event Hubs, you could use the following Additional Connection Properties entry to enable SASL_SSL:
For all Kafka target types that use the Avro format, you can configure rules to generate a custom message key that consists of one or more columns for each source table. Database ingestion and replication incremental load jobs that have a Kafka target can then include the generated message key in the headers of the messages that it sends to the target messaging system. The target messaging system can use the message key to write messages with a specific key value to the same partition in a multi-partitioned topic.
To implement this feature, you must manually create a configuration file that contains rules that identify the key columns for each source table. Then specify the file in a custom configuration property in the task wizard.
Configuration file creation
Create the rule-configuration file in a text editor and save it to a location on the Secure Agent system. The file contains a rule for each source table. Each rule defines the custom key column or columns to use for writing data to topic partitions.
Note: If you change or add a rule or change one of the other parameters after the database ingestion and replication task has been deployed, you must redeploy the task for the rule change to take effect.
Rule syntax:
Use the following syntax to define rules in the configuration file:
To include comments in the file, begin each comment line with the number (#) sign. For example:
#This text is for informational purposes only.
Parameters:
•rule. Defines a rule for generating a composite message key for a source table. In each rule, first identify the schema and table name for a source table. If you change the schema or define table renaming rules for the target, use the name of the schema or renamed table on the target. Then specify the names of one or more table columns that comprise the message key. Ensure that the columns are defined in the table. Otherwise, the database ingestion job fails. For SQL Server sources, also include the name of the database in the format: database.schema.tablename.
You can define multiple rules in the same rule-configuration file.
When generating the message key, Data Ingestion and Replication uses the character representation of each column value followed by the delimiter character. Each column value and delimiter are appended to the composite key value in the order in which the columns appear in the rule definition. The composite key is then used as the Kafka message key for a record. The position of any column that has empty or null values in the message key is represented by the delimiter character only.
•delimiter. Optional. Specifies a single character that will be used as the delimiter after each key column value in generated message keys. You can specify this parameter only once in the rule-configuration file.
Default is the semicolon (;).
•tableNotFound. Optional. Set this parameter to ABORT to cause database ingestion and replication jobs to stop processing any data for a source table and then fail when the table does not have a rule definition in the rule-configuration file. Each source table must have a rule definition for the generation of the composite message key to succeed. You can specify this parameter only once in the configuration file.
If you do not specify this parameter and the table is not found in the rule-configuration file, the default rule in the target messaging system parameters determines the key to use for a record.
•trace. Optional. Enables or disables tracing for the generation of message keys based on rule definitions. Valid values are:
- true. Enables tracing for message key generation based on rule definitions.
- false. Disables tracing for message key generation based on rule definitions.
You can specify this parameter only once in the rule-configuration file.
Database ingestion and replication task configuration
When you create a database ingestion and replication incremental load task that has a Kafka target, you must set the following options to enable the generation of custom message keys:
•On the Target page of the task wizard, make sure that the Use Table Name as Topic Name check box is cleared. Then enter the topic name in the Topic Name field.
•In the Output Format field, select Avro. You can select any Avro format in the Avro Format field.
•Under Custom Properties, specify the captureColumnValuesFile property with a path value that points to the rule-configuration file you created on the Secure Agent system.
Microsoft Azure Synapse Analytics targets
The following list identifies considerations for preparing and using Microsoft Azure Synapse Analytics targets:
•To deploy and run a database ingestion and replication task with a Microsoft Azure Synapse Analytics target, the target connection must specify a database user who has the CONTROL permission on the target database. To grant the CONTROL permission to a user, use the following SQL statements:
USE database_name; GRANT CONTROL TO user_name;
This permission is required for initial load, incremental load, and combined initial and incremental load jobs. This permission allows Database Ingestion and Replication to create target tables and database objects, such as external data source, external file format, and database scoped credential objects, if they do not exist in the database. This permission is specifically required for creating external data source and database scoped credential objects.
Note: You must manually create the master key. To create the master key, you must have the CONTROL permission on the database.
•Database ingestion and replication jobs first send data to a Microsoft Azure Data Lake Storage Gen2 staging file before writing the data to Microsoft Azure Synapse Analytics target tables. The staging file uses the hexadecimal x1d separator as the field delimiter. After the data is written to the target, the entire contents of the table-specific directory that includes the staging files are deleted.
•If you use Microsoft Azure Data Lake Storage Gen2 with a Synapse Analytics connection, you must enable the Hierarchical namespace option in Microsoft Azure Data Lake Storage. With this setting, use of blob storage is not recommended.
•The number of columns in a source table that a database ingestion and replication job can propagate to a Microsoft Azure Synapse Analytics target must not exceed 508 columns.
•Database ingestion and replication incremental load jobs and combined initial and incremental load jobs generate a recovery table named INFORMATICA_CDC_RECOVERY on the target to store internal service information that prevents jobs restarted after a failure from propagating previously processed data again. This recovery table is generated in the same schema as the target tables.
•After a database ingestion and replication job loads data to a Microsoft Azure Synapse Analytics target by using external tables, the job does not drop the log tables and external tables created on the target, even though these tables might be re-created when the job starts again.
Microsoft SQL Server and Azure SQL Database targets
You can use Microsoft SQL Server or Microsoft Azure SQL Database targets in initial load, incremental load, and combined initial and incremental load jobs. SQL Server targets include on-premises, RDS, and Azure SQL Managed Instance targets.
The following list identifies considerations for preparing and using Microsoft SQL Server targets:
•Database Ingestion and Replication does not support Microsoft SQL Server as a target in jobs that replicate data from Teradata sources.
•The SQL Server JDBC driver is delivered with Database Ingestion and Replication. You do not need to install it separately.
•The Database Ingestion and Replication user requires following database roles, at minimum, to create target tables and write data to the tables:
- db_datareader
- db_datawriter
- db_ddladmin
•In Administrator, when you define a SQL Server connection for connecting to a SQL Server target, set the following required connection properties:
- SQL Server Version. Select either SQL Server 2017 or SQL Server 2019.
- Authentication Mode. Select SQL Server Authentication or Windows Authentication v2.
- User Name
- Password
- Host
- Port
- Database Name
- Schema
- Code Page
Optionally, to use SSL encryption for the target connection, you can set the Encryption Method property to one of the following options: SSL, Request SSL, or Login SSL. If you enable encryption, also set the following properties:
- Crypto Protocol Version
- Validate Server Certificate. Select True.
- Trust Store
- Trust Store Password
- Host Name in Certificate
Other connection properties are not supported.
•Database ingestion and replication incremental load and initial and incremental load jobs with a SQL Server target generate a LOG table based on the target table schema, with some additional metadata columns. The LOG table is created right before change data is flushed to the target. The incoming DML data is inserted to the LOG table by supplying a local CSV file to the Bulk Copy API of the SQL Server driver. A merge apply statement is generated based on the information in the LOG table, and then the DML operations are applied to the actual target table. After the DML changes are applied, the LOG table is dropped.
The LOG table might cause a temporary spike in additional space or size requirements in the customer database instance if you run multiple jobs or a job with multiple tables. The space and size required by the LOG table depends on the number of rows received as part of a flush cycle.
•The number of columns in a source table that a database ingestion and replication incremental load or an initial and incremental load job can propagate to a SQL Server target must not exceed 508 columns. If a source table contains more than 508 columns, the job fails while creating the LOG table.
•Database ingestion and replication incremental load jobs and combined initial and incremental load jobs generate a recovery table named INFORMATICA_CDC_RECOVERY on the target to store internal service information that prevents jobs restarted after a failure from propagating previously processed data again. This recovery table is generated in the same schema as the target tables.
Oracle targets
The following list identifies considerations for preparing and using Oracle targets.
Target preparation
•Database Ingestion and Replication requires users to have certain privileges to load data to Oracle target databases. For on-premises Oracle targets, grant the following user privileges to the Database Ingestion and Replication user (cmid_user) who connects to the Oracle target:
GRANT CREATE SESSION TO cmid_user;
GRANT SELECT ON "PUBLIC".V$DATABASE TO cmid_user; GRANT SELECT ON "PUBLIC".V$CONTAINERS TO cmid_user;
GRANT SELECT ON DBA_USERS TO cmid_user; GRANT SELECT ON DBA_TABLES TO cmid_user; GRANT SELECT ON DBA_OBJECT_TABLES TO cmid_user; GRANT SELECT ON DBA_INDEXES TO cmid_user; GRANT SELECT ON DBA_OBJECTS TO cmid_user; GRANT SELECT ON DBA_VIEWS TO cmid_user;
GRANT CREATE TABLE <schema.table> TO cmid_user; <--Unless you grant on ANY TABLE GRANT SELECT ON ALL_CONSTRAINTS TO cmid_user; GRANT SELECT ON ALL_OBJECTS TO cmid_user;
GRANT SELECT ON SYS.TAB$ TO cmid_user; GRANT SELECT ON SYS.RECYCLEBIN$ TO cmid_user; GRANT SELECT ON SYS.COL$ TO cmid_user; <-- If cmid_user is the owner of the target schema GRANT SELECT ON SYS.CCOL$ TO <cmid_user>; GRANT SELECT ON SYS.CDEF$ TO cmid_user; GRANT SELECT ON SYS.OBJ$ TO cmid_user; GRANT SELECT ON SYS.COLTYPE$ TO cmid_user; GRANT SELECT ON SYS.ATTRCOL$ TO cmid_user; GRANT SELECT ON SYS.IDNSEQ$ TO cmid_user; GRANT SELECT ON SYS.ATTRCOL$ TO cmid_user; GRANT SELECT ON SYS.IDNSEQ$ TO cmid_user; GRANT SELECT ON SYS.IND$ TO cmid_user;
-- Grant the following if cmid_user is NOT the owner of the target schema. If you prefer, you -- can grant to individual target tables and indexes instead of to ANY TABLE or ANY INDEX. GRANT ALTER SESSION TO cmid_user; GRANT RESOURCE TO cmid_user; GRANT SELECT ANY TABLE TO cmid_user; GRANT SELECT ANY DICTIONARY TO <cmid_user>; GRANT ALTER ANY TABLE TO cmid_user; GRANT CREATE ANY TABLE TO cmid_user; GRANT DROP ANY TABLE TO cmid_user; GRANT INSERT ANY TABLE TO cmid_user; GRANT UPDATE ANY TABLE TO cmid_user; GRANT DELETE ANY TABLE TO cmid_user; GRANT CREATE ANY INDEX TO cmid_user; GRANT ALTER ANY INDEX TO cmid_user; GRANT DROP ANY INDEX TO cmid_user;
•For Amazon RDS for Oracle targets, log in to RDS as the master user and grant the following user privileges to the Database Ingestion and Replication user (cmid_user) who connects to the Oracle target:
GRANT CREATE SESSION TO cmid_user;
GRANT SELECT on "PUBLIC".V$DATABASE TO cmid_user;
GRANT SELECT on DBA_USERS TO cmid_user; GRANT SELECT on DBA_TABLES TO cmid_user; GRANT SELECT on DBA_INDEXES TO cmid_user; GRANT SELECT ON DBA_VIEWS TO cmid_user;
GRANT CREATE TABLE <schema.table> TO cmid_user; GRANT SELECT on SYS.TAB$ TO cmid_user; GRANT SELECT on SYS.COL$ TO cmid_user; GRANT SELECT on SYS.OBJ$ TO cmid_user; GRANT SELECT on SYS.IND$ TO cmid_user;
-- Grant the following if cmid_user is NOT the owner of the target schema. If you prefer, you -- can grant to individual target tables and indexes instead of to ANY TABLE or INDEX. GRANT ALTER SESSION TO cmid_user; GRANT RESOURCE TO cmid_user; GRANT SELECT ANY TABLE TO cmid_user; GRANT SELECT ANY DICTIONARY TO <cmid_user>; GRANT ALTER ANY TABLE TO cmid_user; GRANT CREATE ANY TABLE TO cmid_user; GRANT DROP ANY TABLE TO cmid_user; GRANT INSERT ANY TABLE TO cmid_user; GRANT UPDATE ANY TABLE TO cmid_user; GRANT DELETE ANY TABLE TO cmid_user; GRANT CREATE ANY INDEX TO cmid_user; GRANT ALTER ANY INDEX TO cmid_user; GRANT DROP ANY INDEX TO cmid_user;
Also, run the following Amazon RDS procedures to grant additional SELECT privileges to cmid_user:
•Database Ingestion and Replication uses the minimum source constraints needed for replication when generating Oracle target objects. If a source has a primary key, the job preferentially uses the primary key and no other constraint when creating the target object. If a source in a database ingestion and replication task does not have a primary key but does have unique indexes, the task chooses the best unique index for replication in the following order of priority: 1) the unique index that has only NOT NULL columns, 2) if multiple unique indexes have only NOT NULL columns, the unique index with the least number of columns, or 3) if no source primary key or unique NOT NULL index exists, the unique index with the greatest number of columns. Only a single primary key or unique index is used.
Source object properties that are not required for replication, such as for storage and partitioning, are not reflected on the target.
•By default, Database Ingestion and Replication disables logging for the Oracle target table to optimize performance. You can enable logging by setting the writerOracleNoLogging custom property to false on the Target page in the database ingestion and replication task wizard.
PostgreSQL targets
You can use Amazon Aurora PostgreSQL and RDS for PostgreSQL as a target in initial load, incremental load, and combined initial and incremental load jobs that have a Db2 for i, Oracle, PostgreSQL, or SQL Server source.
The following list identifies considerations for preparing and using PostgreSQL targets:
• To deploy and run a database ingestion and replication task that includes a PostgreSQL target, the target connection must specify a database user who has the required privileges.
The user must have the CONNECT and TEMPORARY privileges for the database specified in the connection and the USAGE and CREATE privileges for the target schema specified in the target properties.
Use the following SQL statements to grant these privileges to a user role and then assign the role to a user:
CREATE ROLE dbmi_role; GRANT CONNECT ON DATABASE database TO dbmi_role; GRANT TEMPORARY ON DATABASE database TO dbmi_role; GRANT CREATE ON SCHEMA schema TO dbmi_role; GRANT USAGE ON SCHEMA schema TO dbmi_role; CREATE USER dbmi_user with PASSWORD 'password'; GRANT dbmi_role to dbmi_user;
Note: Privileges on the target tables that are generated when the job runs are granted to the user who runs the job.
•Database ingestion and replication incremental load jobs with a PostgreSQL target generate a LOG table based on the target table schema, with some additional metadata columns. The LOG table is created right before change data is flushed to the target. The incoming DML data is inserted to the LOG table by supplying a local CSV file to the Bulk Copy API of the PostgreSQL driver. A set of merge apply statements is generated based on the information in the LOG table, and then the DML operations are applied to the actual target table. After the DML changes are applied, the LOG table is dropped.
The LOG table might cause a temporary spike in additional space or size requirements in the customer database instance if you run multiple jobs or a job with multiple tables. The space and size required by the LOG table depends on the number of rows received as part of a flush cycle.
•The number of columns in a source table that a database ingestion and replication incremental load job can propagate to a PostgreSQL target must not exceed 796 columns. If a source table contains more than 796 columns, the job fails while creating the LOG table.
•PostgreSQL supports a maximum length of 63 characters for source object identifiers. If the length of any source table or column name exceeds 63 characters, the deployment of database ingestion and replication tasks with a PostgreSQL target will fail during validation.
•PostgreSQL stores information about the INTERVAL data type in two columns:
- data_type
- interval_type
Database ingestion and replication does not support the interval_type column for the PostgreSQL target. For tasks that have a PostgreSQL source, the INTERVAL columns of any interval_type are replicated as INTERVAL NULL to the PostgreSQL target.
•Database ingestion and replication incremental load jobs generate a recovery table named INFORMATICA_CDC_RECOVERY on the target to store internal service information that prevents jobs restarted after a failure from propagating previously processed data again. This recovery table is generated in the same schema as the target tables.
Snowflake targets
Target preparation
Target preparation varies depending on whether you use the Superpipe feature for high performance streaming of data to Snowflake target tables or write data to intermediate stage files.
With Superpipe
If you use the Superpipe feature, complete the following steps:
1Create a Data Ingestion and Replication user. Use the following SQL statement:
create user INFACMI_User password 'Xxxx@xxx';
2Create a new user role and grant it to the Data Ingestion and Replication user. Use the following SQL statements:
create role INFACMI_superpipe; grant role INFACMI_superpipe to user INFACMI_User;
3Grant usage on the Snowflake virtual warehouse to the new role. Use the following SQL statement:
grant usage on warehouse warehouse_name to role INFACMI_superpipe;
4Grant usage on the Snowflake database to the new role. Use the following SQL statement:
grant usage on database INFACMI_DB1 to role INFACMI_superpipe;
5Create a new schema. Use the following SQL statements:
use database INFACMI_DB1; create schema sh_superpipe;
6Grant create stream, create view, and create table privileges on the new Snowflake schema to the new role. Use the following SQL statement:
grant create stream, create view, create table, usage on schema INFACMI_DB1.sh_superpipe to role INFACMI_superpipe;
7Set the default role for the newly created user. Use the following SQL statement:
alter user INFACMI_User set default_role=INFACMI_superpipe;
8Define a Snowflake Data Cloud connection to the target. You must use the KeyPair option as the authentication method. See Connectors and Connections > Snowflake Data Cloud connection properties.
9Generate a private key with OpenSSL version 3.x.x and the PBE-SHA1-2DES or PBE-SHA1-3DES cipher. Use the following openssl commands to generate and format the private key:
10Generate the public key. Use the following openssl command, in which the -in option references the file (rsa_key.p8) that contains the encrypted private key:
11In Snowflake, assign the public key to the Snowflake user. Use the following SQL command:
alter user INFACMI_User set rsa_public_key='key_value’;
Next step: When you create an ingestion and replication task, select the Superpipe option on the Target page of the task wizard. You can also optionally specify a Merge Frequency value, which controls the frequency at which change data rows are merged and applied to the Snowflake target table.
Without Superpipe
If you do NOT use the Superpipe feature for Snowflake targets, complete the following steps as the ACCOUNTADMIN user:
1Create a Data Ingestion and Replication user. Use one of the following SQL statements:
create user INFACMI_User password 'Xxxx@xxx';
or
replace user INFACMI_User password 'Xxxx@xxx';
2Create a new role and grant the role to the Data Ingestion and Replication user. Use the following SQL statements:
create role INFA_CMI_Role; grant role INFA_CMI_Role to user INFACMI_User;
3Grant usage on the Snowflake virtual warehouse to the new role. Use the following SQL statement:
grant usage on warehouse CMIWH to role INFA_CMI_Role;
4Grant usage on the Snowflake database to the new role. Use the following SQL statement:
grant usage, CREATE SCHEMA on database CMIDB to role INFA_CMI_Role;
5Set the default role for the newly created user. Use the following SQL statement:
alter user INFACMI_User set default_role=INFA_CMI_Role;
Also, as the INFACMI_User, create a new schema:
create schema CMISchema;
Note: If the user's default role is used for ingestion and replication tasks and does not have the required privileges, the following error will be issued at runtime:
SQL compilation error: Object does not exist, or operation cannot be performed.
Usage considerations
•Database Ingestion and Replication provides alternative methods of moving data to a Snowflake Data Cloud target:
- If you select the Superpipe option when defining an ingestion task, the ingestion job uses the Snowpipe Streaming API to stream rows of data directly to the target tables with low latency. This method is available for all load types. You must use KeyPair authentication.
- If you do not use Superpipe, ingestion jobs first write the data to data files in an internal stage, which has the name you specify in the task definition.
•If you do not use Superpipe and the internal stage specified in the target properties for an ingestion job does not exist, Database Ingestion and Replication automatically creates the stage by running the following SQL command:
Create stage if not exists "Schema"."Stage_Bucket"”
For the command to run successfully, the following privilege must be granted to your user role:
GRANT CREATE STAGE ON SCHEMA "Schema" TO ROLE <your_role>;
•When you define a connection for a Snowflake Data Cloud target, you must set the Additional JDBC URL Parameters field to database=target_database_name. Otherwise, when you try to define the target in the database ingestion and replication task wizard, an error message reports that the list of schemas cannot be retrieved.
•When you define a connection for a Snowflake target using the KeyPair option as the authentication method and you generate the private key with OpenSSL 3.x.x version, use PBE-SHA1-2DES or PBE-SHA1-3DES cipher while generating the private key. Run one of the following commands:
If you use a generic command without the PBE-SHA1-2DES or PBE-SHA1-3DES cipher, an error message about an invalid or unsupported private key might occur while fetching the target schema in the target definition step of the database ingestion and replication task wizard.
The error message does not occur if you use OpenSSL 1.1.1 to generate the private key.
•Database ingestion and replication incremental load jobs and combined initial and incremental load jobs generate a recovery table named INFORMATICA_CDC_RECOVERY on the target to store internal service information that prevents jobs restarted after a failure from propagating previously processed data again. This recovery table is generated in the same schema as the target tables.
•For Snowflake targets, you cannot alter the scale of NUMBER fields or change the data type of an existing field to a different data type because Snowflake does not support these actions.
•If the source tables contain columns that have special characters in the name, you can replace the special characters with a provided choice of string in the Snowflake target.
To replace the special characters with a provided choice of string in the generated target schema, perform the following actions before deploying the database ingestion and replication task:
1Create a properties file to be used at the metadata-manager layer. Add the targetReplacementValue=<string> key-value pair to the properties file.
2Open the Runtime Environment page in Administrator and edit the Secure Agent. Create a custom property in the Custom Configuration Details area:
▪ Select the Database Ingestion service.
▪ Select the DBMI_AGENT_ENV type.
▪ Enter DBMI_TASK_OVERRIDE_PROPERTIES as the property name.
▪ Enter the properties file location as the property value.
3Set the targetReplacementValue=<string> custom property in the Target page of the task wizard.
Before running the task, add <jobname>. to the targetReplacementValue key in the properties file:
<jobname>.targetReplacementValue=<string>
If the property affects all jobs, add "alljobs." to the targetReplacementValue key:
alljobs.targetReplacementValue=<string>
Configure private connectivity to Snowflake
You can access Snowflake using AWS or Azure Private Link endpoints.
The AWS or Azure Private Link setup ensures that the connection to Snowflake uses the AWS or Azure internal network and does not take place over the public Internet.
Default directory structure for CDC files on Amazon S3, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, and Microsoft Fabric OneLake targets
Database ingestion and replication jobs create directories on Amazon S3, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, and Microsoft Fabric OneLake targets to store information about change data processing.
The following directory structure is created by default on the targets:
The following table describes the directories in the default structure:
Folder
Description
connection_folder
Contains the Database Ingestion and Replication objects. This folder is specified in the Folder Path field of the Amazon S3 connection properties, in the Directory Path field of the Microsoft Azure Data Lake Storage Gen2 connection properties, or in the Lakehouse Path field of the Microsoft Fabric OneLake connection properties.
Note: This folder is not created for Google Cloud Storage targets.
job_folder
Contains job output files. This folder is specified in the Directory field on the Target page of the database ingestion and replication task wizard.
cycle/completed
Contains a subfolder for each completed CDC cycle. Each completed cycle subfolder contains a completed cycle file.
cycle/contents
Contains a subfolder for each CDC cycle. Each cycle subfolder contains a cycle contents file.
data
Contains output data files and schema files for each table.
data/table_name/schema/V1
Contains a schema file.
Note: Database Ingestion and Replication does not save a schema file in this folder if the output files use the Parquet format.
data/table_name/data
Contains a subfolder for each CDC cycle that produces output data files.
Cycle directories
Database Ingestion and Replication uses the following pattern to name cycle directories:
[dt=]yyyy-mm-dd-hh-mm-ss
The "dt=" prefix is added to cycle folder names if you select the Add Directory Tags check box on the Target page of the database ingestion and replication task wizard.
Cycle contents files
Cycle contents files are located in cycle/contents/cycle_folder subdirectories. Cycle contents files contain a record for each table that has had a DML event during the cycle. If no DML operations occurred on a table in the cycle, the table does not appear in the cycle contents file.
Database Ingestion and Replication uses the following pattern to name cycle content files:
Cycle-contents-timestamp.csv
A cycle contents csv file contains the following information:
•Table name
•Cycle name
•Path to the cycle folder for the table
•Start sequence for the table
•End sequence for the table
•Number of Insert operations
•Number of Update operations
•Number of Delete operations
•For combined load jobs only: Number of Truncate operations
•For combined load jobs only: Number of Insert operations encountered during the initial load phase
•For combined load jobs only: Number of Delete operations encountered during the initial load phase
•Schema version
•Path to the schema file for the schema version
Note: If the output data files use the Parquet format, Database Ingestion and Replication does not save a schema file at the path that is specified in the cycle contents file. Instead, you can use the schema file in the folder that is specified in the Avro Schema Directory field on the Target page of the database ingestion and replication task wizard.
Completed cycle files
Completed cycle files are located in cycle/completed/completed_cycle_folder subdirectories. A database ingestion and replication job creates a cycle file in this subdirectory after a cycle completes. If this file is not present, the cycle has not completed yet.
Database Ingestion and Replication uses the following pattern to name completed cycle files:
Cycle-timestamp.csv
A completed cycle csv file contains the following information:
•Cycle name
•Cycle start time
•Cycle end time
•Current sequence number at the time the cycle ended
•Path to the cycle contents file
•Reason for the end of cycle
Valid reason values are:
- NORMAL_COMMIT. A commit operation was encountered after the cycle had reached the DML limit or the end of the cycle interval. A cycle can end only on a commit boundary.
- NORMAL_EXPIRY. The cycle ended because the cycle interval expired. The last operation was a commit.
- For combined initial load jobs only: BACKLOG_COMPLETED. The cycle ended because CDC backlog processing completed. The CDC backlog consists of events captured during the initial load phase of the combined job. The backlog includes potential DML changes captured at the beginning or end of the initial load phase and during the transition from the initial load phase to the main CDC incremental processing.
- For combined load jobs only: INITIAL_LOAD_COMPLETED. The cycle ended because the initial load completed.
- For combined load jobs only: RESYNC_STARTED. The cycle ended because the table resync initiated.
Output data files
The data files contain records that include the following information:
•Operation type. Valid values are:
- I for Insert operations.
- U for Update operations.
- D for Delete operations.
- For combined load jobs only: T for Truncate operations.
- For combined load jobs only: X for Delete operations encountered during the initial load phase of a combined load job
- For combined load jobs only: Y for Insert operations encountered during the initial load phase of a combined load job
•Sortable sequence number. In combined initial and incremental load jobs, the sortable sequence number contains a 20-digit prefix that can be used to align rows with the resync version and the load job. The prefix is a combination of the following attributes:
1Incarnation. This nine-digit number is incremented each time the table is resynced. The initial value is 1.
2Schema version. This nine-digit number is incremented each time a schema drift change is propagated for the table. The initial value is 1.
3Phase. This two-digit number changes when transition from unload, to backlog, to CDC is performed. Valid values are:
▪ 00 for Truncation, which is the first data record written during initial load or resync
▪ 01 for a normal insert during initial load or resync
▪ 02 for a change detected during the initial load
▪ 03 for a change detected after the initial load or resync is completed but before the transition back to the main CDC phase
▪ 04 for a change detected during the normal CDC phase
•Data columns
Note: Insert and Delete records contain only after images. Update records contain both before and after images.
Custom directory structure for output files on Amazon S3, Google Cloud Storage, Flat File, Microsoft Fabric OneLake, and ADLS Gen2 targets
You can configure a custom directory structure for the output files that initial load, incremental load, and combined initial and incremental load jobs write to Amazon S3, Google Cloud Storage, Flat File, Microsoft Azure Data Lake Storage (ADLS) Gen2, or Microsoft Fabric OneLake targets if you do not want to use the default structure.
Initial loads
By default, initial load jobs write output files to tablename_timestamp subdirectories under the parent directory. For Amazon S3, Flat File, and ADLS Gen2 targets, the parent directory is specified in the target connection properties if the Connection Directory as Parent check box is selected on the Target page of the task wizard.
•In an Amazon S3 connection, this parent directory is specified in the Folder Path field.
•In a Flat File connection, the parent directory is specified in the Directory field.
•In an ADLS Gen2 connection, the parent directory is specified in the Directory Path field.
For Google Cloud Storage targets, the parent directory is the bucket container specified in the Bucket field on the Target page of the task wizard.
For Microsoft Fabric OneLake targets, the parent directory is the path specified in the Lakehouse Path field in the Microsoft Fabric OneLake connection properties.
You can customize the directory structure to suit your needs. For example, for initial loads, you can write the output files under a root directory or directory path that is different from the parent directory specified in the connection properties to better organize the files for your environment or to find them more easily. Or you can consolidate all output files for a table directly in a directory with the table name rather than write the files to separate timestamped subdirectories, for example, to facilitate automated processing of all of the files.
To configure a directory structure, you must use the Data Directory field on the Target page of the ingestion task wizard. The default value is {TableName}_{Timestamp}, which causes output files to be written to tablename_timestamp subdirectories under the parent directory. You can configure a custom directory path by creating a directory pattern that consists of any combination of case-insensitive placeholders and directory names. The placeholders are:
•{TableName} for a target table name
•{Timestamp} for the date and time, in the format yyyymmdd_hhmissms, at which the initial load job started to transfer data to the target
•{Schema} for the target schema name
•{YY} for a two-digit year
•{YYYY} for a four-digit year
•{MM} for a two-digit month value
•{DD} for a two-digit day in the month
A pattern can also include the following functions:
•toLower() to use lowercase for the values represented by the placeholder in parentheses
•toUpper() to use uppercase for the values represented by the placeholder in parentheses
By default, the target schema is also written to the data directory. If you want to use a different directory for the schema, you can define a directory pattern in the Schema Directory field.
Example 1
You are using an Amazon S3 target and want to write output files and the target schema to the same directory, which is under the parent directory specified in the Folder Path field of the connection properties. In this case, the parent directory is idr-test/DEMO/. You want to write all of the output files for a table to a directory that has a name matching the table name, without a timestamp. You must complete the Data Directory field and select the Connection Directory as Parent check box. The following image shows this configuration on the Target page of the task wizard:
Based on this configuration, the resulting directory structure is:
Example 2
You are using an Amazon S3 target and want to write output data files to a custom directory path and write the target schema to a separate directory path. To use the directory specified in the Folder Path field in the Amazon S3 connection properties as the parent directory for the data directory and schema directory, select Connection Directory as Parent. In this case, the parent directory is idr-test/DEMO/. In the Data Directory and Schema Directory fields, define directory patterns by using a specific directory name, such as data_dir and schema_dir, followed by the default {TableName}_{Timestamp} placeholder value. The placeholder creates tablename_timestamp destination directories. The following image shows this configuration on the Target page of the task wizard:
Based on this configuration, the resulting data directory structure is:
And the resulting schema directory structure is:
Incremental loads and combined initial and incremental loads
By default, incremental load and combined initial and incremental load jobs write cycle files and data files to subdirectories under the parent directory. However, you can create a custom directory structure to organize the files to best suit your organization's requirements.
This feature applies to database ingestion incremental load and combined initial and incremental load tasks that have Amazon S3, Google Cloud Storage, Microsoft Fabric OneLake, or Microsoft Azure Data Lake Storage (ADLS) Gen2 targets.
For Amazon S3 and ADLS Gen2 targets, the parent directory is set in the target connection properties if the Connection Directory as Parent check box is selected on the Target page of the task wizard.
•In an Amazon S3 connection, the parent directory is specified in the Folder Path field.
•In an ADLS Gen2 connection, the parent directory is specified in the Directory Path field.
For Google Cloud Storage targets, the parent directory is the bucket container specified in the Bucket field on the Target page of the task wizard.
For Microsoft Fabric OneLake targets, the parent directory is the path specified in the Lakehouse Path field in the Microsoft Fabric OneLake connection properties.
You can customize the directory structure to suit your needs. For example, you can write the data and cycle files under a target directory for the task instead of under the parent directory specified in the connection properties. Alternatively, you can 1) consolidate table-specific data and schema files under a subdirectory that includes the table name, 2) partition the data files and summary contents and completed files by CDC cycle, or 3) create a completely customized directory structure by defining a pattern that includes literal values and placeholders. For example, if you want to run SQL-type expressions to process the data based on time, you can write all data files directly to timestamp subdirectories without partitioning them by CDC cycle.
To configure a custom directory structure for an incremental load task, define a pattern for any of the following optional fields on the Target page of the ingestion task wizard:
Field
Description
Default
Task Target Directory
Name of a root directory to use for storing output files for an incremental load task.
If you select the Connection Directory as Parent option, you can still optionally specify a task target directory. It will be appended to the parent directory to form the root for the data, schema, cycle completion, and cycle contents directories.
This field is required if the {TaskTargetDirectory} placeholder is specified in patterns for any of the following directory fields.
None
Connection Directory as Parent
Select this check box to use the parent directory specified in the connection properties.
This field is not available for the Microsoft Fabric OneLake target.
Selected
Data Directory
Path to the subdirectory that contains the data files.
In the directory path, the {TableName} placeholder is required if data and schema files are not partitioned by CDC cycle.
{TaskTargetDirectory}/data/{TableName}/data
Schema Directory
Path to the subdirectory in which to store the schema file if you do not want to store it in the data directory.
In the directory path, the {TableName} placeholder is required if data and schema files are not partitioned by CDC cycle.
{TaskTargetDirectory}/data/{TableName}/schema
Cycle Completion Directory
Path to the directory that contains the cycle completed file.
{TaskTargetDirectory}/cycle/completed
Cycle Contents Directory
Path to the directory that contains the cycle contents files.
{TaskTargetDirectory}/cycle/contents
Use Cycle Partitioning for Data Directory
Causes a timestamp subdirectory to be created for each CDC cycle, under each data directory.
If this option is not selected, individual data files are written to the same directory without a timestamp, unless you define an alternative directory structure.
Selected
Use Cycle Partitioning for Summary Directories
Causes a timestamp subdirectory to be created for each CDC cycle, under the summary contents and completed subdirectories.
Selected
List Individual Files in Contents
Lists individual data files under the contents subdirectory.
If Use Cycle Partitioning for Summary Directories is cleared, this option is selected by default. All of the individual files are listed in the contents subdirectory unless you can configure custom subdirectories by using the placeholders, such as for timestamp or date.
If Use Cycle Partitioning for Data Directory is selected, you can still optionally select this check box to list individual files and group them by CDC cycle.
Not selected if Use Cycle Partitioning for Summary Directories is selected.
Selected if you cleared Use Cycle Partitioning for Summary Directories.
A directory pattern consists of any combination of case-insensitive placeholders, shown in curly brackets { }, and specific directory names. The following placeholders are supported:
•{TaskTargetDirectory} for a task-specific base directory on the target to use instead of the directory the connection properties
•{TableName} for a target table name
•{Timestamp} for the date and time, in the format yyyymmdd_hhmissms
•{Schema} for the target schema name
•{YY} for a two-digit year
•{YYYY} for a four-digit year
•{MM} for a two-digit month value
•{DD} for a two-digit day in the month
Note: The timestamp, year, month, and day placeholders indicate when the CDC cycle started when specified in patterns for data, contents, and completed directories, or indicate when the CDC job started when specified in the schema directory pattern.
Example 1
You want to accept the default directory settings for incremental load or combined initial and incremental load jobs as displayed in the task wizard. The target type is Amazon S3. Because the Connection Directory as Parent check box is selected by default, the parent directory path that is specified in the Folder Path field of the Amazon S3 connection properties is used. This parent directory is idr-test/dbmi/. You also must specify a task target directory name, in this case, s3_target, because the {TaskTargetDirectory} placeholder is used in the default patterns in the subsequent directory fields. The files in the data directory and schema directory will be grouped by table name because the {TableName} placeholder is included in their default patterns. Also, because cycle partitioning is enabled, the files in the data directory, schema directory, and cycle summary directories will be subdivided by CDC cycle. The following image shows the default configuration settings on the Target page of the task wizard, except for the specified task target directory name:
Based on this configuration, the resulting data directory structure is:
If you drill down on the data folder and then on a table in that folder (pgs001_src_allint_init), you can access the data and schema subdirectories:
If you drill down on the data folder, you can access the timestamp directories for the data files:
If you drill down on cycle, you can access the summary contents and completed subdirectories:
Example 2
You want to create a custom directory structure for incremental load or combined initial and incremental load jobs that adds the subdirectories "demo" and "d1" in all of the directory paths except in the schema directory so that you can easily find the files for your demos. Because the Connection Directory as Parent check box is selected, the parent directory path (idr-test/dbmi/) that is specified in the Folder Path field of the Amazon S3 connection properties is used. You also must specify the task target directory because the {TaskTargetDirectory} placeholder is used in the patterns in the subsequent directory fields. The files in the data directory and schema directory will be grouped by table name. Also, because cycle partitioning is enabled, the files in the data, schema, and cycle summary directories will be subdivided by CDC cycle. The following image shows the custom configuration on the Target page of the task wizard:
Based on this configuration, the resulting data directory structure is:
Supported Avro data types
Database Ingestion and Replication supports some of the primitive and logical data types that Avro schemas provide. These data types pertain to target types that support Avro or Parquet output format.
A primitive data type represents a single data value. A logical data type is an Avro primitive or complex data type that has additional attributes to represent a derived type.
The following table lists the primitive Avro data types that Database Ingestion and Replication supports:
Primitive data type
Description
INT
32-bit signed integer
LONG
64-bit signed integer
FLOAT
Single precision (32-bit) IEEE 754 floating-point number
DOUBLE
Double precision (64-bit) IEEE 754 floating-point number
BYTES
Sequence of 8-bit unsigned bytes
STRING
Unicode character sequence
The following table lists the logical Avro data types that Database Ingestion and Replication supports:
Logical data type
Description
DECIMAL
An arbitrary-precision signed decimal number of the form unscaled × 10 -scale
DATE
A date, without reference to a time or time zone.
TIME
A time of day that has the precision of 1 millisecond or 1 microsecond, without reference to a time zone or date.
TIMESTAMP
A date and time value that has the precision of 1 millisecond or microsecond, without reference to a particular calendar or time zone.
For Databricks targets, Database Ingestion and Replication does not use the following data types in the intermediate Parquet files:
•TIMESTAMP, with millisecond precision
•TIME, with either millisecond or microsecond precision