Review the key concepts in this section to understand some configuration options before configuring adatabase ingestion and replication task.
Schema drift handling
Database Ingestion and Replication can be configured to automatically detect some source schema changes and handle these changes on the target. This process is referred to as schema drift.
Database Ingestion and Replication can detect the following types of source schema changes:
•Add column
•Modify column
•Drop column
•Rename column
When you define a task, on the Let's Go page of the database ingestion and replication task wizard, you can configure how the supported types of schema changes are handled. For example, you can configure schema drift options to ignore the changes, replicate them, or stop the job or subtask when a schema change occurs. For more information, see Finalize the task definition. Note that dIfferent types of schema changes might have different default settings, depending on the target type.
Schema drift options are supported for the following source - target combinations and load types:
Source
Load Type
Target
Db2 for i
Incremental
Combined initial and incremental
Amazon Redshift, Amazon S3, Databricks, Google BigQuery, Google Cloud Storage, Kafka (incremental loads only), Microsoft Azure Data Lake Storage, Microsoft Azure Synapse Analytics, Microsoft Fabric OneLake, Oracle, Oracle Cloud Object Storage, PostgreSQL, Snowflake, and SQL Server
Db2 for LUW
Incremental
Combined initial and incremental
Snowflake
Db2 for z/OS, except Db2 11
Incremental
Combined initial and incremental
Amazon Redshift, Amazon S3, Databricks, Google BigQuery, Google Cloud Storage, Kafka (incremental loads only), Microsoft Azure Data Lake Storage, Microsoft Azure Synapse Analytics, Microsoft Fabric OneLake, Oracle, Oracle Cloud Object Storage, Snowflake, and SQL Server
Microsoft SQL Server
Incremental
Combined initial and incremental
Amazon Redshift, Amazon S3, Databricks, Google BigQuery, Google Cloud Storage, Kafka (incremental loads only), Microsoft Azure Data Lake Storage, Microsoft Azure Synapse Analytics, Microsoft Fabric OneLake, Oracle, Oracle Cloud Object Storage, PostgreSQL, Snowflake, and SQL Server
Oracle
Incremental
Combined initial and incremental
Amazon Redshift, Amazon S3, Databricks, Google BigQuery, Google Cloud Storage, Kafka (incremental loads only), Microsoft Azure Data Lake Storage, Microsoft Azure SQL Database, Microsoft Azure Synapse Analytics, Microsoft Fabric OneLake, Oracle, Oracle Cloud Object Storage, PostgreSQL, Snowflake, and SQL Server
PostgreSQL
Incremental
Combined initial and incremental
Incremental loads: Amazon Redshift, Amazon S3, Databricks, Google BigQuery, Google Cloud Storage, Kafka, Microsoft Azure Data Lake Storage, Microsoft Azure Synapse Analytics, Microsoft Fabric OneLake, Oracle, Oracle Cloud Object Storage, PostgreSQL, and Snowflake
Combined initial and incremental loads: Oracle, PostgreSQL, and Snowflake
Applying an audit history of all source table DML changes to target tables
You can configure database ingestion and replication incremental load and combined initial and incremental load tasks that have Databricks, Google BigQuery, Oracle, PostgreSQL, Snowflake, or SQL Server targets to write an audit trail of every DML change operation made on the source tables to the target. A row for each DML change on a source table is written to the generated target table along with the audit columns you select. The audit columns contain metadata about the change, such as the DML operation type, time, owner, transaction ID, generated ascending sequence number, and before image.
When you define the task, select Audit in the Apply Mode field on the Target page. The Apply Mode field is available for new or undeployed tasks.
To specify the audit metadata columns to add, select one or more of the check boxes under Advanced on the Target page:
•Add Last Replicated Time. Adds a column that records the timestamp at which a record was inserted or last updated in the target table. For initial loads, all loaded records have the same timestamp, except for Snowflake targets that use the Superpipe option where the seconds or minutes might vary slightly. For incremental loads and combined initial and incremental loads, the column records the timestamp of the last DML operation that was applied to the target.
Note: This option is available for Google BigQuery, Microsoft Azure Synapse Analytics, Snowflake, and SQL Server targets only.
•Add Operation <metadata_type>. Adds columns that contain metadata for change operations, such as the DML operation type, time, owner, transaction ID, and generated ascending sequence number. The columns are populated when data is loaded to the target tables.
•Add Before Images. Adds _OLD columns that contain before-image data for Updates. You can compare the old and new column values in the tables.
•Prefix for Metadata Columns. Add a prefix to the names of the added audit columns to differentiate them from other table columns. Default is INFA_.
These fields are optional. Only the Add Operation Type check box is selected by default to add a column that shows the DML operation type: I (insert), D (delete), or U (update).
The first time you run a job associated with the task, the job generates the target tables with the selected audit metadata columns. Ensure that no constraints other than indexes exist on the target tables.
For targets that support Soft Deletes mode, for each delete operation processed as a soft delete, the operation type of "D" appears in the INFA_OPERATION_TYPE column and values are written to any other metadata columns that you selected. However, for update and insert operations, the INFA_OPERATION_TYPE column and all other selected metadata columns are NULL.
Example for Audit mode
For example, assume the following DML change operations occur on a source table in the order shown:
Insert into tableA pkey = 1 Update tableA where pkey=1 Update tableA where pkey=1 Delete from tableA where pkey = 1
All of the following rows appear in the target table, providing an audit trail of all of the source DML changes:
In this example, the only audit column selected is opType.
Note that when the task's apply mode is Standard, none of these rows appears on the target table because the last DML operation is a Delete, which supersedes the prior changes.
Note: If a combined initial and incremental load job captures an incremental insert change record during the initial unload phase, the job manufactures a delete for the same row to remove any duplicate that might have been obtained from the initial unload. This manufactured activity will be reflected in audit apply mode.
Applying deletes as soft deletes on the target
For database ingestion and replication incremental load and combined initial and incremental load jobs that have any supported source type and a Databricks, Google BigQuery, PostgreSQL, Snowflake, or SQL Server target, you can configure the task to process delete operations on the source as soft deletes on the target.
A soft delete marks a deleted row as deleted without actually removing it from the database. The row is applied to the target with the value of "D" in the generated INFA_OPERATION_TYPE metadata column.
Important: You can use Soft Deletes only if all of the source objects have primary keys and the source does not allow the primary key values to change in any row after the row is first created. If the primary key values change, duplicate rows might be written to the target, causing the target to become corrupted.
Example scenario: Your organization wants to use soft deletes in a data warehouse to mark the rows that were deleted at the source while still retaining the rows for audit purposes.
To enable soft deletes, set the Apply Mode field to Soft Deletes on Target page of the task wizard when you configure the database ingestion and replication task.
Note: If a combined initial and incremental load job captures an incremental insert change record during the initial unload phase, the job manufactures a delete for the same row to remove any duplicate that might have been obtained from the initial unload. This manufactured activity will be reflected in the soft delete apply mode.
CDC staging groups and tasks
You can define a CDC staging group that includes multiple database ingestion and replication incremental load or combined initial and incremental load tasks that have a Db2 for i, Db2 for z/OS, Oracle, Microsoft SQL Server, or SAP HANA source. SAP HANA sources must use the Log-based CDC method.
If you enable staging groups, a CDC Staging Task reads data from the source database in a single pass and writes the data to common storage. The apply tasks that you associate with the group can then read data from the common storage and apply the data to different targets.
This feature can improve CDC performance and scalability because data is read from the source logs once by the CDC Staging Task instead of being read multiple times by the apply jobs. Also, if you use a Secure Agent group, the apply jobs can run on different agents in the same Secure Agent group.
Note: To get access to this feature, contact Informatica Global Customer Support. They'll set the cdir.cdc.group.enabled flag to true for your organization.
When you create a new task in the task wizard, you can add it to an existing staging group or create a new staging group. You cannot add a pre-existing task to a staging group.
When determining whether to use a CDC staging group, consider the following points
•For Db2 for i sources, all of the member apply tasks must use the same journal.
•For log-based sources, including Db2 for z/OS, Oracle, and SQL Server, the tasks in the group can process different tables with different schemas in the same source database.
•Oracle sources must use the Log-based change capture method.
•SQL Server sources must use the CDC Tables or Log-based change capture method.
•SAP HANA sources must use the Log-based change capture method and incremental load type. If you set the capture type to Log Based in the connection to the source database, each apply task that uses the connection has the Log-based method and CDC staging automatically enabled. After you select a deployed staging group or create a new one for the first task, all subsequent apply tasks that use the same connection are added the same group. You can't use a CDC Staging Task with SAP HANA sources that use the Trigger-based change capture method.
For Db2 for z/OS sources, you might need to perform some prerequisite tasks before using staging groups: 1) create the INFA_DATA’ table that stores DBID, OBID and PSID information for all of the selected source tables, if this information might be greater than 16 KB and 2) install the Db2 stored procedure dated 11/01/2024 or later that can use this table to identify the selected tables for processing. For stored procedure installation instructions, go to the Database Ingestion and Replication documentation on the Doc Portal and see Installing and configuring the stored procedure for Db2 for z/OS CDC. For information about creating the INFA_DATA table, see INFA_DATA table for Db2 for z/OS sources in a staging group.
The CDC Staging Task uses the following types of common storage:
•For Db2 for i, Db2 for z/OS, Oracle, and Microsoft SQL Server sources, the CDC Staging Task writes data to cloud Iceberg storage. The cloud storage can be in Amazon S3, Google Cloud Storage, or Microsoft Azure Data Lake Storage Gen2 (with Shared Key Authentication). The staged data is retained in storage for 14 days by default, but you can adjust this retention period up to 365 days.
•For SAP HANA sources, the CDC Staging Task writes data to an SAP HANA or Oracle cache, as specified in the SAP HANA Database Ingestion connection properties. Each CDC Staging Task must have its own cache database and connection. The Secure Agent stages data to the cache database, and when full records are available, the Secure Agent writes the data to the target. By default, after the data is stored in the cache for 14 days, it's compacted but not deleted. You can adjust this period using the Log Clear connection property.
In the monitoring interfaces, you can filter for the asset type of "CDC Staging Task" and view the task's progress in capturing the source change data to cloud storage. In the Task Summary section of the job details, the Target reflects the SAP HANA source connection instead of the cache database type or an ultimate target type.
The following status interactions between the staging task and dependent tasks can occur:
•If the CDC Staging Task fails or is stopped or aborted, all dependent database ingestion and replication apply jobs stop. You should undeploy the stopped apply jobs. You can then undeploy the CDC Staging Task.
Note: If you do not want the dependent apply jobs to stop, you can set the source custom property stopDependentsWhenGroupJobTerminated to false.
•If all jobs in the staging group fail, the CDC Staging Task continues running.
•If all jobs in a staging group are undeployed, the CDC Staging Task stops. You can then undeploy it.
•You can undeploy a CDC Staging Task that has the status of Aborted, Deployed, Failed, or Stopped from any of the monitoring interfaces. In the task row, select Undeploy from the Actions menu. Do not undeploy a CDC Staging Task if any job in the group is not undeployed.
Undeploying a CDC Staging Task cleans up temporary files in the Secure Agent Database Ingestion service area and in cloud cache. It also cleans up staging tables that are in Apache Iceberg format.
•If you undeploy and redeploy a CDC Staging Task, staging data to cloud storage starts over again, without preserving the previous contents in cloud storage.
INFA_DATA table for Db2 for z/OS sources in a staging group
If you expect the information that identifies selected source tables to the CDC staging task to be greater than 16 KB, you must install or manually create the INFA_DATA table. The table stores DBID, OBID and PSID information for each of the source tables selected in the tasks that are associated with the staging group.
If the INFA_DATA table does not exist and the computed length of these values is greater than 16 KB, the CDC staging task will fail.
You can create the INFA_DATA table by using the #INFATBL installation member. The installation member includes the following SQL statements for creating the table:
CREATE TABLE <schema>.INFA_DATA (INFA_CODE CHARACTER(2) NOT NULL ,INFA_GROUP_ID INTEGER NOT NULL ,INFA_SEQ INTEGER NOT NULL ,INFA_TS TIMESTAMP (6) WITHOUT TIME ZONE NOT NULL ,INFA_DATA VARCHAR(4096) ,PRIMARY KEY(INFA_CODE,INFA_GROUP_ID,INFA_SEQ,INFA_TS) ) APPEND NO DATA CAPTURE CHANGES CCSID EBCDIC;
Note: You can use a table name other than the default name of INFA_DATA if you prefer.
Grant the following privileges on the INFA_DATA table to the user specified in the connection properties for the primary source connection:
GRANT INSERT ON <schema>.INFA_DATA to <stagingtask_connecton_user>; GRANT DELETE ON <schema>.INFA_DATA to <stagingtask_connecton_user>; GRANT EXECUTE <schema>.INFA_DATA to <stagingtask_connecton_user>; COMMIT;
Where:
•<schema> is specified by the pwx.cdcreader.ZOS.InfaTableSchema custom property, or if this custom property is not specified, <schema> defaults the schema name of the Db2 stored procedure
•INFA_DATA is the default table name. If you specified another name when creating the table, use that name.
•The <schema>.<table> value in the grant statements must match the information in the CREATE statement in the #INFATBL installation member.
Integrating Database Ingestion and Replication Tasks with Data Integration Taskflows
To perform post-replication transformation, you can configure database ingestion and replication tasks to trigger Data Integration taskflows that process and transform the ingested data. This feature is available for tasks that have any supported load type, source type, and target type.
When you define a database ingestion and replication task, you can select the Execute in Taskflow option to make the task available to add to taskflows in Data Integration. For incremental load and combined load jobs with an Amazon Redshift, Oracle, Snowflake (without Superpipe), or SQL Server target, you can optionally select the Add Cycle ID option to include cycle ID metadata in the target table. The Cycle ID column identifies the cycle in which the row got updated. It's passed as a parameter to the taskflow, where you can use it to filter the rows on which to execute transformation logic.
When you configure the taskflow in Data Integration, you can select the task as an event source and add any appropriate transformation type to transform the ingested data.
Configuration task flow:
1In the Data Ingestion and Replication task configuration wizard, select the following options when defining a database ingestion and replication task:
- (Optional) Add Cycle ID on the Target page. This option adds the generated cycle ID for each CDC cycle to the target table. Applies to Amazon Redshift, Oracle, and SQL Server targets and to Snowflake targets that don't use the Superpipe option.
- Execute in Taskflow on the Schedule and Runtime Options page. This option makes the task available for selection when defining a taskflow in Data Integration.
2When done defining the task, Save it.
3To define a taskflow in Data Integration, click the Orchestrate panel on the Home page.
4To add the database ingestion and replication task in the taskflow, perform the following steps:
aUnder Task Properties, click Start.
bIn the Binding field, select Event.
cIn the Event Source Name field, click Select. Then in Select Event Source dialog box, select the database ingestion and replication task and click Select.
Note: You can filter the list of tasks by task type.
dCheck that the Event Source Name field and Input Fields display the task name. For example:
The taskflow is automatically triggered to start when either the initial load task successfully completes or after each CDC cycle in an incremental load operation. If a CDC cycle ends but the previous taskflow run is still running, the data is queued and waits for the previous taskflow to complete.
About LOB truncation
If you select the Include LOBs option for a Db2 for LUW, Oracle, PostgreSQL, or SQL Server source, LOB data might be truncated, primarily depending on the maximum size that the target allows.
Target-side truncation points:
• BLOB, BYTEA, GEOGRAPHY, GEOMETRY, IMAGE, LONG RAW, LONG VARCHAR FOR BIT, or VARBINARY(MAX) columns are truncated before being written to BINARY columns on the target.
- For Amazon S3, Databricks, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, Oracle, Oracle Cloud Object Storage, Microsoft Fabric OneLake, PostgreSQL, and SQL Server targets, the data is truncated to 16777216 bytes.
- For Amazon Redshift targets, the data is truncated to 1024000 bytes.
- For Microsoft Azure Synapse Analytics targets, the data is truncated to 1000000 bytes.
- For Google BigQuery and Snowflake targets, the data is truncated to 8388608 bytes.
• CLOB, DBCLOB, NCLOB, LONG, LONG VARCHAR, LONG VARGRAPHIC, TEXT, NTEXT, NVARCHAR(MAX), RAW, VARCHAR(MAX), or XML columns are truncated before being written to VARCHAR columns on the target.
- For Amazon S3, Databricks, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, Oracle, Microsoft Fabric OneLake, Oracle Cloud Object Storage, PostgreSQL, and Snowflake targets, the data is truncated to 16777216 bytes.
- For Amazon Redshift targets, the data is truncated to 65535 bytes.
- For Google BigQuery targets, the data is truncated to 8388608 bytes.
- For Microsoft Azure Synapse Analytics targets, the data is truncated to 500000 bytes.
- For SQL Server targets, CLOB, TEXT and VARCHAR(MAX) data is truncated to 16777216 bytes, NCLOB, NTEXT and NVARCHAR(MAX) data to 33554432 bytes, and XML data to 33554442 bytes.
- For Azure Event Hubs targets, the overall record maximum size is 1 MB. If the record size exceeds 1 MB, Azure Event Hubs generates an error, and the task fails.
Source-side truncation considerations:
•For Db2 for LUW and Oracle sources, you can specify the custom properties dbmiSourceBlobTruncationSize and dbmiSourceClobTruncationSize for the source to control the number of bytes at which truncation occurs for blob and clob types of data, respectively, when you want the data truncated at a point less than the maximum size that the target allows.
•For PostgreSQL sources in incremental loads and combined loads, if large-object columns contain more than 1 MB of data, the data is truncated to 1 MB.
Customizing data type mappings
When you configure a target for a database ingestion and replication task, you can optionally define data-type mapping rules to override the default mappings of source data types to target data types.
This feature is supported for tasks that have the following source and target combinations:
•Oracle sources with Databricks, Google BigQuery, Microsoft Azure Synapse Analytics, Oracle, SQL Server, and Snowflake targets
•SQL Server sources with Snowflake targets
For example, you can create a data-type rule that maps Oracle NUMBER columns that have no precision to Snowflake target NUMBER() columns that also have no precision, instead of using the default mapping to the Snowflake VARCHAR(255) data type.
To create a data-type mapping rule:
1Expand Data Type Rules.
2In the Create Rule fields, enter a source data type and the target data type that you want to map it to.
In the Source field only, you can include the percent (%) wildcard to represent the data type precision, scale, or size, for example, NUMBER(%,4), NUMBER(8,%), or NUMBER(%). Use the wildcard to cover all source columns that have the same data type but use different precision, scale, or size values, instead of specifying each one individually. For example, enter FLOAT(%) to cover FLOAT(16), FLOAT(32), and FLOAT(84). You cannot enter the % wildcard in the target data type. A source data type that uses the % wildcard must map to a target data type that uses specific precision, scale, or size value. For example, you could map the source data type FLOAT(%) to a target data type specification such as NUMBER(38,10).
3Click Add Rule.
The rule appears in the list of rules.
To delete a rule, click the Delete icon at the right end of the rule row.
After you deploy a task with custom mapping rules, you cannot edit the rules until the task is undeployed.
Usage notes:
•In general, binary data types cannot be mapped to character data types.
•If you define multiple data-type rules for the same source data type with the same length or same precision and scale values, you will not be able to save the database ingestion and replication task.
•If you define multiple data-type rules for the same source data type but use the % wildcard to represent the length or precision and scale value in one rule and a specific length or precision and scale value in the second rule, the rule that contains the specific value is processed first, before the rule with the % wildcard. For example, if you map the source data types FLOAT(84) and FLOAT(%), the FLOAT(84) rule is processed first and then the FLOAT(%) rule is processed to cover any other FLOAT source columns with different sizes.
•If a source data type requires a length or precision and scale value, make sure that you set the required attribute by using the % wildcard or a specific value, for example, VARCHAR(%) or VARCHAR(10).
•If you define an invalid mapping, an error message is written to the log. You can then correct the mapping error, with assistance from your DBA if necessary.
•For Oracle sources, you must use the data types that are returned by the following query for the source object:
select dbms_metadata.get_ddl('TABLE', 'YOUR_TABLE_NAME','TABLE_OWNER_NAME') from dual;
•Database Ingestion and Replication does not support the BYTE and CHAR semantics in data-type mappings rules.
•If a source data type has a default precision, you must specify it in your rule. For example, you must use TIMESTAMP(6) instead of TIMESTAMP.