Data Engineering Recovery
The Data Integration Service manages jobs that are deployed to run in a cluster environment. When you enable the Data Integration Service for data engineering recovery, the Data Integration Service can recover and continue processing jobs that run on the Spark engine.
To use data engineering recovery, you must configure jobs to run on the Spark engine. Configure the Data Integration Service and log settings, and run the job from infacmd.
The Data Integration Service maintains a queue of jobs to run. The Data Integration Serviced assigns jobs from the queue to nodes which prepare them and send them to a compute cluster for processing.
The cluster assigns a YARN ID to each job and each of its child tasks to track jobs as it runs them. The Data Integration Service gets the YARN IDs from the cluster and stores them on the Model repository database.
If the Data Integration Service runs on a grid or multiple nodes, when a node fails, the Service Manager fails over to another node. The Data Integration Service queries the cluster for the status of tasks as identified by their YARN IDs and compares the response with the status of failed over tasks. Depending on the status, the Data Integration Service takes the following actions:
- •If a task has no YARN ID, it submits the task to the cluster.
- •If a task that has a YARN ID has not been sent to the cluster, it submits the task for processing.
- •If all tasks have been sent, it continues to monitor communications from the cluster until completion.
If the Data Integration Service runs on a single node, it attempts job recovery when the node is restored.
When the Data Integration Service restarts and runs a job, the job creates a cluster configuration under the disTemp directory. This process causes the disTemp directory to grow over time. Manage disk space by monitoring and periodically clearing the contents of the disTemp directory.
Note: The Data Integration Service begins the recovery process by verifying that inactive nodes are not available, and then it assigns the recovered job to an available node. The verification process for unavailable nodes might take several minutes before the job is reassigned to an available node.
Scenarios Where Recovery is Possible
The Data Integration Service can recover queued or running jobs upon node failure in the following scenarios:
- The job has been taken from the queue but is not yet submitted to the cluster.
- Depending on the timing of node failure, the status might be Queued or Running. The Data Integration Service fails the job over to the restored node or another node and continues processing it from the point of interruption.
- The job is partially submitted to the cluster.
- The status of tasks sent to the cluster is Running, and the status of tasks in the queue remains Queued. The Data Integration Service identifies unsent tasks and sends them to the restored node or another node, which processes the unsent tasks and sends them to the cluster for processing.
- The job is fully submitted to the cluster.
- The Data Integration Service has completed its job processing, and node failure at this point has no effect on the queue or on Hadoop processing. The Data Integration Service fails over to a restored node or another node to continue communication with Hadoop and to monitor job status.
Note: If the Data Integration Service runs on a single node, the recovery process begins when the node is running again.
Scenarios Where Recovery is Not Possible
Job recovery is not possible in the following scenarios:
- •The node shuts down before the Mapping Service Module stores the job request in the Model repository.
The job status depends on the timing of the failure. Either the job does not appear at all, or the status is Unknown.
- •You cancel a mapping while it is running.
The job status depends on the timing of the cancellation request. The status will be Canceled or Complete.
- •You recycle the Data Integration Service while a job is running.
The job status will be Aborted or Canceled.
- •The cluster fails, or communication with the cluster is lost while a job is running.
The job status will be Failed.
In each case, the job must be manually re-run.
When a job has a status of Unknown, check the Data Integration Service log and check for the existence of a session log for more information.
Note: Depending on the size of the loan on the Data Integration Service grid, logs might not be updated with failed job states for several minutes.
Recovery Job Management
When you manually interrupt the Data Integration Service, the recovery jobs might not be aborted, and they might not be removed from the Model repository queue. Manual interruption includes disabling, recycling, or deleting the service. You can manage jobs associated with manual interruption through infacmd.
Consider the following ways to manage recovery jobs when you recycle, disable, or delete the Data Integration Service:
- Abort jobs in the queue
- If you recycle or disable the Data Integration Service on a grid while jobs are running, the service might attempt recovery when it is enabled.
- If you do not want to recover the jobs, run infacmd ms abortAllJobs before you recycle or disable the service.
Note: If you recycle or disable a Data Integration Service on a single node or on primary and backup nodes, the service aborts all jobs and does not attempt recovery when it is enabled.
- Purge jobs in the queue
- When you recycle or delete a Data Integration Service, the queue in the Model repository retains job information stored for recovery.
- To maintain the queue and purge the jobs, run infacmd ms PurgeDatabaseWorktables.
For more information about the commands, see the Informatica Command Reference.
Monitoring Recovered Jobs
You can identify recovered jobs on the Monitoring tab.
When a job is recovered, the Monitoring tab shows the same start and end time for the job, and the elapsed time = 0. While this statistic is not the actual elapsed time, it enables you to identify jobs that were recovered. For a more accurate view of the elapsed time for the job, view the Spark job logs on the cluster or the session logs on the Data Integration Service.
Consider the following behavior:
- •When the Data Integration Service recovers a job, the Administrator tool might display incomplete job statistics in the Monitoring tab when the job is complete. For example, the job statistics might not correctly display the number of rows processed.
- •The Monitoring tab does not display detailed statistics if the Data Integration Service process stops after it submits a job to the compute cluster.
Data Engineering Recovery Configuration
Configure a Data Integration Service for data engineering recovery. If the Data Integration Service runs on a grid, verify that certain directories are configured on a network file system.
Perform the following tasks:
- Enable data engineering recovery.
- For the Data Integration Service to recover mapping jobs, you must enable the option in the Data Integration Service Execution Options.
- Verify that directories are configured on a network file system.
If the Data Integration Service runs on a grid, some directories on each node must be configured on a network file system to permit the Data Integration Service components to communicate information about mapping jobs during node failures.
Verify that the following properties are configured for directories:
- - Directory for temporary files. Default is <home directory/disTemp>
- - Log directory. Default is <Informatica installation directory/logs/<node name>/services/DataIntegrationService
Note: Directories must be mounted on a network file system accessible to all Data Integration Service nodes.