Administrator Guide > High Availability > Resilience
  

Resilience

The domain tolerates temporary connection failures between application clients, application services, and nodes.
A temporary connection failure might occur because an application service process fails or because of a network failure. When a temporary connection failure occurs, the Service Manager tries to reestablish connections between the application clients, application services, and nodes.

Application Client Resilience

The application clients try to reconnect to application services when a temporary connection failure occurs.
Based on your license, the following application clients are resilient to the services that they connect to:
PowerCenter Client
The PowerCenter Client tries to reconnect to the PowerCenter Repository Service and the PowerCenter Integration Service when a temporary network failure occurs.
If you perform a PowerCenter Client action that requires connection to the repository while the PowerCenter Client is trying to reestablish the connection, the PowerCenter Client prompts you to try the operation again after the PowerCenter Client reestablishes the connection. If the PowerCenter Client is unable to reestablish the connection during the resilience timeout period, the PowerCenter Client prompts you to reconnect to the repository manually.
Command line programs
Command line programs try to reconnect to the domain or an application service when a temporary network failure occurs while a command line program is running.

Example PowerCenter Client Resilience to Application Services

There is a network connection loss of 120 seconds between the PowerCenter Workflow Monitor and the PowerCenter Repository Service when a developer is monitoring a workflow. The PowerCenter client, Workflow Monitor has a 60 second resilience timeout and the PowerCenter Repository Service has a resilience timeout of 180 seconds.
The Developer does not notice the loss of connection and he is unaffected by the 120 seconds connection loss. However, the following messages appear in the Notifications tab on the PowerCenter Workflow Monitor:
Repository Service notifications are enabled.
DATE TIME-[REP_55101] Connection to the Repository Service [Repository_Service_Name] is broken.
DATE TIME-[REP_55114] Reconnecting to the Repository Service [Repository_Service_Name]. The resilience time is 180 seconds.
DATE TIME-Reconnected to Repository Service [Repository_Service_Name] successfully.

Application Service Resilience

Some application services try to reconnect to application services, application clients, and external components when a temporary connection failure occurs.
Based on your license, the following application services are resilient to the temporary connection failure of their clients:
PowerCenter Integration Service
The PowerCenter Integration Service is resilient to temporary connection failures to other services, the PowerCenter client, and external components such databases and FTP servers.
PowerCenter Repository Service
The PowerCenter Repository Service is resilient to temporary connection failures to other services, such as the PowerCenter Integration Service. It is also resilient to the temporary connection failures to the repository database.

Node Resilience

When a domain contains multiple nodes, the nodes are resilient to temporary failures in communication from other nodes in the domain.
Nodes are resilient to the following temporary connection failures:
A non-master gateway node becomes unavailable.
Every node in the domain sends a communication signal to the master gateway node at periodic intervals of 15 seconds. For nodes with the service role, the communication includes a list of application services running on the node.
All nodes have a resilience timeout of 90 seconds. If a node fails to connect to the master gateway node within the resilience timeout period, the master gateway node marks the node unavailable. If the node that fails to connect has the service role, the master gateway node also reassigns its application services to a back-up node. This ensures that services on a node continue to run despite node failures.
The master gateway node becomes unavailable.
You can configure more than one node to serve as a gateway. If the master gateway node becomes unavailable, the Service Managers on the other gateway nodes elect another master gateway node.
If you configure one node to serve as the gateway and the node becomes unavailable, all other nodes shut down.

Example Resilience Timeout Configuration

Some resilience timeout values are default and others can be configured or overwritten.
You can use the resilience timeout and limit on resilience timeout configured for the domain for PowerCenter application services if you do not set it for the application service. Command line programs use the service resilience timeout. If the service limit on resilience timeout is smaller than the resilience timeout for the connecting client, the client uses the services limit as the resilience timeout.
The following figure shows some sample connections and resilience configurations in a domain with PowerCenter application services:
The following table describes the resilience timeout and the limits shown in the figure above:
Connect From
Connect To
Description
A
PowerCenter Integration Service
PowerCenter Repository Service
The PowerCenter Integration Service can spend up to 30 seconds to connect to the PowerCenter Repository Service, based on the domain resilience timeout. It is not bound by the PowerCenter Repository Service limit on resilience timeout of 60 seconds.
B
pmcmd
PowerCenter Integration Service
pmcmd is bound by the PowerCenter Integration Service limit on resilience timeout of 180 seconds, and it cannot use the 200 second resilience timeout configured in INFA_CLIENT_RESILIENCE_TIMEOUT.
C
PowerCenter Client
PowerCenter Repository Service
The PowerCenter Client is bound by the PowerCenter Repository Service limit on resilience timeout of 60 seconds. It cannot use the default resilience timeout of 180 seconds.
D
Node A
Node B
Node A can spend up to 90 seconds to connect to Node B. The Service Managers on Node A and Node B use the default node resilience timeout of 90 seconds.