Advanced Clusters > Troubleshooting > Troubleshooting an advanced cluster on AWS
  

Troubleshooting an advanced cluster on AWS

Why did the advanced cluster fail to start?
To find out why the advanced cluster failed to start, use the ccs-operation.log file in the following directory on the Secure Agent machine:
<Secure Agent installation directory>/apps/At_Scale_Server/<version>/ccs_home/
The following table lists some reasons why a cluster might fail to start:
Reason
Possible Cause
The cluster operator failed to update the cluster.
The VPC limit was reached on your AWS account.
The master node failed to start.
The master instance type isn't supported in the specified region or availability zone or in your AWS account.
All worker nodes failed to start.
The worker instance type isn't supported in the specified region or availability zone or in your AWS account.
The Kubernetes API server failed to start.
The user-defined master role encountered an error.
When a cluster fails to start due to at least one of these reasons, the ccs-operation.log file displays a BadClusterConfigException.
For example, you might see the following error:
2019-06-27 00:50:02.012 [T:000060] SEVERE : [CCS_10500] [Operation of <cluster instance ID>: start_cluster-<cluster instance ID>]: com.informatica.cloud.service.ccs.exception.BadClusterConfigException: [[CCS_10207] The cluster configuration for cluster [<cluster instance ID>] is incorrect due to the following error: [No [Master] node has been created on the cluster. Verify that the instance type is supported.]. The Cluster Computing System will stop the cluster soon.]
If the cluster encounters a BadClusterConfigException, the agent immediately stops the cluster to avoid incurring additional resource costs and to avoid potential resource leaks. The agent does not attempt to recover the cluster until the configuration error is resolved.
I ran a job to start the advanced cluster, but the VPC limit was reached.
When you do not specify a VPC in the advanced configuration for a cluster, the Secure Agent creates a new VPC on your AWS account. Because the number of VPCs on your AWS account is limited for each region, you might reach the VPC limit.
If you reach the VPC limit, edit the advanced configuration and perform one of the following tasks:
Any cloud resources that were provisioned for the cluster will be reused when the cluster starts in the new region or the existing VPC. For example, the Secure Agent might have provisioned Amazon EBS volumes before it received an error for the VPC limit. The EBS volumes are not deleted, but they are reused during the next startup attempt.
I ran a job to start the advanced cluster, but the cluster failed to be created with the following error:
Failed to create cluster [<cluster instance ID>] due to the following error: [[CCS_10302] Failed to invoke AWS SDK API due to the following error: [Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: <request ID>; S3 Extended Request ID: <S3 extended request ID>)].].]
The Secure Agent failed to create the advanced cluster because Amazon S3 rejected the agent's request.
Make sure that the S3 bucket policies do not require clients to send requests that contain an encryption header.
How do I troubleshoot a Kubernetes API Server that failed to start?
If the Kubernetes API Server fails to start, the advanced cluster fails to start. To troubleshoot the failure, use the Kubernetes API Server logs instead.
To find the Kubernetes API Server logs, complete the following tasks:
  1. 1Connect to the master node from the Secure Agent machine.
  2. 2On the master node, locate the Kubernetes API Server log files in the directory /var/log/.
I updated the staging location for the advanced cluster. Now mappings fail with the following error:
Error while executing mapping. ExecutionId '<execution ID>'. Cause: [Failed to start cluster for [01000D25000000000005]. Error reported while starting cluster [Cannot apply cluster operation START because the cluster is in an error state.].].
Mappings fail with this error when you change the permissions to the staging location before you change the S3 staging location in the advanced configuration.
If you plan to update the staging location, you must first change the S3 staging location in the advanced configuration and then change the permissions to the staging location on AWS. If you used role-based security, you must also change the permissions to the staging location on the Secure Agent machine.
To fix the error, perform the following tasks:
  1. 1Revert the changes to the permissions for the staging location.
  2. 2Edit the advanced configuration to revert the S3 staging location.
  3. 3Stop the cluster when you save the configuration.
  4. 4Update the S3 staging location in the configuration, and then change the permissions to the staging location on AWS.
I updated the staging location for the advanced cluster. Now the following error message appears in the agent job log:
Could not find or load main class com.informatica.compiler.InfaSparkMain
The error message appears when cluster nodes fail to download Spark binaries from the staging location due to access permissions.
Verify access permissions for the staging location based on the type of connectors that the job uses:
Connectors with direct access to Amazon data sources
If you use credential-based security for advanced jobs, make sure that the credentials in the Amazon S3 V2 and Amazon Redshift V2 connections can be used to access the staging location.
If you use role-based security for advanced jobs, make sure that the advanced cluster and the staging location exist under the same AWS account.
Connectors without direct access to Amazon data sources
If you use a user-defined worker role, make sure that the worker role can access both the staging location and the data sources in the advanced job.
If you use the default worker role, make sure that the Secure Agent role can access both the staging location and the data sources in the advanced job.
I restarted the Secure Agent machine and now the status of the advanced cluster is Error.
Make sure that the Secure Agent machine and the Secure Agent are running. Then, stop the advanced cluster in Monitor. In an AWS environment, the cluster might take 3 to 4 minutes to stop. After the cluster stops, you can run an advanced job to start the cluster again.
Is there anything I should do before I use a custom AMI to create cluster nodes?
If you use a custom AMI (Amazon machine image) to create cluster nodes, make sure that the AMI contains an installation of the AWS CLI.
The Secure Agent uses the AWS CLI to propagate tags to Amazon resources and to aggregate logs. The cluster nodes also use the AWS CLI to run initialization scripts.
For information about how to use a custom AMI, contact Informatica Global Customer Support.