Developer Workflow Guide > Cluster Tasks > Create Cluster Task
  

Create Cluster Task

The Create Cluster task contains all the settings that Amazon EMR or Azure HDInsight require to create a cluster with a master node and worker nodes. It also contains a reference to a Hadoop connection and the cloud provisioning configuration.
When you create a cluster workflow, you drag a Create Cluster task into the workflow editor, then configure task properties.
A cluster workflow has only one Create Cluster task.
Configure the advanced properties based on the cloud platform type.

Create Cluster Task General Properties

The following table describes the General properties that you configure for the Create Cluster task:
Property
Description
Name
Name of the task.
Description
Optional. Description of the task.
Connection Name
Name of the cloud provisioning configuration to use with the workflow.
Connection Type
Choose one of the following Hadoop distributions:
  • - Amazon EMR
  • - HDInsight
Default is Amazon EMR

Create Cluster Task Output

Enter output properties for the Create Cluster task.
Verify that the Cluster Identifier property is set to the default value AutoDeployCluster.

Amazon EMR Advanced Properties

Set the advanced properties for an Amazon EMR cluster.

General Options

The following table describes general options that you can set for an EMR cluster:
Property
Description
Cluster Name
Name of the cluster to create.
Release Version
EMR version to run on the cluster.
Enter the AWS version tag string to designate the version. For example: emr-5.8.0
Default is Latest version supported.
Connection Name
Name of the Hadoop connection that you configured for use with the cluster workflow.
S3 Log URI
Optional. S3 location of logs for cluster creation. Format:
s3://<bucket name>/<folder name>
If you do not supply a location, no cluster logs will be stored.

Master Instance Group Options

The following table describes master instance group options that you can set for an EMR cluster:
Property
Description
Master Instance Type
Master node EC2 instance type.
You can specify any available EC2 instance type.
Default is m4.4xlarge.
Master Instance Maximum Spot Price
Maximum spot price for the master node. Setting this property changes the purchasing option of the master instance group to Spot instead of On-demand.

Core Instance Group Options

The following table describes core instance group options that you can set for an EMR cluster:
Property
Description
Core Instance Type
Core node EC2 instance type.
You can specify any available EC2 instance type.
Default is m4.4xlarge.
Core Instance Count
Number of core EC2 instances to create in the cluster.
Default is 2.
Core Instance Maximum Spot Price
Maximum spot price for core nodes. Setting this property changes the purchasing option of the core instance group to Spot instead of On-demand.
Core Auto-Scaling Policy
Optional. Auto-scaling policy for core instances. Type the policy JSON statement here, or provide a path to a file that contains a JSON statement. Format: file:\\<path_to_policy_config_file>

Task Instance Group Options

The following table describes task instance group options that you can set for an EMR cluster:
Property
Description
Task Instance Type
Task node EC2 instance type.
You can specify any available EC2 instance type.
Default is m4.4xlarge.
Task Instance Count
Number of task EC2 instances to create in the cluster.
Default is 2
Task Instance Maximum Spot Price
Maximum spot price for task nodes. Setting this property changes the purchasing option of the task instance group to Spot instead of On-demand.
Task Auto-Scaling Policy
Optional. Auto-scaling policy for task instances. Type the policy JSON statement here, or provide a path to a file that contains a JSON statement. Format: file:\\<path_to_policy_config_file>

Additional Options

The following table describes additional options that you can set for an EMR cluster:
Property
Description
Applications
Optional. Applications to add to the default applications that AWS installs.
AWS installs certain applications when it creates an EMR cluster. In addition, you can specify additional applications. Select additional applications from the drop-down list.
This field is equivalent to the Software Configuration list in the AWS EMR cluster creation wizard.
Tags
Optional. Tags to propagate to cluster EC2 instances.
Tags assist in identifying EC2 instances.
Format: TagName1=TagValue1,TagName2=TagValue2
Software Settings
Optional. Custom configurations to apply to the applications installed on the cluster.
This field is equivalent to the Edit Software Settings field in the AWS cluster creation wizard. You can use this as a method to modify the software configuration on the cluster.
Type the configuration JSON statement here, or provide a path to a file that contains a JSON statement. Format: file:\\<path_to_custom_config_file>
Steps
Optional. Commands to run after cluster creation. For example, you can use this to run Linux commands or HDFS or Hive Hadoop commands.
This field is equivalent to the Add Steps field in the AWS cluster creation wizard.
Type the command statement here, or or provide a path to a file that contains a JSON statement. Format: file:\\<path_to_command_file>
Bootstrap Actions
Optional. Actions to perform after EC2 instances are running, and before applications are installed.
Type the JSON statement here, or provide a path to a file that contains a JSON statement. Format: file:\\<path_to_policy_config_file>
Task Recovery Strategy
Choose from the following options:
  • - Restart task
  • - Skip task
Default is Restart task.

Azure HDInsight Advanced Properties for the Create Cluster Task

The following table describes the Advanced properties for a Microsoft Azure HDInsight cluster:
Property
Description
Cluster Name
Name of the cluster to create.
Azure Cluster Type
Type of the cluster to be created.
Choose one of the options in the drop-down list.
Default is Hadoop.
HDInsight version
HDInsight version to run on the cluster. Enter the HDInsight version tag string to designate the version.
Default is the latest version supported.
Azure Cluster Location
Use the drop-down list to choose the location in which to create the cluster.
Head Node VM Size
Size of the head node instance to create.
Default is Standard_D12_v2.
Number of Worker Node Instances
Number of worker node instances to create in the cluster.
Default is 2.
Worker Node VM Size
Size of the worker node instance to create.
Default is Standard_D13_v2.
Default Storage Type
Primary storage type to be used for the cluster.
Choose one of the following options:
  • - Azure Data Lake Store
  • - Azure BLOB storage account
Default is BLOB storage
Default Storage Container or Root Mount Path
Default container for data. Type one of the following paths:
  • - For ADLS storage, type the path to the storage. For example, you can type storage-name or storage-name/folder-name.
  • - For blob storage, type the path to the container. Format: /path/
Log Location
Optional. Path to the directory to store workflow event logs.
Default is /app-logs.
Attach External Hive Metastore
If you select this option, the workflow attaches an external Hive metastore to the cluster if you configured an external Hive metastore in the cloud provisioning configuration.
Bootstrap JSON String
JSON statement to run during cluster creation. You can use this statement to configure cluster details. For example, you could designate a Hadoop connection for the cluster, add tags to cluster resources, or run script actions.
Choose one of the following methods to populate the property:
  • - Type the JSON statement. Use the following format:
  • {
    "core-site" : {
    "<sample_property_key1>": "<sample_property_val1>",
    "<sample_property_key2>": "<sample_property_val2>"
    },
    "tags": {
    "<tag_key>": "<tag_val>"
    },
    "scriptActions": [
    {
    "name": "setenvironmentvariable",
    "uri": "scriptActionUri",
    "parameters": "headnode"
    }
    ]
    }
  • - Provide a path to a file that contains a JSON statement. Format:
  • file://<path_to_bootstrap_file>