Big Data Management User Guide > Cluster Workflows > Create the Cluster Workflow
  

Create the Cluster Workflow

Create the cluster workflow and add and configure workflow elements.
The cluster workflow must have a Create Cluster task and at least one Mapping task. In addition, you can add Command and other workflow tasks. You can also add a Delete Cluster task, so that the cluster becomes an ephemeral cluster.

Workflow Task Run-Time Behavior

Set mapping and Mapping task properties to specify where the workflow runs Mapping tasks.
You can create a cluster workflow to run some mappings on the cluster that the workflow creates, and other mappings on another cluster.
To specify where each mapping runs, configure options in the mapping and the Mapping task.
Run the mapping on the cluster that the workflow creates.
The following table describes the run-time behavior based on how you configure the mapping and Mapping task:
Mapping
Hadoop Connection Property
Mapping Task
Cluster Identifier Property
Run Time Behavior
Auto Deploy
Auto Deploy
The Data Integration Service generates temporary Hadoop connections based on the values in the Hadoop connection associated with the workflow, and uses the temporary connections to run mappings on the cluster.
<Hadoop connection name>
Auto Deploy
The Mapping task Cluster Identifier property overrides the mapping Hadoop connection property.
You might want to do this if you want to maintain the mapping Hadoop connection property value.
Run the mapping on another cluster.
The following table describes the run-time behavior based on how you configure the mapping and Mapping task:
Mapping
Hadoop Connection Property
Mapping Task
Cluster Identifier Property
Run Time Behavior
<Hadoop connection name>
Blank
The Mapping task Cluster Identifier property takes input from the Hadoop connection and runs the mapping on the cluster identified in the cloud configuration property of the Hadoop connection.

Configure the Cluster Workflow

A cluster workflow must have one Create Cluster task.
    1. In the Developer tool, create a workflow.
    2. From the palette of tasks, drag a Create Cluster task to the workflow editor.
    3. Complete the Create Cluster task general properties.
    Property
    Description
    Name
    Task name.
    Description
    Optional description.
    Connection Name
    Name of the cloud provisioning configuration to use with the workflow.
    Connection Type
    Choose from the following options:
    • - Amazon EMR. Create an Amazon EMR cluster.
    • - HDInsight. Create an Azure HDInsight cluster.
    4. Configure task input and output properties.
    Input properties
    The Create Cluster task does not require any unique values for task input properties.
    Output properties
    Set the Cluster Identifier property to the default value, AutoDeployCluster.
    Note: The Cluster Identifier property of the Create Cluster task overrides the Cluster Identifier property of the Mapping task.
    5. Set the advanced properties that correspond to your cloud platform.
    6. Configure the Software Settings property in the advanced properties if you want to perform the following optional tasks:
    7. Connect the workflow Start_Event to the Create Cluster task.

Amazon EMR Advanced Properties

Set the advanced properties for an Amazon EMR cluster.

General Options

The following table describes general options that you can set for an EMR cluster:
Property
Description
Cluster Name
Name of the cluster to create.
Release Version
EMR version to run on the cluster.
Enter the AWS version tag string to designate the version. For example: emr-5.8.0
Default is Latest version supported.
Connection Name
Name of the Hadoop connection that you configured for use with the cluster workflow.
S3 Log URI
Optional. S3 location of logs for cluster creation. Format:
s3://<bucket name>/<folder name>
If you do not supply a location, no cluster logs will be stored.

Master Instance Group Options

The following table describes master instance group options that you can set for an EMR cluster:
Property
Description
Master Instance Type
Master node EC2 instance type.
You can specify any available EC2 instance type.
Default is m4.4xlarge.
Master Instance Maximum Spot Price
Maximum spot price for the master node. Setting this property changes the purchasing option of the master instance group to Spot instead of On-demand.

Core Instance Group Options

The following table describes core instance group options that you can set for an EMR cluster:
Property
Description
Core Instance Type
Core node EC2 instance type.
You can specify any available EC2 instance type.
Default is m4.4xlarge.
Core Instance Count
Number of core EC2 instances to create in the cluster.
Default is 2.
Core Instance Maximum Spot Price
Maximum spot price for core nodes. Setting this property changes the purchasing option of the core instance group to Spot instead of On-demand.
Core Auto-Scaling Policy
Optional. Auto-scaling policy for core instances. Type the policy JSON statement here, or provide a path to a file that contains a JSON statement.
Format: file:\\<path_to_policy_config_file>

Task Instance Group Options

The following table describes task instance group options that you can set for an EMR cluster:
Property
Description
Task Instance Type
Task node EC2 instance type.
You can specify any available EC2 instance type.
Default is m4.4xlarge.
Task Instance Count
Number of task EC2 instances to create in the cluster.
Default is 2.
Task Instance Maximum Spot Price
Maximum spot price for task nodes. Setting this property changes the purchasing option of the task instance group to Spot instead of On-demand.
Task Auto-Scaling Policy
Optional. Auto-scaling policy for task instances. Type the policy JSON statement here, or provide a path to a file that contains a JSON statement.
Format: file:\\<path_to_policy_config_file>

Additional Options

The following table describes additional options that you can set for an EMR cluster:
Property
Description
Applications
Optional. Applications to add to the default applications that AWS installs.
AWS installs certain applications when it creates an EMR cluster. In addition, you can specify additional applications. Select additional applications from the drop-down list.
This field is equivalent to the Software Configuration list in the AWS EMR cluster creation wizard.
Tags
Optional. Tags to propagate to cluster EC2 instances.
Tags assist in identifying EC2 instances.
Format: TagName1=TagValue1,TagName2=TagValue2
Software Settings
Optional. Custom configurations to apply to the applications installed on the cluster.
This field is equivalent to the Edit Software Settings field in the AWS cluster creation wizard. You can use this as a method to modify the software configuration on the cluster.
Type the configuration JSON statement here, or provide a path to a file that contains a JSON statement. Format: file:\\<path_to_custom_config_file>
Steps
Optional. Commands to run after cluster creation. For example, you can use this to run Linux commands or HDFS or Hive Hadoop commands.
This field is equivalent to the Add Steps field in the AWS cluster creation wizard.
Type the command statement here, or or provide a path to a file that contains a JSON statement. Format: file:\\<path_to_command_file>
Bootstrap Actions
Optional. Actions to perform after EC2 instances are running, and before applications are installed.
Type the JSON statement here, or provide a path to a file that contains a JSON statement. Format: file:\\<path_to_policy_config_file>
Task Recovery Strategy
Choose from the following options:
  • - Restart task
  • - Skip task
Default is Restart task.

Azure HDInsight Advanced Properties for the Create Cluster Task

The following table describes the Advanced properties for a Microsoft Azure HDInsight cluster:
Property
Description
Cluster Name
Name of the cluster to create.
Azure Cluster Type
Type of the cluster to be created.
Choose one of the options in the drop-down list.
Default is Hadoop.
HDInsight version
HDInsight version to run on the cluster. Enter the HDInsight version tag string to designate the version.
Default is the latest version supported.
Azure Cluster Location
Use the drop-down list to choose the location in which to create the cluster.
Head Node VM Size
Size of the head node instance to create.
Default is Standard_D12_v2.
Number of Worker Node Instances
Number of worker node instances to create in the cluster.
Default is 2.
Worker Node VM Size
Size of the worker node instance to create.
Default is Standard_D13_v2.
Default Storage Type
Primary storage type to be used for the cluster.
Choose one of the following options:
  • - Azure Data Lake Store
  • - Azure BLOB storage account
Default is BLOB storage
Default Storage Container or Root Mount Path
Default container for data. Type one of the following paths:
  • - For ADLS storage, type the path to the storage. For example, you can type storage-name or storage-name/folder-name.
  • - For blob storage, type the path to the container. Format: /path/
Log Location
Optional. Path to the directory to store workflow event logs.
Default is /app-logs.
Attach External Hive Metastore
If you select this option, the workflow attaches an external Hive metastore to the cluster if you configured an external Hive metastore in the cloud provisioning configuration.
Bootstrap JSON String
JSON statement to run during cluster creation. You can use this statement to configure cluster details. For example, you could configure Hadoop properties on the cluster, add tags to cluster resources, or run script actions.
Choose one of the following methods to populate the property:
  • - Type the JSON statement. Use the following format:
  • {
    "core-site" : {
    "<sample_property_key1>":
    "<sample_property_val1>",
    "<sample_property_key2>":
    "<sample_property_val2>"
    },
    "tags": {
    "<tag_key>": "<tag_val>"
    },
    "scriptActions": [
    {
    "name": "setenvironmentvariable",
    "uri": "scriptActionUri",
    "parameters": "headnode"
    }
    ]
    }
  • - Provide a path to a file that contains a JSON statement. Format:
  • file://<path_to_bootstrap_file>

Configure the Create Cluster Task to Run Mappings on the Blaze Engine

If you want to use the Blaze engine to run mappings on the cloud platform cluster, you must set cluster configuration properties in the Software Setting property of the Create Cluster task.
Configure the Create Cluster task to set configuration properties in *-site.xml files on the cluster. Hadoop clusters run based on these settings.
The following text shows sample configuration of the Software Settings property:
[
{
"Classification":"yarn-site",
"Properties":{
"yarn.scheduler.minimum-allocation-mb":"250",
"yarn.scheduler.maximum-allocation-mb":"8192",
"yarn.nodemanager.resource.memory-mb":"16000",
"yarn.nodemanager.resource.cpu-vcores":"12"
}
},
{
"Classification":"core-site",
"Properties":{
"hadoop.proxyuser.<DIS/OSPUSER>.groups":"<group names>",
"hadoop.proxyuser.<DIS/OSPUSER>.hosts":"*"
}
}
]

yarn-site

yarn.scheduler.minimum-allocation-mb
The minimum RAM available for each container. Required for Blaze engine resource allocation.
yarn.scheduler.maximum-allocation-mb
The maximum RAM available for each container. Required for Blaze engine resource allocation.
yarn.nodemanager.resource.memory-mb
The maximum RAM available for each container. Set the maximum memory on the cluster to increase resource memory available to the Blaze engine.
yarn.nodemanager.resource.cpu-vcores
The number of virtual cores for each container. Required for Blaze engine resource allocation.

core-site

hadoop.proxyuser.<proxy user>.groups
Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.
Set to group names of impersonation users separated by commas. If less security is preferred, use the wildcard " * " to allow impersonation from any group.
hadlop.proxyuser.<user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.
Set the property to " * " to allow impersonation from any host. This is required to run a Blaze mapping on a cloud platform cluster.

Configure the Cluster to Use an External RDS as the Hive Metastore Database

If you want to use a relational database on the cloud platform as the Hive metastore database for the cluster, you must set cluster configuration properties in the Software Setting property of the Create Cluster task.
Configure the Create Cluster task to set configuration properties in the hive-site.xml configuration file on the cluster. Use a text file to specify hive-site settings, and specify the path to the file in the Software Settings property.
The following text shows sample configuration of the Software Settings property:
{
"Classification":"hive-site",
"Properties":{
"javax.jdo.option.ConnectionURL":"jdbc:mysql:\/\/<RDS_HOST>:<PORT>\/<USER_SCHEMA>?createDatabaseIfNotExist=true",
"javax.jdo.option.ConnectionDriverName":"<JDBC driver name>",
"javax.jdo.option.ConnectionUserName":"<USER>",
"javax.jdo.option.ConnectionPassword":"<USER>"
}
}
Example:
Example:
{
"Classification":"hive-site",
"Properties":{
"javax.jdo.option.ConnectionURL":"jdbc:mysql:\/\/<host name>:<port number>\/hive?createDatabaseIfNotExist=true",
"javax.jdo.option.ConnectionDriverName":"org.mariadb.jdbc.Driver",
"javax.jdo.option.ConnectionUserName":"hive",
"javax.jdo.option.ConnectionPassword":"hive"
}
}

hive-site

javax.jdo.option.ConnectionURL
JDBC connection string for the data store.
javax.jdo.option.ConnectionDriverName
JDBC driver class name for the data store. Specify a JDBC driver that is compatible with the cloud platform.
javax.jdo.option.ConnectionUserName
User name to use to connect to the database.
javax.jdo.option.ConnectionPassword
Password for the database user account.

Create Other Workflow Tasks

Populate the cluster workflow with at least one Mapping task. You can add Command or other workflow tasks and events.
    1. Drag a Mapping task from the task list to the workflow editor.
    The Mapping Task dialog box opens.
    2. Name the Mapping task.
    3. Select a mapping to run with the Mapping task. Click Browse next to the Mapping property, select a mapping, and click Finish.
    4. Optionally select a parameter set to associate with the Mapping task. Click Browse next to the Parameter Set property, select a parameter set, and click Finish.
    For more information on how to use parameter sets with mappings, see the Informatica Developer Mapping Guide.
    5. Optionally complete Input and Output properties.
    The Mapping task does not require any unique values for input or output properties.
    6. Configure the Cluster Identifier property in Advanced properties.
    The Cluster Identifier property designates the cluster to use to run the Mapping task.
    The following table describes values for Cluster Identifier properties:
    Value
    Description
    Blank (no value)
    Run the mapping runs on the cluster configured in the Hadoop connection associated with the mapping.
    AutoDeploy
    Run the mapping on the cluster that the workflow creates.
    When you choose this option, it also populates the Cluster Identifier property in the Create Cluster task with the value Set to AutoDeployCluster.
    Default is AutoDeploy.
    (Assign to task input)
    Select this option to accept input from another source than the Create Cluster task. If you choose this option, enter a parameter value in the Cluster Identifier property of the Mapping task Input properties tab.
    7. Click Finish to the Mapping task.
    8. Optionally add more Mapping and other tasks to the workflow.
    You can include any other workflow tasks in a cluster workflow. For example, you might want to add a Command task to perform tasks after a Mapping task runs.

Add a Delete Cluster Task

To create an ephemeral cluster, add a Delete Cluster task.
The Delete Cluster task terminates the cluster and deletes the cluster and other resources that the cluster workflow creates.
If you do not add a Delete Cluster task, the cluster that the workflow creates remains running when the workflow ends. You can delete the cluster at any time.
    1. Drag a Delete Cluster task to the workflow editor.
    2. In the General properties, optionally rename the Delete Cluster task.
    3. Connect the final task in the workflow to the Delete Cluster task, and connect the Delete Cluster task to the workflow End_Event.
You can also use infacmd ccps deleteCluster to delete a cloud cluster.