Data Engineering Administrator Guide > Data Integration Service Processing > Tuning for Data Engineering Job Processing
  

Tuning for Data Engineering Job Processing

Tune the application services and run-time engines for processing data engineering jobs.
You might want to tune the application services and run-time engines for data engineering job processing to ensure that the application services and the run-time engines are allocated enough resources to perform jobs.
For example, the Model Repository Service and the Data Integration Service require resources to store run-time data. When you run mappings, you might deploy the mappings to the Data Integration Service and run the mappings on the Blaze engine. Similarly, the Blaze engine requires resources to run the mappings. You must allocate enough resources between the Model Repository Service, the Data Integration Service, and the Blaze engine to ensure that mapping performance is optimized.
You can tune the application services and run-time engines based on deployment type. A deployment type represents job processing requirements based on concurrency and volume. The deployment type defines the amount of resources that application services and run-time engines require to function efficiently, and how resources should be allocated between the application services and run-time engines.
To tune the application services and run-time engines, assess the deployment type that best describes the environment that you use for processing data engineering jobs. Then select the application services and the run-time engines that you want to tune. Tune the application services and the run-time engines using infacmd autotune autotune.

Deployment Types

A deployment type represents big data processing requirements based on concurrency and volume.
The deployment type defines the amount of resources that application services and run-time engines require to function efficiently, and how resources should be allocated between the application services and run-time engines. The deployment types are Sandbox, Basic, Standard, and Advanced.
The following table describes the deployment types:
Deployment Type
Description
Sandbox
Used for proof of concepts or as a sandbox with minimal users.
Basic
Used for low volume processing with low levels of concurrency.
Standard
Used for high volume processing with low levels of concurrency.
Advanced
Used for high volume processing with high levels of concurrency.
Each deployment type is described using deployment criteria. The deployment criteria is a set of characteristics that are common to each deployment type. Use the deployment criteria to help you understand the deployment type that best fits the environment that you use for big data processing.
The following table defines the deployment criteria:
Deployment Criteria
Sandbox
Basic
Standard
Advanced
Total data volume
2-10 GB
10 GB - 2 TB
2 TB - 50 TB
50 TB+
Number of nodes in the Hadoop environment
2 - 5
5 - 25
25 - 100
100+
Number of developers
2
5
10
50
Number of concurrent jobs in the Hadoop environment
< 10
< 250
< 500
2000+
Number of Model repository objects
<1000
< 5000
5000 - 20000
20000+
Number of deployed applications
< 10
< 25
< 100
< 500
Number of objects per deployed application
< 10
< 50
< 100
< 100
For example, you estimate that your environment handles an average of 400 concurrent jobs and a data volume of 35 TB. According to the deployment criteria, the deployment type that best describes your environment is Standard.

Tuning the Application Services

Tune the application services for big data processing.
You tune the application services according to the deployment type that best describes the big data processing requirements in your environment. For each application service, the heap memory is tuned based on the deployment type.
The following table describes how the heap memory is tuned for each application service based on the deployment type:
Service
Sandbox
Basic
Standard
Advanced
Analyst Service
768 MB
1 GB
2 GB
4 GB
Content Management Service
1 GB
2 GB
4 GB
4 GB
Data Integration Service
640 MB
2 GB
4 GB
6 GB
Model Repository Service
1 GB
1 GB
2 GB
4 GB
Resource Manager Service
512 MB
512 MB
2 GB
4 GB
Search Service
768 MB
1 GB
2 GB
4 GB

Data Integration Service

When you tune the Data Integration Service, the deployment type additionally defines the execution pool size for jobs that run in the native and Hadoop environments.
The following table lists the execution pool size that is tuned in the native and Hadoop environments based on the deployment type:
Run-time Environment
Sandbox
Basic
Standard
Advanced
Native
10
10
15
30
Hadoop
10
500
1000
2000
Note: If the deployment type is Advanced, the Data Integration Service is tuned to run on a grid.

Tuning the Hadoop Run-time Engines

Tune the Blaze and Spark engines based on the deployment type. You tune the Blaze and Spark engines to adhere to big data processing requirements.

Tuning the Spark Engine

Tune the Spark engine according to a deployment type that defines the big data processing requirements on the Spark engine. When you tune the Spark engine, the autotune command configures the Spark advanced properties in the Hadoop connection.
The following table describes the advanced properties that are tuned:
Property
Description
spark.driver.memory
The driver process memory that the Spark engine uses to run mapping jobs.
spark.executor.memory
The amount of memory that each executor process uses to run tasklets on the Spark engine.
spark.executor.cores
The number of cores that each executor process uses to run tasklets on the Spark engine.
spark.sql.shuffle.partitions
The number of partitions that the Spark engine uses to shuffle data to process joins or aggregations in a mapping job.
The following table lists the tuned value for each advanced property based on the deployment type:
Property
Sandbox
Basic
Standard
Advanced
spark.driver.memory
1 GB
2 GB
4 GB
4 GB
spark.executor.memory
2 GB
4 GB
6 GB
6 GB
spark.executor.cores
2
2
2
2
spark.sql.shuffle.partitions
100
400
1500
3000

Tuning the Blaze Engine

Tune the Blaze engine to adhere to big data processing requirements on the Blaze engine. When you tune the Blaze engine, the autotune command configures the Blaze advanced properties in the Hadoop connection.
The following table describes the Blaze properties that are tuned:
Property
Description
Value
infagrid.orch.scheduler.oop.container.pref.memory
The amount of memory that the Blaze engine uses to run tasklets.
5120
infagrid.orch.scheduler.oop.container.pref.vcore
The number of DTM instances that run on the Blaze engine.
4
infagrid.tasklet.dtm.buffer.block.size
The amount of buffer memory that a DTM instance uses to move a block of data in a tasklet.
6553600
* The tuned properties do not depend on the deployment type.

Autotune

Configures services and connections with recommended settings based on the deployment type. Changes take effect after you recycle the services.
For each specified service, the changes to the service take effect on all nodes that are currently configured to run the service, and the changes affect all service processes.
The infacmd autotune Autotune command uses the following syntax:
Autotune

<-DomainName|-dn> domain_name
<-UserName|-un> user_name
<-Password|-pd> password
[<-SecurityDomain|-sdn> security_domain]
[<-ResilienceTimeout|-re> timeout_period_in_seconds]
<-Size|-s> tuning_size_name
[<-ServiceNames|-sn> service_names]
[<-BlazeConnectionNames|-bcn> connection_names]
[<-SparkConnectionNames|-scn> connection_names]
[<-All|-a> yes_or_no]
The infacmd program uses the following common options to connect to the domain: domain name, user name, password, security domain, and resilience timeout. The table of options has brief descriptions. To see more information about connecting to the domain, see the Command Reference.
The following table describes infacmd autotune Autotune options and arguments:
Option
Description
-DomainName
-dn
Name of the Informatica domain.
-UserName
-un
User name to connect to the domain.
-Password
-pd
Password for the user name.
-SecurityDomain
-sdn
Name of the security domain to which the domain user belongs.
-ResilienceTimeout
-re
Amount of time in seconds that infacmd attempts to establish or re-establish a connection to the domain.
-Size
-s
Required. The deployment type that represents big data processing requirements based on concurrency and volume.
You can enter Sandbox, Basic, Standard, or Advanced.
-ServiceNames
-sn
Optional. List of services configured in the Informatica domain. Separate each service name with a comma.
You can tune the following services:
  • - Analyst Service
  • - Content Management Service
  • - Data Integration Service
  • - Model Repository Service
  • - Resource Manager Service
  • - Search Service
Default is none.
-BlazeConnectionNames
-bcn
Optional. List of Hadoop connections configured in the Informatica domain. For each Hadoop connection, the command tunes Blaze configuration properties in the Hadoop connection.
Separate each Hadoop connection name with a comma.
Default is none.
-SparkConnectionNames
-scn
Optional. List of Hadoop connections configured in the Informatica domain. For each Hadoop connection, the command tunes Spark configuration properties in the Hadoop connection.
Separate each Hadoop connection name with a comma.
Default is none.
-All
-a
Optional. Enter yes to apply recommended settings to all Analyst Services, Content Management Services, Data Integration Services, Model Repository Services, Resource Manager Services, Search Services, and Hadoop connections in the Informatica domain.
Enter no to apply the recommended settings only to the services and Hadoop connections that you specify.
Default is no.