Tuning for Data Engineering Job Processing
Tune the application services and run-time engines for processing data engineering jobs.
You might want to tune the application services and run-time engines for data engineering job processing to ensure that the application services and the run-time engines are allocated enough resources to perform jobs.
For example, the Model Repository Service and the Data Integration Service require resources to store run-time data. When you run mappings, you might deploy the mappings to the Data Integration Service and run the mappings on the Blaze engine. Similarly, the Blaze engine requires resources to run the mappings. You must allocate enough resources between the Model Repository Service, the Data Integration Service, and the Blaze engine to ensure that mapping performance is optimized.
You can tune the application services and run-time engines based on deployment type. A deployment type represents job processing requirements based on concurrency and volume. The deployment type defines the amount of resources that application services and run-time engines require to function efficiently, and how resources should be allocated between the application services and run-time engines.
To tune the application services and run-time engines, assess the deployment type that best describes the environment that you use for processing data engineering jobs. Then select the application services and the run-time engines that you want to tune. Tune the application services and the run-time engines using infacmd autotune autotune.
Deployment Types
A deployment type represents big data processing requirements based on concurrency and volume.
The deployment type defines the amount of resources that application services and run-time engines require to function efficiently, and how resources should be allocated between the application services and run-time engines. The deployment types are Sandbox, Basic, Standard, and Advanced.
The following table describes the deployment types:
Deployment Type | Description |
|---|
Sandbox | Used for proof of concepts or as a sandbox with minimal users. |
Basic | Used for low volume processing with low levels of concurrency. |
Standard | Used for high volume processing with low levels of concurrency. |
Advanced | Used for high volume processing with high levels of concurrency. |
Each deployment type is described using deployment criteria. The deployment criteria is a set of characteristics that are common to each deployment type. Use the deployment criteria to help you understand the deployment type that best fits the environment that you use for big data processing.
The following table defines the deployment criteria:
Deployment Criteria | Sandbox | Basic | Standard | Advanced |
|---|
Total data volume | 2-10 GB | 10 GB - 2 TB | 2 TB - 50 TB | 50 TB+ |
Number of nodes in the Hadoop environment | 2 - 5 | 5 - 25 | 25 - 100 | 100+ |
Number of developers | 2 | 5 | 10 | 50 |
Number of concurrent jobs in the Hadoop environment | < 10 | < 250 | < 500 | 2000+ |
Number of Model repository objects | <1000 | < 5000 | 5000 - 20000 | 20000+ |
Number of deployed applications | < 10 | < 25 | < 100 | < 500 |
Number of objects per deployed application | < 10 | < 50 | < 100 | < 100 |
For example, you estimate that your environment handles an average of 400 concurrent jobs and a data volume of 35 TB. According to the deployment criteria, the deployment type that best describes your environment is Standard.
Tuning the Application Services
Tune the application services for big data processing.
You tune the application services according to the deployment type that best describes the big data processing requirements in your environment. For each application service, the heap memory is tuned based on the deployment type.
The following table describes how the heap memory is tuned for each application service based on the deployment type:
Service | Sandbox | Basic | Standard | Advanced |
|---|
Analyst Service | 768 MB | 1 GB | 2 GB | 4 GB |
Content Management Service | 1 GB | 2 GB | 4 GB | 4 GB |
Data Integration Service | 640 MB | 2 GB | 4 GB | 6 GB |
Model Repository Service | 1 GB | 1 GB | 2 GB | 4 GB |
Resource Manager Service | 512 MB | 512 MB | 2 GB | 4 GB |
Search Service | 768 MB | 1 GB | 2 GB | 4 GB |
Data Integration Service
When you tune the Data Integration Service, the deployment type additionally defines the execution pool size for jobs that run in the native and Hadoop environments.
The following table lists the execution pool size that is tuned in the native and Hadoop environments based on the deployment type:
Run-time Environment | Sandbox | Basic | Standard | Advanced |
|---|
Native | 10 | 10 | 15 | 30 |
Hadoop | 10 | 500 | 1000 | 2000 |
Note: If the deployment type is Advanced, the Data Integration Service is tuned to run on a grid.
Tuning the Hadoop Run-time Engines
Tune the Blaze and Spark engines based on the deployment type. You tune the Blaze and Spark engines to adhere to big data processing requirements.
Tuning the Spark Engine
Tune the Spark engine according to a deployment type that defines the big data processing requirements on the Spark engine. When you tune the Spark engine, the autotune command configures the Spark advanced properties in the Hadoop connection.
The following table describes the advanced properties that are tuned:
Property | Description |
|---|
spark.driver.memory | The driver process memory that the Spark engine uses to run mapping jobs. |
spark.executor.memory | The amount of memory that each executor process uses to run tasklets on the Spark engine. |
spark.executor.cores | The number of cores that each executor process uses to run tasklets on the Spark engine. |
spark.sql.shuffle.partitions | The number of partitions that the Spark engine uses to shuffle data to process joins or aggregations in a mapping job. |
The following table lists the tuned value for each advanced property based on the deployment type:
Property | Sandbox | Basic | Standard | Advanced |
|---|
spark.driver.memory | 1 GB | 2 GB | 4 GB | 4 GB |
spark.executor.memory | 2 GB | 4 GB | 6 GB | 6 GB |
spark.executor.cores | 2 | 2 | 2 | 2 |
spark.sql.shuffle.partitions | 100 | 400 | 1500 | 3000 |
Tuning the Blaze Engine
Tune the Blaze engine to adhere to big data processing requirements on the Blaze engine. When you tune the Blaze engine, the autotune command configures the Blaze advanced properties in the Hadoop connection.
The following table describes the Blaze properties that are tuned:
Property | Description | Value |
|---|
infagrid.orch.scheduler.oop.container.pref.memory | The amount of memory that the Blaze engine uses to run tasklets. | 5120 |
infagrid.orch.scheduler.oop.container.pref.vcore | The number of DTM instances that run on the Blaze engine. | 4 |
infagrid.tasklet.dtm.buffer.block.size | The amount of buffer memory that a DTM instance uses to move a block of data in a tasklet. | 6553600 |
* The tuned properties do not depend on the deployment type. |
Autotune
Configures services and connections with recommended settings based on the deployment type. Changes take effect after you recycle the services.
For each specified service, the changes to the service take effect on all nodes that are currently configured to run the service, and the changes affect all service processes.
The infacmd autotune Autotune command uses the following syntax:
Autotune
<-DomainName|-dn> domain_name
<-UserName|-un> user_name
<-Password|-pd> password
[<-SecurityDomain|-sdn> security_domain]
[<-ResilienceTimeout|-re> timeout_period_in_seconds]
<-Size|-s> tuning_size_name
[<-ServiceNames|-sn> service_names]
[<-BlazeConnectionNames|-bcn> connection_names]
[<-SparkConnectionNames|-scn> connection_names]
[<-All|-a> yes_or_no]
The infacmd program uses the following common options to connect to the domain: domain name, user name, password, security domain, and resilience timeout. The table of options has brief descriptions. To see more information about connecting to the domain, see the Command Reference.
The following table describes infacmd autotune Autotune options and arguments:
Option | Description |
|---|
-DomainName -dn | Name of the Informatica domain. |
-UserName -un | User name to connect to the domain. |
-Password -pd | Password for the user name. |
-SecurityDomain -sdn | Name of the security domain to which the domain user belongs. |
-ResilienceTimeout -re | Amount of time in seconds that infacmd attempts to establish or re-establish a connection to the domain. |
-Size -s | Required. The deployment type that represents big data processing requirements based on concurrency and volume. You can enter Sandbox, Basic, Standard, or Advanced. |
-ServiceNames -sn | Optional. List of services configured in the Informatica domain. Separate each service name with a comma. You can tune the following services: - - Analyst Service
- - Content Management Service
- - Data Integration Service
- - Model Repository Service
- - Resource Manager Service
- - Search Service
Default is none. |
-BlazeConnectionNames -bcn | Optional. List of Hadoop connections configured in the Informatica domain. For each Hadoop connection, the command tunes Blaze configuration properties in the Hadoop connection. Separate each Hadoop connection name with a comma. Default is none. |
-SparkConnectionNames -scn | Optional. List of Hadoop connections configured in the Informatica domain. For each Hadoop connection, the command tunes Spark configuration properties in the Hadoop connection. Separate each Hadoop connection name with a comma. Default is none. |
-All -a | Optional. Enter yes to apply recommended settings to all Analyst Services, Content Management Services, Data Integration Services, Model Repository Services, Resource Manager Services, Search Services, and Hadoop connections in the Informatica domain. Enter no to apply the recommended settings only to the services and Hadoop connections that you specify. Default is no. |