Data Engineering Administrator Guide > Data Integration Service Processing > Tuning for Data Engineering Job Processing
  

Tuning for Data Engineering Job Processing

Tune the application services and run-time engines for processing data engineering jobs.
You might want to tune the application services and run-time engines for data engineering job processing to ensure that the application services and the run-time engines are allocated enough resources to perform jobs.
For example, the Model Repository Service and the Data Integration Service require resources to store run-time data. When you run mappings, you might deploy the mappings to the Data Integration Service and run the mappings on the Blaze engine. Similarly, the Blaze engine requires resources to run the mappings. You must allocate enough resources between the Model Repository Service, the Data Integration Service, and the Blaze engine to ensure that mapping performance is optimized.
You can tune the application services and run-time engines based on deployment type. A deployment type represents job processing requirements based on concurrency and volume. The deployment type defines the amount of resources that application services and run-time engines require to function efficiently, and how resources should be allocated between the application services and run-time engines.
To tune the application services and run-time engines, assess the deployment type that best describes the environment that you use for processing data engineering jobs. Then select the application services and the run-time engines that you want to tune. Tune the application services and the run-time engines using infacmd autotune autotune.

Deployment Types

A deployment type represents big data processing requirements based on concurrency and volume.
The deployment type defines the amount of resources that application services and run-time engines require to function efficiently, and how resources should be allocated between the application services and run-time engines. The deployment types are Sandbox, Basic, Standard, and Advanced.
The following table describes the deployment types:
Deployment Type
Description
Sandbox
Used for proof of concepts or as a sandbox with minimal users.
Basic
Used for low volume processing with low levels of concurrency.
Standard
Used for high volume processing with low levels of concurrency.
Advanced
Used for high volume processing with high levels of concurrency.
Each deployment type is described using deployment criteria. The deployment criteria is a set of characteristics that are common to each deployment type. Use the deployment criteria to help you understand the deployment type that best fits the environment that you use for big data processing.
The following table defines the deployment criteria:
Deployment Criteria
Sandbox
Basic
Standard
Advanced
Total data volume
2-10 GB
10 GB - 2 TB
2 TB - 50 TB
50 TB+
Number of nodes in the Hadoop environment
2 - 5
5 - 25
25 - 100
100+
Number of developers
2
5
10
50
Number of concurrent jobs in the Hadoop environment
< 10
< 250
< 500
2000+
Number of Model repository objects
<1000
< 5000
5000 - 20000
20000+
Number of deployed applications
< 10
< 25
< 100
< 500
Number of objects per deployed application
< 10
< 50
< 100
< 100
For example, you estimate that your environment handles an average of 400 concurrent jobs and a data volume of 35 TB. According to the deployment criteria, the deployment type that best describes your environment is Standard.

Tuning the Application Services

Tune the application services for big data processing.
You tune the application services according to the deployment type that best describes the big data processing requirements in your environment. For each application service, the heap memory is tuned based on the deployment type.
The following table describes how the heap memory is tuned for each application service based on the deployment type:
Service
Sandbox
Basic
Standard
Advanced
Analyst Service
768 MB
1 GB
2 GB
4 GB
Content Management Service
1 GB
2 GB
4 GB
4 GB
Data Integration Service
640 MB
2 GB
4 GB
6 GB
Model Repository Service
1 GB
1 GB
2 GB
4 GB
Resource Manager Service
512 MB
512 MB
2 GB
4 GB
Search Service
768 MB
1 GB
2 GB
4 GB

Data Integration Service

When you tune the Data Integration Service, the deployment type additionally defines the execution pool size for jobs that run in the native and Hadoop environments.
The following table lists the execution pool size that is tuned in the native and Hadoop environments based on the deployment type:
Run-time Environment
Sandbox
Basic
Standard
Advanced
Native
10
10
15
30
Hadoop
10
500
1000
2000
HINWEIS: If the deployment type is Advanced, the Data Integration Service is tuned to run on a grid.

Tuning the Hadoop Run-time Engines

Tune the Blaze and Spark engines based on the deployment type. You tune the Blaze and Spark engines to adhere to big data processing requirements.

Tuning the Spark Engine

Tune the Spark engine according to a deployment type that defines the big data processing requirements on the Spark engine. When you tune the Spark engine, the autotune command configures the Spark advanced properties in the Hadoop connection.
The following table describes the advanced properties that are tuned:
Property
Description
spark.driver.memory
The driver process memory that the Spark engine uses to run mapping jobs.
spark.executor.memory
The amount of memory that each executor process uses to run tasklets on the Spark engine.
spark.executor.cores
The number of cores that each executor process uses to run tasklets on the Spark engine.
spark.sql.shuffle.partitions
The number of partitions that the Spark engine uses to shuffle data to process joins or aggregations in a mapping job.
The following table lists the tuned value for each advanced property based on the deployment type:
Property
Sandbox
Basic
Standard
Advanced
spark.driver.memory
1 GB
2 GB
4 GB
4 GB
spark.executor.memory
2 GB
4 GB
6 GB
6 GB
spark.executor.cores
2
2
2
2
spark.sql.shuffle.partitions
100
400
1500
3000

Tuning the Blaze Engine

Tune the Blaze engine to adhere to big data processing requirements on the Blaze engine. When you tune the Blaze engine, the autotune command configures the Blaze advanced properties in the Hadoop connection.
The following table describes the Blaze properties that are tuned:
Property
Description
Value
infagrid.orch.scheduler.oop.container.pref.memory
The amount of memory that the Blaze engine uses to run tasklets.
5120
infagrid.orch.scheduler.oop.container.pref.vcore
The number of DTM instances that run on the Blaze engine.
4
infagrid.tasklet.dtm.buffer.block.size
The amount of buffer memory that a DTM instance uses to move a block of data in a tasklet.
6553600
* The tuned properties do not depend on the deployment type.

Autotune

Konfiguriert Dienste und Verbindungen mit empfohlenen Einstellungen auf Grundlage des Bereitstellungstyps. Änderungen werden nach dem Neustart der Dienste wirksam.
Für jeden angegebenen Dienst werden die Änderungen am Dienst auf allen Knoten wirksam, die derzeit für die Ausführung des Diensts konfiguriert sind, und die Änderungen wirken sich auf alle Dienstprozesse aus.
Der Befehl infacmd autotune Autotune verwendet die folgende Syntax:
Autotune

<-DomainName|-dn> domain_name
<-UserName|-un> user_name
<-Password|-pd> password
[<-SecurityDomain|-sdn> security_domain]
[<-ResilienceTimeout|-re> timeout_period_in_seconds]
<-Size|-s> tuning_size_name
[<-ServiceNames|-sn> service_names]
[<-BlazeConnectionNames|-bcn> connection_names]
[<-SparkConnectionNames|-scn> connection_names]
[<-All|-a> yes_or_no]
Das infacmd-Programm stellt über die folgenden gemeinsamen Optionen eine Verbindung zur Domäne her: Domänenname, Benutzername, Passwort, Sicherheitsdomäne, Belastbarkeits-Timeout. Die Tabelle mit den Optionen enthält Kurzbeschreibungen. Weitere Informationen zum Herstellen einer Verbindung zur Domäne finden Sie in der Befehlsreferenz.
In der folgenden Tabelle werden die Optionen und Argumente für infacmd Autotune beschrieben:
Option
Beschreibung
-DomainName
-dn
Name der Informatica-Domäne.
-UserName
-un
Benutzername zum Herstellen einer Verbindung zur Domäne.
-Password
-pd
Das Passwort für den Benutzernamen.
-SecurityDomain
-sdn
Name der Sicherheitsdomäne, zu der der Domänenbenutzer gehört.
-ResilienceTimeout
-re
Zeit in Sekunden, in der infacmd versucht, eine Verbindung zur Domäne herzustellen bzw. erneut herzustellen.
-Size
-s
Erforderlich. Der Bereitstellungstyp, der hohe Datenverarbeitungsanforderungen auf der Grundlage von Parallelität und Volumen darstellt.
Sie können Sandbox, Basic, Standard oder Advanced eingeben.
-ServiceNames
-sn
Optional. Liste der Dienste, die in der Informatica-Domäne konfiguriert sind. Trennen Sie die einzelnen Dienstnamen durch ein Komma.
Sie können die folgenden Dienste optimieren:
  • - Analyst-Dienst
  • - Content-Management-Dienst
  • - Datenintegrationsdienst
  • - Modellrepository-Dienst
  • - Ressourcenmanager-Dienst
  • - Suchdienst
Standardwert ist „Keine“.
-BlazeConnectionNames
-bcn
Optional. Liste der Hadoop-Verbindungen, die in der Informatica-Domäne konfiguriert sind. Für jede Hadoop-Verbindung optimiert der Befehl die Blaze-Konfigurationseigenschaften in der Hadoop-Verbindung.
Trennen Sie die einzelnen Hadoop-Verbindungsnamen durch ein Komma.
Standardwert ist „Keine“.
-SparkConnectionNames
-scn
Optional. Liste der Hadoop-Verbindungen, die in der Informatica-Domäne konfiguriert sind. Für jede Hadoop-Verbindung optimiert der Befehl die Spark-Konfigurationseigenschaften in der Hadoop-Verbindung.
Trennen Sie die einzelnen Hadoop-Verbindungsnamen durch ein Komma.
Standardwert ist „Keine“.
-All
-a
Optional. Geben Sie yes ein, um die empfohlenen Einstellungen für alle Analyst-Dienste, Content-Management-Dienste, Datenintegrationsdienste, Modellrepository-Dienste, Ressourcen-Manager-Dienste, Suchdienste und Hadoop-Verbindungen in der Informatica-Domäne anzuwenden.
Geben Sie no ein, um die empfohlenen Einstellungen nur auf die von Ihnen angegebenen Dienste und Hadoop-Verbindungen anzuwenden.
Standardwert ist no.