Tuning for Data Engineering Job Processing
Tune the application services and run-time engines for processing data engineering jobs.
You might want to tune the application services and run-time engines for data engineering job processing to ensure that the application services and the run-time engines are allocated enough resources to perform jobs.
For example, the Model Repository Service and the Data Integration Service require resources to store run-time data. When you run mappings, you might deploy the mappings to the Data Integration Service and run the mappings on the Blaze engine. Similarly, the Blaze engine requires resources to run the mappings. You must allocate enough resources between the Model Repository Service, the Data Integration Service, and the Blaze engine to ensure that mapping performance is optimized.
You can tune the application services and run-time engines based on deployment type. A deployment type represents job processing requirements based on concurrency and volume. The deployment type defines the amount of resources that application services and run-time engines require to function efficiently, and how resources should be allocated between the application services and run-time engines.
To tune the application services and run-time engines, assess the deployment type that best describes the environment that you use for processing data engineering jobs. Then select the application services and the run-time engines that you want to tune. Tune the application services and the run-time engines using infacmd autotune autotune.
Deployment Types
A deployment type represents big data processing requirements based on concurrency and volume.
The deployment type defines the amount of resources that application services and run-time engines require to function efficiently, and how resources should be allocated between the application services and run-time engines. The deployment types are Sandbox, Basic, Standard, and Advanced.
The following table describes the deployment types:
Deployment Type | Description |
|---|
Sandbox | Used for proof of concepts or as a sandbox with minimal users. |
Basic | Used for low volume processing with low levels of concurrency. |
Standard | Used for high volume processing with low levels of concurrency. |
Advanced | Used for high volume processing with high levels of concurrency. |
Each deployment type is described using deployment criteria. The deployment criteria is a set of characteristics that are common to each deployment type. Use the deployment criteria to help you understand the deployment type that best fits the environment that you use for big data processing.
The following table defines the deployment criteria:
Deployment Criteria | Sandbox | Basic | Standard | Advanced |
|---|
Total data volume | 2-10 GB | 10 GB - 2 TB | 2 TB - 50 TB | 50 TB+ |
Number of nodes in the Hadoop environment | 2 - 5 | 5 - 25 | 25 - 100 | 100+ |
Number of developers | 2 | 5 | 10 | 50 |
Number of concurrent jobs in the Hadoop environment | < 10 | < 250 | < 500 | 2000+ |
Number of Model repository objects | <1000 | < 5000 | 5000 - 20000 | 20000+ |
Number of deployed applications | < 10 | < 25 | < 100 | < 500 |
Number of objects per deployed application | < 10 | < 50 | < 100 | < 100 |
For example, you estimate that your environment handles an average of 400 concurrent jobs and a data volume of 35 TB. According to the deployment criteria, the deployment type that best describes your environment is Standard.
Tuning the Application Services
Tune the application services for big data processing.
You tune the application services according to the deployment type that best describes the big data processing requirements in your environment. For each application service, the heap memory is tuned based on the deployment type.
The following table describes how the heap memory is tuned for each application service based on the deployment type:
Service | Sandbox | Basic | Standard | Advanced |
|---|
Analyst Service | 768 MB | 1 GB | 2 GB | 4 GB |
Content Management Service | 1 GB | 2 GB | 4 GB | 4 GB |
Data Integration Service | 640 MB | 2 GB | 4 GB | 6 GB |
Model Repository Service | 1 GB | 1 GB | 2 GB | 4 GB |
Resource Manager Service | 512 MB | 512 MB | 2 GB | 4 GB |
Search Service | 768 MB | 1 GB | 2 GB | 4 GB |
Data Integration Service
When you tune the Data Integration Service, the deployment type additionally defines the execution pool size for jobs that run in the native and Hadoop environments.
The following table lists the execution pool size that is tuned in the native and Hadoop environments based on the deployment type:
Run-time Environment | Sandbox | Basic | Standard | Advanced |
|---|
Native | 10 | 10 | 15 | 30 |
Hadoop | 10 | 500 | 1000 | 2000 |
HINWEIS: If the deployment type is Advanced, the Data Integration Service is tuned to run on a grid.
Tuning the Hadoop Run-time Engines
Tune the Blaze and Spark engines based on the deployment type. You tune the Blaze and Spark engines to adhere to big data processing requirements.
Tuning the Spark Engine
Tune the Spark engine according to a deployment type that defines the big data processing requirements on the Spark engine. When you tune the Spark engine, the autotune command configures the Spark advanced properties in the Hadoop connection.
The following table describes the advanced properties that are tuned:
Property | Description |
|---|
spark.driver.memory | The driver process memory that the Spark engine uses to run mapping jobs. |
spark.executor.memory | The amount of memory that each executor process uses to run tasklets on the Spark engine. |
spark.executor.cores | The number of cores that each executor process uses to run tasklets on the Spark engine. |
spark.sql.shuffle.partitions | The number of partitions that the Spark engine uses to shuffle data to process joins or aggregations in a mapping job. |
The following table lists the tuned value for each advanced property based on the deployment type:
Property | Sandbox | Basic | Standard | Advanced |
|---|
spark.driver.memory | 1 GB | 2 GB | 4 GB | 4 GB |
spark.executor.memory | 2 GB | 4 GB | 6 GB | 6 GB |
spark.executor.cores | 2 | 2 | 2 | 2 |
spark.sql.shuffle.partitions | 100 | 400 | 1500 | 3000 |
Tuning the Blaze Engine
Tune the Blaze engine to adhere to big data processing requirements on the Blaze engine. When you tune the Blaze engine, the autotune command configures the Blaze advanced properties in the Hadoop connection.
The following table describes the Blaze properties that are tuned:
Property | Description | Value |
|---|
infagrid.orch.scheduler.oop.container.pref.memory | The amount of memory that the Blaze engine uses to run tasklets. | 5120 |
infagrid.orch.scheduler.oop.container.pref.vcore | The number of DTM instances that run on the Blaze engine. | 4 |
infagrid.tasklet.dtm.buffer.block.size | The amount of buffer memory that a DTM instance uses to move a block of data in a tasklet. | 6553600 |
* The tuned properties do not depend on the deployment type. |
Autotune
Konfiguriert Dienste und Verbindungen mit empfohlenen Einstellungen auf Grundlage des Bereitstellungstyps. Änderungen werden nach dem Neustart der Dienste wirksam.
Für jeden angegebenen Dienst werden die Änderungen am Dienst auf allen Knoten wirksam, die derzeit für die Ausführung des Diensts konfiguriert sind, und die Änderungen wirken sich auf alle Dienstprozesse aus.
Der Befehl infacmd autotune Autotune verwendet die folgende Syntax:
Autotune
<-DomainName|-dn> domain_name
<-UserName|-un> user_name
<-Password|-pd> password
[<-SecurityDomain|-sdn> security_domain]
[<-ResilienceTimeout|-re> timeout_period_in_seconds]
<-Size|-s> tuning_size_name
[<-ServiceNames|-sn> service_names]
[<-BlazeConnectionNames|-bcn> connection_names]
[<-SparkConnectionNames|-scn> connection_names]
[<-All|-a> yes_or_no]
Das infacmd-Programm stellt über die folgenden gemeinsamen Optionen eine Verbindung zur Domäne her: Domänenname, Benutzername, Passwort, Sicherheitsdomäne, Belastbarkeits-Timeout. Die Tabelle mit den Optionen enthält Kurzbeschreibungen. Weitere Informationen zum Herstellen einer Verbindung zur Domäne finden Sie in der Befehlsreferenz.
In der folgenden Tabelle werden die Optionen und Argumente für infacmd Autotune beschrieben:
Option | Beschreibung |
|---|
-DomainName -dn | Name der Informatica-Domäne. |
-UserName -un | Benutzername zum Herstellen einer Verbindung zur Domäne. |
-Password -pd | Das Passwort für den Benutzernamen. |
-SecurityDomain -sdn | Name der Sicherheitsdomäne, zu der der Domänenbenutzer gehört. |
-ResilienceTimeout -re | Zeit in Sekunden, in der infacmd versucht, eine Verbindung zur Domäne herzustellen bzw. erneut herzustellen. |
-Size -s | Erforderlich. Der Bereitstellungstyp, der hohe Datenverarbeitungsanforderungen auf der Grundlage von Parallelität und Volumen darstellt. Sie können Sandbox, Basic, Standard oder Advanced eingeben. |
-ServiceNames -sn | Optional. Liste der Dienste, die in der Informatica-Domäne konfiguriert sind. Trennen Sie die einzelnen Dienstnamen durch ein Komma. Sie können die folgenden Dienste optimieren: - - Analyst-Dienst
- - Content-Management-Dienst
- - Datenintegrationsdienst
- - Modellrepository-Dienst
- - Ressourcenmanager-Dienst
- - Suchdienst
Standardwert ist „Keine“. |
-BlazeConnectionNames -bcn | Optional. Liste der Hadoop-Verbindungen, die in der Informatica-Domäne konfiguriert sind. Für jede Hadoop-Verbindung optimiert der Befehl die Blaze-Konfigurationseigenschaften in der Hadoop-Verbindung. Trennen Sie die einzelnen Hadoop-Verbindungsnamen durch ein Komma. Standardwert ist „Keine“. |
-SparkConnectionNames -scn | Optional. Liste der Hadoop-Verbindungen, die in der Informatica-Domäne konfiguriert sind. Für jede Hadoop-Verbindung optimiert der Befehl die Spark-Konfigurationseigenschaften in der Hadoop-Verbindung. Trennen Sie die einzelnen Hadoop-Verbindungsnamen durch ein Komma. Standardwert ist „Keine“. |
-All -a | Optional. Geben Sie yes ein, um die empfohlenen Einstellungen für alle Analyst-Dienste, Content-Management-Dienste, Datenintegrationsdienste, Modellrepository-Dienste, Ressourcen-Manager-Dienste, Suchdienste und Hadoop-Verbindungen in der Informatica-Domäne anzuwenden. Geben Sie no ein, um die empfohlenen Einstellungen nur auf die von Ihnen angegebenen Dienste und Hadoop-Verbindungen anzuwenden. Standardwert ist no. |