Tuning for Data Engineering Job Processing

For example, the Model Repository Service and the Data Integration Service require resources to store run-time data. When you run mappings, you might deploy the mappings to the Data Integration Service and run the mappings on the Blaze engine. Similarly, the Blaze engine requires resources to run the mappings. You must allocate enough resources between the Model Repository Service, the Data Integration Service, and the Blaze engine to ensure that mapping performance is optimized.

Deployment Types

Deployment Type	Description
Sandbox	Used for proof of concepts or as a sandbox with minimal users.
Basic	Used for low volume processing with low levels of concurrency.
Standard	Used for high volume processing with low levels of concurrency.
Advanced	Used for high volume processing with high levels of concurrency.

Deployment Criteria	Sandbox	Basic	Standard	Advanced
Total data volume	2-10 GB	10 GB - 2 TB	2 TB - 50 TB	50 TB+
Number of nodes in the Hadoop environment	2 - 5	5 - 25	25 - 100	100+
Number of developers	2	5	10	50
Number of concurrent jobs in the Hadoop environment	< 10	< 250	< 500	2000+
Number of Model repository objects	<1000	< 5000	5000 - 20000	20000+
Number of deployed applications	< 10	< 25	< 100	< 500
Number of objects per deployed application	< 10	< 50	< 100	< 100

Tuning the Application Services

Service	Sandbox	Basic	Standard	Advanced
Analyst Service	768 MB	1 GB	2 GB	4 GB
Content Management Service	1 GB	2 GB	4 GB	4 GB
Data Integration Service	640 MB	2 GB	4 GB	6 GB
Model Repository Service	1 GB	1 GB	2 GB	4 GB
Resource Manager Service	512 MB	512 MB	2 GB	4 GB
Search Service	768 MB	1 GB	2 GB	4 GB

Data Integration Service

Run-time Environment	Sandbox	Basic	Standard	Advanced
Native	10	10	15	30
Hadoop	10	500	1000	2000

Tuning the Hadoop Run-time Engines

Tuning the Spark Engine

Property	Description
spark.driver.memory	The driver process memory that the Spark engine uses to run mapping jobs.
spark.executor.memory	The amount of memory that each executor process uses to run tasklets on the Spark engine.
spark.executor.cores	The number of cores that each executor process uses to run tasklets on the Spark engine.
spark.sql.shuffle.partitions	The number of partitions that the Spark engine uses to shuffle data to process joins or aggregations in a mapping job.

Property	Sandbox	Basic	Standard	Advanced
spark.driver.memory	1 GB	2 GB	4 GB	4 GB
spark.executor.memory	2 GB	4 GB	6 GB	6 GB
spark.executor.cores	2	2	2	2
spark.sql.shuffle.partitions	100	400	1500	3000

Tuning the Blaze Engine

Property	Description	Value
infagrid.orch.scheduler.oop.container.pref.memory	The amount of memory that the Blaze engine uses to run tasklets.	5120
infagrid.orch.scheduler.oop.container.pref.vcore	The number of DTM instances that run on the Blaze engine.	4
infagrid.tasklet.dtm.buffer.block.size	The amount of buffer memory that a DTM instance uses to move a block of data in a tasklet.	6553600
* The tuned properties do not depend on the deployment type.

Autotune

Option	Beschreibung
-DomainName -dn	Name der Informatica-Domäne.
-UserName -un	Benutzername zum Herstellen einer Verbindung zur Domäne.
-Password -pd	Das Passwort für den Benutzernamen.
-SecurityDomain -sdn	Name der Sicherheitsdomäne, zu der der Domänenbenutzer gehört.
-ResilienceTimeout -re	Zeit in Sekunden, in der infacmd versucht, eine Verbindung zur Domäne herzustellen bzw. erneut herzustellen.
-Size -s	Erforderlich. Der Bereitstellungstyp, der hohe Datenverarbeitungsanforderungen auf der Grundlage von Parallelität und Volumen darstellt. Sie können Sandbox, Basic, Standard oder Advanced eingeben.
-ServiceNames -sn	Optional. Liste der Dienste, die in der Informatica-Domäne konfiguriert sind. Trennen Sie die einzelnen Dienstnamen durch ein Komma. Sie können die folgenden Dienste optimieren: - Analyst-Dienst - Content-Management-Dienst - Datenintegrationsdienst - Modellrepository-Dienst - Ressourcenmanager-Dienst - Suchdienst Standardwert ist „Keine“.
-BlazeConnectionNames -bcn	Optional. Liste der Hadoop-Verbindungen, die in der Informatica-Domäne konfiguriert sind. Für jede Hadoop-Verbindung optimiert der Befehl die Blaze-Konfigurationseigenschaften in der Hadoop-Verbindung. Trennen Sie die einzelnen Hadoop-Verbindungsnamen durch ein Komma. Standardwert ist „Keine“.
-SparkConnectionNames -scn	Optional. Liste der Hadoop-Verbindungen, die in der Informatica-Domäne konfiguriert sind. Für jede Hadoop-Verbindung optimiert der Befehl die Spark-Konfigurationseigenschaften in der Hadoop-Verbindung. Trennen Sie die einzelnen Hadoop-Verbindungsnamen durch ein Komma. Standardwert ist „Keine“.
-All -a	Optional. Geben Sie yes ein, um die empfohlenen Einstellungen für alle Analyst-Dienste, Content-Management-Dienste, Datenintegrationsdienste, Modellrepository-Dienste, Ressourcen-Manager-Dienste, Suchdienste und Hadoop-Verbindungen in der Informatica-Domäne anzuwenden. Geben Sie no ein, um die empfohlenen Einstellungen nur auf die von Ihnen angegebenen Dienste und Hadoop-Verbindungen anzuwenden. Standardwert ist no.