Spark session property | Description |
---|---|
infaspark.sql.forcePersist | Indicates whether data persists in memory to avoid repeating read operations. For example, the Router transformation can avoid repeated read operations on output groups. Default is false. |
spark.driver.extraJavaOptions | Additional JVM options for the Spark driver process. Default is -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500. |
spark.driver.maxResultSize | Maximum total size of serialized results of all partitions for each Spark action. Default is 4G. |
spark.driver.memory | Amount of memory for the Spark driver process. Default is 4G. |
spark.dynamicAllocation.maxExecutors | Maximum number of Spark executors if dynamic allocation is enabled. Default is 1000. The value is calculated automatically. |
spark.executor.cores | Number of cores that run each Spark executor. Default is 2. |
spark.executor.extraJavaOptions | Additional JVM options for Spark executors. Default is -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500. |
spark.executor.memory | Amount of memory for each Spark executor. Default is 6G. |
spark.memory.fraction | Fraction of the heap that is allocated to the Spark engine. When set to 1, the Spark engine uses the full heap space except for 300 MB that is reserved memory. Default is 0.6. |
spark.memory.storageFraction | Fraction of memory that the Spark engine uses for storage compared to processing data. Default is 0.5. |
spark.rdd.compress | Indicates whether to compress serialized RDD partitions. Default is false. |
spark.reducer.maxSizeInFlight | Maximum size of the data that each reduce task can receive from a map task while shuffling data. The size represents a network buffer to make sure that the reduce task has enough memory for the shuffled data. Default is 48M. |
spark.shuffle.file.buffer | Size of the in-memory buffer that each map task uses to write the intermediate shuffle output. Default is 32K. |
spark.sql.autoBroadcastJoinThreshold | Threshold in bytes to use broadcast join. When the Spark engine uses broadcast join, the Spark driver sends data to Spark executors that are running on the advanced cluster, thereby avoiding shuffling and resulting in better performance. In some situations, like when a mapping task processes columnar formats or delimited files, broadcast join can cause memory issues at the Spark driver level. To resolve the issues, try reducing the broadcast join threshold to 10 MB, increasing the Spark driver memory, or disabling broadcast join. Default is 256000000. To disable broadcast join, set the value to -1. |
spark.sql.broadcastTimeout | Timeout in seconds that is used during broadcast join. Default is 300. |
spark.sql.shuffle.partitions | Number of partitions that Spark uses to shuffle data to process joins or aggregations. Default is 100. |
spark.custom.property | Configure custom Spark session properties. Use &: to separate custom properties. |