spark调优和日常维护

参考:

见spark PDF

#spark官网参数:

http://spark.apache.org/docs/2.4.3/configuration.html

EMR配置:

spark.executor.memory 6G
spark.driver.memory 8G
spark.driver.maxResultSize 8G

运行可参考：

运行时可参考：
--num-executors １0
--driver-memory 10G
--executor-cores 2
--executor-memory 10G

spark.default.parallelism=200
spark.storage.memoryFraction=0.8
spark.network.timeout=600
spark.driver.maxResultSize=10G

默认是打开，打开就是开启动态资源调整更好些
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true

#默认1s,即当任务调度延迟超过1秒的时候，会请求增加executor,不是很合理,可15
spark.dynamicAllocation.schedulerBacklogTimeout 1s

spark.sql.shuffle.partitions 该参数代表了shuffle read task的并行度，该值默认是200，对于很多场景来说都有点过小。

spark.default.parallelism 该参数用于设置每个stage的默认task数量,设置该参数为num-executors * executor-cores的2~3倍较为合适

spark.storage.memoryFraction,该参数用于设置RDD持久化数据在Executor内存中能占的比例，默认是0.6

shuffle调优:
spark.shuffle.file.buffer 64k 如果作业可用的内存资源较为充足的话，可以适当增加这个参数的大小（比如64k）
spark.reducer.maxSizeInFlight 96m 如果作业可用的内存资源较为充足的话，可以适当增加这个参数的大小（比如96m）

spark.shuffle.memoryFraction 30%
spark.shuffle.io.maxRetries 5