spark on yarn任务提交缓慢解决

spark on yarn任务提交缓慢解决

spark版本:spark-2.0.0 hadoop 2.7.2。

在spark on yarn 模式执行任务提交,发现特别慢,要等待几分钟,

使用集群模式模式提交任务:
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn
--deploy-mode cluster
--driver-memory 4g
--executor-memory 2g
--executor-cores 1
--queue thequeue
examples/jars/spark-examples*.jar
10

发现报出如下警告信息:

17/02/08 18:26:23 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/02/08 18:26:29 INFO yarn.Client: Uploading resource file:/tmp/spark-91508860-fdda-4203-b733-e19625ef23a0/__spark_libs__4918922933506017904.zip -> hdfs://dbmtimehadoop/user/fuxin.zhao/.sparkStaging/application_1486451708427_0392/__spark_libs__4918922933506017904.zip

这个日志之后在上传程序依赖的jar,大概要耗时30s左右,造成任务提交速度超鸡慢,在官网上查到有关的解决办法:

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. 
For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, 
Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.

大意是:如果想要在yarn端(yarn的节点)访问spark的runtime jars,需要指定spark.yarn.archive 或者 spark.yarn.jars。如果都这两个参数都没有指定,spark就会把$SPARK_HOME/jars/所有的jar上传到分布式缓存中。这也是之前任务提交特别慢的原因。

下面是解决方案:
将$SPARK_HOME/jars/* 下spark运行依赖的jar上传到hdfs上。

hadoop fs -mkdir hdfs://dbmtimehadoop/tmp/spark/lib_jars/
hadoop fs -put  $SPARK_HOME/jars/* hdfs://dbmtimehadoop/tmp/spark/lib_jars/

vi $SPARK_HOME/conf/spark-defaults.conf
添加如下内容:
spark.yarn.jars hdfs://dbmtimehadoop/tmp/spark/lib_jars/

再执行任务提交,发现报出如下异常:

Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:500)

查看ResourceManager的日志的异常:http://db-namenode01.host-mtime.com:19888/jobhistory/logs/db-datanode03.host-mtime.com:34545/container_e08_1486451708427_0346_02_000001/

Log Length: 191

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0
Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

说明之前的配置有误,spark相关的jar包没有加载成功,尝试了一下,如下几种配置方法是有效的:

#生效
spark.yarn.jars                  hdfs://dbmtimehadoop/tmp/spark/lib_jars/*.jar ##生效
#spark.yarn.jars                  hdfs://dbmtimehadoop/tmp/spark/lib_jars/*   ##生效
##直接配置多个以逗号分隔的jar,也可以生效。
#spark.yarn.jars                 hdfs://dbmtimehadoop/tmp/spark/lib_jars/activation-1.1.1.jar,hdfs://dbmtimehadoop/tmp/spark/lib_jars/antlr-2.7.7.jar,hdfs://dbmtimehadoop/tmp/spark/lib_jars/antlr4-runtime-4.5.3.jar,hdfs://dbmtimehadoop/tmp/spark/lib_jars/antlr-runtime-3.4.jar
                                                               

再重新提交任务,执行成功。
出现如下信息说明jar添加成功。

17/02/08 19:28:21 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://dbmtimehadoop/tmp/spark/lib_jars/spark-mllib-local_2.11-2.0.0.jar
17/02/08 19:28:21 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://dbmtimehadoop/tmp/spark/lib_jars/spark-mllib_2.11-2.0.0.jar
17/02/08 19:28:21 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://dbmtimehadoop/tmp/spark/lib_jars/spark-network-common_2.11-2.0.0.jar
17/02/08 19:28:21 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://dbmtimehadoop/tmp/spark/lib_jars/spark-network-shuffle_2.11-2.0.0.jar

原文地址:https://www.cnblogs.com/honeybee/p/6379599.html