spark--运行于yarn架构【3】

版本支持

spark在0.6引入对yarn的支持，0.7、0.8版本有了进一步的改善

Building a YARN-Enabled Assembly JAR

需要添加对spark的jar才能在yarn中运行spark的job。需要在编译生成工具的环境添加入如下的环境变量：

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

生成路径: ./assembly/target/scala-2.10/spark-assembly_0.9.0-incubating-hadoop2.0.5.jar.

目前支持YARN versions (2.2.x).

前期准备工作

编译可用yarn的jar
将jar安装与HDFS.
将需要执行的用户程序打包进入一个独立的jar.

配置

这些是针对运行yarn上的spark

环境变量:

添加变量SPARK_YARN_USER_ENV在运行park的YARN配置上，. 多个变量使用逗号分隔，如后面的形式, e.g. SPARK_YARN_USER_ENV="JAVA_HOME=/jdk64,FOO=bar".

系统属性:

spark.yarn.applicationMaster.waitTries, 属性设置ApplicationMaster尝试连接spark的master次数，默认是10次。
spark.yarn.submit.file.replication,HDFS系统的文件复指数. 包括 spark jar, the app jar, 和任何分布式缓存的文件.
spark.yarn.preserve.staging.files, 设为 true 则写保护任务阶段文件如(spark jar, app jar, distributed cache files) 但是在任务完成后删除它们.
spark.yarn.scheduler.heartbeat.interval-ms, spark与YARN ResourceManager的心跳间隔数. 默认 5 seconds.
spark.yarn.max.worker.failures, 在程序失败前默认worker失效最大数目. Default is the number of workers requested times 2 with minimum of 3.

运行spark于YARN

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the hadoop cluster. This would be used to connect to the cluster, write to the dfs and submit jobs to the resource manager.

There are two scheduler mode that can be used to launch spark application on YARN.

Launch spark application by YARN Client with yarn-standalone mode.（在yarn标准模式使用yarn客户端运行spark程序）

The command to launch the YARN Client is as follows:

SPARK_JAR=<SPARK_ASSEMBLY_JAR_FILE> ./bin/spark-class org.apache.spark.deploy.yarn.Client 
  --jar <YOUR_APP_JAR_FILE> 
  --class <APP_MAIN_CLASS> 
  --args <APP_MAIN_ARGUMENTS> 
  --num-workers <NUMBER_OF_WORKER_MACHINES> 
  --master-class <ApplicationMaster_CLASS>
  --master-memory <MEMORY_FOR_MASTER> 
  --worker-memory <MEMORY_PER_WORKER> 
  --worker-cores <CORES_PER_WORKER> 
  --name <application_name> 
  --queue <queue_name> 
  --addJars <any_local_files_used_in_SparkContext.addJar> 
  --files <files_for_distributed_cache> 
  --archives <archives_for_distributed_cache>

For example:

# Build the Spark assembly JAR and the Spark examples JAR
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

# Configure logging
$ cp conf/log4j.properties.template conf/log4j.properties

# Submit Spark's ApplicationMaster to YARN's ResourceManager, and instruct Spark to run the SparkPi example
$ SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.5-alpha.jar 
    ./bin/spark-class org.apache.spark.deploy.yarn.Client 
      --jar examples/target/scala-2.10/spark-examples-assembly-0.9.0-incubating.jar 
      --class org.apache.spark.examples.SparkPi 
      --args yarn-standalone 
      --num-workers 3 
      --master-memory 4g 
      --worker-memory 2g 
      --worker-cores 1

# Examine the output (replace $YARN_APP_ID in the following with the "application identifier" output by the previous command)
# (Note: YARN_APP_LOGS_DIR is usually /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version.)
$ cat $YARN_APP_LOGS_DIR/$YARN_APP_ID/container*_000001/stdout
Pi is roughly 3.13794

默认的Application Master启动YARN客户端进程，SparkPi将以Application Master的子线程运行, YARN Client 会周期性的探测Application Master的状态更新并输出在命令窗口上。YARN客户端将推出在spark程序运行完成后。这种模式下，你的程序实际运行在远程机器上.因此应用程序涉及本地交互不会工作得很好, 例如 spark-shell.

Launch spark application with yarn-client mode.（以yarn客户端模式运行spark程序）

在yarn-client模式下spark程序将在本地以spark-shell on Local / Mesos / Standalone mode的模式运行，运行方法和这几种模式类似，只是以 “yarn-client” 替代master的url.同时还需要导出 SPARK_JAR and SPARK_YARN_APP_JAR环境变量

Configuration in yarn-client mode:

In order to tune worker core/number/memory etc. You need to export environment variables or add them to the spark configuration file (./conf/spark_env.sh). The following are the list of options.

SPARK_YARN_APP_JAR, Path to your application’s JAR file (required)
SPARK_WORKER_INSTANCES, Number of workers to start (Default: 2)
SPARK_WORKER_CORES, Number of cores for the workers (Default: 1).
SPARK_WORKER_MEMORY, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
SPARK_MASTER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
SPARK_YARN_APP_NAME, The name of your application (Default: Spark)
SPARK_YARN_QUEUE, The hadoop queue to use for allocation requests (Default: ‘default’)
SPARK_YARN_DIST_FILES, Comma separated list of files to be distributed with the job.
SPARK_YARN_DIST_ARCHIVES, Comma separated list of archives to be distributed with the job.

For example:

SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.5-alpha.jar 
SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples-assembly-0.9.0-incubating.jar 
./bin/run-example org.apache.spark.examples.SparkPi yarn-client


SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.0-incubating-hadoop2.0.5-alpha.jar 
SPARK_YARN_APP_JAR=examples/target/scala-2.10/spark-examples-assembly-0.9.0-incubating.jar 
MASTER=yarn-client ./bin/spark-shell

Building Spark for Hadoop/YARN 2.2.x

See Building Spark with Maven for instructions on how to build Spark using the Maven process.

Important Notes

Hadoop 2.2以前的版本中 YARN 不支持容器资源的请求. 因此,运行早期版本无法传递core的数量给YARN. 核心请求是否运行在调度决策取决于调度策略被使用,以及它是如何配置的。.
YARN的本地配置目录(Hadoop Yarn config yarn.nodemanager.local-dirs).将覆盖spark的本地目录，如果用户指定特定的spark.local.dir，YARN的配置将被忽略。
-files和-archives选项支持指定文件名类似于Hadoop的#。例如您可以指定:-files localtest.txt # appSees。上传在本地命名localtest.txt的文件到HDFS命名为appSees.txt 。运行在YARN的应用程序应该使用appSees的名字。
–addJars 参数容许本地文件通过SparkContext.addJar功能起效. 使用HDFS, HTTP, HTTPS, or FTP files时不需要此参数。