Spark记录-spark-submit学习

#查看帮助：./bin/spark-submit --help ./bin/spark-shell --help

用法1: spark-submit [options] <app jar | python file> [app arguments]
用法2: spark-submit --kill [submission ID] --master [spark://...]
用法3: spark-submit --status [submission ID] --master [spark://...]

选项:

--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.

-deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client).

--class CLASS_NAME Your application's main class (for Java / Scala apps).

--name NAME A name of your application.

--jars JARS Comma-separated list of local jars to include on the driver and executor classpaths.

--packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts.
--repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working directory of each executor.

--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf.

--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath.

--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).

--proxy-user NAME User to impersonate when submitting the application.

--help, -h Show this help message and exit
--verbose, -v Print additional debug output
--version, Print the version of current Spark

Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1).

Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.

Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors.

Spark standalone and YARN only: --executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,or all available cores on the worker in standalone mode)

YARN-only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically.

./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments]

一些常用的选项是：

--class：您的应用程序的入口（例如org.apache.spark.examples.SparkPi）
--master：群集的主要URL（例如spark://23.195.26.187:7077）
--deploy-mode：是否在工作节点（cluster）上或本地作为外部客户端部署驱动程序（client）（默认值：client）†
--conf：key = value格式的任意Spark配置属性。对于包含空格的值，用引号括起“key = value”（如图所示）。
application-jar：包括您的应用程序和所有依赖项的捆绑jar的路径。URL必须在群集内全局可见，例如，所有节点上存在的hdfs://路径或file://路径。
application-arguments：传递给主类的主要方法的参数，如果有的话

# Run application locally on 8 cores
./bin/spark-submit 
  --class org.apache.spark.examples.SparkPi 
  --master local[8] 
  /path/to/examples.jar 
  100

# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit 
  --class org.apache.spark.examples.SparkPi 
  --master spark://207.184.161.138:7077 
  --executor-memory 20G 
  --total-executor-cores 100 
  /path/to/examples.jar 
  1000

# Run on a Spark standalone cluster in cluster deploy mode with supervise
./bin/spark-submit 
  --class org.apache.spark.examples.SparkPi 
  --master spark://207.184.161.138:7077 
  --deploy-mode cluster 
  --supervise 
  --executor-memory 20G 
  --total-executor-cores 100 
  /path/to/examples.jar 
  1000

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit 
  --class org.apache.spark.examples.SparkPi 
  --master yarn 
  --deploy-mode cluster   # can be client for client mode
  --executor-memory 20G 
  --num-executors 50 
  /path/to/examples.jar 
  1000

# Run a Python application on a Spark standalone cluster
./bin/spark-submit 
  --master spark://207.184.161.138:7077 
  examples/src/main/python/pi.py 
  1000

# Run on a Mesos cluster in cluster deploy mode with supervise
./bin/spark-submit 
  --class org.apache.spark.examples.SparkPi 
  --master mesos://207.184.161.138:7077 
  --deploy-mode cluster 
  --supervise 
  --executor-memory 20G 
  --total-executor-cores 100 
  http://path/to/examples.jar 
  1000

主-MasterURL

传递给Spark的主URL可以采用以下格式之一：

主网址	含义
`local`	使用一个工作线程在本地运行Spark（即完全没有并行）。
`local[K]`	使用K工作线程本地运行Spark（理想情况下，将其设置为机器上的核心数）。
`local[K,F]`	使用K工作线程和F maxFailures在本地运行Spark（有关此变量的解释，请参阅spark.task.maxFailures）
`local[*]`	使用与本机逻辑内核一样多的工作线程在本地运行Spark。
`local[*,F]`	使用与本机上的逻辑内核和F maxFailures一样多的工作线程在本地运行Spark。
`spark://HOST:PORT`	连接到给定的Spark独立群集主机。该端口必须是主设备配置使用的端口，默认为7077。
`spark://HOST1:PORT1,HOST2:PORT2`	使用Zookeeper连接到具有备用主站的给定Spark独立群集。该列表必须包含使用Zookeeper设置的高可用性群集中的所有主控主机。该端口必须是每个主设备配置使用的，默认为7077。
`mesos://HOST:PORT`	连接到给定的Mesos群集。端口必须是您配置使用的端口，默认为5050。或者，对于使用ZooKeeper的Mesos集群，请使用`mesos://zk://...`。要提交`--deploy-mode cluster`，主机：端口应配置为连接到MesosClusterDispatcher。
`yarn`	连接到YARN群集 `client`或`cluster`模式取决于的值`--deploy-mode`。群集的位置将根据`HADOOP_CONF_DIR`或`YARN_CONF_DIR`变量找到。

./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=false --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar

在使用spark-submit提交spark应用程序的时候，需要注意以下几点：

集群外的客户机向Spark Standalone部署Spark应用程序时，要注意事先实现该客户机和Spark Standalone之间的SSH无密码登录。
向YARN部署spark应用程序的时候，注意executor-memory的大小，其内存加上container要使用的内存（默认值是1G）不要超过NM可用内存，不然分配不到container来运行executor。