Spark-Standalone

  1. 安全:默认关闭
  2. 手工启动集群:
    • 使用./sbin/start-master.sh,启动后,master将会打印出spark://HOST:PORT,可以用来连接workers
    • 监控:默认为http://localhost:8080/
    • 启动worker连接到master:./sbin/start-slave.sh <master-spark-URL>
    • ArgumentMeaning
      -h HOST--host HOST Hostname to listen on
      -i HOST--ip HOST Hostname to listen on (deprecated, use -h or --host)
      -p PORT--port PORT Port for service to listen on (default: 7077 for master, random for worker)
      --webui-port PORT Port for web UI (default: 8080 for master, 8081 for worker)
      -c CORES--cores CORES Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
      -m MEM--memory MEM Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine's total RAM minus 1 GB); only on worker
      -d DIR--work-dir DIR Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker
      --properties-file FILE Path to a custom Spark properties file to load (default: conf/spark-defaults.conf)
    • 初始化conf/slaves文件
    • 初始化conf/spark-env.sh
      • Environment VariableMeaning
        SPARK_MASTER_HOST Bind the master to a specific hostname or IP address, for example a public one.
        SPARK_MASTER_PORT Start the master on a different port (default: 7077).
        SPARK_MASTER_WEBUI_PORT Port for the master web UI (default: 8080).
        SPARK_MASTER_OPTS Configuration properties that apply only to the master in the form "-Dx=y" (default: none). See below for a list of possible options.
        SPARK_LOCAL_DIRS Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.
        SPARK_WORKER_CORES Total number of cores to allow Spark applications to use on the machine (default: all available cores).
        SPARK_WORKER_MEMORY Total amount of memory to allow Spark applications to use on the machine, e.g. 1000m2g (default: total memory minus 1 GB); note that each application's individual memory is configured using its spark.executor.memoryproperty.
        SPARK_WORKER_PORT Start the Spark worker on a specific port (default: random).
        SPARK_WORKER_WEBUI_PORT Port for the worker web UI (default: 8081).
        SPARK_WORKER_DIR Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work).
        SPARK_WORKER_OPTS Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). See below for a list of possible options.
        SPARK_DAEMON_MEMORY Memory to allocate to the Spark master and worker daemons themselves (default: 1g).
        SPARK_DAEMON_JAVA_OPTS JVM options for the Spark master and worker daemons themselves in the form "-Dx=y" (default: none).
        SPARK_DAEMON_CLASSPATH Classpath for the Spark master and worker daemons themselves (default: none).
        SPARK_PUBLIC_DNS The public DNS name of the Spark master and workers (default: none).
    • SPARK_MASTER_OPTS
      • Property NameDefaultMeaning
        spark.deploy.retainedApplications 200 The maximum number of completed applications to display. Older applications will be dropped from the UI to maintain this limit.
        spark.deploy.retainedDrivers 200 The maximum number of completed drivers to display. Older drivers will be dropped from the UI to maintain this limit.
        spark.deploy.spreadOut true Whether the standalone cluster manager should spread applications out across nodes or try to consolidate them onto as few nodes as possible. Spreading out is usually better for data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. 
        spark.deploy.defaultCores (infinite) Default number of cores to give to applications in Spark's standalone mode if they don't set spark.cores.max. If not set, applications always get all available cores unless they configure spark.cores.max themselves. Set this lower on a shared cluster to prevent users from grabbing the whole cluster by default. 
        spark.deploy.maxExecutorRetries 10 Limit on the maximum number of back-to-back executor failures that can occur before the standalone cluster manager removes a faulty application. An application will never be removed if it has any running executors. If an application experiences more thanspark.deploy.maxExecutorRetries failures in a row, no executors successfully start running in between those failures, and the application has no running executors then the standalone cluster manager will remove the application and mark it as failed. To disable this automatic removal, set spark.deploy.maxExecutorRetries to -1
        spark.worker.timeout 60 Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats.
    • SPARK_WORKER_OPTS
      • Property NameDefaultMeaning
        spark.worker.cleanup.enabled false Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
        spark.worker.cleanup.interval 1800 (30 minutes) Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
        spark.worker.cleanup.appDataTtl 604800 (7 days, 7 * 24 * 3600) The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
        spark.storage.cleanupFilesAfterExecutorExit true Enable cleanup non-shuffle files(such as temp. shuffle blocks, cached RDD/broadcast blocks, spill files, etc) of worker directories following executor exits. Note that this doesn't overlap with `spark.worker.cleanup.enabled`, as this enables cleanup of non-shuffle files in local directories of a dead executor, while `spark.worker.cleanup.enabled` enables cleanup of all files/subdirectories of a stopped and timeout application. This only affects Standalone mode, support of other cluster manangers can be added in the future.
        spark.worker.ui.compressedLogFileLengthCacheSize 100 For compressed log files, the uncompressed file can only be computed by uncompressing the files. Spark caches the uncompressed file size of compressed log files. This property controls the cache size.
  3. 连接到集群:
    ./bin/spark-shell --master spark://IP:PORT
  4. 出错重跑:针对standalone模式需要在提交spark-submit时配置--supervise,使用如下命令kill一直失败的应用
    ./bin/spark-class org.apache.spark.deploy.Client kill <master url> <driver ID>
  5. 资源配置
    1. APP:
      export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=<value>"
    2. Executor: spark.executor.cores
  6. 监控:默认8080
  7. 日志:默认(SPARK_HOME/work),STDOUTSTDERR
  8. 与Hadoop交互:hdfs:// URL 如 hdfs://<namenode>:9000/path
  9. HA:master使用zookeeper
原文地址:https://www.cnblogs.com/liudingchao/p/11269606.html