大数据平台学习（三）----spark核心编程

spark架构原理

原理图：

创建RDD

一是使用程序中的集合创建RDD，主要用于进行测试，可以实际部署到集群运行之前，

自己使用集合构造测试数据，来测试后面的spark应用的流程；

二是使用本地文件创建RDD，主要用于的场景为在本地临时性地处理一些存储了大量数据的文件；

三是使用HDFS文件创建RDD，主要用于针对HDFS上存储的大数据，进行离线批处理操作

操作RDD

转化操作：

行为操作：

RDD持久化

为什么要执行RDD持久化？

RDD持久化策略是什么？

Spark内核源码深度剖析

原理图：

宽依赖和窄依赖

基于yarn的两种提交模式

sparkcontext原理剖析

master原理剖析与源码分析

资源调度机制十分重要（两种资源调度机制）

主备切换机制原理图：

注册机制原理图：

master资源调度算法原理：

master是通过schedule方法进行资源调度，告知worker启动executor，，，

1,schedule方法

2，startExecutorOnWorkers在woker上开启executor进程

3, scheduleExecutorsOnWorker在每一个worker上调度资源

判断该worker能否分配一个或者多个executor，能则分配相应的executor所需要的CPU核数；

4, allocateWorkerResourceToExecutor在worker上分配的具体资源

五、launchDriver发起driver

六、launchExecutor发起executor

worker原理剖析：

job触发流程原理剖析:

其实最终是调用了SparkContext之前初始化时创建的DAGSchdule的runjob方法；

DAGSchdule原理剖析:

注：reduceByKey的作用就是对相同key的数据进行处理，最终每个key只保留一条记录；

其作用对象是（key, value）形式的RDD；

Taskschedule原理分析

注：本地化级别种类：

PROCESS_LOCAL:进程本地化，rdd和partition和task在同一个executor中；

NODE_LOCAL:rdd的parttion和task，不在一个同一个executor中，但在同一个节点中；

NO_PREF:没有本地化级别；

RACK_LOCAL:机架本地化，至少rdd的parttion和task在一盒机架中；

ANY：任意本地化级别；

Executor原理剖析与源码分析

1、work为application启动的executor，实际上是启动了CoarseGrainedExecutorBackend进程；

2、获取driver的actor，向driver发送RegisterExecutor信息

3、dirver注册executor成功之后，会发送RegisteredExecutor信息

（此时CoarseGrainedExecutorBackend会创建Executor执行句柄，大部分功能都是通过Executor实现的）

4、启动task,反序列化task，调用Executor执行器的launchTask（）启动一个task;

5、对每个task都会创建一个TaskRunner，将TaskRunner放入内存缓存中；

（将task封装在一个线程TaskRunner），将线程丢入线程池中，然后执行线程池是自动实现了排队机制的

Task原理剖析

1、task是封装在一个线程中（taskRunner）

2、调用run()，对序列化的task数据进行反序列化，然后通过通络通信，将需要的文件，资源，jar文件拷贝过来，

通过反序列化操作，调用updateDependencies() 将整个Task数据反序列化回来；

 1 <span style="font-size:14px;">override def run(): Unit = {
 2       val threadMXBean = ManagementFactory.getThreadMXBean
 3       val taskMemoryManager = new TaskMemoryManager(env.memoryManager, taskId)
 4       val deserializeStartTime = System.currentTimeMillis()
 5       val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
 6         threadMXBean.getCurrentThreadCpuTime
 7       } else 0L
 8       Thread.currentThread.setContextClassLoader(replClassLoader)
 9       val ser = env.closureSerializer.newInstance()
10       logInfo(s"Running $taskName (TID $taskId)")
11       execBackend.statusUpdate(taskId, TaskState.RUNNING, EMPTY_BYTE_BUFFER)
12       var taskStart: Long = 0
13       var taskStartCpu: Long = 0
14       startGCTime = computeTotalGcTime()
15 
16       try {
17         //对序列化的task数据进行反序列化
18         val (taskFiles, taskJars, taskProps, taskBytes) =
19           Task.deserializeWithDependencies(serializedTask)
20 
21         // Must be set before <span style="color:#ff0000;">updateDependencies</span>() is called, in case fetching dependencies
22         // requires access to properties contained within (e.g. for access control).
23         Executor.taskDeserializationProps.set(taskProps)
24         //然后通过网络通信，将需要的文件，资源呢、jar拷贝过来
25         updateDependencies(taskFiles, taskJars)
26         //通过反序列化操作，将整个Task的数据反序列化回来
27        task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)
28         task.localProperties = taskProps
29         task.setTaskMemoryManager(taskMemoryManager)
30 
31         // If this task has been killed before we deserialized it, let's quit now. Otherwise,
32         // continue executing the task.
33         if (killed) {
34           // Throw an exception rather than returning, because returning within a try{} block
35           // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl
36           // exception will be caught by the catch block, leading to an incorrect ExceptionFailure
37           // for the task.
38           throw new TaskKilledException
39         }
40 
41         logDebug("Task " + taskId + "'s epoch is " + task.epoch)
42         env.mapOutputTracker.updateEpoch(task.epoch)
43 
44         // Run the actual task and measure its runtime.
45         //计算task的开始时间
46         taskStart = System.currentTimeMillis()
47         taskStartCpu = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
48           threadMXBean.getCurrentThreadCpuTime
49         } else 0L
50         var threwException = true
51         val value = try {
52         调用task的run()
53           val res = task.run(
54             taskAttemptId = taskId,
55             attemptNumber = attemptNumber,
56             metricsSystem = env.metricsSystem)
57           threwException = false
58           res
59         } finally {
60           val releasedLocks = env.blockManager.releaseAllLocksForTask(taskId)
61           val freedMemory = taskMemoryManager.cleanUpAllAllocatedMemory()
62 
63           if (freedMemory > 0 && !threwException) {
64             val errMsg = s"Managed memory leak detected; size = $freedMemory bytes, TID = $taskId"
65             if (conf.getBoolean("spark.unsafe.exceptionOnMemoryLeak", false)) {
66               throw new SparkException(errMsg)
67             } else {
68               logWarning(errMsg)
69             }
70           }
71 
72           if (releasedLocks.nonEmpty && !threwException) {
73             val errMsg =
74               s"${releasedLocks.size} block locks were not released by TID = $taskId:
" +
75                 releasedLocks.mkString("[", ", ", "]")
76             if (conf.getBoolean("spark.storage.exceptionOnPinLeak", false)) {
77               throw new SparkException(errMsg)
78             } else {
79               logWarning(errMsg)
80             }
81           }
82         }
83         val taskFinish = System.currentTimeMillis()
84         val taskFinishCpu = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
85           threadMXBean.getCurrentThreadCpuTime
86         } else 0L
87 
88         // If the task has been killed, let's fail it.
89         if (task.killed) {
90           throw new TaskKilledException
91         }    </span>
92

3、updateDependencies() 主要是获取hadoop的配置文件，使用了java的synchronized多线程访问方式，

task是以java线程的方式，在一个CoarseGrainedExecutorBackend进程内并发运行的，因此在执行一些业务逻辑

的时候，需要访问一些共享资源，就会出现多线程并发访问的安全问题；

 1 <span style="font-size:14px;"> private def updateDependencies(newFiles: HashMap[String, Long], newJars: HashMap[String, Long]) {
 2     //获取hadoop的配置文件
 3     lazy val hadoopConf = SparkHadoopUtil.get.newConfiguration(conf)
 4     //使用了java的synchronized多线程并发访问方式
 5     //task是以java线程的方式，在一个CoarseGrainedExecutorBackend进程内并发运行的
 6     //因此在执行一些业务逻辑的时候，需要访问一些共享资源，就会出现多线程并发访问安全问题
 7     synchronized {
 8       // Fetch missing dependencies
 9       //遍历要拉取的文件
10       for ((name, timestamp) <- newFiles if currentFiles.getOrElse(name, -1L) < timestamp) {
11         logInfo("Fetching " + name + " with timestamp " + timestamp)
12         // Fetch file with useCache mode, close cache for local mode.
13         //通过Utils的fetchFile()，通过网络通信，从远程拉取文件
14        Utils.fetchFile(name, new File(SparkFiles.getRootDirectory()), conf,
15           env.securityManager, hadoopConf, timestamp, useCache = !isLocal)
16         currentFiles(name) = timestamp
17       }
18       //遍历要拉取得jar
19       for ((name, timestamp) <- newJars) {
20         val localName = name.split("/").last
21         val currentTimeStamp = currentJars.get(name)
22           .orElse(currentJars.get(localName))
23           .getOrElse(-1L)
24         if (currentTimeStamp < timestamp) {
25           logInfo("Fetching " + name + " with timestamp " + timestamp)
26           // Fetch file with useCache mode, close cache for local mode.
27           Utils.fetchFile(name, new File(SparkFiles.getRootDirectory()), conf,
28             env.securityManager, hadoopConf, timestamp, useCache = !isLocal)
29           currentJars(name) = timestamp
30           // Add it to our class loader
31           val url = new File(SparkFiles.getRootDirectory(), localName).toURI.toURL
32           if (!urlClassLoader.getURLs().contains(url)) {
33             logInfo("Adding " + url + " to class loader")
34             urlClassLoader.addURL(url)
35           }
36         }
37       }
38     }
39   }    </span>
40

4、task的run方法，创建了Taskcontext，里面记录了task的一些全局性的数据，还包括task重试了几次，task属于

哪个stage，task处理的是rdd的哪个parttion，然后调用抽象方法runTask();

 1 <span style="font-size:14px;">final def run(
 2       taskAttemptId: Long,
 3       attemptNumber: Int,
 4       metricsSystem: MetricsSystem): T = {
 5     SparkEnv.get.blockManager.registerTask(taskAttemptId)
 6     //创建了Taskcontext,里面记录了task的一些全局性的数据
 7     //包括task重试了几次，task属于哪个stage，task处理的是rdd的哪个partition
 8     context = new TaskContextImpl(
 9       stageId,
10       partitionId,
11       taskAttemptId,
12       attemptNumber,
13       taskMemoryManager,
14       localProperties,
15       metricsSystem,
16       metrics)
17     TaskContext.setTaskContext(context)
18     taskThread = Thread.currentThread()
19 
20     if (_killed) {
21       kill(interruptThread = false)
22     }
23 
24     new CallerContext("TASK", appId, appAttemptId, jobId, Option(stageId), Option(stageAttemptId),
25       Option(taskAttemptId), Option(attemptNumber)).setCurrentContext()
26 
27     try {
28     //调用抽象方法runTask（）
29       runTask(context)
30     } catch {
31       case e: Throwable =>
32         // Catch all errors; run task failure callbacks, and rethrow the exception.
33         try {
34           context.markTaskFailed(e)
35         } catch {
36           case t: Throwable =>
37             e.addSuppressed(t)
38         }
39         throw e
40     } finally {
41       // Call the task completion callbacks.
42       context.markTaskCompleted()
43       try {
44         Utils.tryLogNonFatalError {
45           // Release memory used by this thread for unrolling blocks
46           SparkEnv.get.blockManager.memoryStore.releaseUnrollMemoryForThisTask(MemoryMode.ON_HEAP)
47           SparkEnv.get.blockManager.memoryStore.releaseUnrollMemoryForThisTask(MemoryMode.OFF_HEAP)
48           // Notify any tasks waiting for execution memory to be freed to wake up and try to
49           // acquire memory again. This makes impossible the scenario where a task sleeps forever
50           // because there are no other tasks left to notify it. Since this is safe to do but may
51           // not be strictly necessary, we should revisit whether we can remove this in the future.
52           val memoryManager = SparkEnv.get.memoryManager
53           memoryManager.synchronized { memoryManager.notifyAll() }
54         }
55       } finally {
56         TaskContext.unset()
57       }
58     }
59   }</span>
60

5、抽象方法runTask(),需要Task的子类去实现，ShuffleMapTask、ResaultTask；

1 <span style="font-size:14px;">def runTask(context: TaskContext): T</span>

6、一个ShuffleMapTask会将一个个Rdd元素切分为多个bucket基于在一个ShuffleDependency的中指定的partitioner,

 1 <span style="font-size:14px;">private[spark] class ShuffleMapTask(
 2     stageId: Int,
 3     stageAttemptId: Int,
 4     taskBinary: Broadcast[Array[Byte]],
 5     partition: Partition,
 6     @transient private var locs: Seq[TaskLocation],
 7     metrics: TaskMetrics,
 8     localProperties: Properties,
 9     jobId: Option[Int] = None,
10     appId: Option[String] = None,
11     appAttemptId: Option[String] = None)
12   extends Task[MapStatus](stageId, stageAttemptId, partition.index, metrics, localProperties, jobId,
13     appId, appAttemptId)
14   with Logging</span>

7、runTask有MapStatus返回值，对task要处理的数据做一些反序列化操作，然后通过广播变量拿到要处理rdd那部分

数据，获取shuffleManager，调用rdd的iterator（），并且传入了当前要执行的哪个parttion，rdd的iterator（），实现

了针对rdd的某个parttion执行我们定义的算子，之后返回的数据通过ShuffleWriter,经过Hashpartioner后，写入自己对应

bucket中，最后返回mapstatus，里面封装了计算后的数据，存储在Blockmanager中；

 1 <span style="font-size:14px;">override def runTask(context: TaskContext): MapStatus = {
 2     // Deserialize the RDD using the broadcast variable.
 3     val threadMXBean = ManagementFactory.getThreadMXBean
 4     val deserializeStartTime = System.currentTimeMillis()
 5     val deserializeStartCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
 6       threadMXBean.getCurrentThreadCpuTime
 7     } else 0L
 8     //对task要处理的数据做一些反序列化操作
 9     //通过广播变量拿到要处理的rdd那部分数据
10     val ser = SparkEnv.get.closureSerializer.newInstance()
11     val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
12       ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
13     _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime
14     _executorDeserializeCpuTime = if (threadMXBean.isCurrentThreadCpuTimeSupported) {
15       threadMXBean.getCurrentThreadCpuTime - deserializeStartCpuTime
16     } else 0L
17 
18     var writer: ShuffleWriter[Any, Any] = null
19     try {
20     //获取shuffleManager
21       val manager = SparkEnv.get.shuffleManager
22       //从shuffleManager中获取writer
23       writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)
24         //调用了rdd的iterator（），并且传入了当前要执行的哪个partition
25      //rdd的iterator（）,实现了针对rdd的某个parttion执行我们自己定义的算子
26      //执行完我们自己定义的算子，返回的数据通过ShuffleWriter，经过Hashpartioner后，写入自己对应的分区bucket中
27      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])
28       //最后返回MapStatus，里面封装了计算后的数据，存储在Blockmanager中
29       writer.stop(success = true).get
30     } catch {
31       case e: Exception =>
32         try {
33           if (writer != null) {
34             writer.stop(success = false)
35           }
36         } catch {
37           case e: Exception =>
38             log.debug("Could not stop writer", e)
39         }
40         throw e
41     }
42   } </span>

8、MapPartitionsRDD

针对RDD的某个partition，执行我们给定的算子或者函数；

（注：可以理解成自己定义的算子或者函数，spark对其进行了封装，这里针对rdd的parttion，执行自定义的计算

操作，并返回新的rdd的parttion的数据）

 1 <span style="font-size:14px;">private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
 2     var prev: RDD[T],
 3     f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
 4     preservesPartitioning: Boolean = false)
 5   extends RDD[U](prev) {
 6 
 7   override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None
 8 
 9   override def getPartitions: Array[Partition] = firstParent[T].partitions
10 //针对rdd的某个partiton，执行我们给定的算子或者函数
11   //f:可以理解成自己定义的算子或者函数，spark内部进行了封装
12   //这里针对rdd的partition，执行自定义的计算操作，并返回 新的rdd的partition的数据
13   override def compute(split: Partition, context: TaskContext): Iterator[U] =
14     f(context, split.index, firstParent[T].iterator(split, context))
15 
16   override def clearDependencies() {
17     super.clearDependencies()
18     prev = null
19   }
20 }</span>

9、statusUpdate（）是个抽象方法

1 <span style="font-size:14px;">execBackend.statusUpdate(taskId, TaskState.FINISHED, serializedResult)
2  // statusUpdate
3  override def statusUpdate(taskId: Long, state: TaskState, data: ByteBuffer) {
4     val msg = StatusUpdate(executorId, taskId, state, data)
5     driver match {
6       case Some(driverRef) => driverRef.send(msg)
7       case None => logWarning(s"Drop $msg because has not yet connected to driver")
8     }
9   } </span>

10、调用TaskSchedulerImpl的statusUpdate（），获取对应的taskset，task结束了，从内存移除；

 1 <span style="font-size:14px;">def statusUpdate(tid: Long, state: TaskState, serializedData: ByteBuffer) {
 2     var failedExecutor: Option[String] = None
 3     var reason: Option[ExecutorLossReason] = None
 4     synchronized {
 5       try {
 6         taskIdToTaskSetManager.get(tid) match {
 7           获取对应的taskSet
 8           case Some(taskSet) =>
 9             if (state == TaskState.LOST) {
10               // TaskState.LOST is only used by the deprecated Mesos fine-grained scheduling mode,
11               // where each executor corresponds to a single task, so mark the executor as failed.
12               val execId = taskIdToExecutorId.getOrElse(tid, throw new IllegalStateException(
13                 "taskIdToTaskSetManager.contains(tid) <=> taskIdToExecutorId.contains(tid)"))
14               if (executorIdToRunningTaskIds.contains(execId)) {
15                 reason = Some(
16                   SlaveLost(s"Task $tid was lost, so marking the executor as lost as well."))
17                 removeExecutor(execId, reason.get)
18                 failedExecutor = Some(execId)
19               }
20             }
21             if (TaskState.isFinished(state)) {
22             //task结束了从内存中移除
23              cleanupTaskState(tid)
24               taskSet.removeRunningTask(tid)
25               if (state == TaskState.FINISHED) {
26                 taskResultGetter.enqueueSuccessfulTask(taskSet, tid, serializedData)
27               } else if (Set(TaskState.FAILED, TaskState.KILLED, TaskState.LOST).contains(state)) {
28                 taskResultGetter.enqueueFailedTask(taskSet, tid, state, serializedData)
29               }
30             }
31           case None =>
32             logError(
33               ("Ignoring update with state %s for TID %s because its task set is gone (this is " +
34                 "likely the result of receiving duplicate task finished status updates) or its " +
35                 "executor has been marked as failed.")
36                 .format(state, tid))
37         }
38       } catch {
39         case e: Exception => logError("Exception in statusUpdate", e)
40       }
41     }
42     // Update the DAGScheduler without holding a lock on this, since that can deadlock
43     if (failedExecutor.isDefined) {
44       assert(reason.isDefined)
45       dagScheduler.executorLost(failedExecutor.get, reason.get)
46       backend.reviveOffers()
47     }
48   } </span>

普通Shuffle操作的原理剖析

(注：什么是shuffle过程，它分为map中的和reduce中的

首先看map中的：

再看reduce端的shuffle

shuffle的过程：

优化过后的shuffle操作的原理剖析

BlockManager原理剖析：

1、Diver上有BlockManagerMaster,负责对各个节点上的BlockManager内部管理的元数据进行维护；

2、每个节点的BlockManager有几个关键组件，DiskStore负责对磁盘上的数据进行读写，MemoryStore负责对内存中的数据进行读写，ConnectionManager负责建立BlockManager到远程其他节点的

BlockManager的网络连接，BlockTransferService负责对远程其他节点的BlockManager数据读写；

3、每个BlockManager创建之后，会向BlockManangerMaster进行注册，BlockManagerMaster会为其创建对应的BlockManagerInfo；

4、BlockManager进行读写时，比如RDD运行过程中的一些中间数据，或者指定的persist()，会优先将数据写入内存，如果内存大小不够，再将内存部分数据写入磁盘；

5、如果persist() 指定了要replica，那么会使用BlockTransferService将数据replica一份到其他节点的BlockManager上去；

6、BlockManager进行读操作时，就会用ConnectionManager与有数据的BlockManager建立连接，然后用BlockTransferService从远程BlockManager读写数据；

7、只要使用了BlockManager执行了数据增删改的操作，那么就必须将block的BlockStatus上报到BlockManagerInfo内部的BlockStatus进行增删改操作，从而对元数据进行维护；

cacheManager原理剖析：

CacheManager 源码解析：

1、cacheManager管理的是缓存中的数据，缓存可以是基于内存的缓存，也可以是基于磁盘的缓存；

2、cacheManager需要通过BlockManager 来操作数据；

3、每当Task运行的时候回调用RDD的compute方法进行计算，而compute方法会调用iterator方法；

compute方法是final级别不能覆写但可以被子类去使用，可以看见RDD是优先使用内存的，如果存储级别

不等于NONE的情况下，程序会先找CacheManager 获得数据，否则的话会看有没有进行checkpoint;

4、cache在工作的时候会最大化的保留数据，但是数据不一定绝对完整，因为当前的计算机如果需要内存空间的话，那么内存中的数据必须让出空间，这是因为执行比缓存重要，此时如何在RDD持久化的时候同时指定了可以把数据放左Disk上，那么部分cache的数据可以从内存转入磁盘，否则的话，数据就会丢失；

在进行Cache时，BlockManager会帮你进行管理，我们可以通过key到BlockManager中找出曾经缓存的数据；

如果有BlockManager.get()方法没有返回任何数据，就调用acquireLockForPartition 方法，因为会有可能多条线程在操作数据，spark有一个东西叫慢任务StraggleTask 推迟，StraggleTask 推迟的时候一般都会运行两个任务在两台机器上；

最后还是通过 BlockManager.get 来获得数据

5、具体cacheManager在获得缓存数据的时候会通过BlockManager 来抓到数据，优先在本地找数据或者远程抓数据；

BlockManger.getLocal然后转过来调用doGetLocal方法，在doGetLocal的实现中看到缓存其实不一定在内存中，

缓存可以在存在、磁盘、也可以在offHeap(Tachyon)中；

6、在上一步调用了getLocal方法后转过调用了doGetLocal

7、在第5步中如果本地没有缓存的话，就调用getRemote方法从远程抓取数据；

8、如果cacheManager没有通过BlockManager 获得缓存内容的话，其实会通过RDD的computeOrReadCheckpoint （）方法来获得数据；

首先需要检查看当前的RDD是否进行了CheckPoint，如果进行了话就直接读取checkpoint的数据，否则的话就必需进行计算；

checkpoint本身很重要，计算之后通过putInBlockManager 会把数据按照StorageLevel 重新缓存起来；

9、如果数据缓存的空间不够，此时会调用memoryStore中的unrollSafety方法，里面有一个循环在内存中放数据；

CheckPoint原理：

原理分析：

1、SparkContext的setCheckpointDir 设置了一个checkpoint 目录

 1 def setCheckpointDir(directory: String) {
 2 
 3     // If we are running on a cluster, log a warning if the directory is local.
 4     // Otherwise, the driver may attempt to reconstruct the checkpointed RDD from
 5     // its own local file system, which is incorrect because the checkpoint files
 6     // are actually on the executor machines.
 7     if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
 8       logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
 9         s"must not be on the local filesystem. Directory '$directory' " +
10         "appears to be on the local filesystem.")
11     }
12     //利用hadoop的api创建了一个hdfs目录
13     checkpointDir = Option(directory).map { dir =>
14       val path = new Path(dir, UUID.randomUUID().toString)
15       val fs = path.getFileSystem(hadoopConfiguration)
16       fs.mkdirs(path)
17       fs.getFileStatus(path).getPath.toString
18     }
19   }

2、RDD核心的CheckPoint方法

def checkpoint(): Unit = RDDCheckpointData.synchronized {
    // NOTE: we use a global lock here due to complexities downstream with ensuring
    // children RDD partitions point to the correct parent partitions. In the future
    // we should revisit this consideration.
    if (context.checkpointDir.isEmpty) {
      throw new SparkException("Checkpoint directory has not been set in the SparkContext")
    } else if (checkpointData.isEmpty) {
      //创建ReliableRDDCheckpointData指向的RDDCheckpointData的实例
      checkpointData = Some(new ReliableRDDCheckpointData(this))
    }
  }

3、创建了一个ReliableRDDCheckpointData

//ReliableRDDCheckpointData 的父类RDDCheckpointData
private[spark] class ReliableRDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])
  extends RDDCheckpointData[T](rdd) with Logging {

......

}

4、父类RDDCheckpointData

/**
 * RDD 需要经过
 * 
 * [ Initialized  --> CheckpointingInProgress--> Checkpointed ] 
 * 这几个阶段才能被 checkpoint
 * 
 */
private[spark] abstract class RDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])
  extends Serializable {

  import CheckpointState._

  // The checkpoint state of the associated RDD.
  //标识CheckPoint的状态，第一次初始化时是Initialized
  protected var cpState = Initialized

......

}

checkpoint写入数据：

1、Spark job运行最终会调用SparkContext的runJob方法将任务提交给Executor去执行；

 def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:
" + rdd.toDebugString)
    }
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    //在生产环境下会调用ReliableRDDCheckpointData的doCheckpoint方法
    rdd.doCheckpoint()
  }

2、rdd.doCheckpoint()方法

private[spark] def doCheckpoint(): Unit = {
    RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) {
      if (!doCheckpointCalled) {
        doCheckpointCalled = true
        if (checkpointData.isDefined) {
          if (checkpointAllMarkedAncestors) {
            // TODO We can collect all the RDDs that needs to be checkpointed, and then checkpoint
            // them in parallel.
            // Checkpoint parents first because our lineage will be truncated after we
            // checkpoint ourselves
            dependencies.foreach(_.rdd.doCheckpoint())
          }
          checkpointData.get.checkpoint()
        } else {
          //遍历依赖的rdd，调用每个rdd的doCheckpoint方法
          dependencies.foreach(_.rdd.doCheckpoint())
        }
      }
    }
  }

3、checkpointData类型是RDDCheckpointData中的doCheckpoint()方法；

 final def checkpoint(): Unit = {
    // Guard against multiple threads checkpointing the same RDD by
    // atomically flipping the state of this RDDCheckpointData
    RDDCheckpointData.synchronized {
      if (cpState == Initialized) {
        //1、标记当前状态为正在checkpoint中
        cpState = CheckpointingInProgress
      } else {
        return
      }
    }
    //2 这里调用的是子类的doCheckpoint()
    val newRDD = doCheckpoint()

    // Update our state and truncate the RDD lineage
     // 3 标记checkpoint已完成，清空RDD依赖
    RDDCheckpointData.synchronized {
      cpRDD = Some(newRDD)
      cpState = Checkpointed
      rdd.markCheckpointed()
    }
  }

4、子类ReliableRDDCheckpointData的doCheckpoint（）方法；

 1 protected override def doCheckpoint(): CheckpointRDD[T] = {
 2     /**
 3      * 为该 rdd 生成一个新的依赖，设置该 rdd 的 parent rdd 为 CheckpointRDD
 4      * 该 CheckpointRDD 负责以后读取在文件系统上的   checkpoint 文件，生成该 rdd 的 partition
 5      */
 6     val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
 7 
 8     // Optionally clean our checkpoint files if the reference is out of scope
 9     // 是否清除checkpoint文件如果超出引用的资源范围
10     if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {
11       rdd.context.cleaner.foreach { cleaner =>
12         cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)
13       }
14     }
15 
16     logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")
17     //  将新产生的RDD返回给父类
18     newRDD
19   }

5、ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)方法

 /**
   * Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD.
   * 
   * 触发runJob 来执行当前的RDD 中的数据写到Checkpoint 的目录中，同时会产生ReliableCheckpointRDD 实例
   */
  def writeRDDToCheckpointDirectory[T: ClassTag](
      originalRDD: RDD[T],
      checkpointDir: String,
      blockSize: Int = -1): ReliableCheckpointRDD[T] = {
    val checkpointStartTimeNs = System.nanoTime()

    val sc = originalRDD.sparkContext

    // Create the output path for the checkpoint
    //把checkpointDir设置我们checkpoint的目录
    val checkpointDirPath = new Path(checkpointDir)
    // 获取HDFS文件系统API接口
    val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)
    // 创建目录
    if (!fs.mkdirs(checkpointDirPath)) {
      throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")
    }

    // Save to file, and reload it as an RDD
    // 将配置文件信息广播到所有节点
    val broadcastedConf = sc.broadcast(
      new SerializableConfiguration(sc.hadoopConfiguration))
    // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)
    //  核心代码
     /**
      * 此处新提交一个Job，也是对RDD进行计算，那么如果原有的RDD对结果进行了cache的话，
      * 那么是不是减少了很多的计算呢，这就是为啥checkpoint的时候强烈推荐进行cache的缘故
      */
    sc.runJob(originalRDD,
      writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)

    // 如果rdd的partitioner不为空，则将partitioner写入checkpoint目录
    if (originalRDD.partitioner.nonEmpty) {
      writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)
    }
    
    val checkpointDurationMs =
      TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - checkpointStartTimeNs)
    logInfo(s"Checkpointing took $checkpointDurationMs ms.")
    
    // 创建一个CheckpointRDD,该分区数目应该和原始的rdd的分区数是一样的
    val newRDD = new ReliableCheckpointRDD[T](
      sc, checkpointDirPath.toString, originalRDD.partitioner)
    if (newRDD.partitions.length != originalRDD.partitions.length) {
      throw new SparkException(
        "Checkpoint RDD has a different number of partitions from original RDD. Original " +
          s"RDD [ID: ${originalRDD.id}, num of partitions: ${originalRDD.partitions.length}]; " +
          s"Checkpoint RDD [ID: ${newRDD.id}, num of partitions: " +
          s"${newRDD.partitions.length}].")
    }
    newRDD
  }

最后，会返回新的CheckpointRDD ，父类将它复值给成员cpRDD，最终标记当前状态为Checkpointed并清空当RDD的依赖链；到此Checkpoint的数据就被序列化到HDFS上了；

checkpoint读出数据

1、RDD的iterator()

 1 /**
 2    * 先调用presist(),再调用checkpoint()
 3    * 先执行到rdd的iterator()的时候，storageLevel != StorageLevel.NONE，就会通过CacheManager获取数据
 4    * 此时发生BlockManager获取不到数据，就会第一次计算数据，在通过BlockManager进行持久化
 5    * 
 6    * rdd的job执行结束，启动单独的一个job，进行checkpoint，
 7    * 下一次又运行到rdd的iterator()方法就会发现持久化级别不为空，默认从BlockManager中读取持久化数据（正常情况下）
 8    * 
 9    * 在非正常情况下，就会调用computeOrReadCheckpoint方法，判断如果isCheckpoint为ture，
10    * 就会调用父rdd的iterator()，从外部文件系统中读取数据
11    */
12   final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
13     if (storageLevel != StorageLevel.NONE) {
14       getOrCompute(split, context)
15     } else {
16       //进行RDD的partition计算
17       computeOrReadCheckpoint(split, context)
18     }
19   }

2、继续调用computeOrReadCheckpoint

1 private[spark] def computeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =
2   {
3     if (isCheckpointedAndMaterialized) {
4       firstParent[T].iterator(split, context)
5     } else {
6       //抽象方法，找具体实现类,比如MapPartitionsRDD
7       compute(split, context)
8     }
9   }

调用rdd.iterator() 去计算rdd的partition的时候，会调用computeOrReadCheckpoint(split: Partition)去查看该rdd是否被checkPoint过了，

如果是，就调用rdd的parent rdd的iterator（）也就是CheckpointRDD.iterator()，否则直接调用该RDD的compute

3、CheckpointRDD(ReliableCheckpointRDD)的compute

1  //Path上读取我们的CheckPoint数据
2   override def compute(split: Partition, context: TaskContext): Iterator[T] = {
3     val file = new Path(checkpointPath, ReliableCheckpointRDD.checkpointFileName(split.index))
4     ReliableCheckpointRDD.readCheckpointFile(file, broadcastedConf, context)
5   }

4、readCheckpointFile方法

 1  def readCheckpointFile[T](
 2       path: Path,
 3       broadcastedConf: Broadcast[SerializableConfiguration],
 4       context: TaskContext): Iterator[T] = {
 5     val env = SparkEnv.get
 6     // 用hadoop API 读取HDFS上的数据
 7     val fs = path.getFileSystem(broadcastedConf.value.value)
 8     val bufferSize = env.conf.getInt("spark.buffer.size", 65536)
 9     val fileInputStream = {
10       val fileStream = fs.open(path, bufferSize)
11       if (env.conf.get(CHECKPOINT_COMPRESS)) {
12         CompressionCodec.createCodec(env.conf).compressedInputStream(fileStream)
13       } else {
14         fileStream
15       }
16     }
17     val serializer = env.serializer.newInstance()
18     val deserializeStream = serializer.deserializeStream(fileInputStream)
19 
20     // Register an on-task-completion callback to close the input stream.
21     context.addTaskCompletionListener[Unit](context => deserializeStream.close())
22     //反序列化数据后转换为一个Iterator
23     deserializeStream.asIterator.asInstanceOf[Iterator[T]]
24   }

大数据平台学习（三）----spark核心编程

Executor原理剖析与源码分析

CacheManager 源码解析 ：

CacheManager 源码解析：