Spark应用程序-任务的划分

任务的划分

​ DAGScheduler类的handleJobSubmitted方法中,有一个提交阶段的的方法:

var finalStage: ResultStage = null
	……
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
	……
submitStage(finalStage)

​ submitStage方法用于提交最终的ResultStage阶段,由于在最终的ResultStage可能包含了多个上级阶段,所以此处就相当于是提交整个应用程序的全部阶段。查看一下该方法的源码:

private def submitStage(stage: Stage): Unit = {
  val jobId = activeJobForStage(stage)
  if (jobId.isDefined) {
    logDebug(s"submitStage($stage (name=${stage.name};" +
      s"jobs=${stage.jobIds.toSeq.sorted.mkString(",")}))")
    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
      val missing = getMissingParentStages(stage).sortBy(_.id)
      logDebug("missing: " + missing)
      if (missing.isEmpty) {
        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")
        submitMissingTasks(stage, jobId.get)
      } else {
        for (parent <- missing) {
          submitStage(parent)
        }
        waitingStages += stage
      }
    }
  } else {
    abortStage(stage, "No active job for stage " + stage.id, None)
  }
}

​ 该方法的内部核心逻辑是先获取当前阶段的的所有父级阶段,如果其父级阶段为空那么直接执行submitMissingTasks方法,如果不为空,那么递归执行submitStage方法,只不过传入的参数是当前阶段的父级阶段,一直递归直到找到没有上级阶段的阶段,最终没有上级阶段的那个阶段会执行submitMissingTasks方法。下面查看一下该方法的核心源码部分:

private def submitMissingTasks(stage: Stage, jobId: Int): Unit = {
    	 ……
  val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
		 ……
  val tasks: Seq[Task[_]] = try {
    val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()
    stage match {
      case stage: ShuffleMapStage =>
        stage.pendingPartitions.clear()
        partitionsToCompute.map { id =>
          val locs = taskIdToLocations(id)
          val part = partitions(id)
          stage.pendingPartitions += id
          new ShuffleMapTask(stage.id, stage.latestInfo.attemptNumber,
            taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),
            Option(sc.applicationId), sc.applicationAttemptId, stage.rdd.isBarrier())
        }

      case stage: ResultStage =>
        partitionsToCompute.map { id =>
          val p: Int = stage.partitions(id)
          val part = partitions(p)
          val locs = taskIdToLocations(id)
          new ResultTask(stage.id, stage.latestInfo.attemptNumber,
            taskBinary, part, locs, id, properties, serializedTaskMetrics,
            Option(jobId), Option(sc.applicationId), sc.applicationAttemptId,
            stage.rdd.isBarrier())
        }
    }
  } catch {
    case NonFatal(e) =>
      abortStage(stage, s"Task creation failed: $e
${Utils.exceptionString(e)}", Some(e))
      runningStages -= stage
      return
  }

 	……
}

​ 核心代码的逻辑在于根据传入的stage进行模式匹配,会根据不同类型的Satge创建的不同的Task,那么首先会计算分区得到分区索引集合,然后使用map方法将根据分区id创建xxxMapTask对象,有几个分区id就创建几个xxxMapTask对象。partitionsToCompute是stage.findMissingPartitions()的返回值,那么查看其源码,stage是一个抽象类的引用,调用的这个方法具体的实现在具体的xxxMapStage类中。分别查看一下在resultstage和中的源码:

ResultStage:

override def findMissingPartitions(): Seq[Int] = {
  val job = activeJob.get
  (0 until job.numPartitions).filter(id => !job.finished(id))
}

ShuffleMapStage:

override def findMissingPartitions(): Seq[Int] = {
  mapOutputTrackerMaster
    .findMissingPartitions(shuffleDep.shuffleId)
    .getOrElse(0 until numPartitions)
}

​ 所以可以看出,partitionsToCompute就是一个分区索引的集合。ResultStage和ShuffleMapStage的numPartitions的值计算方式一样,都是来自于它们所处阶段的最后一个rdd的分区数量值:

job.numPartitions值:

val numPartitions = finalStage match {
  case r: ResultStage => r.partitions.length
  case m: ShuffleMapStage => m.rdd.partitions.length
}

numPartitions:

val numPartitions = rdd.partitions.length

所以总结一下,应用程序的总任务数量等于每个阶段的最后一个rdd的分区数量之和。

原文地址:https://www.cnblogs.com/yxym2016/p/14254225.html