Spark Shuffle

SparkShuffle 概念
- reduceByKey 会将上一个RDD中的每一个key对应的所有 value 聚合成一个 value, 然后生成一个value, 然后生成一个新的 RDD, 元素资源是<key, value> 对的形式, 这样每一个 key 对应一个聚合起来的 value。
  - 问题:
  - 聚合之前, 每一个key对应的 value 不一定都在一个 partition 中, 也不太可能在同一个节点上, 因为 RDD 是分布式的弹性数据集, RDD 的 partition 极有可能分布在各个节点上。
  - 如何聚合?
    - Shuffle Write:
      - 上一个 stage 的每个 map task 就必须保证将自己处理的当前分区的数据相同的key写入一盒分区文件中, 可能会写入多个不同的分区文件中。
    - Shuffle Read:
      - reduce task 会从上一个stage的所有 task 所在的机器上寻找属于自己的那些分区文件, 这样就可以保证每一个 key 所对应的 value 都会汇集到同一个节点上去处理和聚合。
      - Spark 中有两种Shuffle类型: HashShuffle 和 SortShuffle
        
        Spark1.2之前是HashShuffle, Spark1.2 引入SortShuffle
        
        Spark1.2 - Spark1.6之间 HashShuffle 和 SortShuffle 并存
        
        Spark2.0 移除了 HashShuffle, 只有SortShuffle

HashShuffle

普通机制
- 示意图
- 执行流程
  - 每个 map task 将不同结果写到不同的buffer中, 每个 buffer 的大小为32K。
  - 每个 buffer 文件最后对应一个磁盘小文件。
  - reduce task 来拉取对应的磁盘小文件。
- 总结
  - map task 的计算结果会根据分区器 (默认是 hashPartition) 来决定写入到哪一个磁盘小文件中去。
  - ReduceTask 会去 Map 端拉取相应的磁盘小文件。
  - 产生的磁盘小文件个数: M (map task 的个数) * R (reduce task 的个数)
- 存在的问题
  - 产生的磁盘小文件过多, 会导致以下问题:
    - 在 Shuffle Write 过程中会产生很多写磁盘小文件的对象
    - 在 Shuffle Read 过程中会产生很多读取磁盘小文件的对象
    - 在 JVM 堆内存中对象过多会造成频繁的 gc, 如果经过gc仍无法获取运行所需要的内存的话, 就会造成OOM(Out Of Memory)
    - 在数据传输过程中会有频繁的网络通信, 频繁的网络通信出现通信故障的可能性大大增加, 一旦网络通信出现了故障会导致 shuffle file cannot found (找不到shuffle的文件)，从而导致task失败, TaskScheduler 不负责重试, 由DAGScheduler 负责重试Stage。

合并机制

合并机制示意图

源码

ShuffleManager(Trait)

package org.apache.spark.shuffle

import org.apache.spark.{ShuffleDependency, TaskContext}

/**
 * Pluggable interface for shuffle systems. A ShuffleManager is created 
 * in SparkEnv on the driver and on each executor, based on the 
 * spark.shuffle.manager setting.
 * shuffle 系统的可植入接口。在 SparkEnv(Spark 环境)中创建了一个
 * 基于spark.shuffle.manager 设置的 ShuffleManager
 * The driver registers shuffles with it, and executors (or tasks 
 * running locally in the driver) can ask to read and write data.
 * driver 将 shuffle 任务注册, 并且executor进程(或者运行在本地的任务) 可以申
 * 请数据的读写
 * NOTE: this will be instantiated by SparkEnv so its constructor can 
 * take a SparkConf and boolean isDriver as parameters.
 * 需要注意的是, 以上会被 SparkEnv(Env => Environment) 实例化 从而使它的构
 * 造器能够获取一个 SparkConf(Spark 配置文件) 和 是否是 Driver 的标签
 */
private[spark] trait ShuffleManager {

  /**
   * Register a shuffle with the manager and obtain a handle for it to 
   * pass to tasks.
   * 注册一个 带管理器的 shuffle 任务 并好获取一个将它发送给 任务集的 句柄
   */
  //sortShuffle的实现
  def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle

  /** Get a writer for a given partition. Called on executors by map 
   * tasks.
   * 对已得的分区获取一个writer, 通过 map tasks 在 executor 进程上调用
   */ 
  def getWriter[K, V](handle: ShuffleHandle, mapId: Int, context: TaskContext): ShuffleWriter[K, V]

  /**
   * Get a reader for a range of reduce partitions (startPartition to 
   * endPartition-1, inclusive).
   * 根据一段reduce(规约)分区获取 reader
   * Called on executors by reduce tasks.
   * 通过 reduce tasks 在 executor进程上调用
   */
  def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext): ShuffleReader[K, C]

  /**
   * Remove a shuffle's metadata from the ShuffleManager.
   * 从Shufflemanager 移除一个 shuffle 任务的 元数据
   * @return true if the metadata removed successfully, otherwise false.
   * 如果元数据被成功移除就返回true, 反之 false
   */
  def unregisterShuffle(shuffleId: Int): Boolean

  /**
   * Return a resolver capable of retrieving shuffle block data based on 
   * block coordinates.
   * 返回一个能基于block坐标来取回 shuffle block 中的数据的解析器
   */
  def shuffleBlockResolver: ShuffleBlockResolver

  /** Shut down this ShuffleManager.
   * 关闭该 shuffleManager
   */
  def stop(): Unit
}

ShuffleReader(Trait)

package org.apache.spark.shuffle

/**
 * Obtained inside a reduce task to read combined records from the 
 * mappers.
 * 从一个 reduce task 内部 获取该 Trait的实例 以 从 mappers 中读取 总共的数据
 */
private[spark] trait ShuffleReader[K, C] {
  /** Read the combined key-values for this reduce task 
   * 读取 此 reduce task 的所有键值对
   */
  def read(): Iterator[Product2[K, C]]

  /**
   * Close this reader.关闭该 reader
   * TODO: Add this back when we make the ShuffleReader a developer API 
   * that others can implement (at which point this will likely be 、
   * necessary).
   * 当我们将 ShuffleReader 作为一个他人可以实现的开发者的API时 将以下注释放
   * 开(在某些情况下这可能是必须的).
   * 
   */
  // def stop(): Unit
}

ShuffleWriter

package org.apache.spark.shuffle.sort

import java.util.concurrent.ConcurrentHashMap

import org.apache.spark._
import org.apache.spark.internal.Logging
import org.apache.spark.shuffle._

/**
 * In sort-based shuffle, incoming records are sorted according to their 
 * target partition ids, then written to a single map output file. 
 * 在 sort-based(基于排序) 的 shuffle中, 将到来的记录都会根据他们的目标分区 
 * id 被排序, 然后被写入到一个简单的 map 输出文件
 *
 * Reducers fetch contiguous regions of this file in order to
 * read their portion of the map output.
 * Reducer 获取 该文件相邻的区域从而读取他们的 map 输出部分
 *
 * In cases where the map output data is too large to fit in memory, 
 * sorted subsets of the output can are spilled to disk and those on-
 * disk files are merged to produce the final output file.
 * 当 map 输出文件太大内存装不下时, 排序好的输出的子集可以被溢写到磁盘中, 并且 
 * 这些在磁盘中的(小)文件会被归并输出为最终文件
 *
 * Sort-based shuffle has two different write paths for producing its 
 * map output files:
 * Sort-based shuffle 有两种不同的 产生map 输出文件 的 写路径
 *
 *  - Serialized sorting: used when all three of the following 
 * conditions hold:
 * 当以下所有三种 条件成立时, 使用serialized sorting(序列化的排序) 
 *    1. The shuffle dependency specifies no aggregation or output 
 * ordering.
 *    1. shuffle 的依赖规定了 没有聚合 以及 输出排序
 *    2. The shuffle serializer supports relocation of serialized values 
 * (this is currently supported by KryoSerializer and Spark SQL's custom 
 * serializers).
 *    2.  shuffle 的序列器 支持 序列化值的重新分配 (这目前 由 Kryo 序列器 和 
 * Spark SQK 的客户端序列器 所支持)
 *    3. The shuffle produces fewer than 16777216 output partitions.
 *    3. shuffle 任务产生了比16777216 更少的输出分区
 *
 *  - Deserialized sorting: used to handle all other cases.
 *  - 不序列化的排序, 用于处理所有其他情况
 * -----------------------
 * Serialized sorting mode
 * -----------------------
 *
 * In the serialized sorting mode, incoming records are serialized as 
 * soon as they are passed to the shuffle writer and are buffered in a 
 * serialized form during sorting. 
 * 在序列化的排序模式中, 即将到来的记录会在他们被传送到 shuffle writer 上时就
 * 马上被序列化 并 在排序过程中 被缓存为 序列化的形式
 *
 * This write path implements several optimizations:
 * 此写入路径 实现了许多优化
 *
 *  - Its sort operates on serialized binary data rather than Java 
 * objects, which reduces memory consumption and GC overheads. 
 * 它的 排序直接操作 序列化的二进制数据 而不是 Java 对象, 这样就会减少内存消耗
 * 和 GC(Garbage Collector) 开销
 *
 * This optimization requires the record serializer to have certain 
 * properties to allow serialized records to be re-ordered without  
 * requiring deserialization.
 * 此优化需要记录的序列器依据有特定的性质来允许序列化的记录 在 不执行反序列化
 * 的情况下被重新排序
 *
 *    See SPARK-4550, where this optimization was first proposed and 
 * implemented, for more details.
 * 该优化在 SPARK-4550 中第一次被提出并且实施, 更多细节请查询它
 *
 *  - It uses a specialized cache-efficient 
 * sorter([[ShuffleExternalSorter]]) that sorts arrays of compressed 
 * record pointers and partition ids.
 * 它使用了 一个特殊的 高效缓存排序器(ShuffleExternalSorter[外部的 shuffle 
 * 排序器]), 该排序器 对压缩过的记录点 数组 和 分区 id 数组 进行排序
 *
 * By using only 8 bytes of space per record in the sorting array, this 
 * fits more of the array into cache.
 * 通过 对每条记录 仅使用 排序数组中 8 字节的 空间, 这使得更多的数组能被传入缓
 * 存
 *
 *   The spill merging procedure operates on blocks of serialized 
 * records that belong to the same partition and does not need to 
 * deserialize records during the merge.
 *   溢写合并过程操作了序列化的记录所在的block(块), 这些记录属于相同分区, 并
 * 且不需要 在合并期间将数据反序列化
 *
 *  - When the spill compression codec supports concatenation of 
 * compressed data, the spill merge simply concatenates the serialized  
 * and compressed spill partitions to produce the final output
 * partition.  
 * 当溢写压缩代码编译器 支持了 压缩数据 的合并时, 该溢写过程仅仅合并了序列化且
 * 压缩的溢写分区来产生最终的输出分区.
 *
 * This allows efficient data copying methods, like NIO's `transferTo`, 
 * to be used and avoids the need to allocate decompression or copying 
 * buffers during the merge.
 * 这使得数据能被高效地拷贝, 就像 NIO(Not Blocked IO) 中的 transferTo(转换
 * 到), 在被使用时避免了在 合并时 分配解压缩 或 将缓存拷贝。 
 *
 * For more details on these optimizations, see SPARK-7081.
 * 对于这些优化的更多细节，请查询 SPARK-7081
 */
private[spark] class SortShuffleManager(conf: SparkConf) extends ShuffleManager with Logging {

  if (!conf.getBoolean("spark.shuffle.spill", true)) {
    logWarning(
      "spark.shuffle.spill was set to false, but this configuration is ignored as of Spark 1.6+." +
        " Shuffle will continue to spill to disk when necessary.")
  }

  /**
   * A mapping from shuffle ids to the number of mappers producing 
   * output for those shuffles.
   * 使用一个ConcurrentHashMap 来存储shuffles的中间结果
   */
  private[this] val numMapsForShuffle = new ConcurrentHashMap[Int, Int]()
  // 重写 shuffleBlock 解析器
  override val shuffleBlockResolver = new IndexShuffleBlockResolver(conf)

  /**
   * Obtains a [[ShuffleHandle]] to pass to tasks.
   * 获取一个 shuffle 句柄来传递任务
   */
  /**
    * SortShuffleManager中有几个重要的方法
    *  getReader :读取数据
    *  getWriter :写数据
    *  registerShuffle ： 注册shuffle
    */
  override def registerShuffle[K, V, C](
      shuffleId: Int,
      numMaps: Int,
      dependency: ShuffleDependency[K, V, C]): ShuffleHandle = {
    //判断是否使用sortShuffle 中的BypassMergeSort
    if (SortShuffleWriter.shouldBypassMergeSort(conf, dependency)) {
      // If there are fewer than spark.shuffle.sort.bypassMergeThreshold partitions and we don't
      // need map-side aggregation, then write numPartitions files directly and just concatenate
      // them at the end. This avoids doing serialization and deserialization twice to merge
      // together the spilled files, which would happen with the normal code path. The downside is
      // having multiple files open at a time and thus more memory allocated to buffers.
      /**
        * 如果是少于参数 spark.shuffle.sort.bypassMergeThreshold 的分区，不需要map端预聚合，直接向buffer 缓存区中写数据，最后将它们连接起来。
        * 这样避免了在shuffle 落地文件合并时的 序列化和反序列 过程。缺点是需要分配更多的内存。
        */
      new BypassMergeSortShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else if (SortShuffleManager.canUseSerializedShuffle(dependency)) {
      // Otherwise, try to buffer map outputs in a serialized form, since this is more efficient:
      //使用序列化的形式写入buffer缓存区，存的更多，高效
      new SerializedShuffleHandle[K, V](
        shuffleId, numMaps, dependency.asInstanceOf[ShuffleDependency[K, V, V]])
    } else {
      // Otherwise, buffer map outputs in a deserialized form:
      //不使用序列化直接写入buffer缓存区
      new BaseShuffleHandle(shuffleId, numMaps, dependency)
    }
  }

  /**
   * Get a reader for a range of reduce partitions (startPartition to 、
   * endPartition-1, inclusive).
   * 获取一段 reduce 分区的读取器
   * Called on executors by reduce tasks.
   * 在 executor 进程上被 reduce 任务调用
   */
  override def getReader[K, C](
      handle: ShuffleHandle,
      startPartition: Int,
      endPartition: Int,
      context: TaskContext): ShuffleReader[K, C] = {
    new BlockStoreShuffleReader(
      handle.asInstanceOf[BaseShuffleHandle[K, _, C]], startPartition, endPartition, context)
  }

  /** Get a writer for a given partition. Called on executors by map 
  * tasks.
  * 获取一段分区的写入者 在 executor 进程上被 map 任务调用
  */
  override def getWriter[K, V](
      handle: ShuffleHandle,
      mapId: Int,
      context: TaskContext): ShuffleWriter[K, V] = {
    numMapsForShuffle.putIfAbsent(
      handle.shuffleId, handle.asInstanceOf[BaseShuffleHandle[_, _, _]].numMaps)
    val env = SparkEnv.get
    handle match {
      case unsafeShuffleHandle: SerializedShuffleHandle[K @unchecked, V @unchecked] =>
        new UnsafeShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          context.taskMemoryManager(),
          unsafeShuffleHandle,
          mapId,
          context,
          env.conf)
      case bypassMergeSortHandle: BypassMergeSortShuffleHandle[K @unchecked, V @unchecked] =>
        new BypassMergeSortShuffleWriter(
          env.blockManager,
          shuffleBlockResolver.asInstanceOf[IndexShuffleBlockResolver],
          bypassMergeSortHandle,
          mapId,
          context,
          env.conf)
      case other: BaseShuffleHandle[K @unchecked, V @unchecked, _] =>
        new SortShuffleWriter(shuffleBlockResolver, other, mapId, context)
    }
  }

  /** Remove a shuffle's metadata from the ShuffleManager. 
   * 从 ShuffleManager 移除一个 shuffle 的元数据
   */
  override def unregisterShuffle(shuffleId: Int): Boolean = {
    Option(numMapsForShuffle.remove(shuffleId)).foreach { numMaps =>
      (0 until numMaps).foreach { mapId =>
        shuffleBlockResolver.removeDataByMap(shuffleId, mapId)
      }
    }
    true
  }

  /** Shut down this ShuffleManager. 关闭这个ShuffleManager */
  override def stop(): Unit = {
    shuffleBlockResolver.stop()
  }
}


private[spark] object SortShuffleManager extends Logging {

  /**
   * The maximum number of shuffle output partitions that 
   * SortShuffleManager supports when buffering map outputs in a 
   * serialized form. 
   * 当缓存 map 的输出是一个序列化的形式缓存 map 输出SortShuffleManager 所支
   * 持的最大 shuffle 分区数。  
   *
   * This is an extreme defensive programming measure, since it's 
   * extremely unlikely that a single shuffle produces over 16 million 
   * output partitions.
   * 这是一个十分保守的编程策略, 即单个Shuffle产生超过160万 的输出分区是基本
   * 不可能的。
   * 
   */
  val MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE =
    PackedRecordPointer.MAXIMUM_PARTITION_ID + 1

  /**
   * Helper method for determining whether a shuffle should use an 
   * optimized serialized shuffle path or whether it should fall back to 
   * the original path that operates on deserialized objects.
   * 用于决定一个 shuffle 是否应当使用一个优化的序列化 shuffle 路径 或 它是
   * 否应当 落回 操作非序列化的对象的 原始路径 
   */
  def canUseSerializedShuffle(dependency: ShuffleDependency[_, _, _]): Boolean = {
    val shufId = dependency.shuffleId
    val numPartitions = dependency.partitioner.numPartitions
    if (!dependency.serializer.supportsRelocationOfSerializedObjects) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because the serializer, " +
        s"${dependency.serializer.getClass.getName}, does not support object relocation")
      false
    } else if (dependency.aggregator.isDefined) {
      log.debug(
        s"Can't use serialized shuffle for shuffle $shufId because an aggregator is defined")
      false
    } else if (numPartitions > MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE) {
      log.debug(s"Can't use serialized shuffle for shuffle $shufId because it has more than " +
        s"$MAX_SHUFFLE_OUTPUT_PARTITIONS_FOR_SERIALIZED_MODE partitions")
      false
    } else {
      log.debug(s"Can use serialized shuffle for shuffle $shufId")
      true
    }
  }
}

/**
 * Subclass of [[BaseShuffleHandle]], used to identify when we've chosen 
 * to use the serialized shuffle.
 * BaseShuffleHandle(基本shuffle 的句柄)的子类, 当我们选择去使用序列化
 * shuffle 时用于鉴别
 */
private[spark] class SerializedShuffleHandle[K, V](
  shuffleId: Int,
  numMaps: Int,
  dependency: ShuffleDependency[K, V, V])
  extends BaseShuffleHandle(shuffleId, numMaps, dependency) {
}

/**
 * Subclass of [[BaseShuffleHandle]], used to identify when we've chosen 
 * to use the bypass merge sort shuffle path.
 * BaseShuffleHandle(基本shuffle 的句柄)的子类, 当我们选择使用 bypass 合并
 * shuffle 方式时进行鉴别
 */
private[spark] class BypassMergeSortShuffleHandle[K, V](
  shuffleId: Int,
  numMaps: Int,
  dependency: ShuffleDependency[K, V, V])
  extends BaseShuffleHandle(shuffleId, numMaps, dependency) {
}

SortShuffleWriter

package org.apache.spark.shuffle.sort

import org.apache.spark._
import org.apache.spark.internal.Logging
import org.apache.spark.scheduler.MapStatus
import org.apache.spark.shuffle.{BaseShuffleHandle, IndexShuffleBlockResolver, ShuffleWriter}
import org.apache.spark.storage.ShuffleBlockId
import org.apache.spark.util.Utils
import org.apache.spark.util.collection.ExternalSorter

private[spark] class SortShuffleWriter[K, V, C](
    shuffleBlockResolver: IndexShuffleBlockResolver,
    handle: BaseShuffleHandle[K, V, C],
    mapId: Int,
    context: TaskContext)
  extends ShuffleWriter[K, V] with Logging {

  private val dep = handle.dependency

  private val blockManager = SparkEnv.get.blockManager

  private var sorter: ExternalSorter[K, V, _] = null

  // Are we in the process of stopping? Because map tasks can call stop() with success = true
  // and then call stop() with success = false if they get an exception, we want to make sure
  // we don't try deleting files, etc twice.
  private var stopping = false

  private var mapStatus: MapStatus = null

  private val writeMetrics = context.taskMetrics().shuffleWriteMetrics

  /** Write a bunch of records to this task's output */
  override def write(records: Iterator[Product2[K, V]]): Unit = {
    sorter = if (dep.mapSideCombine) {
      require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
      new ExternalSorter[K, V, C](
        context, dep.aggregator, Some(dep.partitioner), dep.keyOrdering, dep.serializer)
    } else {
      // In this case we pass neither an aggregator nor an ordering to the sorter, because we don't
      // care whether the keys get sorted in each partition; that will be done on the reduce side
      // if the operation being run is sortByKey.
      new ExternalSorter[K, V, V](
        context, aggregator = None, Some(dep.partitioner), ordering = None, dep.serializer)
    }
    //每次数据溢写磁盘
    sorter.insertAll(records)

    // Don't bother including the time to open the merged output file in the shuffle write time,
    // because it just opens a single file, so is typically too fast to measure accurately
    // (see SPARK-3570).
    //下面就是将数据写往一个文件
    val output = shuffleBlockResolver.getDataFile(dep.shuffleId, mapId)
    val tmp = Utils.tempFileWith(output)
    try {
      val blockId = ShuffleBlockId(dep.shuffleId, mapId, IndexShuffleBlockResolver.NOOP_REDUCE_ID)
      val partitionLengths = sorter.writePartitionedFile(blockId, tmp)
      shuffleBlockResolver.writeIndexFileAndCommit(dep.shuffleId, mapId, partitionLengths, tmp)
      mapStatus = MapStatus(blockManager.shuffleServerId, partitionLengths)
    } finally {
      if (tmp.exists() && !tmp.delete()) {
        logError(s"Error while deleting temp file ${tmp.getAbsolutePath}")
      }
    }
  }

  /** Close this writer, passing along whether the map completed */
  override def stop(success: Boolean): Option[MapStatus] = {
    try {
      if (stopping) {
        return None
      }
      stopping = true
      if (success) {
        return Option(mapStatus)
      } else {
        return None
      }
    } finally {
      // Clean up our sorter, which may have its own intermediate files
      if (sorter != null) {
        val startTime = System.nanoTime()
        sorter.stop()
        writeMetrics.incWriteTime(System.nanoTime - startTime)
        sorter = null
      }
    }
  }
}

private[spark] object SortShuffleWriter {
  def shouldBypassMergeSort(conf: SparkConf, dep: ShuffleDependency[_, _, _]): Boolean = {
    // We cannot bypass sorting if we need to do map-side aggregation.
    if (dep.mapSideCombine) {
      //map 端有预聚合的操作，不能使用bypass 机制
      require(dep.aggregator.isDefined, "Map-side combine without Aggregator specified!")
      false
    } else {
      //map 端没有预聚合，但是分区大于 参数 spark.shuffle.sort.bypassMergeThreshold = 200 也不能使用bypass 机制。
      val bypassMergeThreshold: Int = conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
      dep.partitioner.numPartitions <= bypassMergeThreshold
    }
  }
}

总结
- 产生磁盘小文件的个数: C(使用的cpu core 个数) * R(reduce 的个数)

SortShuffle
- 普通机制
  - 示意图
  - 执行流程
    - map task 的计算结果会写入到一个内存数据结构中, 内存数据结构默认是 5M。
    - 在 shuffle 的时候会启动一个定时器, 不定期的去估算这个内存结果的大小, 当内存结构中的数据超过5M时(设为N), 它会申请 (N * 2 - 5) M 内存给内存数据结构。
    - 如果申请成功不会进行溢写, 如果申请不成功, 这时候会溢写磁盘。
    - 在溢写之前, 内存结构中的数据会进行排序分区。
    - 溢写磁盘是以 batch 形式去写, 一个 batch 是 10000 条数据。
    - map task 执行完成后, 会将这些磁盘小文件合并成一个大的磁盘文件, 同时生成一个索引文件。
    - reduce task 去 map 端拉取数据的时候, 首先解析索引文件, 根据索引文件再去拉取对应的数据。
  - 总结
    - 产生磁盘小文件的个数: 2*M (map task的个数)
- bypass机制
  - 示意图
  - 总结
    - shuffle reduce task 的数量小于 spark.shuffle.sort.bypassMergeThreshold 的参数值。(该值默认是200)
    - 产生的磁盘小文件数: 2*M (map task 的个数)
Shuffle 文件寻址
- MapOutputTracker
  - MapOutputTacker 是 Spark 架构中的一个模块, 是一个主从架构。
  - 负责管理磁盘小文件的地址
  - MapOutputTrackerMaster 是主对象, 存在于 Driver中。
  - MapOutputTrackerWorker 是从对象, 存在于 Executor。
- BlockManager
  - BlockManager 块管理者, 是Spark架构中的一个模块, 也是一个主从架构。
  - BlockManagerMaster
  - BlockManagerWorker
  - 无论在 Driver 端的BlockManager 还是在 Executor 端的 BlockManager 都含有四个对象:
    - DiskStore
    - MemoryStore
    - ConnectionManager
    - BlockTransferWorker
- Shuffle文件寻址图
- Shuffle 文件寻址流程
  - 当 map task 执行完成后, 会将 task 的执行情况和磁盘小文件的地址封装到 MpStatus 对象中, 通过 MapOutputTrackerWorker 对象向Driver中的 MapOutputTrackerMaster 汇报。
  - 在所有的map task 执行完毕后, Driver中就掌握了所有的磁盘小文件的地址。
  - 在 reduce task 执行之前, 会通过 Executor 中 MapOutPutTrackerWorker 向 Driver 端的 MapOutputTrackerMaster 获取磁盘小文件的地址。
  - 获取到磁盘小文件的地址后, 会通过 BlockManager 中的 ConnectionManager 连接数据所在节点上的 ConnectionManager, 然后通过BlockTransferService 进行数据的传输。
  - BlockTransferService 默认启动 5 个 task 去节点拉取数据。默认情况下, 5 个 task 拉取数据量不能超过 48M。

Spark Shuffle

SparkShuffle 概念

HashShuffle

SortShuffle

Shuffle 文件寻址