HDFS源码分析(三)-----数据块关系基本结构

前言

正如我在前面的文章中曾经写过，在HDFS中存在着两大关系模块，一个是文件与block数据块的关系，简称为第一关系，但是相比于第一个关系清晰的结构关系，HDFS的第二关系就没有这么简单了，第二关系自然是与数据节点相关，就是数据块与数据节点的映射关系，里面的有些过程的确是错综复杂的，这个也很好理解嘛，本身block块就很多，而且还有副本设置，然后一旦集群规模扩大，数据节点的数量也将会变大，如何处理此时的数据块与对应数据节点的映射就必然不是简单的事情了，所以这里有一点是比较特别的，随着系统的运行，数据节点与他所包含的块列表时动态建立起来的，相应的名字节点也需要通过心跳来获取信息，并不断更新元数据信息。同样在第二关系中,有诸多巧妙的设计,望读者在下面的阅读中细细体会

BlocksMap

映射关系图类,首先这一定是类似于map这种数据类型的,所以肯定有类似于put(key,value),get(key)这样的操作,的确是这样,但是他并没有直接沿用hashMap这样的现成的类,而是自己实现了一个轻量级的集合类,什么叫做轻量级呢,与我们平常见的又有区别呢.先看BlocksMap的主要变量定义

/**
 * This class maintains the map from a block to its metadata.
 * block's metadata currently includes INode it belongs to and
 * the datanodes that store the block.
 */
class BlocksMap {
/**
   * Internal class for block metadata.
   * blockMap内部类保存block信息元数据
   */
  static class BlockInfo extends Block implements LightWeightGSet.LinkedElement {
...
}
/** Constant {@link LightWeightGSet} capacity. */
  private final int capacity;
  
  private GSet<Block, BlockInfo> blocks;

  BlocksMap(int initialCapacity, float loadFactor) {
    this.capacity = computeCapacity();
    //用轻量级的GSet实现block与blockInfo的映射存储
    this.blocks = new LightWeightGSet<Block, BlockInfo>(capacity);
  }
....

在构造函数的上面就是这个图类,新定义的集合类型,GSet,看里面的变量关系,就很明了,从Block到BlockInfo的映射,其中注意到后者是前者的子类,所以在这里,HDFS构造了一个父类到子类的映射.于是我们要跑到GSet类下看看具体的定义声明

/**
 * A {@link GSet} is set,
 * which supports the {@link #get(Object)} operation.
 * The {@link #get(Object)} operation uses a key to lookup an element.
 * 
 * Null element is not supported.
 * 
 * @param <K> The type of the keys.
 * @param <E> The type of the elements, which must be a subclass of the keys.
 * 定义了特殊集合类GSet，能够通过key来寻找目标对象，也可以移除对象，对象类同时也是key的子类
 */
public interface GSet<K, E extends K> extends Iterable<E> {
  /**
   * @return The size of this set.
   */
  int size();
  boolean contains(K key);
  E get(K key);
  E put(E element);
  E remove(K key);
}

答案的确是这样,定义的方法也是类似于map类的方法.OK,GSet的定义明白,但是这还不够,因为在构造函数,HDFS用的是他的子类LightWeightGSet,在此类的介绍文字中说,这是轻量级的集合类,可以低开销的实现元素的查找

/**
 * A low memory footprint {@link GSet} implementation,
 * which uses an array for storing the elements
 * and linked lists for collision resolution.
 *
 * No rehash will be performed.
 * Therefore, the internal array will never be resized.
 *
 * This class does not support null element.
 *
 * This class is not thread safe.
 *
 * @param <K> Key type for looking up the elements
 * @param <E> Element type, which must be
 *       (1) a subclass of K, and
 *       (2) implementing {@link LinkedElement} interface.
 * 轻量级的集合类，低内存的实现用于寻找对象类，不允许存储null类型，存储的类型必须为key的子类
 * 在里面用了哈希算法做了映射，没有进行哈希重映射，因此大小不会调整
 */
public class LightWeightGSet<K, E extends K> implements GSet<K, E> {

看起来非常神秘哦,接下来看变量定义

public class LightWeightGSet<K, E extends K> implements GSet<K, E> {
  /**
   * Elements of {@link LightWeightGSet}.
   */
  public static interface LinkedElement {
    /** Set the next element. */
    public void setNext(LinkedElement next);

    /** Get the next element. */
    public LinkedElement getNext();
  }

  public static final Log LOG = LogFactory.getLog(GSet.class);
  //最大数组长度2的30次方
  static final int MAX_ARRAY_LENGTH = 1 << 30; //prevent int overflow problem
  static final int MIN_ARRAY_LENGTH = 1;

  /**
   * An internal array of entries, which are the rows of the hash table.
   * The size must be a power of two.
   * 存储对象数组，也就是存储哈希表，必须为2的幂次方
   */
  private final LinkedElement[] entries;
  /** A mask for computing the array index from the hash value of an element. */
  //哈希计算掩码
  private final int hash_mask;
  /** The size of the set (not the entry array). */
  private int size = 0;
  /** Modification version for fail-fast.
   * @see ConcurrentModificationException
   */
  private volatile int modification = 0;

在最上方的LinkedElement接口中有next相关的方法,可以预见到,这是一个链表结构的关系,entries数组就是存储这个链表的,链表的最大长度非常大,达到2的30次方,不过这里还有一个比较奇怪的变量hash_mask哈希掩码,这是干嘛的呢,猜测是在做取哈希索引值的时候用的.然后看下此集合的构造函数.

/**
   * @param recommended_length Recommended size of the internal array.
   */
  public LightWeightGSet(final int recommended_length) {
  	//传入初始长度，但是需要经过适配判断
    final int actual = actualArrayLength(recommended_length);
    LOG.info("recommended=" + recommended_length + ", actual=" + actual);

    entries = new LinkedElement[actual];
    //把数组长度减1作为掩码
    hash_mask = entries.length - 1;
  }

在构造函数中传入的数组长度,不过需要经过方法验证,要在给定的合理范围之内,在这里,掩码值被设置成了长度值小1,这个得看了后面才能知道.在这个类Map的方法中,一定要看的方法有2个,put和get方法

 @Override
  public E get(final K key) {
    //validate key
    if (key == null) {
      throw new NullPointerException("key == null");
    }

    //find element
    final int index = getIndex(key);
    for(LinkedElement e = entries[index]; e != null; e = e.getNext()) {
      if (e.equals(key)) {
        return convert(e);
      }
    }
    //element not found
    return null;
  }

做法是先找到第一个索引值,一般元素如果没有被remove,第一个找到的元素就是目标值,这里重点看下哈希索引值的计算,这时候如果是我做的话,一般都是哈希取余数,就是hashCode()%N,在LightWeightGSet中给出略微不同的实现

//构建索引的方法，取哈希值再与掩码进行计算，这里采用的并不是哈希取余的算法
  //与掩码计算的目的就是截取掉高位多余的数字部分使索引值落在数组存储长度范围之内
  private int getIndex(final K key) {
    return key.hashCode() & hash_mask;
  }

简单理解为就是把哈希值排成全部由0和1组成的二进制数,然后直接数组长度段内低位的数字,因为高位的都被掩码计算成0舍掉了,用于位运算的好处是避免了十进制数字与二进制数字直接的来回转换.这个方法与哈希取余数的方法具体有什么样的效果上的差别这个我到时没有仔细比较过,不过在日后可以作为一种新的哈希链表算法的选择.put方法就是标准的链表操作方法

/**
   * Remove the element corresponding to the key,
   * given key.hashCode() == index.
   *
   * @return If such element exists, return it.
   *         Otherwise, return null.
   */
  private E remove(final int index, final K key) {
    if (entries[index] == null) {
      return null;
    } else if (entries[index].equals(key)) {
      //remove the head of the linked list
      modification++;
      size--;
      final LinkedElement e = entries[index];
      entries[index] = e.getNext();
      e.setNext(null);
      return convert(e);
    } else {
      //head != null and key is not equal to head
      //search the element
      LinkedElement prev = entries[index];
      for(LinkedElement curr = prev.getNext(); curr != null; ) {
      	 //下面是标准的链表操作
        if (curr.equals(key)) {
          //found the element, remove it
          modification++;
          size--;
          prev.setNext(curr.getNext());
          curr.setNext(null);
          return convert(curr);
        } else {
          prev = curr;
          curr = curr.getNext();
        }
      }
      //element not found
      return null;
    }
  }

在这里不做过多介绍.再回过头来看BlockInfo内部，有一个Object对象数组

static class BlockInfo extends Block implements LightWeightGSet.LinkedElement {
    //数据块信息所属的INode文件节点
    private INodeFile          inode;

    /** For implementing {@link LightWeightGSet.LinkedElement} interface */
    private LightWeightGSet.LinkedElement nextLinkedElement;

    /**
     * This array contains triplets of references.
     * For each i-th data-node the block belongs to
     * triplets[3*i] is the reference to the DatanodeDescriptor
     * and triplets[3*i+1] and triplets[3*i+2] are references 
     * to the previous and the next blocks, respectively, in the 
     * list of blocks belonging to this data-node.
     * triplets对象数组保存同一数据节点上的连续的block块。triplets[3*i]保存的是当前的数据节点
     * triplets[3*i+1]和triplets[3*i+2]保存的则是一前一后的block信息
     */
    private Object[] triplets;

在这个对象数组中存储入3个对象，第一个是数据块所在数据节点对象，第二个对象保存的是数据块的前一数据块，第三对象是保存数据块的后一数据块，形成了一个双向链表的结构，也就是说，顺着这个数据块遍历，你可以遍历完某个数据节点的所有的数据块，triplets的数组长度受block数据块的副本数限制，因为不同的副本一般位于不同的数据节点。

BlockInfo(Block blk, int replication) {
      super(blk);
      //因为还需要同时保存前后的数据块信息，所以这里会乘以3
      this.triplets = new Object[3*replication];
      this.inode = null;
    }

因为存储对象的不同，所以这里用了Object作为对象数组，而不是用具体的类名。里面的许多操作也是完全类似于链表的操作

/**
     * Remove data-node from the block.
     * 移除数据节点操作，把目标数据块移除，移除位置用最后一个块的信息替代，然后移除末尾块
     */
    boolean removeNode(DatanodeDescriptor node) {
      int dnIndex = findDatanode(node);
      if(dnIndex < 0) // the node is not found
        return false;
      assert getPrevious(dnIndex) == null && getNext(dnIndex) == null : 
        "Block is still in the list and must be removed first.";
      // find the last not null node
      int lastNode = numNodes()-1; 
      // replace current node triplet by the lastNode one 
      //用末尾块替代当前块
      setDatanode(dnIndex, getDatanode(lastNode));
      setNext(dnIndex, getNext(lastNode)); 
      setPrevious(dnIndex, getPrevious(lastNode)); 
      //设置末尾块为空
      // set the last triplet to null
      setDatanode(lastNode, null);
      setNext(lastNode, null); 
      setPrevious(lastNode, null); 
      return true;
    }

DatanodeDescriptor

下面来看另外一个关键类DatanodeDescriptor,可以说是对数据节点的抽象，首先在此类中定义了数据块到数据节点的映射类

/**************************************************
 * DatanodeDescriptor tracks stats on a given DataNode,
 * such as available storage capacity, last update time, etc.,
 * and maintains a set of blocks stored on the datanode. 
 *
 * This data structure is a data structure that is internal
 * to the namenode. It is *not* sent over-the-wire to the Client
 * or the Datnodes. Neither is it stored persistently in the
 * fsImage.
 DatanodeDescriptor数据节点描述类跟踪描述了一个数据节点的状态信息
 **************************************************/
public class DatanodeDescriptor extends DatanodeInfo {
  
  // Stores status of decommissioning.
  // If node is not decommissioning, do not use this object for anything.
  //下面这个对象只与decomission撤销工作相关
  DecommissioningStatus decommissioningStatus = new DecommissioningStatus();

  /** Block and targets pair */
  //数据块以及目标数据节点列表映射类
  public static class BlockTargetPair {
  	//目标数据块
    public final Block block;
    //该数据块的目标数据节点
    public final DatanodeDescriptor[] targets;    

    BlockTargetPair(Block block, DatanodeDescriptor[] targets) {
      this.block = block;
      this.targets = targets;
    }
  }

然后定义了这样的队列类

/** A BlockTargetPair queue. */
  //block块目标数据节点类队列
  private static class BlockQueue {
  	//此类维护了BlockTargetPair列表对象
    private final Queue<BlockTargetPair> blockq = new LinkedList<BlockTargetPair>();

然后下面又声明了一系列的与数据块相关的列表

  /** A queue of blocks to be replicated by this datanode */
  //此数据节点上待复制的block块列表
  private BlockQueue replicateBlocks = new BlockQueue();
  /** A queue of blocks to be recovered by this datanode */
  //此数据节点上待租约恢复的块列表
  private BlockQueue recoverBlocks = new BlockQueue();
  /** A set of blocks to be invalidated by this datanode */
  //此数据节点上无效待删除的块列表
  private Set<Block> invalidateBlocks = new TreeSet<Block>();

  /* Variables for maintaning number of blocks scheduled to be written to
   * this datanode. This count is approximate and might be slightly higger
   * in case of errors (e.g. datanode does not report if an error occurs 
   * while writing the block).
   */
  //写入这个数据节点的块的数目统计变量
  private int currApproxBlocksScheduled = 0;
  private int prevApproxBlocksScheduled = 0;
  private long lastBlocksScheduledRollTime = 0;
  private static final int BLOCKS_SCHEDULED_ROLL_INTERVAL = 600*1000; //10min

下面看一个典型的添加数据块的方法，在这里添加的步骤其实应该拆分成2个步骤来看，第一个步骤是将当前数据节点添加到此block数据块管理的数据节点列表中，第二是反向关系，将数据块加入到数据节点的块列表关系中，代码如下

 /**
   * Add data-node to the block.
   * Add block to the head of the list of blocks belonging to the data-node.
   * 将数据节点加入到block块对应的数据节点列表中
   */
  boolean addBlock(BlockInfo b) {
  	//添加新的数据节点
    if(!b.addNode(this))
      return false;
    // add to the head of the data-node list
    //将此数据块添加到数据节点管理的数据块列表中，并于当前数据块时相邻位置
    blockList = b.listInsert(blockList, this);
    return true;
  }

数据块操作例子-副本

在这里举一个与数据块相关的样本操作过程，副本请求过程。副本相关的很多操作的协调处理都是在FSNamesystem这个大类中的，因此在这个变量中就定义了很多的相关变量

public class FSNamesystem implements FSConstants, FSNamesystemMBean,
    NameNodeMXBean, MetricsSource {
  ...
  
  //等待复制的数据块副本数
  volatile long pendingReplicationBlocksCount = 0L;
  //已损坏的副本数数目
  volatile long corruptReplicaBlocksCount = 0L;
  //正在被复制的块副本数
  volatile long underReplicatedBlocksCount = 0L;
  //正在被调度的副本数
  volatile long scheduledReplicationBlocksCount = 0L;
  //多余数据块副本数
  volatile long excessBlocksCount = 0L;
  //等待被删除的副本数
  volatile long pendingDeletionBlocksCount = 0L;

各种状态的副本状态计数，以及下面的相应的对象设置

//
  // Keeps a Collection for every named machine containing
  // blocks that have recently been invalidated and are thought to live
  // on the machine in question.
  // Mapping: StorageID -> ArrayList<Block>
  //
  //最近无效的块列表,数据节点到块的映射
  private Map<String, Collection<Block>> recentInvalidateSets = 
    new TreeMap<String, Collection<Block>>();

  //
  // Keeps a TreeSet for every named node.  Each treeset contains
  // a list of the blocks that are "extra" at that location.  We'll
  // eventually remove these extras.
  // Mapping: StorageID -> TreeSet<Block>
  //
  // 过剩的数据副本块图,也是数据节点到数据块的映射
  Map<String, Collection<Block>> excessReplicateMap = 
    new TreeMap<String, Collection<Block>>();


/**
 * Store set of Blocks that need to be replicated 1 or more times.
 * Set of: Block
 */
  private UnderReplicatedBlocks neededReplications = new UnderReplicatedBlocks();
  // We also store pending replication-orders.
  //待复制的block
  private PendingReplicationBlocks pendingReplications;

在副本状态转移相关的类，主要关注2个，一个是UnderReplicatedBlocks，另外一个是PendingReplicaionBlocks，前者是准备生产副本复制请求，而后者就是待复制请求类，在UnderReplicatedBlocks里面，做的一个很关键的操作是确定副本复制请求的优先级，他会根据剩余副本数量以及是否是在decomission等状态的情况下，然后给出优先级，所以在变量中首先会有优先级队列设置

/* Class for keeping track of under replication blocks
 * Blocks have replication priority, with priority 0 indicating the highest
 * Blocks have only one replicas has the highest
 * 此类对正在操作的副本块进行了跟踪
 * 并对此设定了不同的副本优先级队列,只有1个队列拥有最高优先级
 */
class UnderReplicatedBlocks implements Iterable<BlockInfo> {
  //定义了4种level级别的队列
  static final int LEVEL = 4;
  //损坏副本数队列在最后一个队列中
  static public final int QUEUE_WITH_CORRUPT_BLOCKS = LEVEL-1;
  //定了副本优先级队列
  private List<LightWeightLinkedSet<BlockInfo>> priorityQueues
      = new ArrayList<LightWeightLinkedSet<BlockInfo>>();
  
  private final RaidMissingBlocks raidQueue;

在这里，大致上分出了2大类队列，一个是损坏副本队列，还有一个就是正常的情况，不过是优先级不同而已。level数字越小，代表优先级越高，优先级确定的核心函数

/* Return the priority of a block
   * 
   * If this is a Raided block and still has 1 replica left, not assign the highest priority.
   * 
   * @param block a under replication block
   * @param curReplicas current number of replicas of the block
   * @param expectedReplicas expected number of replicas of the block
   */
  private int getPriority(BlockInfo block, 
                          int curReplicas, 
                          int decommissionedReplicas,
                          int expectedReplicas) {
    //副本数为负数或是副本数超过预期值,都属于异常情况,归结为损坏队列中
    if (curReplicas<0 || curReplicas>=expectedReplicas) {
      return LEVEL; // no need to replicate
    } else if(curReplicas==0) {
      // If there are zero non-decommissioned replica but there are
      // some decommissioned replicas, then assign them highest priority
      //如果数据节点正位于撤销操作中,以及最高优先级
      if (decommissionedReplicas > 0) {
        return 0;
      }
      return QUEUE_WITH_CORRUPT_BLOCKS; // keep these blocks in needed replication.
    } else if(curReplicas==1) {
      //副本数只有1个的时候同样给予高优先级
      return isRaidedBlock(block) ? 1 : 0; // highest priority
    } else if(curReplicas*3<expectedReplicas) {
      return 1;
    } else {
      //其他情况给及普通优先级
      return 2;
    }
  }

这些副本请求产生之后，就会加入到PendingReplicationBlocks的类中，在里面有相应的变量会管理这些请求信息

/***************************************************
 * PendingReplicationBlocks does the bookkeeping of all
 * blocks that are getting replicated.
 *
 * It does the following:
 * 1)  record blocks that are getting replicated at this instant.
 * 2)  a coarse grain timer to track age of replication request
 * 3)  a thread that periodically identifies replication-requests
 *     that never made it.
 *
 ***************************************************/
//此类记录了待复制请求信息
class PendingReplicationBlocks {
  //block块到副本数据块复制请求信息的映射
  private Map<Block, PendingBlockInfo> pendingReplications;
  //记录超时复制请求列表
  private ArrayList<Block> timedOutItems;
  Daemon timerThread = null;
  private volatile boolean fsRunning = true;

就是上面的PendingBlockInfo，声明了请求产生的时间和当前此数据块的请求数

  /**
   * An object that contains information about a block that 
   * is being replicated. It records the timestamp when the 
   * system started replicating the most recent copy of this
   * block. It also records the number of replication
   * requests that are in progress.
   * 内部类包含了复制请求信息，记录复制请求的开始时间，以及当前对此块的副本请求数
   */
  static class PendingBlockInfo {
  	//请求产生时间
    private long timeStamp;
    //当前的进行复制请求数
    private int numReplicasInProgress;

产生时间用于进行超时检测，请求数会与预期副本数进行对比，在这个类中会对永远不结束的复制请求进行超时检测，默认时间5~10分钟

//
  // It might take anywhere between 5 to 10 minutes before
  // a request is timed out.
  //
  //默认复制请求超时检测5到10分钟
  private long timeout = 5 * 60 * 1000;
  private long defaultRecheckInterval = 5 * 60 * 1000;

下面是监控方法

/*
   * A periodic thread that scans for blocks that never finished
   * their replication request.
   * 一个周期性的线程监控不会结束的副本请求
   */
  class PendingReplicationMonitor implements Runnable {
    public void run() {
      while (fsRunning) {
        long period = Math.min(defaultRecheckInterval, timeout);
        try {
          //调用下面的函数进行检查
          pendingReplicationCheck();
          Thread.sleep(period);
        } catch (InterruptedException ie) {
          FSNamesystem.LOG.debug(
                "PendingReplicationMonitor thread received exception. " + ie);
        }
      }
    }

    /**
     * Iterate through all items and detect timed-out items
     */
    void pendingReplicationCheck() {
      synchronized (pendingReplications) {
        Iterator iter = pendingReplications.entrySet().iterator();
        long now = FSNamesystem.now();
        FSNamesystem.LOG.debug("PendingReplicationMonitor checking Q");
        while (iter.hasNext()) {
          Map.Entry entry = (Map.Entry) iter.next();
          PendingBlockInfo pendingBlock = (PendingBlockInfo) entry.getValue();
          //取出请求开始时间比较时间是否超时
          if (now > pendingBlock.getTimeStamp() + timeout) {
            Block block = (Block) entry.getKey();
            synchronized (timedOutItems) {
              //加入超时队列中
              timedOutItems.add(block);
            }
            FSNamesystem.LOG.warn(
                "PendingReplicationMonitor timed out block " + block);
            iter.remove();
          }
        }
      }
    }
  }

超时的请求会被加入到timeOutItems，这些请求一般在最好又会被插入到UnderReplicatedBlocks中。OK,简单的阐述了一下副本数据块关系流程分析。

全部代码的分析请点击链接https://github.com/linyiqun/hadoop-hdfs,后续将会继续更新HDFS其他方面的代码分析。

参考文献

《Hadoop技术内部–HDFS结构设计与实现原理》.蔡斌等