Balancer 必要性

 HDFS节点间的数据不平衡，尤其在新增和下架节点、或者人为干预副本数量的时候，多的达到80-90%，少的不到50%。出现这种状况，我们一般采用HDFS自带的balancer工具来解决，保证每个节点的数据分布均衡。
 节点空间使用率不均匀会导致计算引擎频繁在跨节点拷贝数据(A节点上运行的task所需数据在其它节点上)，引起不必要的耗时和带宽。另外，当部分节点空间使用率很高但未满(90%左右)时，分配在该节点上的task会存在任务失败的风险。因此，引入balance策略使集群中的节点空间使用率均匀分布必不可少。

Balancer 开启步骤

To start：
     start-balancer.sh    #用默认的10%的阈值启动balancer
     hdfs dfs balancer -threshold 3
     start-balancer.sh -threshold 3    #指定3%的阈值启动balancer
To stop：
     stop-balancer.sh
To Adjust Speed 
    hdfs dfsadmin -setBalancerBandwidth 104857600   (100M/s)

balancer参数配置

hdfs --config /hadoop-client/conf balancer
-threshold  10                    \集群平衡的条件，datanode间磁盘使用率相差阈值，区间选择：0~100
                                  \Threshold参数为集群是否处于均衡状态设置了一个目标
-policy datanode                  \默认为datanode，datanode级别的平衡策略
-exclude  -f  /tmp/ip1.txt        \默认为空，指定该部分ip不参与balance， -f：指定输入为文件
-include  -f  /tmp/ip2.txt        \默认为空，只允许该部分ip参与balance，-f：指定输入为文件
-idleiterations  5               \迭代次数，默认为 5

Balancer 遵循原则及步骤

Hadoop的开发人员在开发Balancer程序的时候，遵循了以下几点原则：

在执行数据重分布的过程中，必须保证数据不能出现丢失，不能改变数据的备份数，不能改变每一个rack中所具备的block数量。
系统管理员可以通过一条命令启动数据重分布程序或者停止数据重分布程序。
Block在移动的过程中，不能暂用过多的资源，如网络带宽。
数据重分布程序在执行的过程中，不能影响name node的正常工作。
ebalance程序作为一个独立的进程与name node进行分开执行。

步骤及过程

1 Rebalance Server从Name Node中获取所有的Data Node情况：每一个Data Node磁盘使用情况。

2 Rebalance Server计算哪些机器需要将数据移动，哪些机器可以接受移动的数据。并且从Name Node中获取需要移动的数据分布情况。

3 Rebalance Server计算出来可以将哪一台机器的block移动到另一台机器中去。

4,5,6 需要移动block的机器将数据移动的目的机器上去，同时删除自己机器上的block数据。

7 Rebalance Server获取到本次数据移动的执行结果，并继续执行这个过程，一直没有数据可以移动或者HDFS集群以及达到了平衡的标准为止。

Hadoop现有的这种Balancer程序工作的方式在绝大多数情况中都是非常适合的。但在有些情况下Balancer 达不到想要的效果。

1,数据是3份备份。
2,HDFS由2个rack组成。
3,2个rack中的机器磁盘配置不同，第一个rack中每一台机器的磁盘空间为1TB，第二个rack中每一台机器的磁盘空间为10TB。
4,现在大多数数据的2份备份都存储在第一个rack中。
在这样的一种情况下，HDFS级群中的数据肯定是不平衡的。现在我们运行Balancer程序，但是会发现运行结束以后，整个HDFS集群中的数据依旧不平衡：rack1中的磁盘剩余空间远远小于rack2。
这是因为Balance程序的开发原则1导致的。
简单的说，就是在执行Balancer程序的时候，不会将数据中一个rack移动到另一个rack中，所以就导致了Balancer程序永远无法平衡HDFS集群的情况。
针对于这种情况，可以采取2中方案：
1 继续使用现有的Balancer程序，但是修改rack中的机器分布。将磁盘空间小的机器分叉到不同的rack中去。
2 修改Balancer程序，允许改变每一个rack中所具备的block数量，将磁盘空间告急的rack中存放的block数量减少，或者将其移动到其他磁盘空间比较空闲的rack中去。

源码（Apache Hadoop 2.7.3）

路径：org.apache.hadoop.hdfs.server.balancer 包内：
统计需要balance的datanode：

  private boolean shouldIgnore(DatanodeInfo dn) {
    // ignore decommissioned nodes (忽略已经下架的datanode)
    final boolean decommissioned = dn.isDecommissioned();
    // ignore decommissioning nodes（忽略正在下架的datanode）
    final boolean decommissioning = dn.isDecommissionInProgress();
    // ignore nodes in exclude list (忽略参数:-exclude配置的datanode)
    final boolean excluded = Util.isExcluded(excludedNodes, dn);
    // ignore nodes not in the include list (if include list is not empty)
    // (如果参数:-include不为空，忽略不在include列表里的datanode)
    final boolean notIncluded = !Util.isIncluded(includedNodes, dn);

    if (decommissioned || decommissioning || excluded || notIncluded) {
      if (LOG.isTraceEnabled()) {
        LOG.trace("Excluding datanode " + dn
            + ": decommissioned=" + decommissioned
            + ", decommissioning=" + decommissioning
            + ", excluded=" + excluded
            + ", notIncluded=" + notIncluded);
      }
      return true;
    }
    return false;
  }

集群平均使用率(计算公式)：average = totalUsedSpaces * 100 / totalCapacities
totalUsedSpaces：各datanode已使用空间（dfsUsed，不包含non dfsUsed）相加；
totalCapacities：各datanode总空间（DataNode配置的服务器磁盘目录）相加；

  void initAvgUtilization() {
    for(StorageType t : StorageType.asList()) {
      final long capacity = totalCapacities.get(t);
      if (capacity > 0L) {
        final double avg  = totalUsedSpaces.get(t)*100.0/capacity;
        avgUtilizations.set(t, avg);
      }
    }
  }

单个datanode使用率：utilization = dfsUsed * 100.0 / capacity
dfsUsed：当前datanode dfs（dfsUsed，不包含non dfsUsed）已使用空间；
capacity：当前datanode（DataNode配置的服务器磁盘目录）总空间；

    Double getUtilization(DatanodeStorageReport r, final StorageType t) {
      long capacity = 0L;
      long dfsUsed = 0L;
      for(StorageReport s : r.getStorageReports()) {
        if (s.getStorage().getStorageType() == t) {
          capacity += s.getCapacity();
          dfsUsed += s.getDfsUsed();
        }
      }
      return capacity == 0L? null: dfsUsed*100.0/capacity;
    }

单个datanode使用率与集群平均使用率差值：utilizationDiff = utilization - average
单个datanode utilizationDiff与阈值的差值: thresholdDiff = |utilizationDiff| - threshold

需要迁移或者可以迁入的空间：maxSize2Move = |utilizationDiff| * capacity

可以迁入的空间计算：Math.min(remaining, maxSizeToMove)
需要迁移的空间计算：Math.min(max, maxSizeToMove)
remaining:datanode节点剩余空间
max:默认单个datanode单次balance迭代可以迁移的最大空间限制，缺省10G)
默认迭代次数为5，即运行一次balance脚本，单个datanode可以最大迁移的空间为：5*10G = 50G

    for(DatanodeStorageReport r : reports) {
      final DDatanode dn = dispatcher.newDatanode(r.getDatanodeInfo());
      final boolean isSource = Util.isIncluded(sourceNodes, dn.getDatanodeInfo());
      for(StorageType t : StorageType.getMovableTypes()) {
        final Double utilization = policy.getUtilization(r, t);
        if (utilization == null) { // datanode does not have such storage type 
          continue;
        }
        
        final double average = policy.getAvgUtilization(t);
        if (utilization >= average && !isSource) {
          LOG.info(dn + "[" + t + "] has utilization=" + utilization
              + " >= average=" + average
              + " but it is not specified as a source; skipping it.");
          continue;
        }

        final double utilizationDiff = utilization - average;
        final long capacity = getCapacity(r, t);
        final double thresholdDiff = Math.abs(utilizationDiff) - threshold;
        final long maxSize2Move = computeMaxSize2Move(capacity,
            getRemaining(r, t), utilizationDiff, maxSizeToMove);

        final StorageGroup g;
        if (utilizationDiff > 0) {
          final Source s = dn.addSource(t, maxSize2Move, dispatcher);
          if (thresholdDiff <= 0) { // within threshold
            aboveAvgUtilized.add(s);
          } else {
            overLoadedBytes += percentage2bytes(thresholdDiff, capacity);
            overUtilized.add(s);
          }
          g = s;
        } else {
          g = dn.addTarget(t, maxSize2Move);
          if (thresholdDiff <= 0) { // within threshold
            belowAvgUtilized.add(g);
          } else {
            underLoadedBytes += percentage2bytes(thresholdDiff, capacity);
            underUtilized.add(g);
          }
        }
        dispatcher.getStorageGroupMap().put(g);
      }

差值判断后datanode的保存队列：

overUtilized：utilizationDiff > 0 && thresholdDiff > 0        <使用率超过平均值，且差值大于阈值>
aboveAvgUtilized：utilizationDiff > 0 && thresholdDiff <= 0   <使用率超过平均值，且差值小于等于阈值>
belowAvgUtilized：utilizationDiff < 0 && thresholdDiff <= 0   <使用率低于平均值，且差值小于等于阈值>
underUtilized：utilizationDiff > 0 && thresholdDiff > 0       <使用率低于平均值，且差值大于等于阈值>

数据迁移配对(原则：1. 优先为同机架，其次为其它机架; 2. 一对多配对)：

第一步[Source -> Target]：each overUtilized datanode => one or more underUtilized datanodes
第二步[Source -> Target]：match each remaining overutilized datanode => one or more belowAvgUtilized datanodes
第三步[Target -> Source]：each remaining underutilized datanode (step 1未和overUtilized匹配过) => one or more aboveAvgUtilized datanodes。

/** Decide all <source, target> pairs according to the matcher. */
  private void chooseStorageGroups(final Matcher matcher) {
    /* first step: match each overUtilized datanode (source) to
     * one or more underUtilized datanodes (targets).
     */
    LOG.info("chooseStorageGroups for " + matcher + ": overUtilized => underUtilized");
    chooseStorageGroups(overUtilized, underUtilized, matcher);
    
    /* match each remaining overutilized datanode (source) to 
     * below average utilized datanodes (targets).
     * Note only overutilized datanodes that haven't had that max bytes to move
     * satisfied in step 1 are selected
     */
    LOG.info("chooseStorageGroups for " + matcher + ": overUtilized => belowAvgUtilized");
    chooseStorageGroups(overUtilized, belowAvgUtilized, matcher);

    /* match each remaining underutilized datanode (target) to 
     * above average utilized datanodes (source).
     * Note only underutilized datanodes that have not had that max bytes to
     * move satisfied in step 1 are selected.
     */
    LOG.info("chooseStorageGroups for " + matcher + ": underUtilized => aboveAvgUtilized");
    chooseStorageGroups(underUtilized, aboveAvgUtilized, matcher);
  }

构建每一对<source, target>时，需要计算当前可以迁移或者迁入的空间大小。
dispatcher创建dispatchExecutor线程池执行数据迁移调度。

  private void matchSourceWithTargetToMove(Source source, StorageGroup target) {
    long size = Math.min(source.availableSizeToMove(), target.availableSizeToMove());
    final Task task = new Task(target, size);
    source.addTask(task);
    target.incScheduledSize(task.getSize());
    dispatcher.add(source, target);
    LOG.info("Decided to move "+StringUtils.byteDesc(size)+" bytes from "
        + source.getDisplayName() + " to " + target.getDisplayName());
  }

【结语】

对于一些大型的HDFS集群(随时可能扩容或下架服务器)，balance脚本需要作为后台常驻进程；
根据官方建议，脚本需要部署在相对空闲的服务器上；
停止脚本通过kill进程实现（建议不kill，后台运行完会自动停止，多次执行同时也只会有一个线程存在，其它自动失败）；

针对datanode存储维护，可以针对以下几个方向进行优化：
* 通过参数(threshold)增加迭代次数，以增加datanode允许迁移的数据；   
* 通过参数(exclude, include)设计合理的允许进行balance策略的服务器，比如将使用率最低(20%)和最高(20%)的进行balance策略;
* 通过参数(threshold )设计合理的阈值;
<备注：理想状态能够通过程序自动发现调整参数，无需人为介入>

链接：https://www.jianshu.com/p/f7c1cd476601