从Container内存监控限制到CPU使用率限制方案

前言

最近在运维我们部门的hadoop集群时,发现了很多Job OOM的现象,因为在机器上可以用命令进行查看,full gc比较严重.我们都知道,full gc带来的后果是比较大的,会"stop the world"的,一旦你的full gc elapsed time超过几分钟,那么其他的活动都得暂停这么多时间.所以Full gc一旦出现并且异常,一定要找到根源并将其解决.本篇文章就为大家讲述一下我们是如何解决这类问题并且在这个基础上做了一些别的什么优化的.

Full Gc源于何处

OOM发生了,导致频繁的Full Gc的现象,首先就要想到底是什么导致Full gc.第一个联想一定是上面跑的Job,Job的运行时的体现又是从每个task身上来的,而每个Task又是表现于每个TaskAttempt.而TaskAttempt是跑在所申请到的container中的.每个container都是一个独立的进程,你可以在某台datanode上用jps命令,可以看到很多叫"YarnChild"的进程,那就是container所起的.找到了源头之后,我们估计会想,那应该是container所启的jvm的内存大小配小了,导致内存不够用了,内存配大些不就可以解决问题了?事实上问题并没有这么简单,这里面的水还是有点深的.

为什么会发生Full Gc

一.为什么会发生full gc,第一种原因就是平常大家所说的内存配小了,就是下面2个配置项:

public static final String MAP_MEMORY_MB = "mapreduce.map.memory.mb";

public static final String REDUCE_MEMORY_MB = "mapreduce.reduce.memory.mb";

默认都是1024M,就是1个G.

二.另外一个原因估计想到的人就不太多了,除非你真的在生活中碰到过,概况来说一句话:错误的配置导致.上面的这个配置其实并不是container设置他所启的jvm的配置,而是每个Task所能用的内存的上限值,但是这里会有个前提,你的jvm必须保证可以用到这么多的内存,如果你的jvm最大内存上限就只有512M,你的task的memory设的再大也没有,最后造成的直接后果就是内存一用超,就会出现full gc.上面的2个值更偏向于理论值.而真正掌控jvm的配置项的其实是这2个配置:

public static final String MAP_JAVA_OPTS = "mapreduce.map.java.opts";

public static final String REDUCE_JAVA_OPTS = "mapreduce.reduce.java.opts";

所以理想的配置应该是java.opts的值必须大于等于memory.mb的值.所以说,这种配置不当的方式也会引发频繁的full gc.

Container内存监控

不过比较幸运的是针对上面所列举的第二种问题,hadoop自身已经对此进行了contaienr级别的监控,对于所有启动过container,他会额外开启一个叫container-monitor的线程,专门有对于这些container的pmem(物理内存),vmem(虚拟内存)的监控.相关的配置属于如下:

String org.apache.hadoop.yarn.conf.YarnConfiguration.NM_PMEM_CHECK_ENABLED = "yarn.nodemanager.pmem-check-enabled"
String org.apache.hadoop.yarn.conf.YarnConfiguration.NM_VMEM_CHECK_ENABLED = "yarn.nodemanager.vmem-check-enabled"

默认都是开启的.内存监控的意思就是一旦这个container所使用的内存超过这个jvm本身所能使用的最大上限值,则将此conyainer kill掉.下面简单的从源代码的级别为大家分析一下,过其实不难.首先进入到ContainersMonitorImpl.java这个类.

@Override
  protected void serviceInit(Configuration conf) throws Exception {
    this.monitoringInterval =
        conf.getLong(YarnConfiguration.NM_CONTAINER_MON_INTERVAL_MS,
            YarnConfiguration.DEFAULT_NM_CONTAINER_MON_INTERVAL_MS);

    ....
    pmemCheckEnabled = conf.getBoolean(YarnConfiguration.NM_PMEM_CHECK_ENABLED,
        YarnConfiguration.DEFAULT_NM_PMEM_CHECK_ENABLED);
    vmemCheckEnabled = conf.getBoolean(YarnConfiguration.NM_VMEM_CHECK_ENABLED,
        YarnConfiguration.DEFAULT_NM_VMEM_CHECK_ENABLED);
    LOG.info("Physical memory check enabled: " + pmemCheckEnabled);
    LOG.info("Virtual memory check enabled: " + vmemCheckEnabled);
    ....

在serviceInit方法中就会从配置中读取是否开启内存监控功能,并输出日志信息.然后我们直接进入到此类的MonitorThread监控线程类中.

  ....
  private class MonitoringThread extends Thread {
    public MonitoringThread() {
      super("Container Monitor");
    }

    @Override
    public void run() {

      while (true) {

        // Print the processTrees for debugging.
        if (LOG.isDebugEnabled()) {
          StringBuilder tmp = new StringBuilder("[ ");
          for (ProcessTreeInfo p : trackingContainers.values()) {
            tmp.append(p.getPID());
            tmp.append(" ");
          }
          ....

在监控线程的run方法中,他会对所监控的container做遍历判断处理

 @Override
    public void run() {

      while (true) {

        ....
        // Now do the monitoring for the trackingContainers
        // Check memory usage and kill any overflowing containers
        long vmemStillInUsage = 0;
        long pmemStillInUsage = 0;
        for (Iterator<Map.Entry<ContainerId, ProcessTreeInfo>> it =
            trackingContainers.entrySet().iterator(); it.hasNext();) {

          Map.Entry<ContainerId, ProcessTreeInfo> entry = it.next();
          ContainerId containerId = entry.getKey();
          ProcessTreeInfo ptInfo = entry.getValue();
          try {
            String pId = ptInfo.getPID();
            ....

我们以物理内存监控的原理实现为一个例子.

首先他会根据pTree拿到进程相关的运行信息,比如内存,CPU信息等

 LOG.debug("Constructing ProcessTree for : PID = " + pId
                + " ContainerId = " + containerId);
            ResourceCalculatorProcessTree pTree = ptInfo.getProcessTree();
            pTree.updateProcessTree();    // update process-tree
            long currentVmemUsage = pTree.getVirtualMemorySize();
            long currentPmemUsage = pTree.getRssMemorySize();
            // if machine has 6 cores and 3 are used,
            // cpuUsagePercentPerCore should be 300% and
            // cpuUsageTotalCoresPercentage should be 50%
            float cpuUsagePercentPerCore = pTree.getCpuUsagePercent();
            float cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore /
                resourceCalculatorPlugin.getNumProcessors();

然后拿到内存使用上限值

// Multiply by 1000 to avoid losing data when converting to int
            int milliVcoresUsed = (int) (cpuUsageTotalCoresPercentage * 1000
                * maxVCoresAllottedForContainers /nodeCpuPercentageForYARN);
            // as processes begin with an age 1, we want to see if there
            // are processes more than 1 iteration old.
            long curMemUsageOfAgedProcesses = pTree.getVirtualMemorySize(1);
            long curRssMemUsageOfAgedProcesses = pTree.getRssMemorySize(1);
            long vmemLimit = ptInfo.getVmemLimit();
            long pmemLimit = ptInfo.getPmemLimit();

而这个pememLimit就不是pTree的信息,而是来自于外界所启动container时候所传进来的值,这个值其实就是java.opts的值.

ContainerId containerId = monitoringEvent.getContainerId();
    switch (monitoringEvent.getType()) {
    case START_MONITORING_CONTAINER:
      ContainerStartMonitoringEvent startEvent =
          (ContainerStartMonitoringEvent) monitoringEvent;
      synchronized (this.containersToBeAdded) {
        ProcessTreeInfo processTreeInfo =
            new ProcessTreeInfo(containerId, null, null,
                startEvent.getVmemLimit(), startEvent.getPmemLimit(),
                startEvent.getCpuVcores());
        this.containersToBeAdded.put(containerId, processTreeInfo);
      }
      break;

然后是内存监控的核心逻辑

....
            } else if (isPmemCheckEnabled()
                && isProcessTreeOverLimit(containerId.toString(),
                    currentPmemUsage, curRssMemUsageOfAgedProcesses,
                    pmemLimit)) {
              // Container (the root process) is still alive and overflowing
              // memory.
              // Dump the process-tree and then clean it up.
              msg = formatErrorMessage("physical",
                  currentVmemUsage, vmemLimit,
                  currentPmemUsage, pmemLimit,
                  pId, containerId, pTree);
              isMemoryOverLimit = true;
              containerExitStatus = ContainerExitStatus.KILLED_EXCEEDED_PMEM;
              ....

传入当前的内存使用量和限制值然后做比较,isProcessTreeOverLimit最终会调用到下面的这个方法.

  /**
   * Check whether a container's process tree's current memory usage is over
   * limit.
   *
   * When a java process exec's a program, it could momentarily account for
   * double the size of it's memory, because the JVM does a fork()+exec()
   * which at fork time creates a copy of the parent's memory. If the
   * monitoring thread detects the memory used by the container tree at the
   * same instance, it could assume it is over limit and kill the tree, for no
   * fault of the process itself.
   *
   * We counter this problem by employing a heuristic check: - if a process
   * tree exceeds the memory limit by more than twice, it is killed
   * immediately - if a process tree has processes older than the monitoring
   * interval exceeding the memory limit by even 1 time, it is killed. Else it
   * is given the benefit of doubt to lie around for one more iteration.
   *
   * @param containerId
   *          Container Id for the container tree
   * @param currentMemUsage
   *          Memory usage of a container tree
   * @param curMemUsageOfAgedProcesses
   *          Memory usage of processes older than an iteration in a container
   *          tree
   * @param vmemLimit
   *          The limit specified for the container
   * @return true if the memory usage is more than twice the specified limit,
   *         or if processes in the tree, older than this thread's monitoring
   *         interval, exceed the memory limit. False, otherwise.
   */
  boolean isProcessTreeOverLimit(String containerId,
                                  long currentMemUsage,
                                  long curMemUsageOfAgedProcesses,
                                  long vmemLimit) {
    boolean isOverLimit = false;

    if (currentMemUsage > (2 * vmemLimit)) {
      LOG.warn("Process tree for container: " + containerId
          + " running over twice " + "the configured limit. Limit=" + vmemLimit
          + ", current usage = " + currentMemUsage);
      isOverLimit = true;
    } else if (curMemUsageOfAgedProcesses > vmemLimit) {
      LOG.warn("Process tree for container: " + containerId
          + " has processes older than 1 "
          + "iteration running over the configured limit. Limit=" + vmemLimit
          + ", current usage = " + curMemUsageOfAgedProcesses);
      isOverLimit = true;
    }

    return isOverLimit;
  }

有2种情况会导致内存超出的现象,1个是使用内存超出限制内存2倍,理由是新的jvm会执行fork和exec操作,fork操作会拷贝父进程的信息,还有1个就是内存年龄值的限制.其他的上面注释已经写的很清楚了,如果还看不懂英文的话,自行找工具翻译.

最后如果发现container内存使用的确是超出内存限制值了,之后,就会发送container kill的event事件,会触发后续的container kill的动作.

....
            } else if (isVcoresCheckEnabled()
                && cpuUsageTotalCoresPercentage > vcoresLimitedRatio) {
              msg =
                  String.format(
                      "Container [pid=%s,containerID=%s] is running beyond %s vcores limits."
                          + " Current usage: %s. Killing container.
", pId,
                      containerId, vcoresLimitedRatio);
              isCpuVcoresOverLimit = true;
              containerExitStatus = ContainerExitStatus.KILLED_EXCEEDED_VCORES;
            }

            if (isMemoryOverLimit) {
              // Virtual or physical memory over limit. Fail the container and
              // remove
              // the corresponding process tree
              LOG.warn(msg);
              // warn if not a leader
              if (!pTree.checkPidPgrpidForMatch()) {
                LOG.error("Killed container process with PID " + pId
                    + " but it is not a process group leader.");
              }
              // kill the container
              eventDispatcher.getEventHandler().handle(
                  new ContainerKillEvent(containerId,
                      containerExitStatus, msg));
              it.remove();
              LOG.info("Removed ProcessTree with root " + pId);
            } else {
              ...

这就是container的内存监控的整个过程.我们当时又恰巧把这个功能给关了,所以导致了大量的Ful gc的现象.

为什么只对Container内存做监控

对于小标题上的问题,不知道有没有哪位同学想过?当时我在解决掉这个问题之后,我就在想,同样是很关键的指标,CPU的使用监控为什么不在ContainersMonitorImpl一起做掉呢.下面是我个人所总结出来的几点原因.

1.内存问题所造成的结果比CPU使用造成的影响更大,因为OOM问题一旦发生,就会引起gc.

2.内存问题比较CPU使用问题更加常见.因为大家在平常生活或写程序时,经常发碰到类似"啊,内存不够用了"等类似的问题,相对比较少碰到"CPU不够用了"的问题.

3.内存问题与Job运行规模,数据量使用规模密切相关.内存的使用与Job所处理的数据量密切相关,一般大Job,处理数据量大了,内存使用自然会变多,CPU也会变多,但不会那么明显.

综上3点原因,所以CPU监控并没有被加入到监控代码中(个人分析而言).

但是hadop自身没有加CPU监控并不代表我们不可以加这样的监控,有一些程序可能就是那种应用内存并不多,但是会耗尽很多CPU资源的程序,比如说开大量的线程,但是每个线程都在做很简单的操作,就会造成机器线程占比过高的问题.基于这个出发点,我添加了CPU使用百分比的监控.

Container的Cpu使用率监控

首先你要定义是否开启此功能的配置:

  /** Specifies whether cpu vcores check is enabled. */
  public static final String NM_VCORES_CHECK_ENABLED = NM_PREFIX
      + "vcores-check-enabled";
  public static final boolean DEFAULT_NM_VCORES_CHECK_ENABLED = false;

因为是新功能,默认是关闭的,然后你还需要定义1个使用阈值,在0~1之间,就是说一旦某个container的使用CPU的百分比超过这个值,就会被kill.

  /** Limit ratio of Virtual CPU Cores which can be allocated for containers. */
  public static final String NM_VCORES_LIMITED_RATIO = NM_PREFIX
      + "resource.cpu-vcores.limited.ratio";
  public static final float DEFAULT_NM_VCORES_LIMITED_RATIO = 0.8f;

默认这个值0.8,这个可以你随便设置.监控代码的逻辑,与内存监控完全类似,我将比较快的带过.

多定义2个变量值

private boolean pmemCheckEnabled;
  ...
  private boolean vcoresCheckEnabled;
  private float vcoresLimitedRatio;

然后在serviceInit中进程配置初始化工作

...
    pmemCheckEnabled = conf.getBoolean(YarnConfiguration.NM_PMEM_CHECK_ENABLED,
        YarnConfiguration.DEFAULT_NM_PMEM_CHECK_ENABLED);
    vmemCheckEnabled = conf.getBoolean(YarnConfiguration.NM_VMEM_CHECK_ENABLED,
        YarnConfiguration.DEFAULT_NM_VMEM_CHECK_ENABLED);
    vcoresCheckEnabled =
        conf.getBoolean(YarnConfiguration.NM_VCORES_CHECK_ENABLED,
            YarnConfiguration.DEFAULT_NM_VCORES_CHECK_ENABLED);
    LOG.info("Physical memory check enabled: " + pmemCheckEnabled);
    LOG.info("Virtual memory check enabled: " + vmemCheckEnabled);
    LOG.info("Cpu vcores check enabled: " + vcoresCheckEnabled);

    if (vcoresCheckEnabled) {
      vcoresLimitedRatio =
          conf.getFloat(YarnConfiguration.NM_VCORES_LIMITED_RATIO,
              YarnConfiguration.DEFAULT_NM_VCORES_LIMITED_RATIO);
      LOG.info("Vcores limited ratio: " + vcoresLimitedRatio);
    }

然后利用monitor监控代码中已计算出的cpu百分比变量

LOG.debug("Constructing ProcessTree for : PID = " + pId
                + " ContainerId = " + containerId);
            ResourceCalculatorProcessTree pTree = ptInfo.getProcessTree();
            pTree.updateProcessTree();    // update process-tree
            long currentVmemUsage = pTree.getVirtualMemorySize();
            long currentPmemUsage = pTree.getRssMemorySize();
            // if machine has 6 cores and 3 are used,
            // cpuUsagePercentPerCore should be 300% and
            // cpuUsageTotalCoresPercentage should be 50%
            float cpuUsagePercentPerCore = pTree.getCpuUsagePercent();
            float cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore /
                resourceCalculatorPlugin.getNumProcessors();

最后进行大小判断即可

....
            } else if (isVcoresCheckEnabled()
                && cpuUsageTotalCoresPercentage > vcoresLimitedRatio) {
              msg =
                  String.format(
                      "Container [pid=%s,containerID=%s] is running beyond %s vcores limits."
                          + " Current usage: %s. Killing container.
", pId,
                      containerId, vcoresLimitedRatio);
              isCpuVcoresOverLimit = true;
              containerExitStatus = ContainerExitStatus.KILLED_EXCEEDED_VCORES;
            }

            if (isMemoryOverLimit || isCpuVcoresOverLimit) {
              // Virtual or physical memory over limit. Fail the container and
              // remove
              // the corresponding process tree
              LOG.warn(msg);
              // warn if not a leader
              if (!pTree.checkPidPgrpidForMatch()) {
                LOG.error("Killed container process with PID " + pId
                    + " but it is not a process group leader.");
              }
              // kill the container
              eventDispatcher.getEventHandler().handle(
                  new ContainerKillEvent(containerId,
                      containerExitStatus, msg));
              it.remove();
              LOG.info("Removed ProcessTree with root " + pId);
            } else {

对了,还要在这里添加1个新的ExitStatus退出码:

  /**
   * Container terminated because of exceeding allocated cpu vcores.
   */
  public static final int KILLED_EXCEEDED_VCORES = -108;

CPU监控代码的改动就是这么多.此功能的完整代码可以查看文章末尾的链接.在这里我要特别申请一下,此功能代码由于我在本地电脑上不支持ProcfsBasedProcessTree,导致单元测试没法跑通,所以我还没有完整测过,理论上是OK,大家可以拿去试试,可以给我一些反馈.希望能带给大家收获.