节点资源超卖方案的实现

背景

在腾讯云对k8s集群负载的优化讨论中提到了关于节点资源超卖的方案，可以简述为：根据每个节点的真实历史负载数据，动态的配置节点可分配资源总量Allocatable，以控制允许调度到该节点的pod数量。文章中给出了整体的技术方案，也提出了很多需要我们自己去考虑的细节问题，这里尝试简单的实现这个方案。

实现

文章中提到的方案如下

- 每个节点的资源超卖比例，我们设置到Node的Annotation中，比如cpu超卖对应Annotation stke.platform/cpu-oversale-ratio。
- 每个节点的超卖比例，由自研组件基于节点历史监控数据，动态的/周期性的去调整超卖比例
- Node超卖特性一定要是可以关闭和还原的，通过Node Annotation stke.platform/mutate: "false"关闭Node超卖，Node在下一个心跳会完成资源复原。
- 通过kube-apiserver的Mutating Admission Webhook对Node的Create和Status Update事件进行拦截，根据超卖比重新计算该Node的Allocatable&Capacity Resource，Patch到APIServer。

这里关键是需要考虑以下细节问题

1.Kubelet Register Node To ApiServer的详细原理是什么，通过webhook直接Patch Node Status是否可行？

1.1 简单描述心跳方式：目前k8s支持两种心跳方式，其中新的心跳方式NodeLease更加轻量，每次心跳内容0.1Kb左右，可以较好的解决节点数量多的时候频繁的心跳信息给apiserver带来的压力，NodeLease目前作为未正式可用的新特性，需要kubelet中开启这个feature，这里仅考虑原心跳方式（关于两种心跳的协同工作原理之后可以记录一下）。

简单看一下源码https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go 中的tryUpdateNodeStatus方法，先获取Node信息（注释里面说到了这个Get操作优先从本地缓存里面拿，效率高一点），然后通过setNodeStatus方法对node的status进行补充（心跳的内容主要就是node对象的node.status字段值），例如补充MachineInfo即cpu和内存信息、ImageList、Condition等（具体会补充什么内容可以查看pkg/kubelet/kubelet_node_status.go#defaultNodeStatusFuncs），之后会调用函数PatchNodeStatus向apiserver发送这个node信息，参数nodeStatusUpdateFrequency指定发送频率，默认是10s。

func (kl *Kubelet) tryUpdateNodeStatus(tryNumber int) error {
    // In large clusters, GET and PUT operations on Node objects coming
    // from here are the majority of load on apiserver and etcd.
    // To reduce the load on etcd, we are serving GET operations from
    // apiserver cache (the data might be slightly delayed but it doesn't
    // seem to cause more conflict - the delays are pretty small).
    // If it result in a conflict, all retries are served directly from etcd.
    opts := metav1.GetOptions{}
    if tryNumber == 0 {
        util.FromApiserverCache(&opts)
    }
    node, err := kl.heartbeatClient.CoreV1().Nodes().Get(context.TODO(), string(kl.nodeName), opts)
    if err != nil {
        return fmt.Errorf("error getting node %q: %v", kl.nodeName, err)
    }

    originalNode := node.DeepCopy()
    if originalNode == nil {
        return fmt.Errorf("nil %q node object", kl.nodeName)
    }

    podCIDRChanged := false
    if len(node.Spec.PodCIDRs) != 0 {
        // Pod CIDR could have been updated before, so we cannot rely on
        // node.Spec.PodCIDR being non-empty. We also need to know if pod CIDR is
        // actually changed.
        podCIDRs := strings.Join(node.Spec.PodCIDRs, ",")
        if podCIDRChanged, err = kl.updatePodCIDR(podCIDRs); err != nil {
            klog.Errorf(err.Error())
        }
    }

    kl.setNodeStatus(node)

    now := kl.clock.Now()
    if now.Before(kl.lastStatusReportTime.Add(kl.nodeStatusReportFrequency)) {
        if !podCIDRChanged && !nodeStatusHasChanged(&originalNode.Status, &node.Status) {
            // We must mark the volumes as ReportedInUse in volume manager's dsw even
            // if no changes were made to the node status (no volumes were added or removed
            // from the VolumesInUse list).
            //
            // The reason is that on a kubelet restart, the volume manager's dsw is
            // repopulated and the volume ReportedInUse is initialized to false, while the
            // VolumesInUse list from the Node object still contains the state from the
            // previous kubelet instantiation.
            //
            // Once the volumes are added to the dsw, the ReportedInUse field needs to be
            // synced from the VolumesInUse list in the Node.Status.
            //
            // The MarkVolumesAsReportedInUse() call cannot be performed in dsw directly
            // because it does not have access to the Node object.
            // This also cannot be populated on node status manager init because the volume
            // may not have been added to dsw at that time.
            kl.volumeManager.MarkVolumesAsReportedInUse(node.Status.VolumesInUse)
            return nil
        }
    }

    // Patch the current status on the API server
    updatedNode, _, err := nodeutil.PatchNodeStatus(kl.heartbeatClient.CoreV1(), types.NodeName(kl.nodeName), originalNode, node)
    if err != nil {
        return err
    }
    kl.lastStatusReportTime = now
    kl.setLastObservedNodeAddresses(updatedNode.Status.Addresses)
    // If update finishes successfully, mark the volumeInUse as reportedInUse to indicate
    // those volumes are already updated in the node's status
    kl.volumeManager.MarkVolumesAsReportedInUse(updatedNode.Status.VolumesInUse)
    return nil
}

View Code

1.2 通过webhook来patch node.status是否可行：经过验证通过配置mutatingwebhook获取心跳信息并patch node.status是可行的

1.3 webhook配置细节：mutatingwebhookconfiguration用于向apiserver注册webhook，如下所示

apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
  name: demo-webhook
  labels:
    app: demo-webhook
    kind: mutator
webhooks:
  - name: demo-webhook.app.svc
    clientConfig:
      service:
        name: demo-webhook
        namespace: app
        path: "/mutate"
      caBundle:  ${CA_BUNDLE}
    rules:
      - operations: [ "UPDATE" ]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["nodes/status"]

关注上述yaml的.webhooks[0].rules字段，对于k8s中的Resource如pod，pod.status被称为pod的subResource，起初我们将rules.resorces配置为["nodes"]和或者["*"]，webhook中都无法获取到10s一次的心跳信息，最终通过查看源码中对到达apiserver的请求是否需要发给webhook这一匹配逻辑，如下：会同时判断Resource和SubResource，也就说["*/*"]才能匹配到所有情况

func (r *Matcher) resource() bool {
    opRes, opSub := r.Attr.GetResource().Resource, r.Attr.GetSubresource()
    for _, res := range r.Rule.Resources {
        res, sub := splitResource(res)
        resMatch := res == "*" || res == opRes
        subMatch := sub == "*" || sub == opSub
        if resMatch && subMatch {
            return true
        }
    }
    return false
}

1.4 k8s中的patch分为哪些类型：apiserver根据请求header中的Content-Type字段来区分patch的类型

Json Patch Content-Type: application/json-patch+json，参考https://tools.ietf.org/html/rfc6902，这种patch方式支持的操作类型挺丰富的，如add、replace、remove、copy等

Merge Patch Content-Type: application/merge-patch+json，参考https://tools.ietf.org/html/rfc7386，这种方式每次patch对某个域的修改总是被替换

Strategic Merge Patch Content-Type: application/strategic-merge-patch+json，这种方式会根据对象的定义中添加的元数据来判断哪些应该是要被merge而不是默认的替换，如下pod.Spec定义中的containers字段通过patchStrategy和patchMergeKey表明当name不相同时，这个字段要被merge

containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`

2.当节点资源超卖后，Kubernetes对应的Cgroup动态调整机制是否能继续正常工作？

3.Node status更新太频繁，每次status update都会触发webhook，大规模集群容易对apiserver造成性能问题，怎么解决？

可启用NodeLease心跳机制（需要看一下NodeLease心跳信息中是否有资源分配水位线allocatable值），也可考虑增大心跳频率发送频率，同时webhook逻辑要尽量简单

4.节点资源超卖对Kubelet Eviction的配置是否也有超配效果，还是仍然按照实际Node配置和负载进行evict? 如果对Evict有影响，又该如解决？

5.超卖比例从大往小调低时，存在节点上 Sum(pods' request resource) > node's allocatable情况出现，这里是否有风险，该如何处理？

初步方案是当超卖比例从大往小调低时，保证allocatable的值不会比 Sum(pods' request resource)小

6.监控系统对Node的监控与Node Allocatable&Capacity Resource有关，超卖后，意味着监控系统对Node的监控不再正确，需要做一定程度的修正，如何让监控系统也能动态的感知超卖比例进行数据和视图的修正？

7.Node Allocatable和Capacity分别该如何超卖？超卖对节点预留资源的影响是如何的？

注意事项

Ocp环境中需要在master中添加如下配置开启mutatingwebhook

admissionConfig:
  pluginConfig:
    MutatingAdmissionWebhook:
      configuration:
        apiVersion: v1
        kind: DefaultAdmissionConfig
        disable: false

View Code

参考链接：

https://blog.csdn.net/shida_csdn/article/details/84286058