节点资源超卖方案的实现

背景

腾讯云对k8s集群负载的优化讨论中提到了关于节点资源超卖的方案,可以简述为:根据每个节点的真实历史负载数据,动态的配置节点可分配资源总量Allocatable,以控制允许调度到该节点的pod数量。文章中给出了整体的技术方案,也提出了很多需要我们自己去考虑的细节问题,这里尝试简单的实现这个方案。

实现

文章中提到的方案如下

- 每个节点的资源超卖比例,我们设置到Node的Annotation中,比如cpu超卖对应Annotation stke.platform/cpu-oversale-ratio。
- 每个节点的超卖比例,由自研组件基于节点历史监控数据,动态的/周期性的去调整超卖比例
- Node超卖特性一定要是可以关闭和还原的,通过Node Annotation stke.platform/mutate: "false"关闭Node超卖,Node在下一个心跳会完成资源复原。
- 通过kube-apiserver的Mutating Admission Webhook对Node的Create和Status Update事件进行拦截,根据超卖比重新计算该Node的Allocatable&Capacity Resource,Patch到APIServer。

这里关键是需要考虑以下细节问题

1.Kubelet Register Node To ApiServer的详细原理是什么,通过webhook直接Patch Node Status是否可行?

1.1 简单描述心跳方式:目前k8s支持两种心跳方式,其中新的心跳方式NodeLease更加轻量,每次心跳内容0.1Kb左右,可以较好的解决节点数量多的时候频繁的心跳信息给apiserver带来的压力,NodeLease目前作为未正式可用的新特性,需要kubelet中开启这个feature,这里仅考虑原心跳方式(关于两种心跳的协同工作原理之后可以记录一下)。

简单看一下源码https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go 中的tryUpdateNodeStatus方法,先获取Node信息(注释里面说到了这个Get操作优先从本地缓存里面拿,效率高一点),然后通过setNodeStatus方法对node的status进行补充(心跳的内容主要就是node对象的node.status字段值),例如补充MachineInfo即cpu和内存信息、ImageList、Condition等(具体会补充什么内容可以查看pkg/kubelet/kubelet_node_status.go#defaultNodeStatusFuncs),之后会调用函数PatchNodeStatus向apiserver发送这个node信息,参数nodeStatusUpdateFrequency指定发送频率,默认是10s。

func (kl *Kubelet) tryUpdateNodeStatus(tryNumber int) error {
    // In large clusters, GET and PUT operations on Node objects coming
    // from here are the majority of load on apiserver and etcd.
    // To reduce the load on etcd, we are serving GET operations from
    // apiserver cache (the data might be slightly delayed but it doesn't
    // seem to cause more conflict - the delays are pretty small).
    // If it result in a conflict, all retries are served directly from etcd.
    opts := metav1.GetOptions{}
    if tryNumber == 0 {
        util.FromApiserverCache(&opts)
    }
    node, err := kl.heartbeatClient.CoreV1().Nodes().Get(context.TODO(), string(kl.nodeName), opts)
    if err != nil {
        return fmt.Errorf("error getting node %q: %v", kl.nodeName, err)
    }

    originalNode := node.DeepCopy()
    if originalNode == nil {
        return fmt.Errorf("nil %q node object", kl.nodeName)
    }

    podCIDRChanged := false
    if len(node.Spec.PodCIDRs) != 0 {
        // Pod CIDR could have been updated before, so we cannot rely on
        // node.Spec.PodCIDR being non-empty. We also need to know if pod CIDR is
        // actually changed.
        podCIDRs := strings.Join(node.Spec.PodCIDRs, ",")
        if podCIDRChanged, err = kl.updatePodCIDR(podCIDRs); err != nil {
            klog.Errorf(err.Error())
        }
    }

    kl.setNodeStatus(node)

    now := kl.clock.Now()
    if now.Before(kl.lastStatusReportTime.Add(kl.nodeStatusReportFrequency)) {
        if !podCIDRChanged && !nodeStatusHasChanged(&originalNode.Status, &node.Status) {
            // We must mark the volumes as ReportedInUse in volume manager's dsw even
            // if no changes were made to the node status (no volumes were added or removed
            // from the VolumesInUse list).
            //
            // The reason is that on a kubelet restart, the volume manager's dsw is
            // repopulated and the volume ReportedInUse is initialized to false, while the
            // VolumesInUse list from the Node object still contains the state from the
            // previous kubelet instantiation.
            //
            // Once the volumes are added to the dsw, the ReportedInUse field needs to be
            // synced from the VolumesInUse list in the Node.Status.
            //
            // The MarkVolumesAsReportedInUse() call cannot be performed in dsw directly
            // because it does not have access to the Node object.
            // This also cannot be populated on node status manager init because the volume
            // may not have been added to dsw at that time.
            kl.volumeManager.MarkVolumesAsReportedInUse(node.Status.VolumesInUse)
            return nil
        }
    }

    // Patch the current status on the API server
    updatedNode, _, err := nodeutil.PatchNodeStatus(kl.heartbeatClient.CoreV1(), types.NodeName(kl.nodeName), originalNode, node)
    if err != nil {
        return err
    }
    kl.lastStatusReportTime = now
    kl.setLastObservedNodeAddresses(updatedNode.Status.Addresses)
    // If update finishes successfully, mark the volumeInUse as reportedInUse to indicate
    // those volumes are already updated in the node's status
    kl.volumeManager.MarkVolumesAsReportedInUse(updatedNode.Status.VolumesInUse)
    return nil
}
View Code

1.2 通过webhook来patch node.status是否可行:经过验证通过配置mutatingwebhook获取心跳信息并patch node.status是可行的

1.3 webhook配置细节:mutatingwebhookconfiguration用于向apiserver注册webhook,如下所示

apiVersion: admissionregistration.k8s.io/v1beta1
kind: MutatingWebhookConfiguration
metadata:
  name: demo-webhook
  labels:
    app: demo-webhook
    kind: mutator
webhooks:
  - name: demo-webhook.app.svc
    clientConfig:
      service:
        name: demo-webhook
        namespace: app
        path: "/mutate"
      caBundle:  ${CA_BUNDLE}
    rules:
      - operations: [ "UPDATE" ]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["nodes/status"]

关注上述yaml的.webhooks[0].rules字段,对于k8s中的Resource如pod,pod.status被称为pod的subResource,起初我们将rules.resorces配置为["nodes"]和或者["*"],webhook中都无法获取到10s一次的心跳信息,最终通过查看源码中对到达apiserver的请求是否需要发给webhook这一匹配逻辑,如下:会同时判断Resource和SubResource,也就说["*/*"]才能匹配到所有情况

func (r *Matcher) resource() bool {
    opRes, opSub := r.Attr.GetResource().Resource, r.Attr.GetSubresource()
    for _, res := range r.Rule.Resources {
        res, sub := splitResource(res)
        resMatch := res == "*" || res == opRes
        subMatch := sub == "*" || sub == opSub
        if resMatch && subMatch {
            return true
        }
    }
    return false
}

 1.4 k8s中的patch分为哪些类型:apiserver根据请求header中的Content-Type字段来区分patch的类型

Json Patch Content-Type: application/json-patch+json,参考https://tools.ietf.org/html/rfc6902,这种patch方式支持的操作类型挺丰富的,如add、replace、remove、copy等

Merge Patch Content-Type: application/merge-patch+json,参考https://tools.ietf.org/html/rfc7386,这种方式每次patch对某个域的修改总是被替换

Strategic Merge Patch Content-Type: application/strategic-merge-patch+json,这种方式会根据对象的定义中添加的元数据来判断哪些应该是要被merge而不是默认的替换,如下pod.Spec定义中的containers字段通过patchStrategy和patchMergeKey表明当name不相同时,这个字段要被merge

containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`

2.当节点资源超卖后,Kubernetes对应的Cgroup动态调整机制是否能继续正常工作?

3.Node status更新太频繁,每次status update都会触发webhook,大规模集群容易对apiserver造成性能问题,怎么解决?

 可启用NodeLease心跳机制(需要看一下NodeLease心跳信息中是否有资源分配水位线allocatable值),也可考虑增大心跳频率发送频率,同时webhook逻辑要尽量简单

4.节点资源超卖对Kubelet Eviction的配置是否也有超配效果,还是仍然按照实际Node配置和负载进行evict? 如果对Evict有影响,又该如解决?

5.超卖比例从大往小调低时,存在节点上 Sum(pods' request resource) > node's allocatable情况出现,这里是否有风险,该如何处理?

 初步方案是当超卖比例从大往小调低时,保证allocatable的值不会比 Sum(pods' request resource)小

6.监控系统对Node的监控与Node Allocatable&Capacity Resource有关,超卖后,意味着监控系统对Node的监控不再正确,需要做一定程度的修正,如何让监控系统也能动态的感知超卖比例进行数据和视图的修正?

7.Node Allocatable和Capacity分别该如何超卖?超卖对节点预留资源的影响是如何的?

注意事项

Ocp环境中需要在master中添加如下配置开启mutatingwebhook

admissionConfig:
  pluginConfig:
    MutatingAdmissionWebhook:
      configuration:
        apiVersion: v1
        kind: DefaultAdmissionConfig
        disable: false
 
View Code

参考链接:

https://blog.csdn.net/shida_csdn/article/details/84286058

原文地址:https://www.cnblogs.com/orchidzjl/p/12365156.html