Kubernetes 设计概要

英文原文：Kubernetes Design Overview

Overview

Kubernetes builds on top of Docker to construct a clustered container scheduling service. The goals of the project are to enable users to ask a Kubernetes cluster to run a set of containers. The system will automatically pick a worker node to run those containers on.

As container based applications and systems get larger, some tools are provided to facilitate sanity. This includes ways for containers to find and communicate with each other and ways to work with and manage sets of containers that do similar work.

When looking at the arechitecture of the system, we'll break it down to services that run on the worker node and services that play a "master" role.

译者信息

概述

Kubernetes基于Docker，其构建了一个集群容器的调度服务。该项目的目标是运行用户在Kubernetes集群上运行一系列容器。系统会自动在这些容器上选择一个工作节点。

由于基于容器的应用和系统不断变大，一些提供便捷智能的工具出现了。这包括容器间发现和彼此的沟通方式，以及与容器管理集的协作方法等类似工作。

当审视系统的架构时，我们希望将架构打破，将其变为那些运行在工作节点和服务上的“master”角色。

Key Concept: Container Pod

While Docker itself works with individual containers, Kubernetes works with a pod. A pod is a group of containers that are scheduled onto the same physical node. In addition to defining the containers that run in the pod, the containers in the pod all use the same network namespace/IP and define a set of storage volumes. Ports are also mapped on a per-pod basis.

The Kubernetes Node

The Kubernetes node has the services necessary to run Docker containers and be managed from the master systems.

The Kubernetes node design is an extension of the Container-optimized Google Compute Engine image. Over time the plan is for these images/nodes to merge and be the same thing used in different ways. It has the services necessary to run Docker containers and be managed from the master systems.

Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers.

译者信息

关键概念：Pod容器

Docker本身是一个独立的容器，然而Kubernetes则基于一个pod容器。所谓的pod是指一组被安排到相同的物理节点上的容器。除了定义运行在pod的容器，pod中的容器都使用相同的网络命名空间/IP并且定义了一组存储卷。端口则基于每个pod基础上进行部署。

Kubernetes节点

Kuberneter节点包含运行Docker容器的必须服务，并且可以被主系统管理。

Kuberneter节点是基于容器优化的谷歌计算引擎图像的扩展而设计的。随着时间的推移，该计划主要为了对这些图像/节点进行合并，使之以不同的方式使用同样的事情。它包含运行Docker容器的必须服务，并且可以被主系统管理。

当然，每一个节点都运行着Docker。Docker关注下载图像和运行容器的详细信息。

Kubelet

The second component on the node is called the kubelet. The Kubelet is the logical successor (and rewrite in go) of the Container Agent that is part of the Compute Engine image.

The Kubelet works in terms of a container manifest. A container manifest (defined here) is a YAML file that describes a pod. The Kubelet takes a set of manifests that are provided in various mechanisms and ensures that the containers described in those manifests are started and continue running.

There are 4 ways that a container manifest can be provided to the Kubelet:

File Path passed as a flag on the command line. This file is rechecked every 20 seconds (configurable with a flag).
HTTP endpoint HTTP endpoint passed as a parameter on the command line. This endpoint is checked every 20 seconds (also configurable with a flag.)
etcd server The Kubelet will reach out and do a watch on an etcd server. The etcd path that is watched is /registry/hosts/$(hostname -f). As this is a watch, changes are noticed and acted upon very quickly.
HTTP server The kubelet can also listen for HTTP and respond to a simple API (underspec'd currently) to submit a new manifest.

译者信息

Kubelet

节点上的第二个组件叫做kubelet。Kubelelet是容器代理的逻辑后继者（用Go重写），它是计算引擎图像的一部分。

Kubelet工作于容器清单。容器清单是一个描述了pod的YAML文件(定义在这儿)。Kubelet采用了一组不同机制提供的清单，并确保这些清单描述的容器已启动并继续运行。

容器清单可以为Kubelet提供四种方式：

文件传递路径作为命令行上的标志。文件每20秒重新检查一次（用标签配置）。
HTTP端点 传递HTTP端点作为命令行上的参数。端点没20秒检查一次（也是用标签配置）。
etcd服务器 Kubelet将在etcd服务器上做一个看守。查看etcd路径/registry/hosts/$(hostname -f)。因为这是一个看守，所以变动将被注意到并迅速的采取行动。
HTTP服务器 kubelet也能监听HTTP并响应简单的API提交的新清单。

Kubernetes Proxy

Each node also runs a simple network proxy. This reflects services as defined in the Kubernetes API on each node and can do simple TCP stream forwarding or round robin TCP forwarding across a set of backends.

The Kubernetes Master

The Kubernetes master is split into a set of components. These work together to provide an unified view of the cluster.

etcd

All persistent master state is stored in an instance of etcd. This provides a great way to store configuration data reliably. With watch support, coordinating components can be notified very quickly of changes.

译者信息

Kubernetes代理

每个节点此外还运行着一个简单的网络代理。它反射了定义在 Kubernetes API 中每个节点上的service，并且可以在一组后端做简单的 TCP流转发或TCP循环转发。

Kubernetes Master

Kuberneters 控制被分到了一系列的组件中。这些组件工作在一起，提供统一的集群视图。

etcd

所有持久化的控制状态都被存储在一个etcd的实例中。这提供了一中非常可靠的存储配置数据的方式。通过watch支持，各个协调组件的变更通知可以是非常快速的查看到。

Kubernetes API Server

This server serves up the main Kubernetes API.

It validates and configures data for 3 types of objects:

pod: Each pod has a representation at the Kubernetes API level.
service: A service is a configuration unit for the proxies that run on every worker node. It is named and points to one or more Pods.
replicationController: A replication controller takes a template and ensures that there is a specified number of "replicas" of that template running at any one time. If there are too many, it'll start more. If there are too few, it'll kill some.

Beyond just servicing REST operations, validating them and storing them in etcd, the API Server does two other things:

Schedules pods to worker nodes. Right now the scheduler is very simple.
Synchronize pod information (where they are, what ports they are exposing) with the service configuration.

译者信息

Kubernetes API服务器

此服务器提供了主要的 Kubernetes API。

它将验证和配置3种对象类型的数据：

pod：每一个 pod 在 Kubernetes API 级别具有一个描述。
service：一个服务是运行在每一个工作节点的代理配置单元。它被命名和指向一个或多个pod。
replicationController：一个复制控制器需要一个模板，并确保在任何一个时间运行该模板的是指定的数量的”副本“。如果副本过多，则启动更多控制器；如果副本过少，则杀掉一些replicationController。

除了仅仅服务“其他”操作，校验和在etcd中存储，API服务器还要做其他两件事情：

将pod附着在工作者节点上。目前，调度器相对比较简单。
与服务配置同步pod信息（如pod部署在哪，哪些端口是打开的）。

Kubernetes Controller Manager Server

The repliationController type described above isn't strictly necessary for Kubernetes to be useful. It is really a service that is layered on top of the simple pod API. To enforce this layering, the logic for the repliationController is actually broken out into another server. This server watches etcd for changes toreplicationController objects and then uses the public Kubernetes API to implement the repliation algorithm.

译者信息

Kubernetes控制管理服务器

对于Kubernetes的使用来说，repliationController的类型描述并非严格必需。它实际是一个基于简单pod API的服务。在这一层上执行，其逻辑定义的repliationController实际上是被划分到其他的服务器上的。在服务器上查看etcd是为了改变repliationController对象和使用公共的KubernetesAPI去实现响应算法。

Key Concept: Labels

Pods are organized using labels. Each pod can have a set of key/value labels set on it.

Via a "label query" the user can identify a set of pods. This simple mechanism is a key part of how bothservices and replicationControllers work. The set of pods that a service points at is defined with a label query. Similarly the population of pods that a replicationController is monitoring is also defined with a label query.

Label queries would typically be used to identify and group pods into, say, a tier in an application. You could also idenitfy the stack such as dev, staging or production.

译者信息

关键概念：标签

pod用标签进行组织。每个pod具备一个key/value键值映射的标签。

通过“标签查询”，用户可以识别一系列的pod集合。这个简单的方法是services和replicationControllers如何工作的关键部分。一个service指向的pod集合由一个标签查询定义。类似的，由replicationController监听的pod的数量同样也是由标签查询定义。

标签查询通常会用于，我们讲，在一个应用程序层次上面识别和分组的pod。同样可以用来识别栈，例如dev、staging或者production。

These sets could be overlapping. For instance, a service might point to all pods with tier in (frontend), stack in (prod). Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a replicationController (with replicasset to 9) for the bulk of the replicas with labels tier=frontend,stack=prod,canary=no and anotherreplicationController (with replicas set to 1) for the canary with labels tier=frontend, stack=prod, canary=yes. Now the service is covering both the canary and non-canary pods. But you can mess with the replicationControllers separately to test things out, monitor the results, etc.

译者信息

这些设置可以重复。举例来说，一项服务可能会通过tier in (frontend), stack in (pod)来指向所有的pod。那如果现在，你说你有10个pod的副本组成了这一层，但是你希望能够创建一个新的"canary"版本的组件，你可以通过标签 tier=fronted, stack=prod, canary=no 来为大部分副本设置一个replicationController，另外通过标签tier=fronted, stack=prod, canary=yes为canary设置一个replicationController。现在服务就能够覆盖canary和非canary的pod。但因此你可能会搞混两个replicationController单独进行测试和监测得到的结果。

Network Model

Kubernetes expands the default Docker networking model. The goal is to have each pod have an IP in a shared networking namespace that has full communication with other physical computers and containers across the network. In this way, it becomes much less necessary to map ports.

For the Google Compute Engine cluster configuration scripts, advanced routing is set up so that each VM has a extra 256 IP addresses that get routed to it. This is in addition to the 'main' IP address assigned to the VM that is NAT-ed for Internet access. The networking bridge (called cbr0 to differentiate it fromdocker0) is set up outside of Docker proper and only does NAT for egress network traffic that isn't aimed at the virtual network.

译者信息

网络模型

Kuberneters扩展了默认的Docker网络模型。其目标是实现每一个pod都能在一个可通过网络与其他物理电脑和容器进行充分沟通的共享网络空间中有自己的Ip地址。通过这种方式，可以尽可能的减少部署端口。

对于谷歌计算引擎群集配置脚本，通过配置高级路由，使每个 VM 都有额外256个IP地址可以寻址到它。当然，这里面出去了为VM分配的主Ip地址，这个地址是用来接入互联网的。以区别于docker0的网络桥cbr0，通过正确设置Docker，仅作Nat功能，用来疏散网络流量，这里的网络不是指的虚拟网络。

Ports mapped in from the 'main IP' (and hence the internet if the right firewall rules are set up) are proxied in user mode by Docker. In the future, this should be done with iptables by either the Kubelet or Docker: Issue #15.

Release Process

Right now "building" or "releasing" Kubernetes consists of some scripts (in release/) to create a tar of the necessary data and then uploading it to Google Cloud Storage. In the future we will generate Docker images for the bulk of the above described components: Issue #19.

译者信息

从 '主要 IP'（确保互联网防火墙被正确的配置）进行的端口映射都是Docker在用户模式下实现的。将来，这项工作应该由Kubelet或者Docker通过iptables实现：Issue#15。

发布过程

现在"building"或者"releasing"Kubernetes由一些脚本(在release/下)组成，通过创建一个tar包来包含所需的数据，然后将它上传到谷歌云存储。在将来，我们将生成以上描述组件的大部分Docker图像：Issue#19。

GCE Cluster Configuration

The scripts and data in the cluster/ directory automates creating a set of Google Compute Engine VMs and installing all of the Kubernetes components. There is a single master node and a set of worker (called minion) nodes.

config-default.sh has a set of tweakable definitions/parameters for the cluster.

The heavy lifting of configuring the VMs is done by SaltStack.

The bootstrapping works like this:

The kube-up.sh script uses the GCE startup-script mechanism for both the master node and the minion nodes.

For the minion, this simply configures and installs SaltStack. The network range that this minion is assigned is baked into the startup-script for that minion.
For the master, the release files are downloaded from GCS and unpacked. Various parts (specifically the SaltStack configuration) are installed in the right places.

SaltStack then installs the necessary servers on each node.

All go code is currently downloaded to each machine and compiled at install time.
The custom networking bridge is configured on each minion before Docker is installed.
Configuration (like telling the apiserver the hostnames of the minions) is dynamically created during the saltstack install.

After the VMs are started, the kube-up.sh script will call curl every 2 seconds until the apiserverstarts responding.

kube-down.sh can be used to tear the entire cluster down. If you build a new release and want to update your cluster, you can use kube-push.sh to update and apply (highstate in salt parlance) the salt config.

译者信息

GCE集群配置

位于cluster/目录下的脚本和数据自动创建了一系列Google Compute Engine VM并且安装了所有的Kubernetes组件。这里有一个单独的主节点和一系列的工作者节点。

对于集群，config-default.sh中有一整套可调的定义/参数。

繁重的配置VM的操作由SaltStack完成。

引导这样工作：

1. kubeup.sh脚本对主节点和下属节点都采取GCE startup-script的方式进行操作。

对于下属节点，脚本仅简单的配置和安装SaltStack。为下属节点分配的网络范围同时又添加到该下属节点的startup-script中。
对于主节点，版本文件在GCS上下载，并进行解压操作。大量的文件（尤其是SaltStack配置）被安装到正确的位置。

2. SaltStack会在每个节点上安装必要的服务项。

所有代码会下载到每一台机器，在安装时编译。
在每个下属节点的自定义网络桥在Docker安装之前进行配置。
Saltstack 安装过程中动态地创建配置（例如告知apiserver下属的主机名）。

3. VM启动之后，kube-up.sh脚本每2s会调用curl，直到apiserver开始响应。

kube-down.sh可以用来卸载整个集群。如果你build一个新的版本，需要升级你的集群，你可以用kube-push.sh进行升级和应用（用salt的术语就是highstate）salt配置。

Cluster Security

As there is no security currently built into the apiserver, the salt configuration will install nginx. nginx is configured to serve HTTPS with a self signed certificate. HTTP basic auth is used from the client to nginx. nginx then forwards the request on to the apiserver over plain old HTTP. Because a self signed certificate is used, access to the server should be safe from eavesdropping but is subject to "man in the middle" attacks. Access via the browser will result in warnings and tools like curl will require an "--insecure" flag.

All communication within the cluster (worker nodes to the master, for instance) occurs on the internal virtual network and should be safe from eavesdropping.

The password is generated randomly as part of the kube-up.sh script and stored in ~/.kubernetes_auth.

译者信息

集群安全

目前还没有建立安全的apiserver，salt的配置会安装Nginx。配置Night是为了使用带有签名证书的HTTPS。HTTP的基础认证使用的是来自客户端Nginx。Nginx会通过原来的HTTP发起对apiserver普通请求。由于使用了签名证书，从而保证了被访问服务器不遭到"man in the middle"的攻击被窃听。使用浏览器访问将导致警告和工具一样的需要”——不安全”的标志。

集群内的所有信息交流（例如，工作的节点）都在在内部虚拟网络中以免遭到窃听。

密码是随机生成的kube-up.sh脚本部分以及存储在kubernetes_auth 。