Repository: zoux86/learning-k8s-source-code Branch: master Commit: 15e9fd4aa46c Files: 103 Total size: 1.8 MB Directory structure: gitextract_ays31j2l/ ├── .gitignore ├── README.md ├── docker/ │ ├── 0-docker章节介绍.md │ ├── 1. linux namespaces 知识准备.md │ ├── 10. 如何下载并二进制编译docker源码.md │ ├── 11. dockercli 源码分析-docker run为例.md │ ├── 12. dockerd源码分析-docker run为例.md │ ├── 2. linux cgroup 知识准备.md │ ├── 3. chroot 命令详解.md │ ├── 4. 如何用golang 实现一个 busybox的容器.md │ ├── 5. docker-overlay技术.md │ ├── 6. docker pull原理分析.md │ ├── 7. docker 命令详解.md │ ├── 8. docker核心组件介绍.md │ ├── 9. docker问题链路排查实例.md │ └── 其他/ │ ├── 补充-僵尸进程处理.md │ └── 补充-容器进程.md ├── etcd/ │ ├── 0. etcd常用操作.md │ └── 协议理论知识/ │ ├── 1. cap原理.md │ ├── 2. ACID理论.md │ ├── 3. base理论.md │ └── 4. raft协议.md └── k8s/ ├── README.md ├── client-go/ │ ├── 1- clientGo简介与章节安排.md │ ├── 10. Controller-runtime原理分析.md │ ├── 2-clientGo提供的四种客户端.md │ ├── 3. apiserver中的list-watch机制.md │ ├── 4. client informer机制简介.md │ ├── 5. SharedInformerFactory机制.md │ ├── 6. informer机制之cache.indexer机制.md │ ├── 7. informer机制详解.md │ ├── 8. client-go的workqueue详解.md │ └── 9.从0到1使用kubebuilder创建crd.md ├── cni/ │ ├── 0.章节介绍.md │ ├── 1. 网络基础知识.md │ ├── 2. docker 4种 网络模式.md │ ├── 3. docker容器网络的底层实现.md │ ├── 4.k8s pod通信原理介绍.md │ ├── 5. k8s 容器网络接口介绍.md │ ├── 6.如何订制自己的cni.md │ ├── 7. flannel原理浅析分析.md │ └── 8. calico原理浅析md.md ├── install-k8s-from source code/ │ ├── 1-debian二进制安装v1.17 k8s.md │ └── 2.window配置goland环境阅读kubernetes源码.md ├── kcm/ │ ├── 0-kcm启动流程.md │ ├── 1-rs controller-manager源码分析.md │ ├── 10-kcm-NodeLifecycleController源码分析.md │ ├── 11.k8s node状态更新机制 .md │ ├── 2-deployment controller-manager源码分析.md │ ├── 3-k8s gc源码分析.md │ ├── 3-k8s中以不同的策略删除资源时发生了什么.md │ ├── 4-hpa-自定义metric server.md │ ├── 4-hpa源码分析.md │ ├── 5-job controller-manager源码分析.md │ ├── 6-namespaces controller-manager源码分析.md │ ├── 9-kubernetes污点和容忍度概念介绍.md │ └── kcm篇源码分析总结.md ├── kube-apiserver/ │ ├── 0-apiserver笔记规划.md │ ├── 1-v1.17 kube-apiserver启动参数介绍.md │ ├── 10-kube-apiserver创建AggregatorServer.md │ ├── 11-kube-apiserver 启动http和https服务.md │ ├── 12-k8s之Authentication.md │ ├── 13-k8s之Authorization.md │ ├── 14-k8s之admission分析.md │ ├── 15-k8s之etcd存储实现.md │ ├── 16. 创建更新删除资源时apiserver做了什么工作.md │ ├── 17-k8s之serviceaccount.md │ ├── 18 event的定义.md │ ├── 19. secret对象详解.md │ ├── 2-kube-apiserver概述.md │ ├── 20. kubectl exec原理介绍.md │ ├── 21-kube-apiserver list-watch源码分析.md │ ├── 3-k8s之资源介绍.md │ ├── 4-scheme介绍.md │ ├── 5-kube-apiserver启动流程汇总.md │ ├── 6-kube-apiserver启动流程-资源注册+命令行初始.md │ ├── 7-kube-apiserver创建APIServer通用配置.md │ ├── 8-kube-apiserver创建APIExtensionsServer.md │ └── 9-kube-apiserver 创建KubeAPIServer.md ├── kube-scheduler/ │ ├── 1. kube-scheduler简介.md │ ├── 2-kube-scheduler源码分析.md │ └── 3-如何编写一个scheduler plugin.md ├── kubectl/ │ ├── 0-ReadMe.md │ ├── 1-kubectl 整体流程分析.md │ ├── 2-client-go中连接apiserver的4种client介绍.md │ ├── 3-kubectl Factory机制-上.md │ ├── 4-kubectl Factor机制-下.md │ ├── 5 visitor机制.md │ ├── 6-kubectl中的所有visitor.md │ ├── 7-kubectl create使用到的visitor.md │ ├── 8- kubectl printer分析.md │ └── 9-kubectl create整体流程分析.md └── kubelet/ ├── 0-readme.md ├── 1-kubelet 架构浅析.md ├── 10-k8s驱逐机制汇总.md ├── 2-kubelet初始化流程-上.md ├── 3-kubelet初始化流程-下.md ├── 4-kubelet 监听pod变化.md ├── 5-pod创建流程.md ├── 6-pod pleg更新流程.md ├── 7-pod delete流程.md ├── 8-kubelet gc流程.md └── 9-kubelet驱逐源码分析.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ */.DS_Store .DS_Store: **/.DS_Store .DS_Store? ================================================ FILE: README.md ================================================ # learning-k8s-source-code 从源码角度出发,学习k8s的原理。 目前打算以`kube-apiserver` `kube-controller-manager` `kube-scheduler` `kubelet` `proxy` 和 `kubectl` 这6个组件为主线进行源码级别的学习。 同时还顺便记录一些平时用到和k8s相关的知识,例如etcd, docker, linux等相关知识。 其他相关k8s组件的分析笔记发布在以下博客: https://www.jianshu.com/c/b097c5e7eb9b https://www.zhihu.com/column/c_1523054529113579520 https://blog.csdn.net/zxyuliwuzhognzx11/category_11880534.html?spm=1001.2014.3001.5482 ================================================ FILE: docker/0-docker章节介绍.md ================================================ 容器技术是 云发展的一个重要基础,docker就是当前很火的一种容器技术。 之前就知道docker利用了linux的cgruop, namespace + chroot + 联合文件系统实现的。 本章力求从源码角度对docker进行分析, docker版本为:https://github.com/moby/moby/tree/v19.03.9 章节安排: (1)了解linux namespaces, cgroup, choot,联合文件系统的原理 (2)了解docker源码结构 (3)以常见的docker run nginx ls命令为主线, 从源码入手了解该命令背后的详细过程 ================================================ FILE: docker/1. linux namespaces 知识准备.md ================================================ * [1 namespace 简介](#1-namespace-简介) * [2\. pid namespace](#2-pid-namespace) * [2\.1 如何查看一个进程的 pid namespace](#21-如何查看一个进程的-pid-namespace) * [2\.2 子进程不共享父进程的pid namespaces](#22-子进程不共享父进程的pid-namespaces) * [2\.3 pid namespace的原理](#23-pid-namespace的原理) * [2\.4 task\_struct 结构图](#24-task_struct-结构图) * [3 总结](#3-总结) * [4\.参考](#4参考) ### 1 namespace 简介 `namespace(命名空间)` 是Linux提供的一种内核级别环境隔离的方法,很多编程语言也有 namespace 这样的功能,例如C++,Java等,编程语言的 namespace 是为了解决项目中能够在不同的命名空间里使用相同的函数名或者类名。而Linux的 namespace 也是为了实现资源能够在不同的命名空间里有相同的名称,譬如在 `A命名空间` 有个pid为1的进程,而在 `B命名空间` 中也可以有一个pid为1的进程。 有了 `namespace` 就可以实现基本的容器功能,著名的 `Docker` 也是使用了 namespace 来实现资源隔离的。 Linux支持6种资源的 `namespace`,分别为(文档): | Type | Parameter | Linux Version | | ------------------ | ------------- | ------------- | | Mount namespaces | CLONE_NEWNS | Linux 2.4.19 | | UTS namespaces | CLONE_NEWUTS | Linux 2.6.19 | | IPC namespaces | CLONE_NEWIPC | Linux 2.6.19 | | PID namespaces | CLONE_NEWPID | Linux 2.6.24 | | Network namespaces | CLONE_NEWNET | Linux 2.6.24 | | User namespaces | CLONE_NEWUSER | Linux 2.6.23 |
个人理解:namespace 就是对进程进行了内核资源的隔离(mount, uts, ipc, pid, network, user)这六种资源。 接下来以 pid 这个来介绍 namespaces是如何起作用的。
### 2. pid namespace #### 2.1 如何查看一个进程的 pid namespace /proc/pid/ns 目录下目前可以看到pid namespace ``` ps ajxf 查看到一个父进程,和子进程 4556 4574 4574 4574 ? -1 Ss 0 0:00 \_ nginx: master process nginx -g daemon off; 4574 4621 4574 4574 ? -1 S 101 0:00 \_ nginx: worker process 4574 4629 4574 4574 ? -1 S 101 0:00 \_ nginx: worker process // 父进程 root@k8s-master:/proc/170/ns# ls -l /proc/4574/ns total 0 lrwxrwxrwx 1 root root 0 Dec 5 08:49 cgroup -> 'cgroup:[4026531835]' lrwxrwxrwx 1 root root 0 Dec 5 08:49 ipc -> 'ipc:[4026532263]' lrwxrwxrwx 1 root root 0 Dec 5 08:49 mnt -> 'mnt:[4026532331]' lrwxrwxrwx 1 root root 0 Dec 5 08:49 net -> 'net:[4026532266]' lrwxrwxrwx 1 root root 0 Dec 5 08:49 pid -> 'pid:[4026532333]' lrwxrwxrwx 1 root root 0 Dec 5 08:49 pid_for_children -> 'pid:[4026532333]' lrwxrwxrwx 1 root root 0 Dec 5 08:49 user -> 'user:[4026531837]' lrwxrwxrwx 1 root root 0 Dec 5 08:49 uts -> 'uts:[4026532332]' // 子进程和父进程有一样的namespaces root@k8s-master:/proc/170/ns# ls -l /proc/4621/ns total 0 lrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec 5 08:49 cgroup -> 'cgroup:[4026531835]' lrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec 5 08:49 ipc -> 'ipc:[4026532263]' lrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec 5 08:49 mnt -> 'mnt:[4026532331]' lrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec 5 08:49 net -> 'net:[4026532266]' lrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec 5 08:49 pid -> 'pid:[4026532333]' lrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec 5 08:49 pid_for_children -> 'pid:[4026532333]' lrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec 5 08:49 user -> 'user:[4026531837]' lrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec 5 08:49 uts -> 'uts:[4026532332] ```
#### 2.2 子进程不共享父进程的pid namespaces ``` root@k8s-master:~# unshare --fork --pid --mount-proc sleep 100 ```
``` 1 701 701 701 ? -1 Ss 0 1:28 /usr/sbin/sshd -D 701 4462 4462 4462 ? -1 Ss 0 0:00 \_ sshd: root@pts/0,pts/1 4462 4497 4497 4497 pts/0 3994 Ss 0 0:00 \_ -bash 4497 3106 3106 4497 pts/0 3994 S 0 0:00 | \_ bash 3106 3994 3994 4497 pts/0 3994 S+ 0 0:00 | \_ unshare --fork --pid --mount- 3994 3995 3994 4497 pts/0 3994 S+ 0 0:00 | \_ sleep 100 ```
``` // 这个是 sleep 100的进程,所以他的子进程和它是公用pid的 root@k8s-master:~# ls -l /proc/3995/ns total 0 lrwxrwxrwx 1 root root 0 Dec 5 10:00 cgroup -> 'cgroup:[4026531835]' lrwxrwxrwx 1 root root 0 Dec 5 10:00 ipc -> 'ipc:[4026531839]' lrwxrwxrwx 1 root root 0 Dec 5 10:00 mnt -> 'mnt:[4026532334]' lrwxrwxrwx 1 root root 0 Dec 5 10:00 net -> 'net:[4026531992]' lrwxrwxrwx 1 root root 0 Dec 5 10:00 pid -> 'pid:[4026532335]' // 这里 pid 和pid_for_children是一样的 lrwxrwxrwx 1 root root 0 Dec 5 10:00 pid_for_children -> 'pid:[4026532335]' lrwxrwxrwx 1 root root 0 Dec 5 10:00 user -> 'user:[4026531837]' lrwxrwxrwx 1 root root 0 Dec 5 10:00 uts -> 'uts:[4026531838]' root@k8s-master:~# // 这个是 unshare 的进程,因为使用了 --pid mount,所以和父进程pid namespaces是不一样的 root@k8s-master:~# ls -l /proc/3994/ns total 0 lrwxrwxrwx 1 root root 0 Dec 5 10:01 cgroup -> 'cgroup:[4026531835]' lrwxrwxrwx 1 root root 0 Dec 5 10:01 ipc -> 'ipc:[4026531839]' lrwxrwxrwx 1 root root 0 Dec 5 10:01 mnt -> 'mnt:[4026532334]' lrwxrwxrwx 1 root root 0 Dec 5 10:01 net -> 'net:[4026531992]' lrwxrwxrwx 1 root root 0 Dec 5 10:01 pid -> 'pid:[4026531836]' // 这里就是不一样的,因为这个进程是 lrwxrwxrwx 1 root root 0 Dec 5 10:01 pid_for_children -> 'pid:[4026532335]' lrwxrwxrwx 1 root root 0 Dec 5 10:01 user -> 'user:[4026531837]' lrwxrwxrwx 1 root root 0 Dec 5 10:01 uts -> 'uts:[4026531838]' ``` #### 2.3 pid namespace的原理 为了让每个进程都可以从属于某一个namespace,Linux内核为进程描述符添加了一个 `struct nsproxy` 的结构,如下: ``` struct task_struct { ... /* namespaces */ struct nsproxy *nsproxy; ... } struct nsproxy { atomic_t count; struct uts_namespace *uts_ns; struct ipc_namespace *ipc_ns; struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns; struct user_namespace *user_ns; struct net *net_ns; }; ``` 从 `struct nsproxy` 结构的定义可以看出,Linux为每种不同类型的资源定义了不同的命名空间结构体进行管理。比如对于 `pid命名空间` 定义了 `struct pid_namespace` 结构来管理 。由于 namespace 涉及的资源种类比较多,所以本文主要以 `pid命名空间` 作为分析的对象。 我们先来看看管理 `pid命名空间` 的 `struct pid_namespace` 结构的定义: ``` struct pid_namespace { struct kref kref; struct pidmap pidmap[PIDMAP_ENTRIES]; int last_pid; struct task_struct *child_reaper; struct kmem_cache *pid_cachep; unsigned int level; struct pid_namespace *parent; #ifdef CONFIG_PROC_FS struct vfsmount *proc_mnt; #endif }; ``` 因为 `struct pid_namespace` 结构主要用于为当前 `pid命名空间` 分配空闲的pid,所以定义比较简单: - `kref` 成员是一个引用计数器,用于记录引用这个结构的进程数 - `pidmap` 成员用于快速找到可用pid的位图 - `last_pid` 成员是记录最后一个可用的pid - `level` 成员记录当前 `pid命名空间` 所在的层次 - `parent` 成员记录当前 `pid命名空间` 的父命名空间 由于 `pid命名空间` 是分层的,也就是说新创建一个 `pid命名空间` 时会记录父级 `pid命名空间` 到 `parent` 字段中,所以随着 `pid命名空间` 的创建,在内核中会形成一颗 `pid命名空间` 的树,如下图(图片来源): ![image-20220226155912906](./image/ns-1.png) 第0层的 `pid命名空间` 是 `init` 进程所在的命名空间。如果一个进程所在的 `pid命名空间` 为 `N`,那么其在 `0 ~ N 层pid命名空间` 都有一个唯一的pid号。也就是说 `高层pid命名空间` 的进程对 `低层pid命名空间` 的进程是可见的,但是 `低层pid命名空间`的进程对 `高层pid命名空间` 的进程是不可见的。 由于在 `第N层pid命名空间` 的进程其在 `0 ~ N层pid命名空间` 都有一个唯一的pid号,所以在进程描述符中通过 `pids` 成员来记录其在每个层的pid号,代码如下: ``` struct task_struct { ... struct pid_link pids[PIDTYPE_MAX]; ... } enum pid_type { PIDTYPE_PID, PIDTYPE_PGID, PIDTYPE_SID, PIDTYPE_MAX }; struct upid { int nr; struct pid_namespace *ns; struct hlist_node pid_chain; }; struct pid { atomic_t count; struct hlist_head tasks[PIDTYPE_MAX]; struct rcu_head rcu; unsigned int level; struct upid numbers[1]; }; struct pid_link { struct hlist_node node; struct pid *pid; }; ``` 这几个结构的关系如下图: ![image-20220226160353792](./image/ns-2.png) 我们主要关注 `struct pid` 这个结构,`struct pid` 有个类型为 `struct upid` 的成员 `numbers`,其定义为只有一个元素的数组,但是其实是一个动态的数据,它的元素个数与 `level` 的值一致,也就是说当 `level` 的值为5时,那么 `numbers` 成员就是一个拥有5个元素的数组。而每个元素记录了其在每层 `pid命名空间` 的pid号,而 `struct upid` 结构的 `nr` 成员就是用于记录进程在不同层级 `pid命名空间` 的pid号。 我们通过代码来看看怎么为进程分配pid号的,在内核中是用过 `alloc_pid()` 函数分配pid号的,代码如下: ``` struct pid *alloc_pid(struct pid_namespace *ns) { struct pid *pid; enum pid_type type; int i, nr; struct pid_namespace *tmp; struct upid *upid; pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL); if (!pid) goto out; tmp = ns; for (i = ns->level; i >= 0; i--) { nr = alloc_pidmap(tmp); // 为当前进程所在的不同层级pid命名空间分配一个pid if (nr < 0) goto out_free; pid->numbers[i].nr = nr; // 对应i层namespace中的pid数字 pid->numbers[i].ns = tmp; // 对应i层namespace的实体 tmp = tmp->parent; } get_pid_ns(ns); pid->level = ns->level; atomic_set(&pid->count, 1); for (type = 0; type < PIDTYPE_MAX; ++type) INIT_HLIST_HEAD(&pid->tasks[type]); spin_lock_irq(&pidmap_lock); for (i = ns->level; i >= 0; i--) { upid = &pid->numbers[i]; // 把upid连接到全局pid中, 用于快速查找pid hlist_add_head_rcu(&upid->pid_chain, &pid_hash[pid_hashfn(upid->nr, upid->ns)]); } spin_unlock_irq(&pidmap_lock); out: return pid; ... } ``` 上面的代码中,那个 `for (i = ns->level; i >= 0; i--)` 就是通过 `parent` 成员不断向上检索为不同层级的 `pid命名空间`分配一个唯一的pid号,并且保存到对应的 `nr` 字段中。另外,还会把进程所在各个层级的pid号添加到全局pid哈希表中,这样做是为了通过pid号快速找到进程。 现在我们来看看怎么通过pid号快速找到一个进程,在内核中 `find_get_pid()` 函数用来通过pid号查找对应的 `struct pid`结构,代码如下(find_get_pid() -> find_vpid() -> find_pid_ns()): ``` struct pid *find_get_pid(pid_t nr) { struct pid *pid; rcu_read_lock(); pid = get_pid(find_vpid(nr)); rcu_read_unlock(); return pid; } struct pid *find_vpid(int nr) { return find_pid_ns(nr, current->nsproxy->pid_ns); } struct pid *find_pid_ns(int nr, struct pid_namespace *ns) { struct hlist_node *elem; struct upid *pnr; hlist_for_each_entry_rcu(pnr, elem, &pid_hash[pid_hashfn(nr, ns)], pid_chain) if (pnr->nr == nr && pnr->ns == ns) return container_of(pnr, struct pid, numbers[ns->level]); return NULL; } ``` 通过pid号查找 `struct pid` 结构时,首先会把进程pid号和当前进程的 `pid命名空间` 传入到 `find_pid_ns()` 函数,而在 `find_pid_ns()` 函数中通过全局pid哈希表来快速查找对应的 `struct pid` 结构。获取到 `struct pid` 结构后就可以很容易地获取到进程对应的进程描述符,例如可以通过 `pid_task()` 函数来获取 `struct pid` 结构对应进程描述符,由于代码比较简单,这里就不再分析了。 #### 2.4 task_struct 结构图 ``` struct task_struct { /* 1. state: 进程执行时,它会根据具体情况改变状态。进程状态是进程调度和对换的依据。Linux中的进程主要有如下状态: 1) TASK_RUNNING: 可运行 处于这种状态的进程,只有两种状态: 1.1) 正在运行 正在运行的进程就是当前进程(由current所指向的进程) 1.2) 正准备运行 准备运行的进程只要得到CPU就可以立即投入运行,CPU是这些进程唯一等待的系统资源,系统中有一个运行队列(run_queue),用来容纳所有处于可运行状态的进程,调度程序执行时,从中选择一个进程投入运行 2) TASK_INTERRUPTIBLE: 可中断的等待状态,是针对等待某事件或其他资源的睡眠进程设置的,在内核发送信号给该进程表明事件已经发生时,进程状态变为TASK_RUNNING,它只要调度器选中该进程即可恢复执行 3) TASK_UNINTERRUPTIBLE: 不可中断的等待状态 处于该状态的进程正在等待某个事件(event)或某个资源,它肯定位于系统中的某个等待队列(wait_queue)中,处于不可中断等待态的进程是因为硬件环境不能满足而等待,例如等待特定的系统资源,它任何情况下都不能被打断,只能用特定的方式来唤醒它,例如唤醒函数wake_up()等      它们不能由外部信号唤醒,只能由内核亲自唤醒 4) TASK_ZOMBIE: 僵死 进程虽然已经终止,但由于某种原因,父进程还没有执行wait()系统调用,终止进程的信息也还没有回收。顾名思义,处于该状态的进程就是死进程,这种进程实际上是系统中的垃圾,必须进行相应处理以释放其占用的资源。 5) TASK_STOPPED: 暂停 此时的进程暂时停止运行来接受某种特殊处理。通常当进程接收到SIGSTOP、SIGTSTP、SIGTTIN或 SIGTTOU信号后就处于这种状态。例如,正接受调试的进程就处于这种状态           6) TASK_TRACED      从本质上来说,这属于TASK_STOPPED状态,用于从停止的进程中,将当前被调试的进程与常规的进程区分开来             7) TASK_DEAD      父进程wait系统调用发出后,当子进程退出时,父进程负责回收子进程的全部资源,子进程进入TASK_DEAD状态 8) TASK_SWAPPING: 换入/换出 */ volatile long state; /* 2. stack 进程内核栈,进程通过alloc_thread_info函数分配它的内核栈,通过free_thread_info函数释放所分配的内核栈 */ void *stack; /* 3. usage 进程描述符使用计数,被置为2时,表示进程描述符正在被使用而且其相应的进程处于活动状态 */ atomic_t usage; /* 4. flags flags是进程当前的状态标志(注意和运行状态区分) 1) #define PF_ALIGNWARN 0x00000001: 显示内存地址未对齐警告 2) #define PF_PTRACED 0x00000010: 标识是否是否调用了ptrace 3) #define PF_TRACESYS 0x00000020: 跟踪系统调用 4) #define PF_FORKNOEXEC 0x00000040: 已经完成fork,但还没有调用exec 5) #define PF_SUPERPRIV 0x00000100: 使用超级用户(root)权限 6) #define PF_DUMPCORE 0x00000200: dumped core 7) #define PF_SIGNALED 0x00000400: 此进程由于其他进程发送相关信号而被杀死 8) #define PF_STARTING 0x00000002: 当前进程正在被创建 9) #define PF_EXITING 0x00000004: 当前进程正在关闭 10) #define PF_USEDFPU 0x00100000: Process used the FPU this quantum(SMP only) #define PF_DTRACE 0x00200000: delayed trace (used on m68k) */ unsigned int flags; /* 5. ptrace ptrace系统调用,成员ptrace被设置为0时表示不需要被跟踪,它的可能取值如下: linux-2.6.38.8/include/linux/ptrace.h 1) #define PT_PTRACED 0x00000001 2) #define PT_DTRACE 0x00000002: delayed trace (used on m68k, i386) 3) #define PT_TRACESYSGOOD 0x00000004 4) #define PT_PTRACE_CAP 0x00000008: ptracer can follow suid-exec 5) #define PT_TRACE_FORK 0x00000010 6) #define PT_TRACE_VFORK 0x00000020 7) #define PT_TRACE_CLONE 0x00000040 8) #define PT_TRACE_EXEC 0x00000080 9) #define PT_TRACE_VFORK_DONE 0x00000100 10) #define PT_TRACE_EXIT 0x00000200 */ unsigned int ptrace; unsigned long ptrace_message; siginfo_t *last_siginfo; /* 6. lock_depth 用于表示获取大内核锁的次数,如果进程未获得过锁,则置为-1 */ int lock_depth; /* 7. oncpu 在SMP上帮助实现无加锁的进程切换(unlocked context switches) */ #ifdef CONFIG_SMP #ifdef __ARCH_WANT_UNLOCKED_CTXSW int oncpu; #endif #endif /* 8. 进程调度 1) prio: 调度器考虑的优先级保存在prio,由于在某些情况下内核需要暂时提高进程的优先级,因此需要第三个成员来表示(除了static_prio、normal_prio之外),由于这些改变不是持久的,因此静态(static_prio)和普通(normal_prio)优先级不受影响 2) static_prio: 用于保存进程的"静态优先级",静态优先级是进程"启动"时分配的优先级,它可以用nice、sched_setscheduler系统调用修改,否则在进程运行期间会一直保持恒定 3) normal_prio: 表示基于进程的"静态优先级"和"调度策略"计算出的优先级,因此,即使普通进程和实时进程具有相同的静态优先级(static_prio),其普通优先级(normal_prio)也是不同的。进程分支时(fork),新创建的子进程会集成普通优先级 */ int prio, static_prio, normal_prio; /* 4) rt_priority: 表示实时进程的优先级,需要明白的是,"实时进程优先级"和"普通进程优先级"有两个独立的范畴,实时进程即使是最低优先级也高于普通进程,最低的实时优先级为0,最高的优先级为99,值越大,表明优先级越高 */ unsigned int rt_priority; /* 5) sched_class: 该进程所属的调度类,目前内核中有实现以下四种: 5.1) static const struct sched_class fair_sched_class; 5.2) static const struct sched_class rt_sched_class; 5.3) static const struct sched_class idle_sched_class; 5.4) static const struct sched_class stop_sched_class; */ const struct sched_class *sched_class; /* 6) se: 用于普通进程的调用实体   调度器不限于调度进程,还可以处理更大的实体,这可以实现"组调度",可用的CPU时间可以首先在一般的进程组(例如所有进程可以按所有者分组)之间分配,接下来分配的时间在组内再次分配   这种一般性要求调度器不直接操作进程,而是处理"可调度实体",一个实体有sched_entity的一个实例标识   在最简单的情况下,调度在各个进程上执行,由于调度器设计为处理可调度的实体,在调度器看来各个进程也必须也像这样的实体,因此se在task_struct中内嵌了一个sched_entity实例,调度器可据此操作各个task_struct */ struct sched_entity se; /* 7) rt: 用于实时进程的调用实体 */ struct sched_rt_entity rt; #ifdef CONFIG_PREEMPT_NOTIFIERS /* 9. preempt_notifier preempt_notifiers结构体链表 */ struct hlist_head preempt_notifiers; #endif /* 10. fpu_counter FPU使用计数 */ unsigned char fpu_counter; #ifdef CONFIG_BLK_DEV_IO_TRACE /* 11. btrace_seq blktrace是一个针对Linux内核中块设备I/O层的跟踪工具 */ unsigned int btrace_seq; #endif /* 12. policy policy表示进程的调度策略,目前主要有以下五种: 1) #define SCHED_NORMAL 0: 用于普通进程,它们通过完全公平调度器来处理 2) #define SCHED_FIFO 1: 先来先服务调度,由实时调度类处理 3) #define SCHED_RR 2: 时间片轮转调度,由实时调度类处理 4) #define SCHED_BATCH 3: 用于非交互、CPU使用密集的批处理进程,通过完全公平调度器来处理,调度决策对此类进程给与"冷处理",它们绝不会抢占CFS调度器处理的另一个进程,因此不会干扰交互式进程,如果不打算用nice降低进程的静态优先级,同时又不希望该进程影响系统的交互性,最适合用该调度策略 5) #define SCHED_IDLE 5: 可用于次要的进程,其相对权重总是最小的,也通过完全公平调度器来处理。要注意的是,SCHED_IDLE不负责调度空闲进程,空闲进程由内核提供单独的机制来处理 只有root用户能通过sched_setscheduler()系统调用来改变调度策略 */ unsigned int policy; /* 13. cpus_allowed cpus_allowed是一个位域,在多处理器系统上使用,用于控制进程可以在哪里处理器上运行 */ cpumask_t cpus_allowed; /* 14. RCU同步原语 */ #ifdef CONFIG_TREE_PREEMPT_RCU int rcu_read_lock_nesting; char rcu_read_unlock_special; struct rcu_node *rcu_blocked_node; struct list_head rcu_node_entry; #endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */ #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT) /* 15. sched_info 用于调度器统计进程的运行信息 */ struct sched_info sched_info; #endif /* 16. tasks 通过list_head将当前进程的task_struct串联进内核的进程列表中,构建;linux进程链表 */ struct list_head tasks; /* 17. pushable_tasks limit pushing to one attempt */ struct plist_node pushable_tasks; /* 18. 进程地址空间 1) mm: 指向进程所拥有的内存描述符 2) active_mm: active_mm指向进程运行时所使用的内存描述符 对于普通进程而言,这两个指针变量的值相同。但是,内核线程不拥有任何内存描述符,所以它们的mm成员总是为NULL。当内核线程得以运行时,它的active_mm成员被初始化为前一个运行进程的active_mm值 */ struct mm_struct *mm, *active_mm; /* 19. exit_state 进程退出状态码 */ int exit_state; /* 20. 判断标志 1) exit_code exit_code用于设置进程的终止代号,这个值要么是_exit()或exit_group()系统调用参数(正常终止),要么是由内核提供的一个错误代号(异常终止) 2) exit_signal exit_signal被置为-1时表示是某个线程组中的一员。只有当线程组的最后一个成员终止时,才会产生一个信号,以通知线程组的领头进程的父进程 */ int exit_code, exit_signal; /* 3) pdeath_signal pdeath_signal用于判断父进程终止时发送信号 */ int pdeath_signal; /* 4) personality用于处理不同的ABI,它的可能取值如下: enum { PER_LINUX = 0x0000, PER_LINUX_32BIT = 0x0000 | ADDR_LIMIT_32BIT, PER_LINUX_FDPIC = 0x0000 | FDPIC_FUNCPTRS, PER_SVR4 = 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, PER_SVR3 = 0x0002 | STICKY_TIMEOUTS | SHORT_INODE, PER_SCOSVR3 = 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE, PER_OSR5 = 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS, PER_WYSEV386 = 0x0004 | STICKY_TIMEOUTS | SHORT_INODE, PER_ISCR4 = 0x0005 | STICKY_TIMEOUTS, PER_BSD = 0x0006, PER_SUNOS = 0x0006 | STICKY_TIMEOUTS, PER_XENIX = 0x0007 | STICKY_TIMEOUTS | SHORT_INODE, PER_LINUX32 = 0x0008, PER_LINUX32_3GB = 0x0008 | ADDR_LIMIT_3GB, PER_IRIX32 = 0x0009 | STICKY_TIMEOUTS, PER_IRIXN32 = 0x000a | STICKY_TIMEOUTS, PER_IRIX64 = 0x000b | STICKY_TIMEOUTS, PER_RISCOS = 0x000c, PER_SOLARIS = 0x000d | STICKY_TIMEOUTS, PER_UW7 = 0x000e | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, PER_OSF4 = 0x000f, PER_HPUX = 0x0010, PER_MASK = 0x00ff, }; */ unsigned int personality; /* 5) did_exec did_exec用于记录进程代码是否被execve()函数所执行 */ unsigned did_exec:1; /* 6) in_execve in_execve用于通知LSM是否被do_execve()函数所调用 */ unsigned in_execve:1; /* 7) in_iowait in_iowait用于判断是否进行iowait计数 */ unsigned in_iowait:1; /* 8) sched_reset_on_fork sched_reset_on_fork用于判断是否恢复默认的优先级或调度策略 */ unsigned sched_reset_on_fork:1; /* 21. 进程标识符(PID) 在CONFIG_BASE_SMALL配置为0的情况下,PID的取值范围是0到32767,即系统中的进程数最大为32768个 #define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000) 在Linux系统中,一个线程组中的所有线程使用和该线程组的领头线程(该组中的第一个轻量级进程)相同的PID,并被存放在tgid成员中。只有线程组的领头线程的pid成员才会被设置为与tgid相同的值。注意,getpid()系统调用 返回的是当前进程的tgid值而不是pid值。 */ pid_t pid; pid_t tgid; #ifdef CONFIG_CC_STACKPROTECTOR /* 22. stack_canary 防止内核堆栈溢出,在GCC编译内核时,需要加上-fstack-protector选项 */ unsigned long stack_canary; #endif /* 23. 表示进程亲属关系的成员 1) real_parent: 指向其父进程,如果创建它的父进程不再存在,则指向PID为1的init进程 2) parent: 指向其父进程,当它终止时,必须向它的父进程发送信号。它的值通常与real_parent相同 */ struct task_struct *real_parent; struct task_struct *parent; /* 3) children: 表示链表的头部,链表中的所有元素都是它的子进程(子进程链表) 4) sibling: 用于把当前进程插入到兄弟链表中(连接到父进程的子进程链表(兄弟链表)) 5) group_leader: 指向其所在进程组的领头进程 */ struct list_head children; struct list_head sibling; struct task_struct *group_leader; struct list_head ptraced; struct list_head ptrace_entry; struct bts_context *bts; /* 24. pids PID散列表和链表 */ struct pid_link pids[PIDTYPE_MAX]; /* 25. thread_group 线程组中所有进程的链表 */ struct list_head thread_group; /* 26. do_fork函数 1) vfork_done 在执行do_fork()时,如果给定特别标志,则vfork_done会指向一个特殊地址 2) set_child_tid、clear_child_tid 如果copy_process函数的clone_flags参数的值被置为CLONE_CHILD_SETTID或CLONE_CHILD_CLEARTID,则会把child_tidptr参数的值分别复制到set_child_tid和clear_child_tid成员。这些标志说明必须改变子 进程用户态地址空间的child_tidptr所指向的变量的值。 */ struct completion *vfork_done; int __user *set_child_tid; int __user *clear_child_tid; /* 27. 记录进程的I/O计数(时间) 1) utime 用于记录进程在"用户态"下所经过的节拍数(定时器) 2) stime 用于记录进程在"内核态"下所经过的节拍数(定时器) 3) utimescaled 用于记录进程在"用户态"的运行时间,但它们以处理器的频率为刻度 4) stimescaled 用于记录进程在"内核态"的运行时间,但它们以处理器的频率为刻度 */ cputime_t utime, stime, utimescaled, stimescaled; /* 5) gtime 以节拍计数的虚拟机运行时间(guest time) */ cputime_t gtime; /* 6) prev_utime、prev_stime是先前的运行时间 */ cputime_t prev_utime, prev_stime; /* 7) nvcsw 自愿(voluntary)上下文切换计数 8) nivcsw 非自愿(involuntary)上下文切换计数 */ unsigned long nvcsw, nivcsw; /* 9) start_time 进程创建时间 10) real_start_time 进程睡眠时间,还包含了进程睡眠时间,常用于/proc/pid/stat, */ struct timespec start_time; struct timespec real_start_time; /* 11) cputime_expires 用来统计进程或进程组被跟踪的处理器时间,其中的三个成员对应着cpu_timers[3]的三个链表 */ struct task_cputime cputime_expires; struct list_head cpu_timers[3]; #ifdef CONFIG_DETECT_HUNG_TASK /* 12) last_switch_count nvcsw和nivcsw的总和 */ unsigned long last_switch_count; #endif struct task_io_accounting ioac; #if defined(CONFIG_TASK_XACCT) u64 acct_rss_mem1; u64 acct_vm_mem1; cputime_t acct_timexpd; #endif /* 28. 缺页统计 */ unsigned long min_flt, maj_flt; /* 29. 进程权能 */ const struct cred *real_cred; const struct cred *cred; struct mutex cred_guard_mutex; struct cred *replacement_session_keyring; /* 30. comm[TASK_COMM_LEN] 相应的程序名 */ char comm[TASK_COMM_LEN]; /* 31. 文件 1) fs 用来表示进程与文件系统的联系,包括当前目录和根目录 2) files 表示进程当前打开的文件 */ int link_count, total_link_count; struct fs_struct *fs; struct files_struct *files; #ifdef CONFIG_SYSVIPC /* 32. sysvsem 进程通信(SYSVIPC) */ struct sysv_sem sysvsem; #endif /* 33. 处理器特有数据 */ struct thread_struct thread; /* 34. nsproxy 命名空间 */ struct nsproxy *nsproxy; /* 35. 信号处理 1) signal: 指向进程的信号描述符 2) sighand: 指向进程的信号处理程序描述符 */ struct signal_struct *signal; struct sighand_struct *sighand; /* 3) blocked: 表示被阻塞信号的掩码 4) real_blocked: 表示临时掩码 */ sigset_t blocked, real_blocked; sigset_t saved_sigmask; /* 5) pending: 存放私有挂起信号的数据结构 */ struct sigpending pending; /* 6) sas_ss_sp: 信号处理程序备用堆栈的地址 7) sas_ss_size: 表示堆栈的大小 */ unsigned long sas_ss_sp; size_t sas_ss_size; /* 8) notifier 设备驱动程序常用notifier指向的函数来阻塞进程的某些信号 9) otifier_data 指的是notifier所指向的函数可能使用的数据。 10) otifier_mask 标识这些信号的位掩码 */ int (*notifier)(void *priv); void *notifier_data; sigset_t *notifier_mask; /* 36. 进程审计 */ struct audit_context *audit_context; #ifdef CONFIG_AUDITSYSCALL uid_t loginuid; unsigned int sessionid; #endif /* 37. secure computing */ seccomp_t seccomp; /* 38. 用于copy_process函数使用CLONE_PARENT标记时 */ u32 parent_exec_id; u32 self_exec_id; /* 39. alloc_lock 用于保护资源分配或释放的自旋锁 */ spinlock_t alloc_lock; /* 40. 中断 */ #ifdef CONFIG_GENERIC_HARDIRQS struct irqaction *irqaction; #endif #ifdef CONFIG_TRACE_IRQFLAGS unsigned int irq_events; int hardirqs_enabled; unsigned long hardirq_enable_ip; unsigned int hardirq_enable_event; unsigned long hardirq_disable_ip; unsigned int hardirq_disable_event; int softirqs_enabled; unsigned long softirq_disable_ip; unsigned int softirq_disable_event; unsigned long softirq_enable_ip; unsigned int softirq_enable_event; int hardirq_context; int softirq_context; #endif /* 41. pi_lock task_rq_lock函数所使用的锁 */ spinlock_t pi_lock; #ifdef CONFIG_RT_MUTEXES /* 42. 基于PI协议的等待互斥锁,其中PI指的是priority inheritance/9优先级继承) */ struct plist_head pi_waiters; struct rt_mutex_waiter *pi_blocked_on; #endif #ifdef CONFIG_DEBUG_MUTEXES /* 43. blocked_on 死锁检测 */ struct mutex_waiter *blocked_on; #endif /* 44. lockdep, */ #ifdef CONFIG_LOCKDEP # define MAX_LOCK_DEPTH 48UL u64 curr_chain_key; int lockdep_depth; unsigned int lockdep_recursion; struct held_lock held_locks[MAX_LOCK_DEPTH]; gfp_t lockdep_reclaim_gfp; #endif /* 45. journal_info JFS文件系统 */ void *journal_info; /* 46. 块设备链表 */ struct bio *bio_list, **bio_tail; /* 47. reclaim_state 内存回收 */ struct reclaim_state *reclaim_state; /* 48. backing_dev_info 存放块设备I/O数据流量信息 */ struct backing_dev_info *backing_dev_info; /* 49. io_context I/O调度器所使用的信息 */ struct io_context *io_context; /* 50. CPUSET功能 */ #ifdef CONFIG_CPUSETS nodemask_t mems_allowed; int cpuset_mem_spread_rotor; #endif /* 51. Control Groups */ #ifdef CONFIG_CGROUPS struct css_set *cgroups; struct list_head cg_list; #endif /* 52. robust_list futex同步机制 */ #ifdef CONFIG_FUTEX struct robust_list_head __user *robust_list; #ifdef CONFIG_COMPAT struct compat_robust_list_head __user *compat_robust_list; #endif struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; #endif #ifdef CONFIG_PERF_EVENTS struct perf_event_context *perf_event_ctxp; struct mutex perf_event_mutex; struct list_head perf_event_list; #endif /* 53. 非一致内存访问(NUMA Non-Uniform Memory Access) */ #ifdef CONFIG_NUMA struct mempolicy *mempolicy; /* Protected by alloc_lock */ short il_next; #endif /* 54. fs_excl 文件系统互斥资源 */ atomic_t fs_excl; /* 55. rcu RCU链表 */ struct rcu_head rcu; /* 56. splice_pipe 管道 */ struct pipe_inode_info *splice_pipe; /* 57. delays 延迟计数 */ #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif /* 58. make_it_fail fault injection */ #ifdef CONFIG_FAULT_INJECTION int make_it_fail; #endif /* 59. dirties FLoating proportions */ struct prop_local_single dirties; /* 60. Infrastructure for displayinglatency */ #ifdef CONFIG_LATENCYTOP int latency_record_count; struct latency_record latency_record[LT_SAVECOUNT]; #endif /* 61. time slack values,常用于poll和select函数 */ unsigned long timer_slack_ns; unsigned long default_timer_slack_ns; /* 62. scm_work_list socket控制消息(control message) */ struct list_head *scm_work_list; /* 63. ftrace跟踪器 */ #ifdef CONFIG_FUNCTION_GRAPH_TRACER int curr_ret_stack; struct ftrace_ret_stack *ret_stack; unsigned long long ftrace_timestamp; atomic_t trace_overrun; atomic_t tracing_graph_pause; #endif #ifdef CONFIG_TRACING unsigned long trace; unsigned long trace_recursion; #endif }; ``` ### 3 总结 (1)pid 就是一个编号,通过pid namespace的引入,让每个进程,可以存在多个ns命名空间。比如2.2中的sleep,其实就是存在两层的ns。 第一层就是父进程所在的那层,父进程是直接fork和bash, 以及真正的1号进程的pid namespaces是一致的。 第二层就是 自己新创建的这层。 分配pid的时候,首先在第二层分配给sleep pid=1(这个没进去,看不见) 然后再再父进程的ns,给sleep 分配的pid = 3994 这样的话,就实现了进程隔离,因为子进程在第二层看到的 pid=1,所以它在第二层只能看到,自己产生的所有进程,从而达到了隔离。 ### 4.参考 [容器原理之 - namespace](https://mp.weixin.qq.com/s/FnuOMbWAhLQoiCBA_NFYXA) [Linux-进程描述符 task_struct 详解](https://www.cnblogs.com/JohnABC/p/9084750.html) ================================================ FILE: docker/10. 如何下载并二进制编译docker源码.md ================================================ * [1\. 如何下载docker源码](#1-如何下载docker源码) * [2\. docker源码目录解析](#2-docker源码目录解析) * [3\. 二进制编译docker源码](#3-二进制编译docker源码) * [3\.1 下载需要编译的源代码](#31-下载需要编译的源代码) * [3\.2 通过容器编译](#32-通过容器编译) ### 1. 如何下载docker源码 在下载docker源码的时候,发现有moby、docker-ce与docker-ee项目。 docker是一家公司,其中的一个产品就是docker。docker-ce是 免费版本。docker-ee的商用版本。目前docker-ee没有git repo。 docker-ce repo处于废弃状态。 docker将docker进行了开源,开源项目的名字是 moby。 至于为什么这么做,可以参考以下的issue。 https://www.zhihu.com/question/58805021 https://github.com/moby/moby/pull/32691 所以研究源码直接研究moby就可以了: ### 2. docker源码目录解析 ``` ├── AUTHORS ├── CHANGELOG.md ├── CONTRIBUTING.md ├── Dockerfile ├── Dockerfile.aarch64 ├── Dockerfile.armhf ├── Dockerfile.ppc64le ├── Dockerfile.s390x ├── Dockerfile.simple ├── Dockerfile.solaris ├── Dockerfile.windows ├── LICENSE ├── MAINTAINERS ├── Makefile ├── NOTICE ├── README.md ├── ROADMAP.md ├── VENDORING.md ├── VERSION ├── api api目录是docker cli或者第三方软件与docker daemon进行交互的api库,它是HTTP REST API. api/types:是被docker client和server共用的一些类型定义,比如多种对象,options, responses等。大部分是手工写的代码,也有部分是通过swagger自动生成的。 ├── builder docker build dockerfile实现相关代码 ├── cli Docker命令行接口,定义了docker支持的所有命令。例如docker stop等 ├── client docker client端(发送http请求)。定义所有命令的client请求 ├── cmd dockerd命令行实现,docker,dockerd的启动函数 ├── container 和容器相关的数据结构定义,比如容器状态,容器的io,容器的环境变量 ├── contrib 包括脚本,镜像和其它一些有用的工具,并不属于docker发布的一部分,正因为如此,它们可能会过时 ├── daemon docker daemon实现 ├── distribution docker镜像仓库相关功能代码,如docker push,docker pull ├── dockerversion docker镜像仓库相关功能代码,如docker push,docker pull ├── docs 文档相关 ├── experimental 开启docker实验特性的相关文档说明 ├── hack 与编译相关的工具目录 ├── hooks 编译相关的钩子 ├── image 镜像存储相关操作代码 ├── integration-cli 集成测试相关命令行 ├── keys 和测试相关的key ├── layer 镜像层相关操作代码 ├── libcontainerd 与containerd通信相关lib ├── man 生成docker手册相关的代码 ├── migrate 用于转换老的镜像层次,主要是转v1 ├── oci 支持oci相关实现(容器运行时标准) ├── opts 处理命令选项相关 ├── pkg 工具包。处理字符串,url,系统相关信号,锁相关工具 ├── plugin docker插件处理相关实现 ├── poule.yml ├── profiles linux下安全相关处理,apparmor和seccomp. ├── project 文档相关 ├── reference 镜像仓库reference管理 ├── registry 镜像仓库相关代码 ├── restartmanager 容器重启策略实现 ├── runconfig 容器运行相关配置操作 ├── vendor go语言的目录,依赖第三方库目录 ├── vendor.conf └── volume docker volume相关的代码实现 ``` ### 3. 二进制编译docker源码-17.05.0版本 直接看源码肯定会在一些地方卡住,所以最好的办法就是编译源码,通过打日志/调试的方式来确定具体实现细节。 #### 3.1 下载需要编译的源代码 这里我是下载的 https://github.com/moby/moby/blob/v17.05.0-ce ``` # git clone https://github.com/moby/moby.git -b v17.05.0-ce ``` 然后修改文件项目为: `/home/zoux/data/golang/src/github.com/docker/docker` #### 3.2 通过容器编译 docker开发环境本质上是创建一个docker镜像,镜像里包含了docker的所有开发运行环境,本地代码通过挂载的方式放到容器中运行。 dockercore/docker就是官方提供的编译镜像。 ``` docker run --rm -it --privileged -v /home/zoux/data/golang/src/github.com/docker/docker:/go/src/github.com/docker/docker dockercore/docker bash ## 进去之后可以直接运行 该命令进行编译 root@ab1bf697b6a6:/go/src/github.com/docker/docker# ./hack/make.sh binary bundles/17.05.0-ce already exists. Removing. ---> Making bundle: binary (in bundles/17.05.0-ce/binary) Building: bundles/17.05.0-ce/binary-client/docker-17.05.0-ce Created binary: bundles/17.05.0-ce/binary-client/docker-17.05.0-ce Building: bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce Created binary: bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce Copying nested executables into bundles/17.05.0-ce/binary-daemon ## 还可以自己设置tag root@ab1bf697b6a6:/go/src/github.com/docker/docker# export DOCKER_GITCOMMIT=v17.05-zx root@ab1bf697b6a6:/go/src/github.com/docker/docker# ./hack/make.sh binary bundles/17.05.0-ce already exists. Removing. ---> Making bundle: binary (in bundles/17.05.0-ce/binary) Building: bundles/17.05.0-ce/binary-client/docker-17.05.0-ce Created binary: bundles/17.05.0-ce/binary-client/docker-17.05.0-ce Building: bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce Created binary: bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce Copying nested executables into bundles/17.05.0-ce/binary-daemon root@ab1bf697b6a6:/go/src/github.com/docker/docker# ./bundles/17.05.0-ce/binary-daemon/dockerd --version Docker version 17.05.0-ce, build v17.05-zx ## 下载到本地, 一定要是dockerd-17.05.0-ce,而不是dockerd, dockerd只是一个链接文件 docker cp ab1bf697b6a6:/go/src/github.com/docker/docker/bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce /home/zoux/dockerd root@ab1bf697b6a6:/go/src/github.com/docker/docker/bundles/17.05.0-ce/binary-daemon# ls -l total 68704 -rwxr-xr-x 1 root root 8997448 Feb 23 09:12 docker-containerd -rwxr-xr-x 1 root root 8448168 Feb 23 09:12 docker-containerd-ctr -rw-r--r-- 1 root root 56 Feb 23 09:12 docker-containerd-ctr.md5 -rw-r--r-- 1 root root 88 Feb 23 09:12 docker-containerd-ctr.sha256 -rwxr-xr-x 1 root root 3047240 Feb 23 09:12 docker-containerd-shim -rw-r--r-- 1 root root 57 Feb 23 09:12 docker-containerd-shim.md5 -rw-r--r-- 1 root root 89 Feb 23 09:12 docker-containerd-shim.sha256 -rw-r--r-- 1 root root 52 Feb 23 09:12 docker-containerd.md5 -rw-r--r-- 1 root root 84 Feb 23 09:12 docker-containerd.sha256 -rwxr-xr-x 1 root root 772400 Feb 23 09:12 docker-init -rw-r--r-- 1 root root 46 Feb 23 09:12 docker-init.md5 -rw-r--r-- 1 root root 78 Feb 23 09:12 docker-init.sha256 -rwxr-xr-x 1 root root 2530685 Feb 23 09:12 docker-proxy -rw-r--r-- 1 root root 47 Feb 23 09:12 docker-proxy.md5 -rw-r--r-- 1 root root 79 Feb 23 09:12 docker-proxy.sha256 -rwxr-xr-x 1 root root 7096504 Feb 23 09:12 docker-runc -rw-r--r-- 1 root root 46 Feb 23 09:12 docker-runc.md5 -rw-r--r-- 1 root root 78 Feb 23 09:12 docker-runc.sha256 lrwxrwxrwx 1 root root 18 Feb 23 09:12 dockerd -> dockerd-17.05.0-ce -rwxr-xr-x 1 root root 39392304 Feb 23 09:12 dockerd-17.05.0-ce -rw-r--r-- 1 root root 53 Feb 23 09:12 dockerd-17.05.0-ce.md5 -rw-r--r-- 1 root root 85 Feb 23 09:12 dockerd-17.05.0-ce.sha256 ``` ### 4. 二进制编译docker源码-19.03.9版本 这个版本和17.5版本的不同在于,docker和dockerd分离了。 在docker v17.06 之后,docker cli 和dockerd分离了, 单独拆成了https://github.com/docker/cli #### 4.1 docker编译 将该项目下载到 $GOPATH/src/github.com/docker 目录。然后有go环境,直接 `make binary`就可以编译docker源码。 ``` root:/home/zoux/data/golang/src/github.com/docker/cli# source /home/zouxiang/config // 设置go环境 root:/home/zoux/data/golang/src/github.com/docker/cli# make binary // 编译 WARNING: you are not in a container. Use "make -f docker.Makefile binary" or set DISABLE_WARN_OUTSIDE_CONTAINER=1 to disable this warning. Press Ctrl+C now to abort. WARNING: binary creates a Linux executable. Use cross for macOS or Windows. ./scripts/build/binary Building statically linked build/docker-linux-amd64 ``` #### 4.2 dockerd编译 同样通过二进制编译。将该项目下载到 $GOPATH/src/github.com/docker 目录。然后有go环境,直接 `./hack/make.sh binary`就可以编译docker源码。 ``` root:/home/zoux/data/golang/src/github.com/docker/docker# ./hack/make.sh binary #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # GITCOMMIT = 811a247d06-unsupported # The version you are building is listed as unsupported because # there are some files in the git repository that are in an uncommitted state. # Commit these changes, or add to .gitignore to remove the -unsupported from the version. # Here is the current list: M cmd/dockerd/daemon.go #~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Removing bundles/ ---> Making bundle: binary (in bundles/binary) Building: bundles/binary-daemon/dockerd-dev GOOS="linux" GOARCH="amd64" GOARM="" Created binary: bundles/binary-daemon/dockerd-dev ``` 该过程可能会遇到报错。比如: No package 'devmapper' found make binary causes fatal error: btrfs/ioctl.h: No such file or directory
这是一些基础的包没装好。apt-get 或者yum安装就好了。 ``` apt-get install -y libdevmapper-dev apt-get install -y install btrfs-progs apt-get install -y btrfs-progs-dev ``` ================================================ FILE: docker/11. dockercli 源码分析-docker run为例.md ================================================ * [0\. 章节目的](#0-章节目的) * [1\. docker run 客户端处理流程](#1-docker-run-客户端处理流程) * [1\.1 docker 函数入口](#11-docker-函数入口) * [2\. 初始化docker cli客户端](#2-初始化docker-cli客户端) * [3\. 实例化newDockerCommand对象](#3-实例化newdockercommand对象) * [3\.1 newDockerCommand](#31-newdockercommand) * [3\.2\. NewRunCommand](#32-newruncommand) * [3\.3 runContainer](#33-runcontainer) * [3\.4 ContainerCreate & ContainerStart](#34-containercreate--containerstart) * [3\.5 总结](#35-总结) ### 0. 章节目的 从本章节开始以 docker run niginx ls为例。从源码角度弄清楚docker run nginx ls具体过程。 本章节的目的就是弄清楚该命令运行时。 docker cli做了什么工作。 ``` root# docker run nginx ls bin boot dev docker-entrypoint.d docker-entrypoint.sh etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var ```
顺便补充一下: 在docker v17.06 之前,docker cli(就是我们经常使用的docker) 和dockerd 源码是一起的。都在:https://github.com/moby/moby项目 并且都在cmd目录。 cmd/docker: 是docker cli的主函数目录 cmd/dockerd: 是dockerd的主函数目录
在docker v17.06 之后,docker cli 和dockerd分离了, 单独拆成了https://github.com/docker/cli 所以,本节基于https://github.com/docker/cli/tree/v19.03.9 进行研究。 将该项目下载到 $GOPATH/src/github.com/docker 目录。然后有go环境,直接 `make binary`就可以编译源码。
### 1. docker run 客户端处理流程 #### 1.1 docker 函数入口 main函数主要就是定义了newDockerCommand, dockerd的mian函数在cmd/dockerd/docker.go ``` func runDocker(dockerCli *command.DockerCli) error { tcmd := newDockerCommand(dockerCli) cmd, args, err := tcmd.HandleGlobalFlags() if err != nil { return err } if err := tcmd.Initialize(); err != nil { return err } args, os.Args, err = processAliases(dockerCli, cmd, args, os.Args) if err != nil { return err } if len(args) > 0 { if _, _, err := cmd.Find(args); err != nil { err := tryPluginRun(dockerCli, cmd, args[0]) if !pluginmanager.IsNotFound(err) { return err } // For plugin not found we fall through to // cmd.Execute() which deals with reporting // "command not found" in a consistent way. } } // We've parsed global args already, so reset args to those // which remain. cmd.SetArgs(args) return cmd.Execute() } ``` 主要干了两件事: (1)实例化newDockerCommand对象 (2)初始化了docker cli客户端 先看看初始化客户端做了什么。 ### 2. 初始化docker cli客户端 Initialize函数进行了cli客户端的初始化。 docker是cs结构的框架,但是client server基本都是在同一台机器上。所以docker使用了unix socket进行进程的通信。这样的好处就是快,比tcp快1/7。 dockerd运行起来后,会创建一个socket,默认是 /var/run/docker.sock。基于这个sock文件就可以构造一个客户端,用于交互。 dockerd运行起来后,会在 /var/run 目录增加两个文件。docker.pid (进程编号), docker.sock。 可参考:[golang中基于http 和unix socket的通信代码实现(服务端基于gin框架)](https://blog.csdn.net/qq_33399567/article/details/107691339)
``` // Initialize the dockerCli runs initialization that must happen after command // line flags are parsed. func (cli *DockerCli) Initialize(opts *cliflags.ClientOptions, ops ...InitializeOpt) error { var err error for _, o := range ops { if err := o(cli); err != nil { return err } } cliflags.SetLogLevel(opts.Common.LogLevel) if opts.ConfigDir != "" { cliconfig.SetDir(opts.ConfigDir) logrus.Errorf("zoux Initialize opts.ConfigDir is: %v", opts.ConfigDir) } if opts.Common.Debug { debug.Enable() } cli.loadConfigFile() baseContextStore := store.New(cliconfig.ContextStoreDir(), cli.contextStoreConfig) logrus.Errorf("zoux Initialize baseContextStore is: %v", baseContextStore) cli.contextStore = &ContextStoreWithDefault{ Store: baseContextStore, Resolver: func() (*DefaultContext, error) { return ResolveDefaultContext(opts.Common, cli.ConfigFile(), cli.contextStoreConfig, cli.Err()) }, } cli.currentContext, err = resolveContextName(opts.Common, cli.configFile, cli.contextStore) if err != nil { return err } cli.dockerEndpoint, err = resolveDockerEndpoint(cli.contextStore, cli.currentContext) if err != nil { return errors.Wrap(err, "unable to resolve docker endpoint") } logrus.Errorf("zoux Initialize dockerEndpoint TLSData is %v: host is %v", cli.dockerEndpoint.TLSData, cli.dockerEndpoint.Host) if cli.client == nil { cli.client, err = newAPIClientFromEndpoint(cli.dockerEndpoint, cli.configFile) if tlsconfig.IsErrEncryptedKey(err) { passRetriever := passphrase.PromptRetrieverWithInOut(cli.In(), cli.Out(), nil) newClient := func(password string) (client.APIClient, error) { cli.dockerEndpoint.TLSPassword = password return newAPIClientFromEndpoint(cli.dockerEndpoint, cli.configFile) } cli.client, err = getClientWithPassword(passRetriever, newClient) } if err != nil { return err } } logrus.Errorf("zoux Initialize cli.client is %v", cli.client) return nil } ``` 在上面的核心函数增加了部分日志,可以看出来。 docker cli的构建核心就是,利用var/run/docker.sock 文件创建了go的客户端。 ``` root@k8s-node:~# docker run nginx ls ERRO[0000] zoux initialize opts.configDir is /root/.docker ERRO[0000] zoux initialize baseContextStore is &{0xc0002f4e80 0xc00005a3a0} ERRO[0000] zoux initialize dockerEndpooint TLSData is , host is unix:///var/run/docker.sock ERRO[0000] zoux initizlize cli.client is &{http unix:///var/run/docker.sock unix /var/run/docker.sock 0xc000368720 1.40 map[User-Agent:Docker-Client/unknown-version (linux)] false false false} bin boot dev docker-entrypoint.d docker-entrypoint.sh etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var ``` ### 3. 实例化newDockerCommand对象 #### 3.1 newDockerCommand ``` func newDockerCommand(dockerCli *command.DockerCli) *cli.TopLevelCommand { var ( opts *cliflags.ClientOptions flags *pflag.FlagSet helpCmd *cobra.Command ) cmd := &cobra.Command{ Use: "docker [OPTIONS] COMMAND [ARG...]", Short: "A self-sufficient runtime for containers", SilenceUsage: true, SilenceErrors: true, TraverseChildren: true, RunE: func(cmd *cobra.Command, args []string) error { if len(args) == 0 { return command.ShowHelp(dockerCli.Err())(cmd, args) } return fmt.Errorf("docker: '%s' is not a docker command.\nSee 'docker --help'", args[0]) }, PersistentPreRunE: func(cmd *cobra.Command, args []string) error { return isSupported(cmd, dockerCli) }, Version: fmt.Sprintf("%s, build %s", version.Version, version.GitCommit), DisableFlagsInUseLine: true, } opts, flags, helpCmd = cli.SetupRootCommand(cmd) flags.BoolP("version", "v", false, "Print version information and quit") setFlagErrorFunc(dockerCli, cmd) setupHelpCommand(dockerCli, cmd, helpCmd) setHelpFunc(dockerCli, cmd) cmd.SetOutput(dockerCli.Out()) commands.AddCommands(cmd, dockerCli) cli.DisableFlagsInUseLine(cmd) setValidateArgs(dockerCli, cmd) // flags must be the top-level command flags, not cmd.Flags() return cli.NewTopLevelCommand(cmd, dockerCli, opts, flags) } ``` newDockerCommand函数的核心就是: (1)RunE (2)PersistentPreRunE (3)commands.AddCommands(cmd, dockerCli)
**RunE**就是打印help函数,这和实操是一样的。输入docker,后面什么都不带就是打印help。因为docker 本身是不能运行的,后面必须跟子命令。 **PersistentPreRunE**就是判断docker 输入的flags是否支持。 ``` func areFlagsSupported(cmd *cobra.Command, details versionDetails) error { errs := []string{} cmd.Flags().VisitAll(func(f *pflag.Flag) { if !f.Changed { return } if !isVersionSupported(f, details.Client().ClientVersion()) { errs = append(errs, fmt.Sprintf(`"--%s" requires API version %s, but the Docker daemon API version is %s`, f.Name, getFlagAnnotation(f, "version"), details.Client().ClientVersion())) return } if !isOSTypeSupported(f, details.ServerInfo().OSType) { errs = append(errs, fmt.Sprintf( `"--%s" is only supported on a Docker daemon running on %s, but the Docker daemon is running on %s`, f.Name, getFlagAnnotation(f, "ostype"), details.ServerInfo().OSType), ) return } if _, ok := f.Annotations["experimental"]; ok && !details.ServerInfo().HasExperimental { errs = append(errs, fmt.Sprintf(`"--%s" is only supported on a Docker daemon with experimental features enabled`, f.Name)) } if _, ok := f.Annotations["experimentalCLI"]; ok && !details.ClientInfo().HasExperimental { errs = append(errs, fmt.Sprintf(`"--%s" is only supported on a Docker cli with experimental cli features enabled`, f.Name)) } // buildkit-specific flags are noop when buildkit is not enabled, so we do not add an error in that case }) if len(errs) > 0 { return errors.New(strings.Join(errs, "\n")) } return nil } ``` **commands.AddCommands:** 就是增加子命令。 这里我们主要关键 NewContainerCommand。 而docker run就是对应了NewRunCommand子命令。 ``` // AddCommands adds all the commands from cli/command to the root command func AddCommands(cmd *cobra.Command, dockerCli command.Cli) { cmd.AddCommand( // checkpoint checkpoint.NewCheckpointCommand(dockerCli), // config config.NewConfigCommand(dockerCli), // container container.NewContainerCommand(dockerCli), container.NewRunCommand(dockerCli), // image image.NewImageCommand(dockerCli), image.NewBuildCommand(dockerCli), // builder builder.NewBuilderCommand(dockerCli), // manifest manifest.NewManifestCommand(dockerCli), // network network.NewNetworkCommand(dockerCli), // node node.NewNodeCommand(dockerCli), // plugin plugin.NewPluginCommand(dockerCli), // registry registry.NewLoginCommand(dockerCli), registry.NewLogoutCommand(dockerCli), registry.NewSearchCommand(dockerCli), // secret secret.NewSecretCommand(dockerCli), // service service.NewServiceCommand(dockerCli), // system system.NewSystemCommand(dockerCli), system.NewVersionCommand(dockerCli), // stack stack.NewStackCommand(dockerCli), // swarm swarm.NewSwarmCommand(dockerCli), // trust trust.NewTrustCommand(dockerCli), // volume volume.NewVolumeCommand(dockerCli), // context context.NewContextCommand(dockerCli), // legacy commands may be hidden hide(stack.NewTopLevelDeployCommand(dockerCli)), hide(system.NewEventsCommand(dockerCli)), hide(system.NewInfoCommand(dockerCli)), hide(system.NewInspectCommand(dockerCli)), hide(container.NewAttachCommand(dockerCli)), hide(container.NewCommitCommand(dockerCli)), hide(container.NewCopyCommand(dockerCli)), hide(container.NewCreateCommand(dockerCli)), hide(container.NewDiffCommand(dockerCli)), hide(container.NewExecCommand(dockerCli)), hide(container.NewExportCommand(dockerCli)), hide(container.NewKillCommand(dockerCli)), hide(container.NewLogsCommand(dockerCli)), hide(container.NewPauseCommand(dockerCli)), hide(container.NewPortCommand(dockerCli)), hide(container.NewPsCommand(dockerCli)), hide(container.NewRenameCommand(dockerCli)), hide(container.NewRestartCommand(dockerCli)), hide(container.NewRmCommand(dockerCli)), hide(container.NewStartCommand(dockerCli)), hide(container.NewStatsCommand(dockerCli)), hide(container.NewStopCommand(dockerCli)), hide(container.NewTopCommand(dockerCli)), hide(container.NewUnpauseCommand(dockerCli)), hide(container.NewUpdateCommand(dockerCli)), hide(container.NewWaitCommand(dockerCli)), hide(image.NewHistoryCommand(dockerCli)), hide(image.NewImagesCommand(dockerCli)), hide(image.NewImportCommand(dockerCli)), hide(image.NewLoadCommand(dockerCli)), hide(image.NewPullCommand(dockerCli)), hide(image.NewPushCommand(dockerCli)), hide(image.NewRemoveCommand(dockerCli)), hide(image.NewSaveCommand(dockerCli)), hide(image.NewTagCommand(dockerCli)), ) if runtime.GOOS == "linux" { // engine cmd.AddCommand(engine.NewEngineCommand(dockerCli)) } } ``` #### 3.2. NewRunCommand ``` // NewRunCommand create a new `docker run` command func NewRunCommand(dockerCli command.Cli) *cobra.Command { var opts runOptions var copts *containerOptions cmd := &cobra.Command{ Use: "run [OPTIONS] IMAGE [COMMAND] [ARG...]", Short: "Run a command in a new container", Args: cli.RequiresMinArgs(1), RunE: func(cmd *cobra.Command, args []string) error { copts.Image = args[0] if len(args) > 1 { copts.Args = args[1:] } return runRun(dockerCli, cmd.Flags(), &opts, copts) }, } flags := cmd.Flags() flags.SetInterspersed(false) // These are flags not stored in Config/HostConfig flags.BoolVarP(&opts.detach, "detach", "d", false, "Run container in background and print container ID") flags.BoolVar(&opts.sigProxy, "sig-proxy", true, "Proxy received signals to the process") flags.StringVar(&opts.name, "name", "", "Assign a name to the container") flags.StringVar(&opts.detachKeys, "detach-keys", "", "Override the key sequence for detaching a container") // Add an explicit help that doesn't have a `-h` to prevent the conflict // with hostname flags.Bool("help", false, "Print usage") command.AddPlatformFlag(flags, &opts.platform) command.AddTrustVerificationFlags(flags, &opts.untrusted, dockerCli.ContentTrustEnabled()) copts = addFlags(flags) return cmd } ```
和其他命令已有,设置了一堆flags, 还有校验,比如image 必须要有,制定了多个只允许第一个 核心就是runRun函数核心就是runContainer ``` func runRun(dockerCli command.Cli, flags *pflag.FlagSet, ropts *runOptions, copts *containerOptions) error { proxyConfig := dockerCli.ConfigFile().ParseProxyConfig(dockerCli.Client().DaemonHost(), opts.ConvertKVStringsToMapWithNil(copts.env.GetAll())) newEnv := []string{} for k, v := range proxyConfig { if v == nil { newEnv = append(newEnv, k) } else { newEnv = append(newEnv, fmt.Sprintf("%s=%s", k, *v)) } } copts.env = *opts.NewListOptsRef(&newEnv, nil) containerConfig, err := parse(flags, copts, dockerCli.ServerInfo().OSType) // just in case the parse does not exit if err != nil { reportError(dockerCli.Err(), "run", err.Error(), true) return cli.StatusError{StatusCode: 125} } if err = validateAPIVersion(containerConfig, dockerCli.Client().ClientVersion()); err != nil { reportError(dockerCli.Err(), "run", err.Error(), true) return cli.StatusError{StatusCode: 125} } return runContainer(dockerCli, ropts, copts, containerConfig) } ```
#### 3.3 runContainer 从下面的函数逻辑可以看出来,run container分为两个过程:creatContainer, StartContainer。 在`ContainerCreate()`和`ContainerStart()`中分别向daemon发送了create和start命令。下一步,就需要到docker daemon中分析daemon对create和start的处理。 ``` createResponse, err := createContainer(ctx, dockerCli, containerConfig, opts.name) if err := client.ContainerStart(ctx, createResponse.ID, types.ContainerStartOptions{}); err != nil ```
``` // nolint: gocyclo func runContainer(dockerCli command.Cli, opts *runOptions, copts *containerOptions, containerConfig *containerConfig) error { config := containerConfig.Config hostConfig := containerConfig.HostConfig stdout, stderr := dockerCli.Out(), dockerCli.Err() client := dockerCli.Client() config.ArgsEscaped = false // 1.更加配置初始化是否attach,运行的os等 if !opts.detach { if err := dockerCli.In().CheckTty(config.AttachStdin, config.Tty); err != nil { return err } } else { if copts.attach.Len() != 0 { return errors.New("Conflicting options: -a and -d") } config.AttachStdin = false config.AttachStdout = false config.AttachStderr = false config.StdinOnce = false } // Telling the Windows daemon the initial size of the tty during start makes // a far better user experience rather than relying on subsequent resizes // to cause things to catch up. if runtime.GOOS == "windows" { hostConfig.ConsoleSize[0], hostConfig.ConsoleSize[1] = dockerCli.Out().GetTtySize() } ctx, cancelFun := context.WithCancel(context.Background()) defer cancelFun() // 1.调用createContainer创建container createResponse, err := createContainer(ctx, dockerCli, containerConfig, &opts.createOptions) if err != nil { reportError(stderr, "run", err.Error(), true) return runStartContainerErr(err) } if opts.sigProxy { sigc := ForwardAllSignals(ctx, dockerCli, createResponse.ID) defer signal.StopCatch(sigc) } var ( waitDisplayID chan struct{} errCh chan error ) if !config.AttachStdout && !config.AttachStderr { // Make this asynchronous to allow the client to write to stdin before having to read the ID waitDisplayID = make(chan struct{}) go func() { defer close(waitDisplayID) fmt.Fprintln(stdout, createResponse.ID) }() } attach := config.AttachStdin || config.AttachStdout || config.AttachStderr if attach { if opts.detachKeys != "" { dockerCli.ConfigFile().DetachKeys = opts.detachKeys } close, err := attachContainer(ctx, dockerCli, &errCh, config, createResponse.ID) if err != nil { return err } defer close() } statusChan := waitExitOrRemoved(ctx, dockerCli, createResponse.ID, copts.autoRemove) //start the container // 3.调用ContainerStart,运行容器 if err := client.ContainerStart(ctx, createResponse.ID, types.ContainerStartOptions{}); err != nil { // If we have hijackedIOStreamer, we should notify // hijackedIOStreamer we are going to exit and wait // to avoid the terminal are not restored. if attach { cancelFun() <-errCh } reportError(stderr, "run", err.Error(), false) if copts.autoRemove { // wait container to be removed <-statusChan } return runStartContainerErr(err) } if (config.AttachStdin || config.AttachStdout || config.AttachStderr) && config.Tty && dockerCli.Out().IsTerminal() { if err := MonitorTtySize(ctx, dockerCli, createResponse.ID, false); err != nil { fmt.Fprintln(stderr, "Error monitoring TTY size:", err) } } if errCh != nil { if err := <-errCh; err != nil { if _, ok := err.(term.EscapeError); ok { // The user entered the detach escape sequence. return nil } logrus.Debugf("Error hijack: %s", err) return err } } // Detached mode: wait for the id to be displayed and return. if !config.AttachStdout && !config.AttachStderr { // Detached mode <-waitDisplayID return nil } status := <-statusChan if status != 0 { return cli.StatusError{StatusCode: status} } return nil } ``` #### 3.4 ContainerCreate & ContainerStart 从下面的代码很容易看出来。ContainerCreate核心逻辑如下: (1)通过配置获取镜像tag等信息 (2)调用dockercli客户端,创建container。 (3)如果创建失败,并且是因为image的问题,并且 --pull=always或者missing,就先pull image,然后再次创建 ``` --pull missing Pull image before running ("always"|"missing"|"never") ``` docker run参数详见: https://docs.docker.com/engine/reference/commandline/run/
``` func createContainer(ctx context.Context, dockerCli command.Cli, containerConfig *containerConfig, opts *createOptions) (*container.ContainerCreateCreatedBody, error) { config := containerConfig.Config hostConfig := containerConfig.HostConfig networkingConfig := containerConfig.NetworkingConfig stderr := dockerCli.Err() warnOnOomKillDisable(*hostConfig, stderr) warnOnLocalhostDNS(*hostConfig, stderr) var ( trustedRef reference.Canonical namedRef reference.Named ) containerIDFile, err := newCIDFile(hostConfig.ContainerIDFile) if err != nil { return nil, err } defer containerIDFile.Close() ref, err := reference.ParseAnyReference(config.Image) if err != nil { return nil, err } if named, ok := ref.(reference.Named); ok { namedRef = reference.TagNameOnly(named) if taggedRef, ok := namedRef.(reference.NamedTagged); ok && !opts.untrusted { var err error trustedRef, err = image.TrustedReference(ctx, dockerCli, taggedRef, nil) if err != nil { return nil, err } config.Image = reference.FamiliarString(trustedRef) } } //create the container response, err := dockerCli.Client().ContainerCreate(ctx, config, hostConfig, networkingConfig, opts.name) //if image not found try to pull it if err != nil { if apiclient.IsErrNotFound(err) && namedRef != nil { fmt.Fprintf(stderr, "Unable to find image '%s' locally\n", reference.FamiliarString(namedRef)) // we don't want to write to stdout anything apart from container.ID if err := pullImage(ctx, dockerCli, config.Image, opts.platform, stderr); err != nil { return nil, err } if taggedRef, ok := namedRef.(reference.NamedTagged); ok && trustedRef != nil { if err := image.TagTrusted(ctx, dockerCli, trustedRef, taggedRef); err != nil { return nil, err } } // Retry var retryErr error response, retryErr = dockerCli.Client().ContainerCreate(ctx, config, hostConfig, networkingConfig, opts.name) if retryErr != nil { return nil, retryErr } } else { return nil, err } } for _, warning := range response.Warnings { fmt.Fprintf(stderr, "WARNING: %s\n", warning) } err = containerIDFile.Write(response.ID) return &response, err } ``` ContainerCreate, ContainerStart 直接就是Post /containers/create 或者/start 请求创建, 运行。 ``` // ContainerCreate creates a new container based in the given configuration. // It can be associated with a name, but it's not mandatory. func (cli *Client) ContainerCreate(ctx context.Context, config *container.Config, hostConfig *container.HostConfig, networkingConfig *network.NetworkingConfig, containerName string) (container.ContainerCreateCreatedBody, error) { var response container.ContainerCreateCreatedBody if err := cli.NewVersionError("1.25", "stop timeout"); config != nil && config.StopTimeout != nil && err != nil { return response, err } // When using API 1.24 and under, the client is responsible for removing the container if hostConfig != nil && versions.LessThan(cli.ClientVersion(), "1.25") { hostConfig.AutoRemove = false } query := url.Values{} if containerName != "" { query.Set("name", containerName) } body := configWrapper{ Config: config, HostConfig: hostConfig, NetworkingConfig: networkingConfig, } serverResp, err := cli.post(ctx, "/containers/create", query, body, nil) defer ensureReaderClosed(serverResp) if err != nil { return response, err } err = json.NewDecoder(serverResp.body).Decode(&response) return response, err } // ContainerStart sends a request to the docker daemon to start a container. func (cli *Client) ContainerStart(ctx context.Context, containerID string, options types.ContainerStartOptions) error { query := url.Values{} if len(options.CheckpointID) != 0 { query.Set("checkpoint", options.CheckpointID) } if len(options.CheckpointDir) != 0 { query.Set("checkpoint-dir", options.CheckpointDir) } resp, err := cli.post(ctx, "/containers/"+containerID+"/start", query, nil, nil) ensureReaderClosed(resp) return err } ``` #### 3.5 总结 可以看出来 ContainerCreate和ContainerStart处理非常简单,就是 (1)利用var/run/docker.sock 文件创建了http的客户端 (2)调用cli 客户端发送post请求,创建和启动容器 ================================================ FILE: docker/12. dockerd源码分析-docker run为例.md ================================================ * [0\. 章节目的](#0-章节目的) * [1\. docker run服务器端处理流程](#1-docker-run服务器端处理流程) * [1\.1 dockerd 函数入口](#11-dockerd-函数入口) * [1\.2 runDaemon](#12-rundaemon) * [1\.3 daemonCli\.start](#13-daemonclistart) * [1\.4 NewDaemon](#14-newdaemon) * [1\.5 dockerd的路由设置 containers](#15--dockerd的路由设置-containers) * [2\. docker create container详细流程分析](#2-docker-create-container详细流程分析) * [2\.1 postContainersCreate](#21-postcontainerscreate) * [2\.2 containerCreate](#22-containercreate) * [2\.3 daemon\.create](#23-daemoncreate) * [2\.4 newContainer](#24--newcontainer) * [2\.5 实验](#25-实验) * [2\.5\.1 实验1\-观察目录变化](#251-实验1-观察目录变化) * [2\.5\.2 实验2\-查看配置](#252-实验2-查看配置) * [2\.6 总结](#26-总结) * [3\. Docker start container详细流程分析](#3-docker-start-container详细流程分析) * [3\.1 postContainerExecStart](#31-postcontainerexecstart) * [3\.2 ContainerStart](#32-containerstart) * [3\.3 containerStart](#33-containerstart) * [4\. docker start 创建的详细过程](#4-docker-start-创建的详细过程) * [4\.1 containerd的初始化](#41-containerd的初始化) * [4\.2 容器的网络设置](#42-容器的网络设置) * [4\.3 容器的spec设置\-createSpec函数](#43-容器的spec设置-createspec函数) * [4\.4 containerd创建容器的详细流程](#44-containerd创建容器的详细流程) * [5\. 总结](#5-总结) ### 0. 章节目的 以 docker run niginx ls为例。从源码角度弄清楚dockerd具体的执行过程。 源码版本:https://github.com/moby/moby/tree/v19.03.9-ce 从上一篇分析中,docker run 其实是分为了container create, container start这两个步骤。 ### 1. docker run服务器端处理流程 还是先从docker的main函数可以入手。在安装docker之后。查看docker的配置,发现docker运行没有带任何参数。 ``` root@k8s-node:~# ps -ef | grep docker root 6164 5604 0 21:11 pts/1 00:00:00 grep docker root 12493 1 0 17:40 ? 00:01:04 /usr/bin/dockerd root@k8s-node:~# cat /usr/lib/systemd/system/docker.service [Unit] Description=Docker Application Container Engine Documentation=https://docs.docker.com After=network-online.target firewalld.service Wants=network-online.target [Service] Type=notify ExecStart=/usr/bin/dockerd ExecReload=/bin/kill -s HUP LimitNOFILE=infinity LimitNPROC=infinity LimitCORE=infinity TimeoutStartSec=0 Delegate=yes KillMode=process Restart=on-failure StartLimitBurst=3 StartLimitInterval=60s [Install] WantedBy=multi-user.target ```
#### 1.1 dockerd 函数入口 dockerd main函数在cmd/dockerd/docker.go。还是熟悉的cobra框架,所以直接从newDaemonCommand入手。 newDaemonOptions主要是调用了runDaemon命令,从上面分析看,这里默认dockerd启动没有flags。在之前配置镜像源的时候,经常在 `/etc/docker/daemon.json` 目录下进行如下配置。这个其实是docker的默认配置目录。 ``` root@k8s-node:/etc/docker# cat daemon.json { "registry-mirrors": ["https://b9pmyelo.mirror.aliyuncs.com"] } ```
``` func newDaemonCommand() (*cobra.Command, error) { opts := newDaemonOptions(config.New()) cmd := &cobra.Command{ Use: "dockerd [OPTIONS]", Short: "A self-sufficient runtime for containers.", SilenceUsage: true, SilenceErrors: true, Args: cli.NoArgs, RunE: func(cmd *cobra.Command, args []string) error { opts.flags = cmd.Flags() return runDaemon(opts) }, DisableFlagsInUseLine: true, Version: fmt.Sprintf("%s, build %s", dockerversion.Version, dockerversion.GitCommit), } cli.SetupRootCommand(cmd) flags := cmd.Flags() flags.BoolP("version", "v", false, "Print version information and quit") // 读取默认的配置文件。默认是 /etc/docker/daemon.json defaultDaemonConfigFile, err := getDefaultDaemonConfigFile() if err != nil { return nil, err } flags.StringVar(&opts.configFile, "config-file", defaultDaemonConfigFile, "Daemon configuration file") opts.InstallFlags(flags) if err := installConfigFlags(opts.daemonConfig, flags); err != nil { return nil, err } installServiceFlags(flags) return cmd, nil } ```
#### 1.2 runDaemon ``` func runDaemon(opts *daemonOptions) error { daemonCli := NewDaemonCli() return daemonCli.start(opts) } ``` 这里主要是 runDaemon -> daemonCli.start。 #### 1.3 daemonCli.start start函数的核心逻辑如下: 1. 设置默认的配置,以及从命令行、文件读取配置。从打印出来的日志来看,确实没什么启动参数。基本都是默认值,比如 默认的docker目录是/var/lib/docker, 默人的sock是const DefaultDockerHost = "unix:///var/run/docker.sock" 2. 检查一些配置,比如是否debug模式,是否开启实验模式,是否以root运行等等 3. 创建docker-root目录文件,默认在 /var/lib/docker目录下 4. 创建docker.pid文件 5. 创建sever config 6. 根据config,创建一个sever 7. daemon程序可以根据选项监控多个地址,loadListeners遍历这些地址,也监听了多个地址。 8. initcontainerD 初始化容器运行时, initContainerD会调用supervisor.Start然后调用 startContainerd,启动containerd。会在/var/run/docker/containerd目录下,pid和sock文件。 9. 初始化pluginStore,实际就是生成一个map用来保存有哪些plugins 10. 初始化Middlewares, http的中间件,这些中间件主要进行版本兼容性检查、添加CORS跨站点请求相关响应头、对请求进行认证。 11. 实例化Daemon对象,做好 sever端的一切准备,包括检查网络以及其他环境 12. 实例化metric server 13. docker可能以集群方式运行,开启 14. 运行 swarm containers 15. 配置路由,包括contianer,image, driver等等 16. 初始化路由,接下里会分析 1. 开启服务器,以及通知就绪等等 ``` func (cli *DaemonCli) start(opts *daemonOptions) (err error) { stopc := make(chan bool) defer close(stopc) // warn from uuid package when running the daemon uuid.Loggerf = logrus.Warnf // 1.设置默认的配置,以及从命令行、文件读取配置。从打印出来的日志来看,确实没什么启动参数。例如指定了 // root is /var/lib/docker, conf.TrustKeyPath is /etc/docker/key.json opts.SetDefaultOptions(opts.flags) // 增加日志打印输出。用于理解源码 logrus.Infof("zoux start flags.configFile is %v, damonConfig is %v, flags is %v, debug is %v, hosts is %v", opts.configFile, opts.daemonConfig,opts.Debug, opts.Hosts) if cli.Config, err = loadDaemonCliConfig(opts); err != nil { return err } if err := configureDaemonLogs(cli.Config); err != nil { return err } logrus.Info("Starting up") cli.configFile = &opts.configFile cli.flags = opts.flags // 2.检查一些配置,比如是否debug模式,是否开启实验模式,是否以root运行等等 if cli.Config.Debug { debug.Enable() } if cli.Config.Experimental { logrus.Warn("Running experimental build") if cli.Config.IsRootless() { logrus.Warn("Running in rootless mode. Cgroups, AppArmor, and CRIU are disabled.") } if rootless.RunningWithRootlessKit() { logrus.Info("Running with RootlessKit integration") if !cli.Config.IsRootless() { return fmt.Errorf("rootless mode needs to be enabled for running with RootlessKit") } } } else { if cli.Config.IsRootless() { return fmt.Errorf("rootless mode is supported only when running in experimental mode") } } // return human-friendly error before creating files if runtime.GOOS == "linux" && os.Geteuid() != 0 { return fmt.Errorf("dockerd needs to be started with root. To see how to run dockerd in rootless mode with unprivileged user, see the documentation") } system.InitLCOW(cli.Config.Experimental) if err := setDefaultUmask(); err != nil { return err } // 3. 创建docker-root目录文件,默认在 /var/lib/docker目录下 // Create the daemon root before we create ANY other files (PID, or migrate keys) // to ensure the appropriate ACL is set (particularly relevant on Windows) if err := daemon.CreateDaemonRoot(cli.Config); err != nil { return err } if err := system.MkdirAll(cli.Config.ExecRoot, 0700, ""); err != nil { return err } potentiallyUnderRuntimeDir := []string{cli.Config.ExecRoot} // 4.创建docker.pid文件 if cli.Pidfile != "" { pf, err := pidfile.New(cli.Pidfile) if err != nil { return errors.Wrap(err, "failed to start daemon") } potentiallyUnderRuntimeDir = append(potentiallyUnderRuntimeDir, cli.Pidfile) defer func() { if err := pf.Remove(); err != nil { logrus.Error(err) } }() } if cli.Config.IsRootless() { // Set sticky bit if XDG_RUNTIME_DIR is set && the file is actually under XDG_RUNTIME_DIR if _, err := homedir.StickRuntimeDirContents(potentiallyUnderRuntimeDir); err != nil { // StickRuntimeDirContents returns nil error if XDG_RUNTIME_DIR is just unset logrus.WithError(err).Warn("cannot set sticky bit on files under XDG_RUNTIME_DIR") } } // 5.创建sever config serverConfig, err := newAPIServerConfig(cli) if err != nil { return errors.Wrap(err, "failed to create API server") } // 6.根据config,创建一个sever cli.api = apiserver.New(serverConfig) // 7.daemon程序可以根据选项监控多个地址,loadListeners遍历这些地址,也监听了多个地址。 hosts, err := loadListeners(cli, serverConfig) if err != nil { return errors.Wrap(err, "failed to load listeners") } ctx, cancel := context.WithCancel(context.Background()) // 8.initcontainerD 初始化容器运行时, initContainerD会调用supervisor.Start然后调用 startContainerd,启动containerd。会在/var/run/docker/containerd目录下,pid和sock文件。 waitForContainerDShutdown, err := cli.initContainerD(ctx) if waitForContainerDShutdown != nil { defer waitForContainerDShutdown(10 * time.Second) } if err != nil { cancel() return err } defer cancel() signal.Trap(func() { cli.stop() <-stopc // wait for daemonCli.start() to return }, logrus.StandardLogger()) // Notify that the API is active, but before daemon is set up. preNotifySystem() // 9.初始化pluginStore,实际就是生成一个map用来保存有哪些plugins pluginStore := plugin.NewStore() // 10.初始化Middlewares, http的中间件,这些中间件主要进行版本兼容性检查、添加CORS跨站点请求相关响应头、对请求进行认证。 if err := cli.initMiddlewares(cli.api, serverConfig, pluginStore); err != nil { logrus.Fatalf("Error creating middlewares: %v", err) } // 11.实例化Daemon对象,做好 sever端的一切准备,包括检查网络以及其他环境 d, err := daemon.NewDaemon(ctx, cli.Config, pluginStore) if err != nil { return errors.Wrap(err, "failed to start daemon") } d.StoreHosts(hosts) // validate after NewDaemon has restored enabled plugins. Don't change order. if err := validateAuthzPlugins(cli.Config.AuthorizationPlugins, pluginStore); err != nil { return errors.Wrap(err, "failed to validate authorization plugin") } cli.d = d // 12. 实例化metric server if err := cli.startMetricsServer(cli.Config.MetricsAddress); err != nil { return err } // 13.docker可能以集群方式运行,开启 c, err := createAndStartCluster(cli, d) if err != nil { logrus.Fatalf("Error starting cluster component: %v", err) } // Restart all autostart containers which has a swarm endpoint // and is not yet running now that we have successfully // initialized the cluster. // 14.运行 swarm containers d.RestartSwarmContainers() logrus.Info("Daemon has completed initialization") // 15.配置路由,包括contianer,image, driver等等 routerOptions, err := newRouterOptions(cli.Config, d) if err != nil { return err } routerOptions.api = cli.api routerOptions.cluster = c // 16.初始化路由,接下里会分析 initRouter(routerOptions) // 17. 开启服务器,以及通知就绪等等 go d.ProcessClusterNotifications(ctx, c.GetWatchStream()) cli.setupConfigReloadTrap() // The serve API routine never exits unless an error occurs // We need to start it as a goroutine and wait on it so // daemon doesn't exit serveAPIWait := make(chan error) go cli.api.Wait(serveAPIWait) // after the daemon is done setting up we can notify systemd api notifySystem() // Daemon is fully initialized and handling API traffic // Wait for serve API to complete errAPI := <-serveAPIWait c.Cleanup() shutdownDaemon(d) // Stop notification processing and any background processes cancel() if errAPI != nil { return errors.Wrap(errAPI, "shutting down due to ServeAPI error") } logrus.Info("Daemon shutdown complete") return nil } 日志输出结果: logrus.Infof("zoux start flags.configFile is %v, damonConfig is %v, flags is %v, debug is %v, hosts is %v", opts.configFile, opts.daemonConfig,opts.Debug, opts.Hosts) Feb 28 16:56:58 k8s-node dockerd[28021]: time="2022-02-28T16:56:58.742186824+08:00" level=info msg="zoux start flags.configFile is /etc/docker/daemon.json, damonConfig is &{{ [] true map[] false [] [] [] 0 0 /var/run/docker.pid false /var/lib/docker /var/run/docker docker false map[] 0xc0005f3bd0 0xc0005f3bd8 15 false [] false false { } 0 0 {[] [] []} {json-file map[]} {{ } {0.0.0.0 true} false true true true true } {{[]} 1500} {[] [] []} {0 0} map[] false [] false map[] {{false [] } { }} moby plugins.moby} {map[] runc } false map[] 0 0 -500 false 67108864 false private false}, flags is false, debug is [], hosts is %!v(MISSING)" ``` **initContainerD**会调用supervisor.Start然后调用 startContainerd,启动containerd。会在/var/run/docker/containerd目录下,pid和sock文件。 ``` root@k8s-node:/var/run/docker/containerd# ls 0ea51049a3dde9b6ca6940f563b920997cc7ff05425bfe5174f2fbced72a9feb 7f1343294ac385c400b076a0d0c62979909cede65e90b2a0d8615ddba36c19cd containerd-debug.sock containerd.toml daemon root@k8s-node:/var/run/docker/containerd# root@k8s-node:/var/run/docker/containerd# root@k8s-node:/var/run/docker/containerd# systemctl start docker.service root@k8s-node:/var/run/docker/containerd# root@k8s-node:/var/run/docker/containerd# ls 492a9c1152120b8eafd70c476a04aa7d73b8ec359fbf01c55e55b70912872dfe 702cc9e5c234374195375cdc05bf34eb0484221f9da3e4288d2c37154f2325bd containerd-debug.sock containerd.pid containerd.sock containerd.toml daemon root@k8s-node:/var/run/docker/containerd# ls ```
``` func (cli *DaemonCli) initContainerD(ctx context.Context) (func(time.Duration) error, error) { var waitForShutdown func(time.Duration) error if cli.Config.ContainerdAddr == "" { systemContainerdAddr, ok, err := systemContainerdRunning(honorXDG) if err != nil { return nil, errors.Wrap(err, "could not determine whether the system containerd is running") } if !ok { logrus.Debug("Containerd not running, starting daemon managed containerd") opts, err := cli.getContainerdDaemonOpts() if err != nil { return nil, errors.Wrap(err, "failed to generate containerd options") } r, err := supervisor.Start(ctx, filepath.Join(cli.Config.Root, "containerd"), filepath.Join(cli.Config.ExecRoot, "containerd"), opts...) if err != nil { return nil, errors.Wrap(err, "failed to start containerd") } logrus.Debug("Started daemon managed containerd") cli.Config.ContainerdAddr = r.Address() // Try to wait for containerd to shutdown waitForShutdown = r.WaitTimeout } else { cli.Config.ContainerdAddr = systemContainerdAddr } } return waitForShutdown, nil } ``` #### 1.4 NewDaemon NewDaemon核心就是为了接下来开启 服务端路由做准备。包括 (1)环境的检测调整 (2)用户空间重映射特性 (3)对存储目录进行必要的权限调整、对daemon进程的`oom_score_adj`参数进行必要的调整(减小daemon进程被OS杀掉的可能性)、创建临时目录。 (4)调整进程的最大线程数限制 * 安装AppArmor相关的配置 (5)创建初始化了一堆与镜像存储相关的目录及Store,有以下几个: `/var/lib/docker/containers` 这个目录是用来记录的是容器相关的信息,每运行一个容器,就在这个目录下面生成一个容器Id对应的子目录 `/var/lib/docker/image/${graphDriverName}/layerdb` 这个目录是用来记录layer元数据的 `/var/lib/docker/image/${graphDriverName}/imagedb` 这个目录是用来记录镜像元数据的 `/var/lib/docker/image/${graphDriverName}/distribution` 这个目录用来记录layer元数据与镜像元数据之间的关联关系 `/var/lib/docker/image/${graphDriverName}/repositories.json` 这个目录是用来记录镜像仓库元数据的 `/var/lib/docker/trust` 这个目录用来放一些证书文件 * `/var/lib/docker/volumes` 这个目录是用来记录卷元数据的 (6)如果配置了在集群中向外发布的访问地址,则需要初始化集群节点的服务发现Agent。一般来说就是定时向KV库报告自身的状态及公布访问地址 (7)再然后就是给Daemon对象的一系列属性赋上值。 (8)确保插件系统初始化完毕,然后根据`/var/lib/docker/containers`目录里容器目录还原部分容器、初始化容器依赖的网络环境,初始化容器之间的link关系等。 具体不一样对应了,看代码和注释就知道了。代码位置在:daemon/daemon.go #### 1.5 dockerd的路由设置 containers 在2.2中,initRouter就是负责路由规则。可以看出来包括image, contianer, plugins等等。这里我们只关注container路由。 ``` func initRouter(opts routerOptions) { 。。。 routers := []router.Router{ // we need to add the checkpoint router before the container router or the DELETE gets masked checkpointrouter.NewRouter(opts.daemon, decoder), container.NewRouter(opts.daemon, decoder, opts.daemon.RawSysInfo().CgroupUnified), image.NewRouter(opts.daemon.ImageService()), systemrouter.NewRouter(opts.daemon, opts.cluster, opts.buildkit, opts.features), volume.NewRouter(opts.daemon.VolumesService()), build.NewRouter(opts.buildBackend, opts.daemon, opts.features), sessionrouter.NewRouter(opts.sessionManager), swarmrouter.NewRouter(opts.cluster), pluginrouter.NewRouter(opts.daemon.PluginManager()), distributionrouter.NewRouter(opts.daemon.ImageService()), } 。。。 opts.api.InitRouter(routers...) } ``` 上述所有的路由实现都对应在 api/server/router目录。 可以看出来: container create: 对应了 r.postContainersCreate 这个实现函数 container start: 对应了 r.postContainerExecStart 这个实现函数 ``` api/server/router/container/container.go // NewRouter initializes a new container router func NewRouter(b Backend, decoder httputils.ContainerDecoder, cgroup2 bool) router.Router { r := &containerRouter{ backend: b, decoder: decoder, cgroup2: cgroup2, } r.initRoutes() return r } // Routes returns the available routes to the container controller func (r *containerRouter) Routes() []router.Route { return r.routes } // initRoutes initializes the routes in container router func (r *containerRouter) initRoutes() { r.routes = []router.Route{ // HEAD router.NewHeadRoute("/containers/{name:.*}/archive", r.headContainersArchive), // GET router.NewGetRoute("/containers/json", r.getContainersJSON), router.NewGetRoute("/containers/{name:.*}/export", r.getContainersExport), router.NewGetRoute("/containers/{name:.*}/changes", r.getContainersChanges), router.NewGetRoute("/containers/{name:.*}/json", r.getContainersByName), router.NewGetRoute("/containers/{name:.*}/top", r.getContainersTop), router.NewGetRoute("/containers/{name:.*}/logs", r.getContainersLogs), router.NewGetRoute("/containers/{name:.*}/stats", r.getContainersStats), router.NewGetRoute("/containers/{name:.*}/attach/ws", r.wsContainersAttach), router.NewGetRoute("/exec/{id:.*}/json", r.getExecByID), router.NewGetRoute("/containers/{name:.*}/archive", r.getContainersArchive), // POST //r.postContainersCreate 这个是 container create的实现函数 router.NewPostRoute("/containers/create", r.postContainersCreate), router.NewPostRoute("/containers/{name:.*}/kill", r.postContainersKill), router.NewPostRoute("/containers/{name:.*}/pause", r.postContainersPause), router.NewPostRoute("/containers/{name:.*}/unpause", r.postContainersUnpause), router.NewPostRoute("/containers/{name:.*}/restart", r.postContainersRestart), router.NewPostRoute("/containers/{name:.*}/start", r.postContainersStart), router.NewPostRoute("/containers/{name:.*}/stop", r.postContainersStop), router.NewPostRoute("/containers/{name:.*}/wait", r.postContainersWait), router.NewPostRoute("/containers/{name:.*}/resize", r.postContainersResize), router.NewPostRoute("/containers/{name:.*}/attach", r.postContainersAttach), router.NewPostRoute("/containers/{name:.*}/copy", r.postContainersCopy), // Deprecated since 1.8, Errors out since 1.12 router.NewPostRoute("/containers/{name:.*}/exec", r.postContainerExecCreate), router.NewPostRoute("/exec/{name:.*}/start", r.postContainerExecStart), router.NewPostRoute("/exec/{name:.*}/resize", r.postContainerExecResize), router.NewPostRoute("/containers/{name:.*}/rename", r.postContainerRename), router.NewPostRoute("/containers/{name:.*}/update", r.postContainerUpdate), router.NewPostRoute("/containers/prune", r.postContainersPrune), router.NewPostRoute("/commit", r.postCommit), // PUT router.NewPutRoute("/containers/{name:.*}/archive", r.putContainersArchive), // DELETE router.NewDeleteRoute("/containers/{name:.*}", r.deleteContainers), } } ``` ### 2. docker create container详细流程分析 docker create container在后端调用的是postContainersCreate,首先从源码角度分析详细流程 #### 2.1 postContainersCreate postContainersCreate 函数逻辑如下: 1. 对request进行校验 2. 从表单获取contaienr name 3. 获取容器hostConfig, 网络config等配置 4. 传入配置信息,调用ContainerCreate进一步创建容器 看起来核心是backend.ContainerCreate 函数 ``` func (s *containerRouter) postContainersCreate(ctx context.Context, w http.ResponseWriter, r *http.Request, vars map[string]string) error { // 1.对request进行校验 if err := httputils.ParseForm(r); err != nil { return err } if err := httputils.CheckForJSON(r); err != nil { return err } // 2.从表单获取contaienr name name := r.Form.Get("name") // 3.获取容器hostConfig, 网络config等配置 config, hostConfig, networkingConfig, err := s.decoder.DecodeConfig(r.Body) if err != nil { return err } version := httputils.VersionFromContext(ctx) adjustCPUShares := versions.LessThan(version, "1.19") // When using API 1.24 and under, the client is responsible for removing the container if hostConfig != nil && versions.LessThan(version, "1.25") { hostConfig.AutoRemove = false } if hostConfig != nil && versions.LessThan(version, "1.40") { // Ignore BindOptions.NonRecursive because it was added in API 1.40. for _, m := range hostConfig.Mounts { if bo := m.BindOptions; bo != nil { bo.NonRecursive = false } } // Ignore KernelMemoryTCP because it was added in API 1.40. hostConfig.KernelMemoryTCP = 0 // Ignore Capabilities because it was added in API 1.40. hostConfig.Capabilities = nil // Older clients (API < 1.40) expects the default to be shareable, make them happy if hostConfig.IpcMode.IsEmpty() { hostConfig.IpcMode = container.IpcMode("shareable") } } if hostConfig != nil && hostConfig.PidsLimit != nil && *hostConfig.PidsLimit <= 0 { // Don't set a limit if either no limit was specified, or "unlimited" was // explicitly set. // Both `0` and `-1` are accepted as "unlimited", and historically any // negative value was accepted, so treat those as "unlimited" as well. hostConfig.PidsLimit = nil } // 4.传入配置信息,调用ContainerCreate进一步创建容器 ccr, err := s.backend.ContainerCreate(types.ContainerCreateConfig{ Name: name, Config: config, HostConfig: hostConfig, NetworkingConfig: networkingConfig, AdjustCPUShares: adjustCPUShares, }) if err != nil { return err } return httputils.WriteJSON(w, http.StatusCreated, ccr) } ``` backend.ContainerCreate最终调用的是 daemon.ContainerCreate ``` daemon/create.go // ContainerCreate creates a regular container func (daemon *Daemon) ContainerCreate(params types.ContainerCreateConfig) (containertypes.ContainerCreateCreatedBody, error) { return daemon.containerCreate(createOpts{ params: params, managed: false, ignoreImagesArgsEscaped: false}) } ```
#### 2.2 containerCreate containerCreate的核心逻辑如下: 1. 一开始纪录时间,估计是统计耗时用的,接下来看返回条件就知道,是做一系列的验证 2. 如果指定了镜像,就调用imageService.GetImage获取 image对象。这里只是为了获取镜像信息,如果没有镜像并没有拉取。原因是客户端docker会拉去镜像再重试 3. 修改hostconfig的不正常值,例如CPUShares、Memory 4. 继续调用daemon.create创建容器 5. 纪录已经创建容器的时间 ``` func (daemon *Daemon) containerCreate(opts createOpts) (containertypes.ContainerCreateCreatedBody, error) { start := time.Now() if opts.params.Config == nil { return containertypes.ContainerCreateCreatedBody{}, errdefs.InvalidParameter(errors.New("Config cannot be empty in order to create a container")) } os := runtime.GOOS if opts.params.Config.Image != "" { img, err := daemon.imageService.GetImage(opts.params.Config.Image) if err == nil { os = img.OS } } else { // This mean scratch. On Windows, we can safely assume that this is a linux // container. On other platforms, it's the host OS (which it already is) if runtime.GOOS == "windows" && system.LCOWSupported() { os = "linux" } } warnings, err := daemon.verifyContainerSettings(os, opts.params.HostConfig, opts.params.Config, false) if err != nil { return containertypes.ContainerCreateCreatedBody{Warnings: warnings}, errdefs.InvalidParameter(err) } err = verifyNetworkingConfig(opts.params.NetworkingConfig) if err != nil { return containertypes.ContainerCreateCreatedBody{Warnings: warnings}, errdefs.InvalidParameter(err) } if opts.params.HostConfig == nil { opts.params.HostConfig = &containertypes.HostConfig{} } err = daemon.adaptContainerSettings(opts.params.HostConfig, opts.params.AdjustCPUShares) if err != nil { return containertypes.ContainerCreateCreatedBody{Warnings: warnings}, errdefs.InvalidParameter(err) } container, err := daemon.create(opts) if err != nil { return containertypes.ContainerCreateCreatedBody{Warnings: warnings}, err } containerActions.WithValues("create").UpdateSince(start) if warnings == nil { warnings = make([]string, 0) // Create an empty slice to avoid https://github.com/moby/moby/issues/38222 } return containertypes.ContainerCreateCreatedBody{ID: container.ID, Warnings: warnings}, nil } ```
#### 2.3 daemon.create create主要逻辑如下: 1. 定义一些全局变量 2. 看起來还是只是getImages 没有pull 3. 根据镜像信息,再一次校验信息是否有误 4. 调用daemon.newContainer创建容器 5. 判断是否设置容器特权。 noNewPrivileges:设置为true后可以防止进程获取额外的权限(如使得suid和文件capabilities失效),该标记位在内核4.10版本之后可以在/proc/$pid/status中查看NoNewPrivs的设置值。更多参见 https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt 6. 为容器设置 可读性 layer层 7. 以 root uid gid的属性创建目录,在/var/lib/docker/containers目录下创建容器文件,并在容器文件下创建checkpoints目录 8. 根据特定的OS创建容器,比如默认路径已经创建volume(这些特性和os有关) 9. 设置网络 10. 更新网络 ``` // Create creates a new container from the given configuration with a given name. func (daemon *Daemon) create(opts createOpts) (retC *container.Container, retErr error) { // 1. 定义一些全局变量 var ( container *container.Container img *image.Image imgID image.ID err error ) os := runtime.GOOS // 2. getImages 获取镜像信息 if opts.params.Config.Image != "" { img, err = daemon.imageService.GetImage(opts.params.Config.Image) if err != nil { return nil, err } if img.OS != "" { os = img.OS } else { // default to the host OS except on Windows with LCOW if runtime.GOOS == "windows" && system.LCOWSupported() { os = "linux" } } imgID = img.ID() if runtime.GOOS == "windows" && img.OS == "linux" && !system.LCOWSupported() { return nil, errors.New("operating system on which parent image was created is not Windows") } } else { if runtime.GOOS == "windows" { os = "linux" // 'scratch' case. } } // On WCOW, if are not being invoked by the builder to create this container (where // ignoreImagesArgEscaped will be true) - if the image already has its arguments escaped, // ensure that this is replicated across to the created container to avoid double-escaping // of the arguments/command line when the runtime attempts to run the container. if os == "windows" && !opts.ignoreImagesArgsEscaped && img != nil && img.RunConfig().ArgsEscaped { opts.params.Config.ArgsEscaped = true } // 3.根据镜像信息,再一次校验信息是否有误 if err := daemon.mergeAndVerifyConfig(opts.params.Config, img); err != nil { return nil, errdefs.InvalidParameter(err) } if err := daemon.mergeAndVerifyLogConfig(&opts.params.HostConfig.LogConfig); err != nil { return nil, errdefs.InvalidParameter(err) } // 4.调用daemon.newContainer创建容器 if container, err = daemon.newContainer(opts.params.Name, os, opts.params.Config, opts.params.HostConfig, imgID, opts.managed); err != nil { return nil, err } defer func() { if retErr != nil { if err := daemon.cleanupContainer(container, true, true); err != nil { logrus.Errorf("failed to cleanup container on create error: %v", err) } } }() // 5. 判断是否设置容器特权。 noNewPrivileges:设置为true后可以防止进程获取额外的权限(如使得suid和文件capabilities失效),该标记位在内核4.10版本 // 之后可以在/proc/$pid/status中查看NoNewPrivs的设置值。更多参见 https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt if err := daemon.setSecurityOptions(container, opts.params.HostConfig); err != nil { return nil, err } container.HostConfig.StorageOpt = opts.params.HostConfig.StorageOpt // Fixes: https://github.com/moby/moby/issues/34074 and // https://github.com/docker/for-win/issues/999. // Merge the daemon's storage options if they aren't already present. We only // do this on Windows as there's no effective sandbox size limit other than // physical on Linux. if runtime.GOOS == "windows" { if container.HostConfig.StorageOpt == nil { container.HostConfig.StorageOpt = make(map[string]string) } for _, v := range daemon.configStore.GraphOptions { opt := strings.SplitN(v, "=", 2) if _, ok := container.HostConfig.StorageOpt[opt[0]]; !ok { container.HostConfig.StorageOpt[opt[0]] = opt[1] } } } // 6. 为容器设置 可读性 layer层 // Set RWLayer for container after mount labels have been set rwLayer, err := daemon.imageService.CreateLayer(container, setupInitLayer(daemon.idMapping)) if err != nil { return nil, errdefs.System(err) } container.RWLayer = rwLayer rootIDs := daemon.idMapping.RootPair() // 7. 以 root uid gid的属性创建目录,在/var/lib/docker/containers目录下创建容器文件,并在容器文件下创建checkpoints目录 if err := idtools.MkdirAndChown(container.Root, 0700, rootIDs); err != nil { return nil, err } if err := idtools.MkdirAndChown(container.CheckpointDir(), 0700, rootIDs); err != nil { return nil, err } if err := daemon.setHostConfig(container, opts.params.HostConfig); err != nil { return nil, err } // 8. 根据特定的OS创建容器,比如默认路径已经创建volume(这些特性和os有关) if err := daemon.createContainerOSSpecificSettings(container, opts.params.Config, opts.params.HostConfig); err != nil { return nil, err } // 9.设置网络 var endpointsConfigs map[string]*networktypes.EndpointSettings if opts.params.NetworkingConfig != nil { endpointsConfigs = opts.params.NetworkingConfig.EndpointsConfig } // Make sure NetworkMode has an acceptable value. We do this to ensure // backwards API compatibility. runconfig.SetDefaultNetModeIfBlank(container.HostConfig) // 10.更新网络 daemon.updateContainerNetworkSettings(container, endpointsConfigs) if err := daemon.Register(container); err != nil { return nil, err } stateCtr.set(container.ID, "stopped") daemon.LogContainerEvent(container, "create") return container, nil } ```
接下来继续看看第四步,daemon.newContainer做了什么 #### 2.4 newContainer 可以看出来new container只是创建容器这个对象。具体就是给对象赋值。而创建目录啥的在createContainerOSSpecificSettings做了 ``` func (daemon *Daemon) newContainer(name string, operatingSystem string, config *containertypes.Config, hostConfig *containertypes.HostConfig, imgID image.ID, managed bool) (*container.Container, error) { var ( id string err error noExplicitName = name == "" ) id, name, err = daemon.generateIDAndName(name) if err != nil { return nil, err } if hostConfig.NetworkMode.IsHost() { if config.Hostname == "" { config.Hostname, err = os.Hostname() if err != nil { return nil, errdefs.System(err) } } } else { daemon.generateHostname(id, config) } entrypoint, args := daemon.getEntrypointAndArgs(config.Entrypoint, config.Cmd) base := daemon.newBaseContainer(id) base.Created = time.Now().UTC() base.Managed = managed base.Path = entrypoint base.Args = args //FIXME: de-duplicate from config base.Config = config base.HostConfig = &containertypes.HostConfig{} base.ImageID = imgID base.NetworkSettings = &network.Settings{IsAnonymousEndpoint: noExplicitName} base.Name = name base.Driver = daemon.imageService.GraphDriverForOS(operatingSystem) base.OS = operatingSystem return base, err } ``` #### 2.5 实验 ##### 2.5.1 实验1-观察目录变化 在执行 `docker container create --name nginx` 命令的过程中,时刻观察/var/lib/docker的变化,发现在create的阶段镜像文件以及挂载都准备好了。 如果nginx镜像不存在,可以看到下载进行的整个过程。 ``` 03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob163641926 CREATE 03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob307902189 CREATE 03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob256086888 CREATE 03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob630460839 CREATE 03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob086739162 CREATE 03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob105444465 CREATE ```
``` root@k8s-node:~# inotifywait -mrq --timefmt '%d/%m/%y %H:%M' --format '%T %w %f %e' -e modify,delete,create,attrib /var/lib/docker 03/03/22 11:44 /var/lib/docker/overlay2/ 4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/l/ GRIVPJLK7YAT3OXDTS4V2QFCUA CREATE 03/03/22 11:44 /var/lib/docker/overlay2/7a25fdc447cb19682434e15e2a721250a869eb3a75aa8d439bbd985e736f8ef4/ committed MODIFY 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ merged CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/ work CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/ work ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/ work ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ diff ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/ ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3fed CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3fed ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ diff ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/ ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/ .dockerenv CREATE 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/ .dockerenv ATTRIB 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3fee CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3fee ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ diff ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/ ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3ff0 CREATE 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3ff2 CREATE 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/dev/ shm CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/dev/ shm ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/dev/ console CREATE 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/dev/ console ATTRIB 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ merged DELETE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/ 4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/l/ PV2PZDA4VGO3PPNCMHCCT4YDVN CREATE 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ link CREATE 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ linkMODIFY 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ work CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ committed CREATE 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ lower CREATE 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ lower MODIFY 03/03/22 11:44 /var/lib/docker/image/overlay2/layerdb/mounts/ 15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52 CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/containers/ 15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52 CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json389982635 ATTRIB 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json466278670 CREATE 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json466278670 MODIFY 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json070462229 CREATE 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json070462229 MODIFY 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json070462229 ATTRIB 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json466278670 ATTRIB 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ merged CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/ work CREATE,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/ work ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/ work ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/ work ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/work/ ATTRIB,ISDIR 03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ merged DELETE,ISDIR 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json231746416 CREATE 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json231746416 MODIFY 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json732808207 CREATE 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json732808207 ATTRIB 03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json231746416 ATTRIB ``` ##### 2.5.2 实验2-查看配置 实际上docker create container 是制定了所有配置。包括运行命令。从inspect 就可以看出来。 ``` "Config": { "Hostname": "687c38e427a4", "Domainname": "", "User": "", "AttachStdin": false, "AttachStdout": true, "AttachStderr": true, "ExposedPorts": { "80/tcp": {} }, "Tty": false, "OpenStdin": false, "StdinOnce": false, "Env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "NGINX_VERSION=1.21.5", "NJS_VERSION=0.7.1", "PKG_RELEASE=1~bullseye" ], "Cmd": [ "ls" ], "Image": "nginx", "Volumes": null, "WorkingDir": "", "Entrypoint": [ "/docker-entrypoint.sh" ], "OnBuild": null, "Labels": { "maintainer": "NGINX Docker Maintainers " }, "StopSignal": "SIGQUIT" }, ``` #### 2.6 总结 docker create只是根据docker的配置(包括使用什么存储系统,root目录等),完成了所有的初始化。 主要是利用镜像层已有的数据。初始化container的所有数据。 主要是初始化这个目录:/var/lib/docker/containers/contaienrId ### 3. Docker start container详细流程分析 从上面的分析可以得出。docker create 就已经将所有的准备工作做好了,包括运行的参数。接下来看看docker start做了什么。 #### 3.1 postContainerExecStart 和create一样,这里主要是调用了postContainerExecStart进行start ``` func (s *containerRouter) postContainersStart(ctx context.Context, w http.ResponseWriter, r *http.Request, vars map[string]string) error { // If contentLength is -1, we can assumed chunked encoding // or more technically that the length is unknown // https://golang.org/src/pkg/net/http/request.go#L139 // net/http otherwise seems to swallow any headers related to chunked encoding // including r.TransferEncoding // allow a nil body for backwards compatibility version := httputils.VersionFromContext(ctx) var hostConfig *container.HostConfig // A non-nil json object is at least 7 characters. if r.ContentLength > 7 || r.ContentLength == -1 { if versions.GreaterThanOrEqualTo(version, "1.24") { return bodyOnStartError{} } if err := httputils.CheckForJSON(r); err != nil { return err } c, err := s.decoder.DecodeHostConfig(r.Body) if err != nil { return err } hostConfig = c } if err := httputils.ParseForm(r); err != nil { return err } checkpoint := r.Form.Get("checkpoint") checkpointDir := r.Form.Get("checkpoint-dir") if err := s.backend.ContainerStart(vars["name"], hostConfig, checkpoint, checkpointDir); err != nil { return err } w.WriteHeader(http.StatusNoContent) return nil } ```
#### 3.2 ContainerStart ContainerStart主要逻辑如下: (1)根据容器name, 判断容器状态,比如paused状态的容器不能start等等。 (2)判断hostconfig信息等,hostconfig必须在create的时候指定,start只管启动 (3)调用containerStart进行start。核心是这个函数 ``` // ContainerStart starts a container. func (daemon *Daemon) ContainerStart(name string, hostConfig *containertypes.HostConfig, checkpoint string, checkpointDir string) error { if checkpoint != "" && !daemon.HasExperimental() { return errdefs.InvalidParameter(errors.New("checkpoint is only supported in experimental mode")) } container, err := daemon.GetContainer(name) if err != nil { return err } validateState := func() error { container.Lock() defer container.Unlock() if container.Paused { return errdefs.Conflict(errors.New("cannot start a paused container, try unpause instead")) } if container.Running { return containerNotModifiedError{running: true} } if container.RemovalInProgress || container.Dead { return errdefs.Conflict(errors.New("container is marked for removal and cannot be started")) } return nil } if err := validateState(); err != nil { return err } // Windows does not have the backwards compatibility issue here. if runtime.GOOS != "windows" { // This is kept for backward compatibility - hostconfig should be passed when // creating a container, not during start. if hostConfig != nil { logrus.Warn("DEPRECATED: Setting host configuration options when the container starts is deprecated and has been removed in Docker 1.12") oldNetworkMode := container.HostConfig.NetworkMode if err := daemon.setSecurityOptions(container, hostConfig); err != nil { return errdefs.InvalidParameter(err) } if err := daemon.mergeAndVerifyLogConfig(&hostConfig.LogConfig); err != nil { return errdefs.InvalidParameter(err) } if err := daemon.setHostConfig(container, hostConfig); err != nil { return errdefs.InvalidParameter(err) } newNetworkMode := container.HostConfig.NetworkMode if string(oldNetworkMode) != string(newNetworkMode) { // if user has change the network mode on starting, clean up the // old networks. It is a deprecated feature and has been removed in Docker 1.12 container.NetworkSettings.Networks = nil if err := container.CheckpointTo(daemon.containersReplica); err != nil { return errdefs.System(err) } } container.InitDNSHostConfig() } } else { if hostConfig != nil { return errdefs.InvalidParameter(errors.New("Supplying a hostconfig on start is not supported. It should be supplied on create")) } } // check if hostConfig is in line with the current system settings. // It may happen cgroups are umounted or the like. if _, err = daemon.verifyContainerSettings(container.OS, container.HostConfig, nil, false); err != nil { return errdefs.InvalidParameter(err) } // Adapt for old containers in case we have updates in this function and // old containers never have chance to call the new function in create stage. if hostConfig != nil { if err := daemon.adaptContainerSettings(container.HostConfig, false); err != nil { return errdefs.InvalidParameter(err) } } return daemon.containerStart(container, checkpoint, checkpointDir, true) } ```
#### 3.3 containerStart 核心逻辑如下: (1)判断容器状态,是否已经running或者dead (2)通过defer函数进行收尾,然后start过程出现了错误,调用daemon.Cleanup,ContainerRm进行清理工作 (3)挂载目录。docker start过程也会很多目录的创建,mount (4)设置容器的网络模式,默认模式bridge:同一个host主机上容器的通信通过Linux bridge进行。与宿主机外部网络的通信需要通过宿主机端 口进行NAT (5)创建/proc /dev等spec文件,对容器所特有的属性都进行设置,例如:资源限制,命名空间,安全模式等等配置信息 (6)初始化libContainerd的 createOptions,到这里就是调用containerd了 (7)通过containerd创建容器 (8)通过containerd启动容器 (9)设置状态,已经running等等 ``` // containerStart prepares the container to run by setting up everything the // container needs, such as storage and networking, as well as links // between containers. The container is left waiting for a signal to // begin running. func (daemon *Daemon) containerStart(container *container.Container, checkpoint string, checkpointDir string, resetRestartManager bool) (err error) { start := time.Now() container.Lock() defer container.Unlock() // 1.判断容器状态,是否已经running或者dead if resetRestartManager && container.Running { // skip this check if already in restarting step and resetRestartManager==false return nil } if container.RemovalInProgress || container.Dead { return errdefs.Conflict(errors.New("container is marked for removal and cannot be started")) } if checkpointDir != "" { // TODO(mlaventure): how would we support that? return errdefs.Forbidden(errors.New("custom checkpointdir is not supported")) } // 2.通过defer函数进行收尾,然后start过程出现了错误,调用daemon.Cleanup,ContainerRm进行清理工作 // if we encounter an error during start we need to ensure that any other // setup has been cleaned up properly defer func() { if err != nil { container.SetError(err) // if no one else has set it, make sure we don't leave it at zero if container.ExitCode() == 0 { container.SetExitCode(128) } if err := container.CheckpointTo(daemon.containersReplica); err != nil { logrus.Errorf("%s: failed saving state on start failure: %v", container.ID, err) } container.Reset(false) daemon.Cleanup(container) // if containers AutoRemove flag is set, remove it after clean up if container.HostConfig.AutoRemove { container.Unlock() if err := daemon.ContainerRm(container.ID, &types.ContainerRmConfig{ForceRemove: true, RemoveVolume: true}); err != nil { logrus.Errorf("can't remove container %s: %v", container.ID, err) } container.Lock() } } }() // 3.挂载目录。docker start过程也会很多目录的创建,mount if err := daemon.conditionalMountOnStart(container); err != nil { return err } // 4.设置容器的网络模式,默认模式bridge:同一个host主机上容器的通信通过Linux bridge进行。与宿主机外部网络的通信需要通过宿主机端 口进行NAT if err := daemon.initializeNetworking(container); err != nil { return err } // 5. 创建/proc /dev等spec文件,对容器所特有的属性都进行设置,例如:资源限制,命名空间,安全模式等等配置信息 spec, err := daemon.createSpec(container) if err != nil { return errdefs.System(err) } if resetRestartManager { container.ResetRestartManager(true) container.HasBeenManuallyStopped = false } if err := daemon.saveApparmorConfig(container); err != nil { return err } if checkpoint != "" { checkpointDir, err = getCheckpointDir(checkpointDir, checkpoint, container.Name, container.ID, container.CheckpointDir(), false) if err != nil { return err } } // 6.初始化libContainerd的 createOptions,到这里就是调用containerd了 createOptions, err := daemon.getLibcontainerdCreateOptions(container) if err != nil { return err } ctx := context.TODO() // 7. 通过containerd创建容器 err = daemon.containerd.Create(ctx, container.ID, spec, createOptions) if err != nil { if errdefs.IsConflict(err) { logrus.WithError(err).WithField("container", container.ID).Error("Container not cleaned up from containerd from previous run") // best effort to clean up old container object daemon.containerd.DeleteTask(ctx, container.ID) if err := daemon.containerd.Delete(ctx, container.ID); err != nil && !errdefs.IsNotFound(err) { logrus.WithError(err).WithField("container", container.ID).Error("Error cleaning up stale containerd container object") } err = daemon.containerd.Create(ctx, container.ID, spec, createOptions) } if err != nil { return translateContainerdStartErr(container.Path, container.SetExitCode, err) } } // 8. 通过containerd启动容器 // TODO(mlaventure): we need to specify checkpoint options here pid, err := daemon.containerd.Start(context.Background(), container.ID, checkpointDir, container.StreamConfig.Stdin() != nil || container.Config.Tty, container.InitializeStdio) if err != nil { if err := daemon.containerd.Delete(context.Background(), container.ID); err != nil { logrus.WithError(err).WithField("container", container.ID). Error("failed to delete failed start container") } return translateContainerdStartErr(container.Path, container.SetExitCode, err) } // 9.设置状态,已经running等等 container.SetRunning(pid, true) container.HasBeenStartedBefore = true daemon.setStateCounter(container) daemon.initHealthMonitor(container) if err := container.CheckpointTo(daemon.containersReplica); err != nil { logrus.WithError(err).WithField("container", container.ID). Errorf("failed to store container") } daemon.LogContainerEvent(container, "start") containerActions.WithValues("start").UpdateSince(start) return nil } ```
**docker start nginx 过程的目录变化** ``` root@k8s-node:~# inotifywait -mrq --timefmt '%d/%m/%y %H:%M' --format '%T %w %f %e' -e modify,delete,create,attrib /var/lib/docker 03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/ merged CREATE,ISDIR 03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/work/ work DELETE,ISDIR 03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/work/ work CREATE,ISDIR 03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/work/ work ATTRIB,ISDIR 03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/work/ work ATTRIB,ISDIR 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY 03/03/22 17:06 /var/lib/docker/containers/8f7318baf651f7a9539e1d41486151946b36af202f0b0ef4257954911fb77edc/ hosts MODIFY 03/03/22 17:06 /var/lib/docker/containers/8f7318baf651f7a9539e1d41486151946b36af202f0b0ef4257954911fb77edc/ hosts MODIFY 03/03/22 17:06 /var/lib/docker/containers/8f7318baf651f7a9539e1d41486151946b36af202f0b0ef4257954911fb77edc/ resolv.conf MODIFY 03/03/22 17:06 /var/lib/docker/containers/8f7318baf651f7a9539e1d41486151946b36af2 ```
### 4. docker start 创建的详细过程 上面已经知道了docker start的大致流程。接下里才是重点,就是containerd是如何创建容器的,以及runc是啥时候调用的等等。 这一节就是详细弄清楚整个过程,可能会拆分章节。 #### 4.1 containerd的初始化 在dockerd启动的时候,通过initContainerD函数启动了containerd #### 4.2 容器的网络设置 待补充,需要补充其他知识,可能会再开一章节 #### 4.3 容器的spec设置-createSpec函数 Linux 内核提供了一种通过`/proc`文件系统,在运行时访问内核内部数据结构、改变内核设置的机制。 proc文件系统是一个伪文件系统,它只存在内存当中,而不占用外存空间。 它以文件系统的方式为访问系统内核数据的操作提供接口。 ``` unc (daemon *Daemon) createSpec(c *container.Container) (retSpec *specs.Spec, err error) { var ( opts []coci.SpecOpts s = oci.DefaultSpec() ) opts = append(opts, WithCommonOptions(daemon, c), WithCgroups(daemon, c), WithResources(c), WithSysctls(c), WithDevices(daemon, c), WithUser(c), WithRlimits(daemon, c), WithNamespaces(daemon, c), WithCapabilities(c), WithSeccomp(daemon, c), WithMounts(daemon, c), WithLibnetwork(daemon, c), WithApparmor(c), WithSelinux(c), WithOOMScore(&c.HostConfig.OomScoreAdj), ) if c.NoNewPrivileges { opts = append(opts, coci.WithNoNewPrivileges) } // Set the masked and readonly paths with regard to the host config options if they are set. if c.HostConfig.MaskedPaths != nil { opts = append(opts, coci.WithMaskedPaths(c.HostConfig.MaskedPaths)) } if c.HostConfig.ReadonlyPaths != nil { opts = append(opts, coci.WithReadonlyPaths(c.HostConfig.ReadonlyPaths)) } if daemon.configStore.Rootless { opts = append(opts, WithRootless) } return &s, coci.ApplyOpts(context.Background(), nil, &containers.Container{ ID: c.ID, }, &s, opts...) } ``` #### 4.4 containerd创建容器的详细流程 待补充 ### 5. 总结 (1)docker run nginx ls 其实是分成了两个步骤。`docker create contianer nginx ls` 和 `docker start nginx` (2)docker create 做了前期的准备工作,包括下载镜像,准备所有的文件和目录 (3)docker start核心是调用containerd进行start,启动进程等等。这个过程涉及网络以及其他底层的知识。目前先了解到这里,还有很多细节比如第四章节还待补充。这个等补充一波知识后,再更新。 ================================================ FILE: docker/2. linux cgroup 知识准备.md ================================================ * [0\. 说明](#0-说明) * [1\. cgroup简介](#1-cgroup简介) * [2\. CGroup 使用](#2-cgroup-使用) * [3\. CGroup 基本概念](#3-cgroup-基本概念) * [4\. CGroup 操作规则](#4-cgroup-操作规则) * [5\. CGroup的原理实现](#5-cgroup的原理实现) * [5\.1 cgroup 结构体](#51-cgroup-结构体) * [5\.2 CGroup 的挂载](#52-cgroup-的挂载) * [5\.3 向 CGroup 添加要进行资源控制的进程](#53-向-cgroup-添加要进行资源控制的进程) * [5\.4 限制 CGroup 的资源使用](#54-限制-cgroup-的资源使用) * [5\.5 限制进程使用资源](#55-限制进程使用资源) * [6\.参考资料](#6参考资料) ### 0. 说明 本文章转载微信公众的一篇文章。地址如下:https://mp.weixin.qq.com/s/n796FnrKsfLLxcvV4-dAlg 该笔记绝大部分来源于上诉公众号,用于自己对cgroup的理解,当做笔记记录。 ### 1. cgroup简介 `CGroup` 全称 `Control Group` 中文意思为 `控制组`,用于控制(限制)进程对系统各种资源的使用,比如 `CPU`、`内存`、`网络` 和 `磁盘I/O` 等资源的限制,著名的容器引擎 `Docker` 就是使用 `CGroup` 来对容器进行资源限制。 ### 2. CGroup 使用 本文主要以 `内存子系统(memory subsystem)` 作为例子来阐述 `CGroup` 的原理,所以这里先介绍怎么通过 `内存子系统` 来限制进程对内存的使用。 > `子系统` 是 `CGroup` 用于控制某种资源(如内存或者CPU等)使用的逻辑或者算法 > > 在系统的开机阶段,systemd会把支持的子系统挂载到默认的 `/sys/fs/cgroup` 目录下面。 `CGroup` 使用了 `虚拟文件系统` 来进行管理限制的资源信息和被限制的进程列表等,例如要创建一个限制内存使用的 `CGroup` 可以使用下面命令: ``` $ mount -t cgroup -o memory memory /sys/fs/cgroup/memory ``` 上面的命令用于创建内存子系统的根 `CGroup`,如果系统已经存在可以跳过。然后我们使用下面命令在这个目录下面创建一个新的目录 `test`, ``` $ mkdir /sys/fs/cgroup/memory/test ``` 这样就在内存子系统的根 `CGroup` 下创建了一个子 `CGroup`,我们可以通过 `ls` 目录来查看这个目录下有哪些文件: ``` $ ls -l /sys/fs/cgroup/memory/test cgroup.clone_childrenmemory.kmem.max_usage_in_bytesmemory.limit_in_bytesmemory.numa_statmemory.use_hierarchy cgroup.event_controlmemory.kmem.slabinfomemory.max_usage_in_bytesmemory.oom_controlnotify_on_release cgroup.procsmemory.kmem.tcp.failcntmemory.memsw.failcntmemory.pressure_leveltasks memory.failcntmemory.kmem.tcp.limit_in_bytesmemory.memsw.limit_in_bytesmemory.soft_limit_in_bytes memory.force_emptymemory.kmem.tcp.max_usage_in_bytesmemory.memsw.max_usage_in_bytesmemory.stat memory.kmem.failcntmemory.kmem.tcp.usage_in_bytesmemory.memsw.usage_in_bytesmemory.swappiness memory.kmem.limit_in_bytesmemory.kmem.usage_in_bytesmemory.move_charge_at_immigratememory.usage_in_bytes ``` 可以看到在目录下有很多文件,每个文件都是 `CGroup` 用于控制进程组的资源使用。我们可以向 `memory.limit_in_bytes` 文件写入限制进程(进程组)使用的内存大小,单位为字节(bytes)。例如可以使用以下命令写入限制使用的内存大小为 `1MB`: ``` $ echo 1048576 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes ``` 然后我们可以通过以下命令把要限制的进程加入到 `CGroup` 中: ``` $ echo task_pid > /sys/fs/cgroup/memory/test/tasks ``` 上面的 `task_pid` 为进程的 `PID`,把进程PID添加到 `tasks` 文件后,进程对内存的使用就受到此 `CGroup` 的限制。 ### 3. CGroup 基本概念 在介绍 `CGroup` 原理前,先介绍一下 `CGroup` 几个相关的概念,因为要理解 `CGroup` 就必须要理解他们: - `任务(task)`。任务指的是系统的一个进程,如上面介绍的 `tasks` 文件中的进程; - `控制组(control group)`。控制组就是受相同资源限制的一组进程。`CGroup` 中的资源控制都是以控制组为单位实现。一个进程可以加入到某个控制组,也从一个进程组迁移到另一个控制组。一个进程组的进程可以使用 `CGroup` 以控制组为单位分配的资源,同时受到 `CGroup` 以控制组为单位设定的限制; - `层级(hierarchy)`。由于控制组是以目录形式存在的,所以控制组可以组织成层级的形式,即一棵控制组组成的树。控制组树上的子节点控制组是父节点控制组的孩子,继承父控制组的特定的属性; - `子系统(subsystem)`。一个子系统就是一个资源控制器,比如 `CPU子系统` 就是控制 CPU 时间分配的一个控制器。子系统必须附加(attach)到一个层级上才能起作用,一个子系统附加到某个层级以后,这个层级上的所有控制组都受到这个子系统的控制。 他们之间的关系如下图: ![image-20220226162724297](./image/cgroup-1.png) 我们可以把 `层级` 中的一个目录当成是一个 `CGroup`,那么目录里面的文件就是这个 `CGroup` 用于控制进程组使用各种资源的信息(比如 `tasks` 文件用于保存这个 `CGroup` 控制的进程组所有的进程PID,而 `memory.limit_in_bytes` 文件用于描述这个 `CGroup` 能够使用的内存字节数)。 而附加在 `层级` 上的 `子系统` 表示这个 `层级` 中的 `CGroup` 可以控制哪些资源,每当向 `层级` 附加 `子系统` 时,`层级` 中的所有 `CGroup` 都会产生很多与 `子系统` 资源控制相关的文件。 ### 4. CGroup 操作规则 使用 `CGroup` 时,必须按照 `CGroup` 一些操作规则来进行操作,否则会出错。下面介绍一下关于 `CGroup` 的一些操作规则: 1. 一个 `层级` 可以附加多个 `子系统`,如下图: ![image-20220226162836054](./image/cgroup-2.png)2. 一个已经被挂载的 `子系统` 只能被再次挂载在一个空的 `层级` 上,不能挂载到已经挂载了其他 `子系统` 的 `层级`,如下图: ![image-20220226163153346](/Users/game-netease/k8sLearnNote/learning-k8s-source-code/docker/image/cgroup-3.png) 3. 每个 `任务` 只能在同一个 `层级` 的唯一一个 `CGroup` 里,并且可以在多个不同层级的 `CGroup` 中,如下图: ![image-20220226163311087](./image/cgroup-4.png) 4. 子进程在被 `fork` 出时自动继承父进程所在 `CGroup`,但是 `fork` 之后就可以按需调整到其他 `CGroup`,如下图: ![image-20220226163414589](/Users/game-netease/k8sLearnNote/learning-k8s-source-code/docker/image/cgroup-5.png) ### 5. CGroup的原理实现 #### 5.1 `cgroup` 结构体 前面介绍过,`cgroup` 是用来控制进程组对各种资源的使用,而在内核中,`cgroup` 是通过 `cgroup` 结构体来描述的,我们来看看其定义: ``` struct cgroup { unsigned long flags; /* "unsigned long" so bitops work */ atomic_t count; struct list_head sibling; /* my parent's children */ struct list_head children; /* my children */ struct cgroup *parent; /* my parent */ struct dentry *dentry; /* cgroup fs entry */ struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]; struct cgroupfs_root *root; struct cgroup *top_cgroup; struct list_head css_sets; struct list_head release_list; }; ``` 下面我们来介绍一下 `cgroup` 结构体各个字段的用途: 1. `flags`: 用于标识当前 `cgroup` 的状态。 2. `count`: 引用计数器,表示有多少个进程在使用这个 `cgroup`。 3. `sibling、children、parent`: 由于 `cgroup` 是通过 `层级` 来进行管理的,这三个字段就把同一个 `层级` 的所有 `cgroup` 连接成一棵树。`parent` 指向当前 `cgroup` 的父节点,`sibling` 连接着所有兄弟节点,而 `children` 连接着当前 `cgroup` 的所有子节点。 4. `dentry`: 由于 `cgroup` 是通过 `虚拟文件系统` 来进行管理的,在介绍 `cgroup` 使用时说过,可以把 `cgroup` 当成是 `层级` 中的一个目录,所以 `dentry` 字段就是用来描述这个目录的。 5. `subsys`: 前面说过,`子系统` 能够附加到 `层级`,而附加到 `层级` 的 `子系统` 都有其限制进程组使用资源的算法和统计数据。所以 `subsys` 字段就是提供给各个 `子系统` 存放其限制进程组使用资源的统计数据。我们可以看到 `subsys` 字段是一个数组,而数组中的每一个元素都代表了一个 `子系统` 相关的统计数据。从实现来看,`cgroup` 只是把多个进程组织成控制进程组,而真正限制资源使用的是各个 `子系统`。 6. `root`: 用于保存 `层级` 的一些数据,比如:`层级` 的根节点,附加到 `层级` 的 `子系统` 列表(因为一个 `层级` 可以附加多个 `子系统`),还有这个 `层级` 有多少个 `cgroup` 节点等。 7. `top_cgroup`: `层级` 的根节点(根cgroup)。 我们通过下面图片来描述 `层级` 中各个 `cgroup` 组成的树状关系: ![图片](./image/cgroup-6.png) `cgroup_subsys_state` 结构体 每个 `子系统` 都有属于自己的资源控制统计信息结构,而且每个 `cgroup` 都绑定一个这样的结构,这种资源控制统计信息结构就是通过 `cgroup_subsys_state` 结构体实现的,其定义如下: ``` struct cgroup_subsys_state { struct cgroup *cgroup; atomic_t refcnt; unsigned long flags; }; ``` 下面介绍一下 `cgroup_subsys_state` 结构各个字段的作用: 1. `cgroup`: 指向了这个资源控制统计信息所属的 `cgroup`。 2. `refcnt`: 引用计数器。 3. `flags`: 标志位,如果这个资源控制统计信息所属的 `cgroup` 是 `层级` 的根节点,那么就会将这个标志位设置为 `CSS_ROOT` 表示属于根节点。 从 `cgroup_subsys_state` 结构的定义看不到各个 `子系统` 相关的资源控制统计信息,这是因为 `cgroup_subsys_state` 结构并不是真实的资源控制统计信息结构,比如 `内存子系统` 真正的资源控制统计信息结构是 `mem_cgroup`,那么怎样通过这个 `cgroup_subsys_state` 结构去找到对应的 `mem_cgroup` 结构呢?我们来看看 `mem_cgroup` 结构的定义: ``` struct mem_cgroup { struct cgroup_subsys_state css; // 注意这里 struct res_counter res; struct mem_cgroup_lru_info info; int prev_priority; struct mem_cgroup_stat stat; }; ``` 从 `mem_cgroup` 结构的定义可以发现,`mem_cgroup` 结构的第一个字段就是一个 `cgroup_subsys_state` 结构。下面的图片展示了他们之间的关系: ![图片](./image/cgroup-7.png) 从上图可以看出,`mem_cgroup` 结构包含了 `cgroup_subsys_state` 结构,`内存子系统` 对外暴露出 `mem_cgroup` 结构的 `cgroup_subsys_state` 部分(即返回 `cgroup_subsys_state` 结构的指针),而其余部分由 `内存子系统` 自己维护和使用。 由于 `cgroup_subsys_state` 部分在 `mem_cgroup` 结构的首部,所以要将 `cgroup_subsys_state` 结构转换成 `mem_cgroup` 结构,只需要通过指针类型转换即可。如下代码: `cgroup` 结构与 `cgroup_subsys_state` 结构之间的关系如下图: ![图片](./image/cgroup-8.png) `css_set` 结构体 由于一个进程可以同时添加到不同的 `cgroup` 中(前提是这些 `cgroup` 属于不同的 `层级`)进行资源控制,而这些 `cgroup` 附加了不同的资源控制 `子系统`。所以需要使用一个结构把这些 `子系统` 的资源控制统计信息收集起来,方便进程通过 `子系统ID` 快速查找到对应的 `子系统` 资源控制统计信息,而 `css_set` 结构体就是用来做这件事情。`css_set` 结构体定义如下: ``` struct css_set { struct kref ref; struct list_head list; struct list_head tasks; struct list_head cg_links; struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]; }; ``` 下面介绍一下 `css_set` 结构体各个字段的作用: 1. `ref`: 引用计数器,用于计算有多少个进程在使用此 `css_set`。 2. `list`: 用于连接所有 `css_set`。 3. `tasks`: 由于可能存在多个进程同时受到相同的 `cgroup` 控制,所以用此字段把所有使用此 `css_set` 的进程连接起来。 4. `subsys`: 用于收集各种 `子系统` 的统计信息结构。 进程描述符 `task_struct` 有两个字段与此相关,如下: ``` struct task_struct { ... struct css_set *cgroups; struct list_head cg_list; ... } ``` 可以看出,`task_struct` 结构的 `cgroups` 字段就是指向 `css_set` 结构的指针,而 `cg_list` 字段用于连接所有使用此 `css_set` 结构的进程列表。 `task_struct` 结构与 `css_set` 结构的关系如下图: ![图片](./image/cgroup-9.png) `cgroup_subsys` 结构 `CGroup` 通过 `cgroup_subsys` 结构操作各个 `子系统`,每个 `子系统` 都要实现一个这样的结构,其定义如下: ``` struct cgroup_subsys { struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss, struct cgroup *cgrp); void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp); void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp); int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, struct task_struct *tsk); void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk); void (*fork)(struct cgroup_subsys *ss, struct task_struct *task); void (*exit)(struct cgroup_subsys *ss, struct task_struct *task); int (*populate)(struct cgroup_subsys *ss, struct cgroup *cgrp); void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp); void (*bind)(struct cgroup_subsys *ss, struct cgroup *root); int subsys_id; int active; int disabled; int early_init; const char *name; struct cgroupfs_root *root; struct list_head sibling; void *private; }; ``` `cgroup_subsys` 结构包含了很多函数指针,通过这些函数指针,`CGroup` 可以对 `子系统` 进行一些操作。比如向 `CGroup` 的 `tasks` 文件添加要控制的进程PID时,就会调用 `cgroup_subsys` 结构的 `attach()` 函数。当在 `层级` 中创建新目录时,就会调用 `create()` 函数创建一个 `子系统` 的资源控制统计信息对象 `cgroup_subsys_state`,并且调用 `populate()` 函数创建 `子系统` 相关的资源控制信息文件。 除了函数指针外,`cgroup_subsys` 结构还包含了很多字段,下面说明一下各个字段的作用: 1. `subsys_id`: 表示了子系统的ID。 2. `active`: 表示子系统是否被激活。 3. `disabled`: 子系统是否被禁止。 4. `name`: 子系统名称。 5. `root`: 被附加到的层级挂载点。 6. `sibling`: 用于连接被附加到同一个层级的所有子系统。 7. `private`: 私有数据。 `内存子系统` 定义了一个名为 `mem_cgroup_subsys` 的 `cgroup_subsys` 结构,如下: ``` struct cgroup_subsys mem_cgroup_subsys = { .name = "memory", .subsys_id = mem_cgroup_subsys_id, .create = mem_cgroup_create, .pre_destroy = mem_cgroup_pre_destroy, .destroy = mem_cgroup_destroy, .populate = mem_cgroup_populate, .attach = mem_cgroup_move_task, .early_init = 0, }; ``` 另外 Linux 内核还定义了一个 `cgroup_subsys` 结构的数组 `subsys`,用于保存所有 `子系统` 的 `cgroup_subsys` 结构,如下: ``` static struct cgroup_subsys *subsys[] = { cpuset_subsys, debug_subsys, ns_subsys, cpu_cgroup_subsys, cpuacct_subsys, mem_cgroup_subsys }; ``` #### 5.2 `CGroup` 的挂载 前面介绍了 `CGroup` 相关的几个结构体,接下来我们分析一下 `CGroup` 的实现。 要使用 `CGroup` 功能首先必须先进行挂载操作,比如使用下面命令挂载一个 `CGroup`: ``` $ mount -t cgroup -o memory memory /sys/fs/cgroup/memory ``` 在上面的命令中,`-t` 参数指定了要挂载的文件系统类型为 `cgroup`,而 `-o` 参数表示要附加到此 `层级` 的子系统,上面表示附加了 `内存子系统`,当然可以附加多个 `子系统`。而紧随 `-o` 参数后的 `memory` 指定了此 `CGroup` 的名字,最后一个参数表示要挂载的目录路径。 挂载过程最终会调用内核函数 `cgroup_get_sb()` 完成,由于 `cgroup_get_sb()` 函数比较长,所以我们只分析重要部分: ``` static int cgroup_get_sb(struct file_system_type *fs_type, int flags, const char *unused_dev_name, void *data, struct vfsmount *mnt) { ... struct cgroupfs_root *root; ... root = kzalloc(sizeof(*root), GFP_KERNEL); ... ret = rebind_subsystems(root, root->subsys_bits); ... struct cgroup *cgrp = &root->top_cgroup; cgroup_populate_dir(cgrp); ... } ``` `cgroup_get_sb()` 函数会调用 `kzalloc()` 函数创建一个 `cgroupfs_root` 结构。`cgroupfs_root` 结构主要用于描述这个挂载点的信息,其定义如下: ``` struct cgroupfs_root { struct super_block *sb; unsigned long subsys_bits; unsigned long actual_subsys_bits; struct list_head subsys_list; struct cgroup top_cgroup; int number_of_cgroups; struct list_head root_list; unsigned long flags; char release_agent_path[PATH_MAX]; }; ``` 下面介绍一下 `cgroupfs_root` 结构的各个字段含义: 1. `sb`: 挂载的文件系统超级块。 2. `subsys_bits/actual_subsys_bits`: 附加到此层级的子系统标志。 3. `subsys_list`: 附加到此层级的子系统(cgroup_subsys)列表。 4. `top_cgroup`: 此层级的根cgroup。 5. `number_of_cgroups`: 层级中有多少个cgroup。 6. `root_list`: 连接系统中所有的cgroupfs_root。 7. `flags`: 标志位。 其中最重要的是 `subsys_list` 和 `top_cgroup` 字段,`subsys_list` 表示了附加到此 `层级` 的所有 `子系统`,而 `top_cgroup` 表示此 `层级` 的根 `cgroup`。 接着调用 `rebind_subsystems()` 函数把挂载时指定要附加的 `子系统` 添加到 `cgroupfs_root` 结构的 `subsys_list` 链表中,并且为根 `cgroup` 的 `subsys` 字段设置各个 `子系统` 的资源控制统计信息对象,最后调用 `cgroup_populate_dir()` 函数向挂载目录创建 `cgroup` 的管理文件(如 `tasks` 文件)和各个 `子系统` 的管理文件(如 `memory.limit_in_bytes` 文件)。 #### 5.3 向 `CGroup` 添加要进行资源控制的进程 通过向 `CGroup` 的 `tasks` 文件写入要进行资源控制的进程PID,即可以对进程进行资源控制。例如下面命令: ``` $ echo 123012 > /sys/fs/cgroup/memory/test/tasks ``` 向 `tasks` 文件写入进程PID是通过 `attach_task_by_pid()` 函数实现的,代码如下: ``` static int attach_task_by_pid(struct cgroup *cgrp, char *pidbuf) { pid_t pid; struct task_struct *tsk; int ret; if (sscanf(pidbuf, "%d", &pid) != 1) // 读取进程pid return -EIO; if (pid) { // 如果有指定进程pid ... tsk = find_task_by_vpid(pid); // 通过pid查找对应进程的进程描述符 if (!tsk || tsk->flags & PF_EXITING) { rcu_read_unlock(); return -ESRCH; } ... } else { tsk = current; // 如果没有指定进程pid, 就使用当前进程 ... } ret = cgroup_attach_task(cgrp, tsk); // 调用 cgroup_attach_task() 把进程添加到cgroup中 ... return ret; } ``` `attach_task_by_pid()` 函数首先会判断是否指定了进程pid,如果指定了就通过进程pid查找到进程描述符,如果没指定就使用当前进程,然后通过调用 `cgroup_attach_task()` 函数把进程添加到 `cgroup` 中。 我们接着看看 `cgroup_attach_task()` 函数的实现: ``` int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk) { int retval = 0; struct cgroup_subsys *ss; struct cgroup *oldcgrp; struct css_set *cg = tsk->cgroups; struct css_set *newcg; struct cgroupfs_root *root = cgrp->root; ... newcg = find_css_set(cg, cgrp); // 根据新的cgroup查找css_set对象 ... rcu_assign_pointer(tsk->cgroups, newcg); // 把进程的cgroups字段设置为新的css_set对象 ... // 把进程添加到css_set对象的tasks列表中 write_lock(&css_set_lock); if (!list_empty(&tsk->cg_list)) { list_del(&tsk->cg_list); list_add(&tsk->cg_list, &newcg->tasks); } write_unlock(&css_set_lock); // 调用各个子系统的attach函数 for_each_subsys(root, ss) { if (ss->attach) ss->attach(ss, cgrp, oldcgrp, tsk); } ... return 0; } ``` `cgroup_attach_task()` 函数首先会调用 `find_css_set()` 函数查找或者创建一个 `css_set` 对象。前面说过 `css_set` 对象用于收集不同 `cgroup` 上附加的 `子系统` 资源统计信息对象。 因为一个进程能够被加入到不同的 `cgroup` 进行资源控制,所以 `find_css_set()` 函数就是收集进程所在的所有 `cgroup` 上附加的 `子系统` 资源统计信息对象,并返回一个 `css_set` 对象。接着把进程描述符的 `cgroups` 字段设置为这个 `css_set` 对象,并且把进程添加到这个 `css_set` 对象的 `tasks` 链表中。 最后,`cgroup_attach_task()` 函数会调用附加在 `层级` 上的所有 `子系统` 的 `attach()` 函数对新增进程进行一些其他的操作(这些操作由各自 `子系统` 去实现)。 #### 5.4 限制 `CGroup` 的资源使用 本文主要是使用 `内存子系统` 作为例子,所以这里分析内存限制的原理。 可以向 `cgroup` 的 `memory.limit_in_bytes` 文件写入要限制使用的内存大小(单位为字节),如下面命令限制了这个 `cgroup` 只能使用 1MB 的内存: ``` $ echo 1048576 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes ``` 向 `memory.limit_in_bytes` 写入数据主要通过 `mem_cgroup_write()` 函数实现的,其实现如下: ``` static ssize_t mem_cgroup_write(struct cgroup *cont, struct cftype *cft, struct file *file, const char __user *userbuf, size_t nbytes, loff_t *ppos) { return res_counter_write(&mem_cgroup_from_cont(cont)->res, cft->private, userbuf, nbytes, ppos, mem_cgroup_write_strategy); } ``` 其主要工作就是把 `内存子系统` 的资源控制对象 `mem_cgroup` 的 `res.limit` 字段设置为指定的数值。 #### 5.5 限制进程使用资源 当设置好 `cgroup` 的资源使用限制信息,并且把进程添加到这个 `cgroup` 的 `tasks` 列表后,进程的资源使用就会受到这个 `cgroup` 的限制。这里使用 `内存子系统` 作为例子,来分析一下内核是怎么通过 `cgroup` 来限制进程对资源的使用的。 当进程要使用内存时,会调用 `do_anonymous_page()` 来申请一些内存页,而 `do_anonymous_page()` 函数会调用 `mem_cgroup_charge()` 函数来检测进程是否超过了 `cgroup` 设置的资源限制。而 `mem_cgroup_charge()` 最终会调用 `mem_cgroup_charge_common()` 函数进行检测,`mem_cgroup_charge_common()` 函数实现如下: ``` static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm, gfp_t gfp_mask, enum charge_type ctype) { struct mem_cgroup *mem; ... mem = rcu_dereference(mm->mem_cgroup); // 获取进程对应的内存限制对象 ... while (res_counter_charge(&mem->res, PAGE_SIZE)) { // 判断进程使用内存是否超出限制 if (!(gfp_mask & __GFP_WAIT)) goto out; if (try_to_free_mem_cgroup_pages(mem, gfp_mask)) // 如果超出限制, 就释放一些不用的内存 continue; if (res_counter_check_under_limit(&mem->res)) continue; if (!nr_retries--) { mem_cgroup_out_of_memory(mem, gfp_mask); // 如果尝试过5次后还是超出限制, 那么发出oom信号 goto out; } ... } ... } ``` `mem_cgroup_charge_common()` 函数会对进程内存使用情况进行检测,如果进程已经超过了 `cgroup` 设置的限制,那么就会尝试进行释放一些不用的内存,如果还是超过限制,那么就会发出 `OOM (out of memory)` 的信号。 ### 6.参考资料 [容器三把斧之 | cgroup原理与实现](https://mp.weixin.qq.com/s/n796FnrKsfLLxcvV4-dAlg) [CGroup 介绍](https://mp.weixin.qq.com/s/66MKhzWTVCZ_nJ07fPrVIw) ================================================ FILE: docker/3. chroot 命令详解.md ================================================ * [1\. chroot命令介绍](#1-chroot命令介绍) * [2\. chroot实践](#2-chroot实践) * [2\.1 执行bash, ls命令](#21-执行bash-ls命令) * [2\.2 执行ps命令](#22-执行ps命令) * [2\.3 如何实现容器内pid 隔离](#23-如何实现容器内pid-隔离) * [1\. 在容器外面证明可以做到](#1-在容器外面证明可以做到) * [2\. 先取消之前的proc挂载](#2-先取消之前的proc挂载) * [3\. 提取docker镜像中的rootfs文件](#3-提取docker镜像中的rootfs文件) * [4\. 参考文档](#4-参考文档) ### 1. chroot命令介绍 把根目录换成指定的目的目录 **chroot命令** 用来在指定的根目录下运行指令。chroot,即 change root directory (更改 root 目录)。在 linux 系统中,系统默认的目录结构都是以`/`,即是以根 (root) 开始的。而在使用 chroot 之后,系统的目录结构将以指定的位置作为`/`位置。 在经过 chroot 命令之后,系统读取到的目录和文件将不在是旧系统根下的而是新根下(即被指定的新的位置)的目录结构和文件,因此它带来的好处大致有以下3个: **增加了系统的安全性,限制了用户的权力:** 在经过 chroot 之后,在新根下将访问不到旧系统的根目录结构和文件,这样就增强了系统的安全性。这个一般是在登录 (login) 前使用 chroot,以此达到用户不能访问一些特定的文件。 **建立一个与原系统隔离的系统目录结构,方便用户的开发:** 使用 chroot 后,系统读取的是新根下的目录和文件,这是一个与原系统根下文件不相关的目录结构。在这个新的环境中,可以用来测试软件的静态编译以及一些与系统不相关的独立开发。 **切换系统的根目录位置,引导 Linux 系统启动以及急救系统等:** chroot 的作用就是切换系统的根位置,而这个作用最为明显的是在系统初始引导磁盘的处理过程中使用,从初始 RAM 磁盘 (initrd) 切换系统的根位置并执行真正的 init。另外,当系统出现一些问题时,我们也可以使用 chroot 来切换到一个临时的系统。
### 2. chroot实践 直接使用是不行的,所以需要构建好test目录 ``` root@k8s-master:~# chroot test chroot: failed to run command ‘/bin/bash’: No such file or directory ```
#### 2.1 执行bash, ls命令 ``` root@k8s-master:~/test# tree . ├── bin │ ├── bash // bin目录下要有bash可执行文件 │ └── ls ├── lib │ ├── libc.so.6 //还要有ddl │ ├── libdl.so.2 │ └── libtinfo.so.6 └── lib64 └── ld-linux-x86-64.so.2 // 还不能执行ls,因为没有ls对应的ddl root@k8s-master:~/test/bin# chroot /root/test ls ls: error while loading shared libraries: libselinux.so.1: cannot open shared object file: No such file or directory // 通过ldd 查看ls依赖哪些动态链接库,然后拷贝到lib目录 root@k8s-master:~/test/bin# ldd ls linux-vdso.so.1 (0x00007ffff6bb8000) libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007f9580683000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f95804c2000) libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007f958044e000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9580449000) /lib64/ld-linux-x86-64.so.2 (0x00007f95808d6000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f9580428000) root@k8s-master:~/test/bin# root@k8s-master:~/test/bin# root@k8s-master:~/test/bin# cp /lib/x86_64-linux-gnu/libselinux.so.1 /root/test/lib 有了这些,就可以chroot,执行bash, ls了 root@k8s-master:~/test# pwd /root/test root@k8s-master:~/test# tree . ├── bin │ ├── bash │ └── ls ├── lib │ ├── libc.so.6 │ ├── libdl.so.2 │ ├── libpcre.so.3 │ ├── libpthread.so.0 │ ├── libselinux.so.1 │ └── libtinfo.so.6 └── lib64 └── ld-linux-x86-64.so.2 3 directories, 9 files ```
``` 成功chroot了,并且可以执行ls root@k8s-master:~/test# chroot /root/test bash-5.0# ls bin lib lib64 ``` #### 2.2 执行ps命令 ps命令有点特殊,除了需要拷贝ddl文件之外,还需要mount ``` root@k8s-master:~/test# chroot . ps Error, do this: mount -t proc proc /proc // 只能这样用, 其实最正确的做法应该是 mount -t proc proc /root/test/proc root@k8s-master:~/test# mount -t proc proc proc root@k8s-master:~/test# root@k8s-master:~/test# pwd /root/test root@k8s-master:~/test# ls bin lib lib64 proc // 可以看到,这里ps是看到了所有的 进程 root@k8s-master:~# chroot test bash bash-5.0# l ps bash-5.0# ps PID TTY TIME CMD 20877 ? 00:00:00 bash 20929 ? 00:00:00 ps 32545 ? 00:00:00 bash bash: history: /root/.bash_history: cannot create: No such file or directory bash-5.0# ps -ef UID PID PPID C STIME TTY TIME CMD 0 1 0 0 Oct23 ? 00:07:35 /sbin/init nopti nospectre_v2 nospec_store_bypass_disable 0 2 0 0 Oct23 ? 00:00:00 [kthreadd] 0 3 2 0 Oct23 ? 00:00:00 [rcu_gp] 0 4 2 0 Oct23 ? 00:00:00 [rcu_par_gp] 0 6 2 0 Oct23 ? 00:00:00 [kworker/0:0H-kblockd] 0 8 2 0 Oct23 ? 00:00:00 [mm_percpu_wq] 0 9 2 0 Oct23 ? 00:03:21 [ksoftirqd/0] 0 10 2 0 Oct23 ? 00:25:33 [rcu_sched] 。。。。。 bash-5.0# cd proc bash-5.0# ls 1 16 192 212 24 279 381 666 cmdline kmsg swaps 10 17 193 213 240 28 3856 669 consoles kpagecgroup sys 10696 170 194 214 241 281 3873 670 cpuinfo kpagecount sysrq-trigger 10738 171 195 215 242 28614 3928 671 crypto kpageflags sysvipc 11 172 196 216 243 29 3937 685 devices loadavg thread-self 11292 173 19646 21635 244 3 4 688 diskstats locks timer_list 11310 174 19654 217 245 30 455 692 dma meminfo tty 115 175 197 22 246 31 4556 693 driver misc uptime 116 176 198 224 247 32 4574 701 execdomains modules version 11681 177 2 225 248 32521 4621 714 fb mounts vmallocinfo 11700 178 20 226 249 32529 4629 718 filesystems mtrr vmstat 118 179 200 227 25 32530 492 732 fs net zoneinfo 119 180 206 228 250 32545 5271 8 interrupts pagetypeinfo 12 181 207 229 251 32560 54 8371 iomem partitions 122 187 208 230 26 33 5447 9 ioports sched_debug 14 188 20877 231 27 337 55 9586 irq schedstat 1485 189 209 232 27530 34 555 acpi kallsyms self 15 19 21 233 276 35 6 buddyinfo kcore slabinfo 1505 190 210 234 278 36 6134 bus key-users softirqs 15434 191 211 235 27808 3728 65 cgroups keys stat ``` ### 2.3 如何实现容器内pid 隔离 ##### 1. 在容器外面证明可以做到 ``` root@k8s-master:~# unshare --fork --pid --mount-proc /bin/bash root@k8s-master:~# root@k8s-master:~# ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 1 19:25 pts/0 00:00:00 /bin/bash root 11 1 0 19:25 pts/0 00:00:00 ps -ef root@k8s-master:~# ```
##### 2. 先取消之前的proc挂载 ``` root@k8s-master:~/test# cd proc/ root@k8s-master:~/test/proc# ls 1 16 190 211 235 27530 34 555 acpi kallsyms self 10 16679 191 212 23982 276 35 6 buddyinfo kcore slabinfo 10696 16776 192 213 23983 278 36 65 bus keys softirqs 10738 17 193 214 24 27808 3728 666 cgroups key-users stat 11 170 194 215 240 279 381 669 cmdline kmsg swaps 11292 171 195 216 241 28 3856 670 consoles kpagecgroup sys 11310 172 196 21635 242 281 3873 671 cpuinfo kpagecount sysrq-trigger 115 173 19646 217 243 28614 3928 685 crypto kpageflags sysvipc 116 174 19654 22 244 29 3937 688 devices loadavg thread-self 11681 175 197 224 245 3 4 692 diskstats locks timer_list 11700 176 198 225 246 30 455 693 dma meminfo tty 118 177 2 226 24640 31 4556 701 driver misc uptime 119 178 20 227 247 32 4574 714 execdomains modules version 12 179 200 228 248 32521 4621 718 fb mounts vmallocinfo 122 180 206 229 249 32529 4629 732 filesystems mtrr vmstat 14 181 207 230 25 32530 492 8 fs net zoneinfo 1485 187 208 231 250 32545 5271 9 interrupts pagetypeinfo 15 188 209 232 251 32560 54 9362 iomem partitions 1505 189 21 233 26 33 5447 9586 ioports sched_debug 15434 19 210 234 27 337 55 9647 irq schedstat root@k8s-master:~/test/proc# root@k8s-master:~/test/proc# root@k8s-master:~/test/proc# cd .. root@k8s-master:~/test# ls bin lib lib64 proc root@k8s-master:~/test# umount /root/test/proc/ root@k8s-master:~/test# root@k8s-master:~/test# ls bin lib lib64 proc root@k8s-master:~/test# ls proc/ ```
``` // 先通过unshare 隔离出来pid,就是这个/bin/bash 就是新的shell进程 root@k8s-master:~# unshare --fork --pid --mount-proc /bin/bash // 这个时候文件目录还是系统 root@k8s-master:~# ls apiserver-to-kubelet-rbac.yaml c.txt kubernetes-server-linux-amd64.tar.gz test1 a.sh cup pod.yaml test.sh a.txt kubectl pod.yaml-1 testYaml b.txt kube-flannel.yml svc TLS cni-plugins-linux-amd64-v0.8.6.tgz kubernetes test root@k8s-master:~# root@k8s-master:~# ls test/proc/ root@k8s-master:~# root@k8s-master:~# mount -t proc proc /root/test/proc // 修改root root@k8s-master:~# chroot test bash-5.0# l ps PID TTY TIME CMD 1 ? 00:00:00 bash 21 ? 00:00:00 bash 23 ? 00:00:00 ps bash: history: /root/.bash_history: cannot create: No such file or directory // 进程已经改变了,只能看到自己的进程 bash-5.0# ps -ef UID PID PPID C STIME TTY TIME CMD 0 1 0 0 11:36 ? 00:00:00 /bin/bash 0 21 1 0 11:38 ? 00:00:00 /bin/bash -i 0 24 21 0 11:38 ? 00:00:00 ps -ef ```
**如何查看默认的shell** ``` root# echo ${SHELL} /bin/bash ```
### 3. 提取docker镜像中的rootfs文件 参考: https://www.cnblogs.com/sparkdev/p/8556075.html 通过 chroot 运行 busybox 为例 busybox 包含了丰富的工具,我们可以把这些工具放置在一个目录下,然后通过 chroot 构造出一个 mini 系统。简单起见我们直接使用 docker 的 busybox 镜像打包的文件系统。先在当前目录下创建一个目录 rootfs:
``` root# mkdir rootfs // 提取busybox镜像的rootfs到当前目录 root# (docker export $(docker create busybox) | tar -C rootfs -xvf -) .dockerenv bin/ bin/[ bin/[[ bin/acpid bin/add-shell bin/addgroup bin/adduser bin/adjtimex bin/ar bin/arch bin/arp bin/arping bin/ash bin/awk bin/base32 bin/base64 bin/basename bin/bc bin/beep bin/blkdiscard bin/blkid bin/blockdev bin/bootchartd bin/brctl bin/bunzip2 bin/busybox bin/bzcat bin/bzip2 bin/cal bin/cat bin/chat bin/chattr bin/chgrp bin/chmod bin/chown bin/chpasswd bin/chpst bin/chroot bin/chrt bin/chvt bin/cksum bin/clear bin/cmp bin/comm bin/conspy bin/cp bin/cpio bin/crond bin/crontab bin/cryptpw bin/cttyhack bin/cut bin/date bin/dc bin/dd bin/deallocvt bin/delgroup bin/deluser bin/depmod bin/devmem bin/df bin/dhcprelay bin/diff bin/dirname bin/dmesg bin/dnsd bin/dnsdomainname bin/dos2unix bin/dpkg bin/dpkg-deb bin/du bin/dumpkmap bin/dumpleases bin/echo bin/ed bin/egrep bin/eject bin/env bin/envdir bin/envuidgid bin/ether-wake bin/expand bin/expr bin/factor bin/fakeidentd bin/fallocate bin/false bin/fatattr bin/fbset bin/fbsplash bin/fdflush bin/fdformat bin/fdisk bin/fgconsole bin/fgrep bin/find bin/findfs bin/flock bin/fold bin/free bin/freeramdisk bin/fsck bin/fsck.minix bin/fsfreeze bin/fstrim bin/fsync bin/ftpd bin/ftpget bin/ftpput bin/fuser bin/getconf bin/getopt bin/getty bin/grep bin/groups bin/gunzip bin/gzip bin/halt bin/hd bin/hdparm bin/head bin/hexdump bin/hexedit bin/hostid bin/hostname bin/httpd bin/hush bin/hwclock bin/i2cdetect bin/i2cdump bin/i2cget bin/i2cset bin/i2ctransfer bin/id bin/ifconfig bin/ifdown bin/ifenslave bin/ifplugd bin/ifup bin/inetd bin/init bin/insmod bin/install bin/ionice bin/iostat bin/ip bin/ipaddr bin/ipcalc bin/ipcrm bin/ipcs bin/iplink bin/ipneigh bin/iproute bin/iprule bin/iptunnel bin/kbd_mode bin/kill bin/killall bin/killall5 bin/klogd bin/last bin/less bin/link bin/linux32 bin/linux64 bin/linuxrc bin/ln bin/loadfont bin/loadkmap bin/logger bin/login bin/logname bin/logread bin/losetup bin/lpd bin/lpq bin/lpr bin/ls bin/lsattr bin/lsmod bin/lsof bin/lspci bin/lsscsi bin/lsusb bin/lzcat bin/lzma bin/lzop bin/makedevs bin/makemime bin/man bin/md5sum bin/mdev bin/mesg bin/microcom bin/mim bin/mkdir bin/mkdosfs bin/mke2fs bin/mkfifo bin/mkfs.ext2 bin/mkfs.minix bin/mkfs.vfat bin/mknod bin/mkpasswd bin/mkswap bin/mktemp bin/modinfo bin/modprobe bin/more bin/mount bin/mountpoint bin/mpstat bin/mt bin/mv bin/nameif bin/nanddump bin/nandwrite bin/nbd-client bin/nc bin/netstat bin/nice bin/nl bin/nmeter bin/nohup bin/nologin bin/nproc bin/nsenter bin/nslookup bin/ntpd bin/nuke bin/od bin/openvt bin/partprobe bin/passwd bin/paste bin/patch bin/pgrep bin/pidof bin/ping bin/ping6 bin/pipe_progress bin/pivot_root bin/pkill bin/pmap bin/popmaildir bin/poweroff bin/powertop bin/printenv bin/printf bin/ps bin/pscan bin/pstree bin/pwd bin/pwdx bin/raidautorun bin/rdate bin/rdev bin/readahead bin/readlink bin/readprofile bin/realpath bin/reboot bin/reformime bin/remove-shell bin/renice bin/reset bin/resize bin/resume bin/rev bin/rm bin/rmdir bin/rmmod bin/route bin/rpm bin/rpm2cpio bin/rtcwake bin/run-init bin/run-parts bin/runlevel bin/runsv bin/runsvdir bin/rx bin/script bin/scriptreplay bin/sed bin/sendmail bin/seq bin/setarch bin/setconsole bin/setfattr bin/setfont bin/setkeycodes bin/setlogcons bin/setpriv bin/setserial bin/setsid bin/setuidgid bin/sh bin/sha1sum bin/sha256sum bin/sha3sum bin/sha512sum bin/showkey bin/shred bin/shuf bin/slattach bin/sleep bin/smemcap bin/softlimit bin/sort bin/split bin/ssl_client bin/start-stop-daemon bin/stat bin/strings bin/stty bin/su bin/sulogin bin/sum bin/sv bin/svc bin/svlogd bin/svok bin/swapoff bin/swapon bin/switch_root bin/sync bin/sysctl bin/syslogd bin/tac bin/tail bin/tar bin/taskset bin/tc bin/tcpsvd bin/tee bin/telnet bin/telnetd bin/test bin/tftp bin/tftpd bin/time bin/timeout bin/top bin/touch bin/tr bin/traceroute bin/traceroute6 bin/true bin/truncate bin/ts bin/tty bin/ttysize bin/tunctl bin/ubiattach bin/ubidetach bin/ubimkvol bin/ubirename bin/ubirmvol bin/ubirsvol bin/ubiupdatevol bin/udhcpc bin/udhcpc6 bin/udhcpd bin/udpsvd bin/uevent bin/umount bin/uname bin/unexpand bin/uniq bin/unix2dos bin/unlink bin/unlzma bin/unshare bin/unxz bin/unzip bin/uptime bin/users bin/usleep bin/uudecode bin/uuencode bin/vconfig bin/vi bin/vlock bin/volname bin/w bin/wall bin/watch bin/watchdog bin/wc bin/wget bin/which bin/who bin/whoami bin/whois bin/xargs bin/xxd bin/xz bin/xzcat bin/yes bin/zcat bin/zcip dev/ dev/console dev/pts/ dev/shm/ etc/ etc/group etc/hostname etc/hosts etc/localtime etc/mtab etc/network/ etc/network/if-down.d/ etc/network/if-post-down.d/ etc/network/if-pre-up.d/ etc/network/if-up.d/ etc/passwd etc/resolv.conf etc/shadow home/ proc/ root/ sys/ tmp/ usr/ usr/sbin/ var/ var/spool/ var/spool/mail/ var/www/ root# ls rootfs bin dev etc home proc root sys tmp usr var // proc是空的 root/rootfs# cd proc/ root /rootfs/proc# ls root /rootfs/proc# 没有任何进程() root # chroot rootfs /bin/ps PID USER TIME COMMAND root # chroot rootfs /bin/sh / # ps -ef PID USER TIME COMMAND / # / # ps ajxf PID USER TIME COMMAND / # / # ``` ### 4. 参考文档 [chroot介绍和使用](https://wangchujiang.com/linux-command/c/chroot.html) [浅析Linux中的.a、.so、和.o文件](https://oldpan.me/archives/linux-a-so-o-tell) 用linux命令实现容器: https://juejin.cn/post/6951639064843911175 unshare详解: unshare 就是使用与父进程不共享的命名空间运行 子进程 https://juejin.cn/post/6987564689606180900 ================================================ FILE: docker/4. 如何用golang 实现一个 busybox的容器.md ================================================ * [1\. 背景](#1-背景) * [2\. 如何运行](#2-如何运行) * [3\. 参考](#3-参考) ### 1. 背景 在入手docker源码之前,这里先用一个例子先理解一下,上面提到的Linux原理。 主要参考这个repo:https://github.com/jiajunhuang/cup/blob/master/README.md 原repo中需要准备工作为: (1)创建rootfs,并且自己下载 busybox 二进制文件 但是我按照要求,下载好这个二进制文件,放入rootfs/bin 目录后一直报错: ``` root /data/golang/src/cup/cup# ./cup \ > 2021/12/05 15:21:44 main start... 2021/12/05 15:21:44 path is : 2021/12/05 15:21:44 childProcess start...uid: 0, gid: 0 2021/12/05 15:21:44 child: hostname: kmaster 2021/12/05 15:21:44 child: hostname: cup-host 2021/12/05 15:21:44 failed to run command: fork/exec /bin/busybox: no such file or directory panic: failed to run command: fork/exec /bin/busybox: no such file or directory ``` 因此为了更好的应用,和理解原理,这里做了一些修改。主要是修改了rootfs。rootfs的内容直接从busybox提取出来。 ``` root@zoux:/home/zoux/data/golang/src/cup/cup# (docker export $(docker create busybox) | tar -C rootfs -xvf -) .dockerenv bin/ bin/[ ... ``` 最终的目录结构: ``` root /data/golang/src/cup/cup# tree -L 1 . ├── cup ├── LICENSE ├── main.go ├── Makefile ├── README.md └── rootfs 1 directory, 5 files ```
### 2. 如何运行 (1) make 生成二进制文件 cup (2) ./cup 即可 ``` root /data/golang/src/cup/cup# ./cup 2021/12/05 18:28:16 main start... 2021/12/05 18:28:16 childProcess start...uid: 0, gid: 0 2021/12/05 18:28:16 child: hostname: kmaster 2021/12/05 18:28:16 child: hostname: cup-host / # ps ajxf PID USER TIME COMMAND 1 root 0:00 {exe} childProcess 6 root 0:00 /bin/busybox sh 7 root 0:00 ps ajxf / # ls bin dev etc home proc root sys tmp usr var ``` ### 3. 参考 [Linux Namespace 技术与 Docker 原理浅析](https://www.cnblogs.com/dream397/p/13999018.html) ================================================ FILE: docker/5. docker-overlay技术.md ================================================ * [0 背景](#0-背景) * [1 overlay介绍](#1-overlay介绍) * [2\. 实验\-通过实验来理解](#2-实验-通过实验来理解) * [2\.1 实验设置](#21-实验设置) * [2\.2 补充实验](#22-补充实验) * [2\.2 结论](#22-结论) * [2\.2\.1 workdir作用是什么](#221-workdir作用是什么) * [2\.2\.2 文件覆盖规则](#222-文件覆盖规则) * [3 源码分析\-通过原理来理解](#3-源码分析-通过原理来理解) * [4 总结](#4-总结) ### 0 背景 cgroup, namespaces, chroot都是Linux 已有功能。这些计算是可以做到了隔离。但是docker在这些基层上来,加上了联合文件系统,这个是docker image的基础,使得镜像可以分层继承。overlay是docker联合文件系统的一种。本节就是对overlay的基础知识进行整理总结。 ### 1 overlay介绍 ![image-20220226173105551](./image/overlay-1.png) `OverlayFS` 文件系统主要有三个角色,`lowerdir`、`upperdir` 和 `merged`。`lowerdir` 是只读层,用户不能修改这个层的文件;`upperdir` 是可读写层,用户能够修改这个层的文件;而 `merged` 是合并层,把 `lowerdir` 层和 `upperdir` 层的文件合并展示。
使用 `OverlayFS` 前需要进行挂载操作,挂载 `OverlayFS` 文件系统的基本命令如下: ``` $ mount -t overlay overlay -o lowerdir=lower1:lower2,upperdir=upper,workdir=work merged ``` 参数 `-t` 表示挂载的文件系统类型,这里设置为 `overlay` 表示文件系统类型为 `OverlayFS`,而参数 `-o` 指定的是 `lowerdir`、`upperdir` 和 `workdir`,最后的 `merged` 目录就是最终的挂载点目录。下面说明一下 `-o` 参数几个目录的作用: 1. `lowerdir`:指定用户需要挂载的lower层目录,指定多个目录可以使用 `:` 来分隔(最大支持500层)。 2. `upperdir`:指定用户需要挂载的upper层目录。 3. `workdir`:指定文件系统的工作基础目录,挂载后内容会被清空,且在使用过程中其内容用户不可见。 ### 2. 实验-通过实验来理解 #### 2.1 实验设置 ``` root@k8s-master:~/testOverlay# mkdir -p fileRoot A B C worker root@k8s-master:~/testOverlay# echo "from A" > A/a.txt root@k8s-master:~/testOverlay# echo "from B" > B/b.txt root@k8s-master:~/testOverlay# echo "from C" > C/c.txt root@k8s-master:~/testOverlay# mkdir -p A/aa root@k8s-master:~/testOverlay# tree . ├── A │ ├── aa │ └── a.txt ├── B │ └── b.txt ├── C │ └── c.txt ├── fileRoot └── worker ```
指定 A,B是 底层文件; C是 上层文件。 worker为工作目录。 入口函数为 fileRoot ``` mount -t overlay overlay -o lowerdir=A:B,upperdir=C,workdir=worker fileRoot ``` 查看fileRoot结果: ``` root@k8s-master:~/testOverlay# mount -t overlay overlay -o lowerdir=A:B,upperdir=C,workdir=worker fileRoot root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# ls fileRoot/ aa a.txt b.txt c.txt // 1.对worker目录进行实验。 结果:worker目录可以写入,但是不会影响fileRoot文件 root@k8s-master:~/testOverlay# echo "from worker" > worker/work.txt root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# ls fileRoot/ aa a.txt b.txt c.txt root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# ls worker/ work work.txt // 2.文件覆盖规则实验; lowerdir可以手动修改 root@k8s-master:~/testOverlay# echo "from A1" > A/a.txt root@k8s-master:~/testOverlay# cat fileRoot/a.txt from A1 root@k8s-master:~/testOverlay# ls worker/ work work.txt root@k8s-master:~/testOverlay# ls worker/work // 3. 覆盖顺序测试: upperdir优先级最高,lowerdir按照mount时从左到右的顺序,权重依次降低,左边的覆盖右边的同名文件或者文件夹。 root@k8s-master:~/testOverlay# echo "from B" > B/a.txt root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# ls fileRoot/ aa a.txt b.txt c.txt root@k8s-master:~/testOverlay# cat fileRoot/a.txt from A1 root@k8s-master:~/testOverlay# cat A/a.txt from A1 root@k8s-master:~/testOverlay# cat B/a.txt from B oot@k8s-master:~/testOverlay# echo "from A" > A/b.txt root@k8s-master:~/testOverlay# cat A/b.txt from A root@k8s-master:~/testOverlay# cat fileRoot/b.txt from A root@k8s-master:~/testOverlay# cat B/b.txt from B root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# cat C/c.txt from C root@k8s-master:~/testOverlay# cat fileRoot/c.txt from C root@k8s-master:~/testOverlay# echo "from A" > A/c.txt root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# cat A/c.txt from A root@k8s-master:~/testOverlay# cat C/c.txt from C root@k8s-master:~/testOverlay# cat fileRoot/c.txt from C // 目录中的文件也是一样,存在同名的时,以左边的A为准 root@k8s-master:~/testOverlay# mkdir B/aa root@k8s-master:~/testOverlay# echo "from bb" > B/aa/a.txt root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# cat fileRoot/aa/a.txt from aa root@k8s-master:~/testOverlay# echo "from bb" > B/aa/b.txt // 为什么aa目录下没有b.txt root@k8s-master:~/testOverlay# ls fileRoot/aa a.txt root@k8s-master:~/testOverlay# ls fileRoot/aa a.txt root@k8s-master:~/testOverlay# ls B/aa/b.txt B/aa/b.txt root@k8s-master:~/testOverlay# ls A/aa/b.txt ls: cannot access 'A/aa/b.txt': No such file or directory root@k8s-master:~/testOverlay# ls fileRoot/aa a.txt root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# ls fileRoot/ aa a.txt b.txt c.txt root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# cd fileRoot/ root@k8s-master:~/testOverlay/fileRoot# ls aa a.txt b.txt c.txt root@k8s-master:~/testOverlay/fileRoot# cd aa/ root@k8s-master:~/testOverlay/fileRoot/aa# ls a.txt // 破案了,因为 A//aa 目录的优先级 比 B/aa高,所以fileRoot/aa = A/aa root@k8s-master:~/testOverlay# echo "from aa" > A/aa/e.txt root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# ls fileRoot/a aa/ a.txt root@k8s-master:~/testOverlay# ls fileRoot/a aa/ a.txt root@k8s-master:~/testOverlay# ls fileRoot/aa/ a.txt e.txt root@k8s-master:~/testOverlay# root@k8s-master:~/testOverlay# echo "from bb" > B/aa/f.txt root@k8s-master:~/testOverlay# ls fileRoot/aa/ a.txt e.txt root@k8s-master:~/testOverlay# // 这个就有,所以 root@k8s-master:~/testOverlay# echo "from bb" > B/f.txt root@k8s-master:~/testOverlay# ls fileRoot/ aa a.txt b.txt c.txt f.txt root@k8s-master:~/testOverlay# cat fileRoot/f.txt from bb // 为啥这个f.txt 不是from aa ???, 看起来又不是A为主?? root@k8s-master:~/testOverlay# echo "from bb" > B/f.txt root@k8s-master:~/testOverlay# ls fileRoot/ aa a.txt b.txt c.txt f.txt root@k8s-master:~/testOverlay# cat fileRoot/f.txt from bb root@k8s-master:~/testOverlay# echo "from aa" > A/f.txt root@k8s-master:~/testOverlay# cat fileRoot/f.txt from bb root@k8s-master:~/testOverlay# cat fileRoot/f.txt from bb 在merged文件夹所做的所有修改,最终都会存储到upperdir目录中 root@k8s-master:~/testOverlay# echo "from fileRoot" > fileRoot/fr.txt root@k8s-master:~/testOverlay# ls C c.txt fr.txt ```
#### 2.2 补充实验 ``` root@k8s-master:~/testOver# mkdir -p fileRoot A/aa B/aa C worker root@k8s-master:~/testOver# echo "from A" > A/a.txt root@k8s-master:~/testOver# echo "from A" > A/aa/a.txt root@k8s-master:~/testOver# echo "from B" > B/aa/a.txt root@k8s-master:~/testOver# echo "from B" > B/aa/b.txt root@k8s-master:~/testOver# echo "from B" > B/a.txt root@k8s-master:~/testOver# echo "from B" > B/b.txt root@k8s-master:~/testOver# echo "from C" > C/c.txt root@k8s-master:~/testOver# mount -t overlay overlay -o lowerdir=A:B,upperdir=C,workdir=worker fileRoot root@k8s-master:~/testOver# ls fileRoot/ aa a.txt b.txt c.txt root@k8s-master:~/testOver# ls fileRoot/a.txt fileRoot/a.txt root@k8s-master:~/testOver# cat fileRoot/a.txt from A root@k8s-master:~/testOver# ls fileRoot/aa/ a.txt b.txt root@k8s-master:~/testOver# cat fileRoot/aa/a.txt from A ``` #### 2.2 结论 ##### 2.2.1 workdir作用是什么 通过实验:wokrdir目录平时都是空的,但是可以手动写入文件,写入文件后不影响overlay文件(fileRoot); 通过实验没看出来,查询资料,解析是: workdir选项是必需的,用于在原子操作中将文件切换到覆盖目标之前准备文件(workdir必须与upperdir在同一文件系统上)。 资料来源:[http](http://windsock.io/the-overlay-filesystem/) : [//windsock.io/the-overlay-filesystem/](http://windsock.io/the-overlay-filesystem/) 我可能会猜测“覆盖目标”的意思`upperdir`。 所以...某些文件(也许是“ whiteout”文件?)是非原子创建和配置的`workdir`,然后原子移动到的`upperdir`。 链接:https://qastack.cn/unix/324515/linux-filesystem-overlay-what-is-workdir-used-for-overlayfs
##### 2.2.2 文件覆盖规则 (1)lowerdir的值可以是一些的文件夹列表,文件都可以读写 (2)merged文件夹是最终联合起来的文件系统,我们可以在merged文件夹中访问所有lowerdir和upperdir中的内容 (3)文件的覆盖顺序,upperdir目录拥有最高覆盖权限,lowerdir按照mount时从左到右的顺序,权重依次降低,左边的覆盖右边的同名文件或者文件夹。 (4)在merged文件夹所做的所有修改,最终都会存储到upperdir目录中 (5)workdir指定的目录需要和upperdir位于同一目录中 (6)mount的时候,lowerdir相同文件会被最左边的覆盖,不同的文件和合并到相同目录 (补充实验) ### 3 源码分析-通过原理来理解 目前暂时先不设计这一块代码,了解大概的使用即可。如果需要,后面参考这两个链接再仔细研究。 https://mp.weixin.qq.com/s/pgu0uXvokgBTXUNk1LpB6Q https://docs.docker.com/storage/storagedriver/overlayfs-driver/ ### 4 总结 从实验结果来看,docker image里面的应该是最新的在最左边。 ================================================ FILE: docker/6. docker pull原理分析.md ================================================ * [0\. 章节目标](#0-章节目标) * [1\. docker pull busybox 引入](#1-docker-pull-busybox-引入) * [1\.1 引入的问题](#11-引入的问题) * [2\. docker pull 原理](#2-docker-pull-原理) * [2\.1 查看docker 信息](#21-查看docker-信息) * [2\.2 Root Dir](#22-root-dir) * [2\.3 image目录](#23-image目录) * [2\.4 如何获取dockerhub镜像的manifest](#24-如何获取dockerhub镜像的manifest) * [3\. docker pull后的文件是如何存储的](#3-docker-pull后的文件是如何存储的) * [3\.1 查看image元数据信息\-imageConfig](#31-查看image元数据信息-imageconfig) * [3\.2 sha256sum 作用](#32-sha256sum-作用) * [3\.3 diff\_ids vs docker pull的layer\-id](#33-diff_ids-vs-docker-pull的layer-id) * [3\.4 如何查看每一层的layer在哪](#34-如何查看每一层的layer在哪) * [4\. 结论](#4-结论) * [5 参考](#5-参考) ### 0. 章节目标 从体验和原理入手, 弄清楚doker pull 镜像的过程; 弄清楚docker 镜像是如何存储的, 为后面docker pull 源码做准备。 ### 1. docker pull busybox 引入 ``` root@k8s-master:~# docker pull busybox Using default tag: latest latest: Pulling from library/busybox 3cb635b06aa2: Pull complete //该镜像只有一层 Digest: sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a Status: Downloaded newer image for busybox:latest docker.io/library/busybox:latest // 镜像id 是 ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af root@k8s-master:/var/lib/docker/overlay2# docker images --no-trunc REPOSITORY TAG IMAGE ID CREATED SIZE busybox latest sha256:ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af 4 days ago 1.24MB root@k8s-master:~# docker pull busybox:latest latest: Pulling from library/busybox Digest: sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a Status: Image is up to date for busybox:latest docker.io/library/busybox:latest root@k8s-master:~# docker rmi busybox:latest Untagged: busybox:latest Untagged: busybox@sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a Deleted: sha256:ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af Deleted: sha256:64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed root@k8s-master:~# docker pull zoux/pause-amd64:3.0 3.0: Pulling from zoux/pause-amd64 4f4fb700ef54: Pull complete ce150f7a21ec: Pull complete Digest: sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b Status: Downloaded newer image for zoux/pause-amd64:3.0 ``` #### 1.1 引入的问题 Q: Digest 是什么 ? A:镜像在服务器端的 sha256sum ID。 Q: rmi 的时候为什么还要delete: 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed ? A: 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed 是bosybox 的rootfs_id
### 2. docker pull 原理 ![image-20220226174337490](./image/image-1.png) 关键信息: (1)manifest 有什么信息 (2)image config是什么 (3)diff_ids是什么 #### 2.1 查看docker 信息 ``` root@k8s-master:~# docker info Client: Debug Mode: false Server: Containers: 11 Running: 4 Paused: 0 Stopped: 7 Images: 7 Server Version: 19.03.9 Storage Driver: overlay2 // 使用的是 overlay2文件系统 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd init version: fec3683 Security Options: apparmor seccomp Profile: default Kernel Version: 4.19.0-17-amd64 Operating System: Debian GNU/Linux 10 (buster) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.854GiB Name: k8s-master ID: DN3J:XOLZ:VIGR:W4E2:LK47:PCEH:43KP:LFCW:XPRG:NPEZ:4DRR:TPTE Docker Root Dir: /var/lib/docker // docker 关键文件 Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: true Insecure Registries: 127.0.0.0/8 Registry Mirrors: https://b9pmyelo.mirror.aliyuncs.com/ Live Restore Enabled: false Product License: Community Engine WARNING: No swap limit support ``` #### 2.2 Root Dir 这里一个非常关键的就是: Docker Root Dir: /var/lib/docker ``` root@k8s-master:~# ls -l /var/lib/docker total 60 drwx------ 2 root root 4096 Oct 23 16:13 builder drwx--x--x 4 root root 4096 Oct 23 16:13 buildkit drwx------ 3 root root 4096 Oct 23 16:13 containerd drwx------ 13 root root 4096 Dec 12 16:51 containers drwx------ 3 root root 4096 Oct 23 16:13 image drwxr-x--- 3 root root 4096 Oct 23 16:13 network drwx------ 55 root root 12288 Dec 12 16:51 overlay2 drwx------ 4 root root 4096 Oct 23 16:13 plugins drwx------ 2 root root 4096 Dec 12 16:50 runtimes drwx------ 2 root root 4096 Oct 23 16:13 swarm drwx------ 2 root root 4096 Dec 12 16:50 tmp drwx------ 2 root root 4096 Oct 23 16:13 trust drwx------ 2 root root 4096 Oct 23 16:13 volume ``` 和镜像存储有关的信息如下: - overlay2: 镜像和容器的层信息 - image:存储镜像元相关信息 #### 2.3 image目录 ``` root@k8s-master:~# tree -L 1 /var/lib/docker/image/overlay2/ /var/lib/docker/image/overlay2/ ├── distribution ├── imagedb ├── layerdb └── repositories.json 3 directories, 1 file ``` repositories.json就是存储镜像信息,主要是name和image id的对应,digest和image id的对应。当pull镜像的时候会更新这个文件。 ``` root@k8s-master:/var/lib/docker# cat image/overlay2/repositories.json { "Repositories": { "busybox": { "busybox:latest": "sha256:ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af", "busybox@sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a": "sha256:ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af" }, "zoux/pause-amd64": { "zoux/pause-amd64:3.0": "sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2", "zoux/pause-amd64@sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b": "sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2" }, "nginx": { "nginx:latest": "sha256:f652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e", "nginx@sha256:097c3a0913d7e3a5b01b6c685a60c03632fc7a2b50bc8e35bcaa3691d788226e": "sha256:ea335eea17ab984571cd4a3bcf90a0413773b559c75ef4cda07d0ce952b00291", "nginx@sha256:644a70516a26004c97d0d85c7fe1d0c3a67ea8ab7ddf4aff193d9f301670cf36": "sha256:87a94228f133e2da99cb16d653cd1373c5b4e8689956386c1c12b60a20421a02", "nginx@sha256:9522864dd661dcadfd9958f9e0de192a1fdda2c162a35668ab6ac42b465f0603": "sha256:f652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e" }, "quay.io/coreos/flannel": { "quay.io/coreos/flannel:v0.15.0": "sha256:09b38f011a29c697679aa10918b7514e22136b50ceb6cf59d13151453fe8b7a0", "quay.io/coreos/flannel@sha256:bf24fa829f753d20b4e36c64cf9603120c6ffec9652834953551b3ea455c4630": "sha256:09b38f011a29c697679aa10918b7514e22136b50ceb6cf59d13151453fe8b7a0" }, "rancher/mirrored-flannelcni-flannel-cni-plugin": { "rancher/mirrored-flannelcni-flannel-cni-plugin:v1.2": "sha256:98660e6e4c3ae49bf49cd640309f79626c302e1d8292e1971dcc2e6a6b7b8c4d", "rancher/mirrored-flannelcni-flannel-cni-plugin@sha256:b69fb2dddf176edeb7617b176543f3f33d71482d5d425217f360eca5390911dc": "sha256:98660e6e4c3ae49bf49cd640309f79626c302e1d8292e1971dcc2e6a6b7b8c4d" } } } ```
``` root@k8s-master:~# docker images --digests REPOSITORY TAG DIGEST IMAGE ID CREATED SIZE busybox latest sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a ffe9d497c324 4 days ago 1.24MB ``` **查看docker image信息** ``` root@k8s-master:~# export DOCKER_CLI_EXPERIMENTAL=enabled //需要开启docker cli root@k8s-master:~# docker manifest inspect busybox:latest { "schemaVersion": 2, "mediaType": "application/vnd.docker.distribution.manifest.list.v2+json", "manifests": [ { "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 527, "digest": "sha256:50e44504ea4f19f141118a8a8868e6c5bb9856efa33f2183f5ccea7ac62aacc9", //这个为啥不一样 "platform": { "architecture": "amd64", "os": "linux" } }, { // 其他平台。。 "mediaType": "application/vnd.docker.distribution.manifest.v2+json", "size": 527, "digest": "sha256:0252da5f2df7425dcf48afb4bc337966dfeb2d87079ea3f7fe25051d5b9e9c26", "platform": { "architecture": "arm", "os": "linux", "variant": "v5" } }, ] } ``` **解答疑问:** 从这里就可以看出来,repositories.json存储了 镜像id和 digestsid的对应关系。 digestsid 就是存储在服务器远端的 所有镜像文件的 sha256值。 当第二次docker pull的时候,发现 busybox:latest 对应的 digestsid=b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a。 一查看repositories.json,发现本地有这个镜像,所以不会再下载了。
digest是manifest的sha256:,因为manifest在本地没有,我们可以通过registry的结果去获取。 #### 2.4 如何获取dockerhub镜像的manifest https://stackoverflow.com/questions/55269256/how-to-get-manifests-using-http-api-v2 https://zhuanlan.zhihu.com/p/95900321 这个看起来可以的 https://gist.github.com/tnozicka/f46b37f57f7ac755fefa6a0f0c8a77bf ``` repo=openshift/origin && curl -H "Authorization: Bearer $(curl -sSL "https://auth.docker.io/token?service=registry.docker.io&scope=repository:${repo}:pull" | jq --raw-output .token)" "https://registry.hub.docker.com/v2/${repo}/manifests/latest" root@k8s-master:~# repo=zoux/pause-amd64 && curl -H "Authorization: Bearer $(curl -sSL "https://auth.docker.io/token?service=registry.docker.io&scope=repository:${repo}:pull" | jq --raw-output .token)" "https://registry.hub.docker.com/v2/${repo}/manifests/3.0" { "schemaVersion": 1, "name": "zoux/pause-amd64", "tag": "3.0", "architecture": "amd64", "fsLayers": [ { "blobSum": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" }, { "blobSum": "sha256:ce150f7a21ecb3a4150d71685079f2727057c1785323933f9fdd0750874e13e5" }, { "blobSum": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" } ], "history": [ { "v1Compatibility": "{\"architecture\":\"amd64\",\"config\":{\"Hostname\":\"95722352e41d\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":false,\"AttachStdout\":false,\"AttachStderr\":false,\"Tty\":false,\"OpenStdin\":false,\"StdinOnce\":false,\"Env\":null,\"Cmd\":null,\"Image\":\"f8e2eec424cf985b4e41d6423991433fb7a93c90f9acc73a5e7bee213b789c52\",\"Volumes\":null,\"WorkingDir\":\"\",\"Entrypoint\":[\"/pause\"],\"OnBuild\":null,\"Labels\":{}},\"container\":\"a9873535145fe72b464d3055efbac36aab70d059914e221cbbd7fe3cac53ef6b\",\"container_config\":{\"Hostname\":\"95722352e41d\",\"Domainname\":\"\",\"User\":\"\",\"AttachStdin\":false,\"AttachStdout\":false,\"AttachStderr\":false,\"Tty\":false,\"OpenStdin\":false,\"StdinOnce\":false,\"Env\":null,\"Cmd\":[\"/bin/sh\",\"-c\",\"#(nop) ENTRYPOINT \\u0026{[\\\"/pause\\\"]}\"],\"Image\":\"f8e2eec424cf985b4e41d6423991433fb7a93c90f9acc73a5e7bee213b789c52\",\"Volumes\":null,\"WorkingDir\":\"\",\"Entrypoint\":[\"/pause\"],\"OnBuild\":null,\"Labels\":{}},\"created\":\"2016-05-04T06:26:41.522308365Z\",\"docker_version\":\"1.9.1\",\"id\":\"3d2e5b3ef4b070401482a8161420136e75da9354ccfc7cece40b2b5ba8d0f1be\",\"os\":\"linux\",\"parent\":\"58ca451648f521bb9749d929fab33c76c1aec4ac54990f4d33fb86705682ec32\"}" }, { "v1Compatibility": "{\"id\":\"58ca451648f521bb9749d929fab33c76c1aec4ac54990f4d33fb86705682ec32\",\"parent\":\"00fa447be331f70e08ea0dfff0174e514aac7f0f089a6c4d3a8f58d855a10b3e\",\"created\":\"2016-05-04T06:26:41.091672218Z\",\"container_config\":{\"Cmd\":[\"/bin/sh -c #(nop) ADD file:b7eb6a5df9d5fbe509cac16ed89f8d6513a4362017184b14c6a5fae151eee5c5 in /pause\"]}}" }, { "v1Compatibility": "{\"id\":\"00fa447be331f70e08ea0dfff0174e514aac7f0f089a6c4d3a8f58d855a10b3e\",\"created\":\"2016-05-04T06:26:40.628395649Z\",\"container_config\":{\"Cmd\":[\"/bin/sh -c #(nop) ARG ARCH\"]}}" } ], "signatures": [ { "header": { "jwk": { "crv": "P-256", "kid": "W2RG:USLL:S22T:VLMH:PO66:FQVK:M5BQ:WYME:FDIC:TNX4:J4TE:LKIW", "kty": "EC", "x": "abyPWJMVZM6xBosAkf1sUh4D30sa-4XEjXNTuIv72_s", "y": "9miJIR5j2yXpcTaxqrFW491OEKc0npyWDYAa5KLxDNw" }, "alg": "ES256" }, "signature": "WZVTu9_Q2jFeNViqxIXUf_bLlLTjhH5tAjdcdCB0ohC1hgyxLIrt1hAeG2ZZkxg0wBuEaWm8ip6C1yt6Vad9SQ", "protected": "eyJmb3JtYXRMZW5ndGgiOjIzOTEsImZvcm1hdFRhaWwiOiJDbjAiLCJ0aW1lIjoiMjAyMS0xMi0yNVQwMjozNjowN1oifQ" } ] } ``` ### 3. docker pull后的文件是如何存储的 #### 3.1 查看image元数据信息-imageConfig 镜像元数据存储在了/var/lib/docker/image//imagedb/content/sha256/目录下,名称是以镜像ID命名的文件,镜像ID可通过docker images查看,这些文件以json的形式保存了该镜像的rootfs信息、镜像创建时间、构建历史信息、所用容器、包括启动的Entrypoint和CMD等等。 这里以bosybox镜像为例: 从docker pull的输出可以看出来,busybox只有一层, 3cb635b06aa2 ``` // docker pull busybox之前 root@k8s-master:/var/lib/docker/image/overlay2/imagedb/content/sha256# ls 09b38f011a29c697679aa10918b7514e22136b50ceb6cf59d13151453fe8b7a0 87a94228f133e2da99cb16d653cd1373c5b4e8689956386c1c12b60a20421a02 98660e6e4c3ae49bf49cd640309f79626c302e1d8292e1971dcc2e6a6b7b8c4d 99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2 ea335eea17ab984571cd4a3bcf90a0413773b559c75ef4cda07d0ce952b00291 f652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e // docker pull的时候,只有这个pull 3cb635b06aa2: Pull complete // 下载镜像之后 root@k8s-master:/var/lib/docker/image/overlay2/imagedb/content/sha256# ls 09b38f011a29c697679aa10918b7514e22136b50ceb6cf59d13151453fe8b7a0 87a94228f133e2da99cb16d653cd1373c5b4e8689956386c1c12b60a20421a02 98660e6e4c3ae49bf49cd640309f79626c302e1d8292e1971dcc2e6a6b7b8c4d 99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2 ea335eea17ab984571cd4a3bcf90a0413773b559c75ef4cda07d0ce952b00291 f652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af // 多了这一层, 每个文件名就是一个imageid // 文件内容是镜像的详细信息 root@k8s-master:/var/lib/docker/image/overlay2/imagedb/content/sha256# cat ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af { "architecture": "amd64", "config": { "Hostname": "", "Domainname": "", "User": "", "AttachStdin": false, "AttachStdout": false, "AttachStderr": false, "Tty": false, "OpenStdin": false, "StdinOnce": false, "Env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" ], "Cmd": [ "sh" ], "Image": "sha256:47595422ea26649bce6768903b3f14aa220694e0811e1bdb5e5bd6fd3df852b2", "Volumes": null, "WorkingDir": "", "Entrypoint": null, "OnBuild": null, "Labels": null }, "container": "0234093c99ba42a97028378063ca32364ca85f74b6804ae65da0f874c16cff69", "container_config": { "Hostname": "0234093c99ba", "Domainname": "", "User": "", "AttachStdin": false, "AttachStdout": false, "AttachStderr": false, "Tty": false, "OpenStdin": false, "StdinOnce": false, "Env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" ], "Cmd": [ "/bin/sh", "-c", "#(nop) ", "CMD [\"sh\"]" ], "Image": "sha256:47595422ea26649bce6768903b3f14aa220694e0811e1bdb5e5bd6fd3df852b2", "Volumes": null, "WorkingDir": "", "Entrypoint": null, "OnBuild": null, "Labels": { } }, "created": "2021-12-08T00:22:34.424256906Z", "docker_version": "20.10.7", "history": [ { "created": "2021-12-08T00:22:34.228923742Z", "created_by": "/bin/sh -c #(nop) ADD file:e2d2d9591696b14787114bccd6c84033d8e8433ce416045672e2870b983b6029 in / " }, { "created": "2021-12-08T00:22:34.424256906Z", "created_by": "/bin/sh -c #(nop) CMD [\"sh\"]", "empty_layer": true } ], "os": "linux", "rootfs": { "type": "layers", "diff_ids": [ "sha256:64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed" ] } } ```
#### 3.2 sha256sum 作用 sha256sum:计算文件的哈希值 ``` root@k8s-master:~# sha256sum a.sh 96a9988dd952b0910d4d808187b52a623fda2a45b86337b61a76589618f901bf a.sh 没看错,镜像id就是 该image-config文件的hash值 root@k8s-master:/var/lib/docker/image/overlay2/imagedb/content/sha256# sha256sum ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af // 该文件的hash值 ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af ``` **镜像id就是 该image-config文件的hash值!!!** #### 3.3 diff_ids vs docker pull的layer-id /var/lib/docker/image/overlay2/imagedb/content/sha256 目录存放了镜像的 config。并且指定了 diff_ids是: 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed 这个看起来就是 具体镜像文件了。 docker pull是: 3cb635b06aa2 diff_ids: 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed 这两为啥又不一样: 在pull镜像的时候显示的是各个layer的digest信息,在image config存的是diffid。要区分这两个,还要先回答为什么manifest的layer的表达和image config的layer的表达中不是一个东西。
**结论:** image config里面的diffid 就是本地解压后的 layer sha256sum值。 docker pull的是服务器端压缩的 layer sha256sum 当我们去registry上拉layer的时候,拉什么格式的呢,是根据请求中的media type决定的,因为layer存在本地的时候未压缩的,或者说是解压过的。 为了在网络上传输的更加快呢,所有media type一般会指定压缩格式的,比如gzip的,具体有哪些格式,见:[media type](https://link.zhihu.com/?target=https%3A//docs.docker.com/registry/spec/manifest-v2-2/%23media-types) 结合我最开始说的(manifest对应registry服务端的配置,image config针对本地存储端的),其实也就不难理解了。 当docker发现本地不存在某个layer的时候,就会通过manifest里面的digest + mediaType(一般是"application/vnd.docker.image.rootfs.diff.tar.gzip")去registry拉对应的leyer。 然后在image id存的对应的diff id就是上面拿到的tar.gz包解压为tar包的id。 ``` # curl -H "Accept:application/vnd.docker.image.rootfs.diff.tar.gzip" https://docker-search.4pd.io/v2/ubuntu/blobs/sha256:7ddbc47eeb70dc7f08e410a667948b87ff3883024eb41478b44ef9a81bf400c -o layer1.tar.gz # sha256sum layer1.tar.gz 7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c layer1.tar.gz # sha256sum layer1.tar cc967c529ced563b7746b663d98248bc571afdb3c012019d7f54d6c092793b8b layer1.tar ``` **distribution目录存放了对应的转换关系** v2metadata-by-diffid : 文件名是 diff_ids, 文件的值是digest diffid-by-digest: 文件名是digest, 文件值是 diff_ids ``` root@k8s-master:/var/lib/docker/image/overlay2/distribution/v2metadata-by-diffid/sha256# cat 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed [{"Digest":"sha256:3cb635b06aa273034d7080e0242e4b6628c59347d6ddefff019bfd82f45aa7d5","SourceRepository":"docker.io/library/busybox","HMAC":""}] root@k8s-master:/var/lib/docker/image/overlay2/distribution/diffid-by-digest/sha256# cat 3cb635b06aa273034d7080e0242e4b6628c59347d6ddefff019bfd82f45aa7d5 sha256:64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed ``` #### 3.4 如何查看每一层的layer在哪 以curlimages/curl:7.75.0镜像为例: ``` { "architecture": "amd64", "config": { "User": "curl_user", "Env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "CURL_VERSION=7_75_0", "CURL_RELEASE_TAG=curl-7_75_0", "CURL_GIT_REPO=https://github.com/curl/curl.git", "CURL_CA_BUNDLE=/cacert.pem"], "Entrypoint": ["/entrypoint.sh"], "Cmd": ["curl"], "Labels": { "Maintainer": "James Fuller \u003cjim.fuller@webcomposite.com\u003e", "Name": "curl", "Version": "1.0.0", "docker.cmd": "docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se", "se.haxx.curl": "curl", "se.haxx.curl.description": "network utility", "se.haxx.curl.release_tag": "curl-7_75_0", "se.haxx.curl.version": "7_75_0" }, "ArgsEscaped": true, "OnBuild": null }, "created": "2021-02-03T10:22:09.59342396Z", "history": [{ "created": "2020-12-17T00:19:41.960367136Z", "created_by": "/bin/sh -c #(nop) ADD file:ec475c2abb2d46435286b5ae5efacf5b50b1a9e3b6293b69db3c0172b5b9658b in / " }, { "created": "2020-12-17T00:19:42.11518025Z", "created_by": "/bin/sh -c #(nop) CMD [\"/bin/sh\"]", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "ARG CURL_RELEASE_TAG=latest", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "ARG CURL_RELEASE_VERSION", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "ARG CURL_GIT_REPO=https://github.com/curl/curl.git", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "ENV CURL_VERSION=7_75_0", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "ENV CURL_RELEASE_TAG=curl-7_75_0", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "ENV CURL_GIT_REPO=https://github.com/curl/curl.git", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "LABEL Maintainer=James Fuller \u003cjim.fuller@webcomposite.com\u003e", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "LABEL Name=curl", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "LABEL Version=", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "LABEL docker.cmd=docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:18:02.868616268Z", "created_by": "RUN |3 CURL_RELEASE_TAG=curl-7_75_0 CURL_RELEASE_VERSION=7_75_0 CURL_GIT_REPO=https://github.com/curl/curl.git /bin/sh -c apk add --no-cache brotli brotli-dev libssh2 nghttp2-dev \u0026\u0026 rm -fr /var/cache/apk/* # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2021-02-03T10:18:03.050522395Z", "created_by": "RUN |3 CURL_RELEASE_TAG=curl-7_75_0 CURL_RELEASE_VERSION=7_75_0 CURL_GIT_REPO=https://github.com/curl/curl.git /bin/sh -c addgroup -S curl_group \u0026\u0026 adduser -S curl_user -G curl_group # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2021-02-03T10:22:08.691286411Z", "created_by": "COPY /cacert.pem /cacert.pem # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2021-02-03T10:22:08.691286411Z", "created_by": "ENV CURL_CA_BUNDLE=/cacert.pem", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:22:08.768815145Z", "created_by": "COPY /alpine/usr/local/lib/libcurl.so.4.7.0 /usr/lib/ # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2021-02-03T10:22:08.853211212Z", "created_by": "COPY /alpine/usr/local/bin/curl /usr/bin/curl # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2021-02-03T10:22:09.262850838Z", "created_by": "RUN |3 CURL_RELEASE_TAG=curl-7_75_0 CURL_RELEASE_VERSION=7_75_0 CURL_GIT_REPO=https://github.com/curl/curl.git /bin/sh -c ln -s /usr/lib/libcurl.so.4.7.0 /usr/lib/libcurl.so.4 # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2021-02-03T10:22:09.516766096Z", "created_by": "RUN |3 CURL_RELEASE_TAG=curl-7_75_0 CURL_RELEASE_VERSION=7_75_0 CURL_GIT_REPO=https://github.com/curl/curl.git /bin/sh -c ln -s /usr/lib/libcurl.so.4 /usr/lib/libcurl.so # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2021-02-03T10:22:09.516766096Z", "created_by": "USER curl_user", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:22:09.59342396Z", "created_by": "COPY entrypoint.sh /entrypoint.sh # buildkit", "comment": "buildkit.dockerfile.v0" }, { "created": "2021-02-03T10:22:09.59342396Z", "created_by": "CMD [\"curl\"]", "comment": "buildkit.dockerfile.v0", "empty_layer": true }, { "created": "2021-02-03T10:22:09.59342396Z", "created_by": "ENTRYPOINT [\"/entrypoint.sh\"]", "comment": "buildkit.dockerfile.v0", "empty_layer": true }], "os": "linux", "rootfs": { "type": "layers", "diff_ids": ["sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf", "sha256:019dd39b82bba02007b940007ee0662015ff0a11ddd55fb7b4a4f6f1e3f694f2", "sha256:ead19f98b65e2cb338cab0470d7ddadc8a23c32ccd34ab6511a35393c7b7335d", "sha256:bcbfcc5b87d4afa5cf8981569a2dcebfd01643a7ddbe82f191062cf677d024b2", "sha256:6e767bd912c28e4d667adfec7adcf1dab84f76ecf0b71cba76634b03a00e67e8", "sha256:9904f3d51f2e6e052fd2ce88494090739f23acec20f2a9c3b2d3deb86874dd0e", "sha256:56a8d17054bd206ae215f3b81ecbb2d2715b21f48966763fc8c9144ac8f8d46e", "sha256:939fe15ec48dad8528237a6330438426dd8627db92a891eb610e36075274e2f5", "sha256:3e7aa53fce9350e24217d0b33912c286a4748e36facfd174c32ec53303be025f"] } } ``` /var/lib/docker/image/overlay2/layerdb/sha256目录存放的diffids的最上层信息,也就是777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf,这是个目录 那这个里面到底是啥意思呢,这个里面是chainid,这个是因为chainid的一层是依赖上一层的,这就导致最后算出来的rootfs是统一的。 公式为(具体可见:[layer-chainid](https://link.zhihu.com/?target=https%3A//github.com/opencontainers/image-spec/blob/master/config.md%23layer-chainid)): chaninid(1) = diffid(1) chainid(n) = sha256(chain(n-1) diffid(n) ) ``` root@k8s-node:~# cd /var/lib/docker/image/overlay2/layerdb/sha256 root@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256# ls 02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05 1be8816ebbd7f52290964aa6df8ff27825772a40baded5d91a152ded7c2534a3 1bfbb02dea047ad9341efddc61f0b8a9b473b86001bf7605df7a8880b157b8a9 4006d6bc83834f41eae67f73db4fd4ed3364b06362780a529b44dd5015711092 439f01e6ba92ba1e5b3be977f73014ab80e7997462b9ca86f44ae9b6cdc99cb7 4d4eb19da25f4f4649cf74c7028acd317962959e4b9b55aec27b4cfc3b867b93 5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef 666604249ff52593858b7716232097daa6d721b7b4825aac8bf8a3f45dfba1ce 722f29343eb01a012a210445f66fc22678ca5750ae3bba2cfde9a5c3b62c701d 777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf //第一层 7897c392c5f451552cd2eb20fdeadd1d557c6be8a3cd20d0355fb45c1f151738 7fcb75871b2101082203959c83514ac8a9f4ecfee77a0fe9aa73bbe56afdf1b4 8473ff61fb5a229cfb7e0410cc815321b3bbe7a88c22766fe4f3f643a7ea2e32 85e5b916bf35f12eeb78c6d89d1cba758c0a60d516401beef41a2aa65f8ddb76 b6b031f5155c8fdd924e4e2508b6ae4018ff646efa86734c8c34b0d61a82b5ea bfb718dadfd11e598f98dc1314421be5bdee044f417a4149bfd370083db78e6e d43d6edaff1c22bfd53fcb4b0aa1f00dcd987d45b38ac3971317350785c18574 d8546a51a3203d6ac8eb7b5b0f23a97e77aa706e0ee2136e8747c000538926bd e8f232ecf2faa5a124d8025eaea6861ff94fc1a5c7da17d7b9712aa24431293e eea7cd97478d04eff4f9fc36c229d9e9f3d42740e6dc02d6578104e945f38d9f f1dd685eb59e7d19dd353b02c4679d9fafd21ccffe1f51960e6c3645f3ceb0cd root@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256# root@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256# root@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256/777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf# ls -l total 32 -rw-r--r-- 1 root root 64 Dec 19 20:41 cache-id 真正对应的layer数据那个目录 -rw-r--r-- 1 root root 71 Dec 19 20:41 diff 该层的diffid -rw-r--r-- 1 root root 7 Dec 19 20:41 size 该层的大小 -rw-r--r-- 1 root root 19501 Dec 19 20:41 tar-split.json.gz layer压缩包的split文件 root@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256/777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf# cat cache-id 840c5d412d4af8d058a526074900c098c1469ecd2f08fb21c39d23ffd2a9d527 root@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256/777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf# cat diff sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf ```
/var/lib/docker/overlay2/就是layer数据存放的目录,比如每个chainid里面cache-id都回应这个目录下面的一个目录 diff 目录就是所有数据的目录 ``` // 没有lower, diff目录 root@k8s-node:/var/lib/docker/overlay2/840c5d412d4af8d058a526074900c098c1469ecd2f08fb21c39d23ffd2a9d527# ls committed diff link root@k8s-node:/var/lib/docker/overlay2/840c5d412d4af8d058a526074900c098c1469ecd2f08fb21c39d23ffd2a9d527/diff# ls -l total 68 drwxr-xr-x 2 root root 4096 Dec 16 2020 bin drwxr-xr-x 2 root root 4096 Dec 16 2020 dev drwxr-xr-x 15 root root 4096 Dec 16 2020 etc drwxr-xr-x 2 root root 4096 Dec 16 2020 home drwxr-xr-x 7 root root 4096 Dec 16 2020 lib drwxr-xr-x 5 root root 4096 Dec 16 2020 media drwxr-xr-x 2 root root 4096 Dec 16 2020 mnt drwxr-xr-x 2 root root 4096 Dec 16 2020 opt dr-xr-xr-x 2 root root 4096 Dec 16 2020 proc drwx------ 2 root root 4096 Dec 16 2020 root drwxr-xr-x 2 root root 4096 Dec 16 2020 run drwxr-xr-x 2 root root 4096 Dec 16 2020 sbin drwxr-xr-x 2 root root 4096 Dec 16 2020 srv drwxr-xr-x 2 root root 4096 Dec 16 2020 sys drwxrwxrwt 2 root root 4096 Dec 16 2020 tmp drwxr-xr-x 7 root root 4096 Dec 16 2020 usr drwxr-xr-x 12 root root 4096 Dec 16 2020 var ```
这里很奇怪的一点就是: 镜像中 diff_ids 这个为啥只有第一层在 /var/lib/docker/image/overlay2/layerdb/sha256 目录中,其他的都 不在吗? ``` "diff_ids": ["sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf", "sha256:019dd39b82bba02007b940007ee0662015ff0a11ddd55fb7b4a4f6f1e3f694f2", "sha256:ead19f98b65e2cb338cab0470d7ddadc8a23c32ccd34ab6511a35393c7b7335d", "sha256:bcbfcc5b87d4afa5cf8981569a2dcebfd01643a7ddbe82f191062cf677d024b2", "sha256:6e767bd912c28e4d667adfec7adcf1dab84f76ecf0b71cba76634b03a00e67e8", "sha256:9904f3d51f2e6e052fd2ce88494090739f23acec20f2a9c3b2d3deb86874dd0e", "sha256:56a8d17054bd206ae215f3b81ecbb2d2715b21f48966763fc8c9144ac8f8d46e", "sha256:939fe15ec48dad8528237a6330438426dd8627db92a891eb610e36075274e2f5", "sha256:3e7aa53fce9350e24217d0b33912c286a4748e36facfd174c32ec53303be025f"] ``` 其实不是的,这里的diff_ids可以认为是累加的changeid,比如说我想知道第二层对应 overlay的文件。就可以。 ``` 必须这样这个是因为chainid的一层是依赖上一层的,这就导致最后算出来的rootfs是统一的。 公式为(具体可见:[layer-chainid](https://link.zhihu.com/?target=https%3A//github.com/opencontainers/image-spec/blob/master/config.md%23layer-chainid)): chaninid(1) = diffid(1) chainid(n) = sha256(chain(n-1) diffid(n) ) // 02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05 就是第二层的目录,里面的cache_id就是 overLay-id root@k8s-node:/var/lib/docker/image/overlay2/layerdb# echo -n "sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf sha256:019dd39b82bba02007b940007ee0662015ff0a11ddd55fb7b4a4f6f1e3f694f2" | sha256sum 02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05 - // 4d4eb19da25f4f4649cf74c7028acd317962959e4b9b55aec27b4cfc3b867b93 就是第三层层的目录,里面的cache_id就是 overLay-id root@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256# echo -n "sha256:02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05 sha256:ead19f98b65e2cb338cab0470d7ddadc8a23c32ccd34ab6511a35393c7b7335d" | sha256sum 4d4eb19da25f4f4649cf74c7028acd317962959e4b9b55aec27b4cfc3b867b93 - ``` 这样 b4c36536404c5e7e468080cabf0c664a45b68eece4a37ff09cac8395869131fc (02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05 cache-id的内容)就是第二层的overlay 文件。 ``` root@k8s-node:/var/lib/docker/overlay2/b4c36536404c5e7e468080cabf0c664a45b68eece4a37ff09cac8395869131fc/diff# ls etc lib usr root@k8s-node:/var/lib/docker/overlay2/0422e796ce6cdc75d11303c0018b65ca9285dc36b812f4e14c4f68dbc01bc6d9/diff# ls etc home // 有lower, work目录 root@k8s-node:/var/lib/docker/overlay2/0422e796ce6cdc75d11303c0018b65ca9285dc36b812f4e14c4f68dbc01bc6d9# ls committed diff link lower work ```
再找一个最简单的。或者直接比较镜像中文件和第一层layer文件,发现第一层layer文件是最基础的。 ``` root@# docker pull zoux/pause-amd64:3.0 3.0: Pulling from zoux/pause-amd64 4f4fb700ef54: Pull complete ce150f7a21ec: Pull complete Digest: sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b Status: Downloaded newer image for zoux/pause-amd64:3.0 "diff_ids":[ "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef", "sha256:41ff149e94f22c52b8f36c59cafe7538b70ea771e62d9fc6922dedac25392fdf", "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef"]}} echo -n "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef sha256:41ff149e94f22c52b8f36c59cafe7538b70ea771e62d9fc6922dedac25392fdf" | sha256sum ``` ### 4. 结论 (1)一个镜像有一个唯一的imageid 和 digestid。imageid 可以认为是本地image config的sha256sum值,digestid是服务器端该镜像config的sha256sum值。 例如本地image config 保存在 /var/lib/docker/image/overlay2/imagedb/content/sha256 目录。该目录下,一个文件就是一个image config。 对该文件内容计算sha256sum得出来的值就是 imageid, 也就是文件名。 /var/lib/docker/image/overlay2/repositories.json 存放了对应的转换关系。 (2)为什么有了imageid,还需要digestid。因为本地的image config一般都是解压后的,服务器端一般都是压缩打包的,所以可以认为digestid是服务器端压缩好的image config 的sha256sum (3)image config里面的diffid 就是本地解压后的 layer sha256sum值。 docker pull的是服务器端压缩的 layer sha256sum。 /var/lib/docker/image/overlay2/distribution/v2metadata-by-diffid/sha256 目录下存放了对应的转换关系。 (4)diffids是本地镜像每一层的sha256sum值。 pull 的是 服务器中每一层的sha256sum值。 (5)/var/lib/docker/image/overlay2/layerdb/sha256 存放了 diffids -> overlay(实际文件) 转换 (cacheid) 但是不是第一层的要经过转换。 (6)var/lib/docker/overlay2/0422e796ce6cdc75d11303c0018b65ca9285dc36b812f4e14c4f68dbc01bc6d9/diff 是实际每一层的文件内容 第一层没有 lower, work目录,因为从第二层开始才是联合文件。
举例说明: f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b 是该镜像config在 服务器的sha256sum值 4f4fb700ef54, ce150f7a21ec表示该镜像有俩层,是 layer在服务器压缩文件的 sha256sum ``` root@k8s-master: # docker pull zoux/pause-amd64:3.0 3.0: Pulling from zoux/pause-amd64 4f4fb700ef54: Pull complete ce150f7a21ec: Pull complete Digest: sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b Status: Downloaded newer image for zoux/pause-amd64:3.0 ```
``` root@k8s-master:# docker rmi zoux/pause-amd64:3.0 Untagged: zoux/pause-amd64:3.0 // untag服务器 digestid Untagged: zoux/pause-amd64@sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b // 删除镜像id Deleted: sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2 // 删除 layer-id (不是diff-ids, 是转换好的id,所以通过这个id,可以直接在 ) Deleted: sha256:666604249ff52593858b7716232097daa6d721b7b4825aac8bf8a3f45dfba1ce Deleted: sha256:7897c392c5f451552cd2eb20fdeadd1d557c6be8a3cd20d0355fb45c1f151738 // 找到真正的overlay目录 /var/lib/docker/image/overlay2/layerdb/sha256/7897c392c5f451552cd2eb20fdeadd1d557c6be8a3cd20d0355fb45c1f151738# cat cache-id d932ba5b6deb33a4933760be2010ffb5a81bfd874a42b36678fbcf5a3091f827 ``` ### 5 参考 https://zhuanlan.zhihu.com/p/95900321 ================================================ FILE: docker/7. docker 命令详解.md ================================================ * [1\.docker 常见命令行用法](#1docker-常见命令行用法) * [1\.1 docker 系统本身相关](#11-docker-系统本身相关) * [1\.1\.1 docker info](#111-docker-info) * [1\.1\.2 docker system](#112-docker-system) * [1\.1\.3 docker events](#113-docker-events) * [1\.2 docker image相关](#12-docker-image相关) * [1\-虚悬镜像](#1-虚悬镜像) * [2\-docker image ls 格式化展示](#2-docker-image-ls-格式化展示) * [3\-Untagged 和 Deleted](#3-untagged-和-deleted) * [1\.3 docke container相关](#13-docke-container相关) * [1\-docker diff](#1-docker-diff) * [2\-docker top](#2-docker-top) * [3\-docker attach](#3-docker-attach) * [4\-docker logs \-f containerId](#4-docker-logs--f-containerid) * [2\. docker api](#2-docker-api) * [2\.1 Unix domain socket介绍](#21--unix-domain-socket介绍) * [2\.2 如何通过 unix socket 使用docker](#22-如何通过-unix-socket-使用docker) * [2\.3 如何通过restful api 使用docker](#23-如何通过restful-api-使用docker) * [3\. 参考](#3-参考) ## 1.docker 常见命令行用法 ### 1.1 docker 系统本身相关 #### 1.1.1 docker info 查看docker 的详细信息,例如docker root目录,使用的联合文件系统等等 ``` root@k8s-node:~# docker info Client: Debug Mode: false Server: Containers: 9 Running: 4 Paused: 0 Stopped: 5 Images: 4 Server Version: 19.03.9 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: runc Default Runtime: runc Init Binary: docker-init containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd init version: fec3683 Security Options: apparmor seccomp Profile: default Kernel Version: 4.19.0-17-amd64 Operating System: Debian GNU/Linux 10 (buster) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.854GiB Name: k8s-node ID: FZUV:UMD7:U4L5:KUOH:WYWM:HI6I:HYOD:WSXF:E4D7:RUP2:4ETP:OQTY Docker Root Dir: /var/lib/docker Debug Mode: false Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Registry Mirrors: https://b9pmyelo.mirror.aliyuncs.com/ Live Restore Enabled: false Product License: Community Engine WARNING: No swap limit support ``` #### 1.1.2 docker system ``` Usage: docker system COMMAND Manage Docker Commands: df Show docker disk usage events Get real time events from the server info Display system-wide information prune Remove unused data Run 'docker system COMMAND --help' for more information on a command. // 查看镜像实际占用的磁盘空间 root@k8s-master:~# docker system df TYPE TOTAL ACTIVE SIZE RECLAIMABLE Images 7 5 491.1MB 140.1MB (28%) Containers 11 4 3.557kB 2.324kB (65%) Local Volumes 0 0 0B 0B Build Cache 0 0 0B 0B ``` #### 1.1.3 docker events 获取docker server的实时事件 ``` # docker events --since 112141543 2022-01-17T12:29:19.046917401+08:00 container die 78deadc2dcd6a3fafc9ac6f8380e1cd8853ffd6bc33796a224ece76d17dd1d92 (Maintainer=James Fuller , Name=curl, Version=1.0.0, annotation.io.kubernetes.container.hash=bef672e5, annotation.io.kubernetes.container.restartCount=686, annotation.io.kubernetes.container.terminationMessagePath=/dev/termination-log, annotation.io.kubernetes.container.terminationMessagePolicy=File, annotation.io.kubernetes.pod.terminationGracePeriod=10, docker.cmd=docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se, exitCode=0, image=sha256:26a9afb7027cca51ed4f7915474a04822a13e99fce2e1eecad3d43aab6199387, io.kubernetes.container.logpath=/var/log/pods/default_nginx1_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21/nginx/686.log, io.kubernetes.container.name=nginx, io.kubernetes.docker.type=container, io.kubernetes.pod.name=nginx1, io.kubernetes.pod.namespace=default, io.kubernetes.pod.uid=cc8a9cfb-872c-44ba-9899-b4c8bbc93a21, io.kubernetes.sandbox.id=e93a3ae70771ca0e4954fcb6ecf0ffd091eebfc64bcb3cbf461c94eb5474c9aa, name=k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_686, se.haxx.curl=curl, se.haxx.curl.description=network utility, se.haxx.curl.release_tag=curl-7_75_0, se.haxx.curl.version=7_75_0) 2022-01-17T12:29:19.386628039+08:00 container destroy 98c26f5e6c744e7733eaf39fd4a0bfc3692d312213f0504664353157d5d446d9 (Maintainer=James Fuller , Name=curl, Version=1.0.0, annotation.io.kubernetes.container.hash=bef672e5, annotation.io.kubernetes.container.restartCount=685, annotation.io.kubernetes.container.terminationMessagePath=/dev/termination-log, annotation.io.kubernetes.container.terminationMessagePolicy=File, annotation.io.kubernetes.pod.terminationGracePeriod=10, docker.cmd=docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se, image=sha256:26a9afb7027cca51ed4f7915474a04822a13e99fce2e1eecad3d43aab6199387, io.kubernetes.container.logpath=/var/log/pods/default_nginx1_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21/nginx/685.log, io.kubernetes.container.name=nginx, io.kubernetes.docker.type=container, io.kubernetes.pod.name=nginx1, io.kubernetes.pod.namespace=default, io.kubernetes.pod.uid=cc8a9cfb-872c-44ba-9899-b4c8bbc93a21, io.kubernetes.sandbox.id=e93a3ae70771ca0e4954fcb6ecf0ffd091eebfc64bcb3cbf461c94eb5474c9aa, name=k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_685, se.haxx.curl=curl, se.haxx.curl.description=network utility, se.haxx.curl.release_tag=curl-7_75_0, se.haxx.curl.version=7_75_0) 2022-01-17T12:29:19.448410928+08:00 container create 21c6aa12859cf40f78c0a80f6ef4b782e86b86a84a23fefb860b73cfed55cf31 (Maintainer=James Fuller , Name=curl, Version=1.0.0, annotation.io.kubernetes.container.hash=bef672e5, annotation.io.kubernetes.container.restartCount=687, annotation.io.kubernetes.container.terminationMessagePath=/dev/termination-log, annotation.io.kubernetes.container.terminationMessagePolicy=File, annotation.io.kubernetes.pod.terminationGracePeriod=10, docker.cmd=docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se, image=sha256:26a9afb7027cca51ed4f7915474a04822a13e99fce2e1eecad3d43aab6199387, io.kubernetes.container.logpath=/var/log/pods/default_nginx1_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21/nginx/687.log, io.kubernetes.container.name=nginx, io.kubernetes.docker.type=container, io.kubernetes.pod.name=nginx1, io.kubernetes.pod.namespace=default, io.kubernetes.pod.uid=cc8a9cfb-872c-44ba-9899-b4c8bbc93a21, io.kubernetes.sandbox.id=e93a3ae70771ca0e4954fcb6ecf0ffd091eebfc64bcb3cbf461c94eb5474c9aa, name=k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_687, se.haxx.curl=curl, se.haxx.curl.description=network utility, se.haxx.curl.release_tag=curl-7_75_0, se.haxx.curl.version=7_75_0) ``` ### 1.2 docker image相关 | 命令 | 解释 | | ------- | --------------------------------------------- | | pull | 从某个registry拉取镜像或者仓库 | | history | 展示镜像历史信息 | | export | 打包一个容器文件系统到tar文件 | | build | 从一个Dockerfile构建镜像 | | commit | 从一个容器的修改创建一个新的镜像 | | images | 展示镜像列表 | | import | 用tar文件导入并创建镜像文件 | | load | 从tar文件或者标准输入载入镜像 | | login | 登录Docker registry | | logout | 从Docker registry退出 | | save | 打包一个或多个镜像到tar文件(默认是到标准输出) | | rmi | 移除一个或多个镜像 | | version | 显示Docker版本信息 | | tag | 标记一个镜像到仓库 | 补充说明 #### 1-虚悬镜像 镜像列表中,还可以看到一个特殊的镜像,这个镜像既没有仓库名,也没有标签,均 为 ``` 00285df0df87 5 days ago 342 MB ``` 这个镜像原本是有镜像名和标签的,原来为 mongo:3.2 ,随着官方镜像维护,发布了新版本 后,重新 docker pull mongo:3.2 时, mongo:3.2 这个镜像名被转移 到了新下载的镜像身 上,而旧的镜像上的这个名称则被取消,从而成为了虚悬镜像。除了 docker pull 可能导致 这种情况, docker build 也同样可以导致这种现 象。由于新旧镜像同名,旧镜像名称被取 消,从而出现仓库名、标签均为 的镜像。 这类无标签镜像也被称为 虚悬镜像 (dangling image) ,可以用下面的命令专门显示这类镜像: ``` $ docker image ls -f dangling=true REPOSITORY TAG IMAGE ID CREATED SIZE 00285df0df87 5 days ago 342 MB ``` 一般来说,虚悬镜像已经失去了存在的价值,是可以随意删除的,可以用下面的命令删除。 ``` $ docker image prune ``` #### 2-docker image ls 格式化展示 不加任何参数的情况下, docker image ls 会列出所有顶级镜像,但是有时候我们只希望列出 部分镜像。 docker image ls 有好几个参数可以帮助做到这个事情。 根据仓库名列出镜像 ``` $ docker image ls ubuntu REPOSITORY TAG IMAGE ID CREATED SIZE ubuntu 16.04 f753707788c5 4 weeks ago 127 MB ubuntu latest f753707788c5 4 weeks ago 127 MB ubuntu 14.04 1e0c3dd64ccd 4 weeks ago 188 MB ``` 列出特定的某个镜像,也就是说指定仓库名和标签 ``` docker image ls ubuntu:16.04 REPOSITORY TAG IMAGE ID CREATED SIZE ubuntu 16.04 f753707788c5 4 weeks ago 127 MB ``` 除此以外, docker image ls 还支持强大的过滤器参数 --filter ,或者简写 -f 。之前我们 已经看到了使用过滤器来列出虚悬镜像的用法,它还有更多的用法。比如,我们希望看到在 mongo:3.2 之后建立的镜像,可以用下面的命令: ``` docker image ls -f since=mongo:3.2 REPOSITORY TAG IMAGE ID CREATED SIZE redis latest 5f515359c7f8 5 days ago 183 MB nginx latest 05a60462f8ba 5 days ago 181 MB ``` 想查看某个位置之前的镜像也可以,只需要把 since 换成 before 即可。 此外,如果镜像构建时,定义了 LABEL ,还可以通过 LABEL 来过滤。 ``` $ docker image ls -f label=com.example.version=0.1 ``` **以特定格式显示** 默认情况下, docker image ls 会输出一个完整的表格,但是我们并非所有时候都会需要这些 内容。比如,刚才删除虚悬镜像的时候,我们需要利用 docker image ls 把所有的虚悬镜像 的 ID 列出来,然后才可以交给 docker image rm 命令作为参数来删除指定的这些镜像,这个 时候就用到了 -q 参数。 ``` $ docker image ls -q //展示所有镜像的id 5f515359c7f8 05a60462f8ba fe9198c04d62 00285df0df87 f753707788c5 f753707788c5 1e0c3dd64ccd ``` --filter 配合 -q 产生出指定范围的 ID 列表,然后送给另一个 docker 命令作为参数,从 而针对这组实体成批的进行某种操作的做法在 Docker 命令行使用过程中非常常见,不仅仅是 镜像,将来我们会在各个命令中看到这类搭配以完成很强大的功能。因此每次在文档看到过 滤器后,可以多注意一下它们的用法。 另外一些时候,我们可能只是对表格的结构不满意,希望自己组织列;或者不希望有标题, 这样方便其它程序解析结果等,这就用到了 Go 的模板语法。 比如,下面的命令会直接列出镜像结果,并且只包含镜像ID和仓库名: ``` $ docker image ls --format "{{.ID}}: {{.Repository}}" 5f515359c7f8: redis 05a60462f8ba: nginx fe9198c04d62: mongo 00285df0df87: f753707788c5: ubuntu f753707788c5: ubuntu 1e0c3dd64ccd: ubuntu ``` 或者打算以表格等距显示,并且有标题行,和默认一样,不过自己定义列: ``` $ docker image ls --format "table {{.ID}}\t{{.Repository}}\t{{.Tag}}" IMAGE ID REPOSITORY TAG 5f515359c7f8 redis latest 05a60462f8ba nginx latest fe9198c04d62 mongo 3.2 00285df0df87 f753707788c5 ubuntu 16.04 f753707788c5 ubuntu latest 1e0c3dd64ccd ubuntu 14.04 ``` #### 3-Untagged 和 Deleted 如果观察上面这几个命令的运行输出信息的话,你会注意到删除行为分为两类,一类是 Untagged ,另一类是 Deleted 。 我们之前介绍过,镜像的唯一标识是其 ID 和摘要,而一个 镜像可以有多个标签。 因此当我们使用上面命令删除镜像的时候,实际上是在要求删除某个标签的镜像。所以首先 需要做的是将满足我们要求的所有镜像标签都取消,这就是我们看到的 Untagged 的信息。 因为一个镜像可以对应多个标签,因此当我们删除了所指定的标签后,可能还有别的标签指 向了这个镜像,如果是这种情况,那么 Delete 行为就不会发生。所以并非所有的 docker rmi 都会产生删除镜像的行为,有可能仅仅是取消了某个标签而已。 当该镜像所有的标签都被取消了,该镜像很可能会失去了存在的意义,因此会触发删除行 为。镜像是多层存储结构,因此在删除的时候也是从上层向基础层方向依次进行判断删除。 镜像的多层结构让镜像复用变动非常容易,因此很有可能某个其它镜像正依赖于当前镜像的 某一层。这种情况,依旧不会触发删除该层的行为。直到没有任何层依赖当前层时,才会真 实的删除当前层。这就是为什么,有时候会奇怪,为什么明明没有别的标签指向这个镜像, 但是它还是存在的原因,也是为什么有时候会发现所删除的层数和自己 docker pull 看到的 层数不一样的源。 除了镜像依赖以外,还需要注意的是容器对镜像的依赖。如果有用这个镜像启动的容器存在 (即使容器没有运行),那么同样不可以删除这个镜像。之前讲过,容器是以镜像为基础, 再加一层容器存储层,组成这样的多层存储结构去运行的。因此该镜像如果被这个容器所依 赖的,那么删除必然会导致故障。如果这些容器是不需要的,应该先将它们删除,然后再来 删除镜像。 ### 1.3 docke container相关 | 命令 | 解释 | | ------- | ----------------------------------------------- | | attach | 附加到一个运行的容器 | | cp | 在容器与本地文件系统之间复制文件/文件夹 | | create | 创建新的容器 | | diff | 检阅一个容器文件系统的修改 | | exec | 在运行的容器内执行命令 | | inspect | 展示一个容器/镜像或者任务的底层信息 | | kill | 终止一个或者多个运行中的容器 | | logs | 获取容器的日志 | | network | 管理Docker网络 | | node | 管理Docker Swarm节点 | | pause | 暂停一个或者多个容器的所有进程 | | port | 管理容器的端口映射 | | ps | 展示容器列表 | | rename | 重命名容器 | | restart | 重启容器 | | rm | 移除一个或多个容器 | | run | 运行一个新的容器 | | search | 在Docker Hub搜索镜像 | | service | 管理Docker services(和k8s svc 咋看起来差不多) | | top | 展示容器运行进程(方便查看container对应的Pid) | | unpause | 解除暂停一个或多个容器的所有进程 | | swarm | 管理Docker Swarm | | stop | 停止一个或多个运行容器 | | stats | 获取容器的实时资源使用统计 | | update | 更新一个或多个容器的配置 | | volume | 管理Docker volumes | | wait | 阻塞直到容器停止,然后打印退出代码 | | start | 启动一个或者多个容器 | 补充说明 #### 1-docker diff ``` root@k8s-node:~# docker diff 3596feb5ce62 C /run A /run/secrets A /run/secrets/kubernetes.io A /run/secrets/kubernetes.io/serviceaccount ``` A代表新增文件 C代表修改过的文件 D代表被删除的文件 #### 2-docker top 快速查看containerid 对应的pid ``` root@k8s-node:~# docker top c3a457fe7cc5 UID PID PPID C STIME TTY TIME CMD root 3709 3692 0 2021 ? 00:20:16 /opt/bin/flanneld --ip-masq --kube-subnet-mgr ``` #### 3-docker attach Docker attach可以attach到一个已经运行的容器的stdin,然后进行命令执行的动作。 但是需要注意的是,如果从这个stdin中exit,会导致容器的停止。 (docker exec则不会) ``` root@k8s-master:~# docker run -d nginx:latest a66f0b29a030b4b0fbe9128faaa373b995526ea1cb8ca714db7e3b3dc821d09d root@k8s-master:~# root@k8s-master:~# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES a66f0b29a030 nginx:latest "/docker-entrypoint.…" 5 seconds ago Up 4 seconds 80/tcp eager_cannon root@k8s-master:~# root@k8s-master:~# docker attach a66f0b29a030b4b0fbe9128faaa373b995526ea1cb8ca714db7e3b3dc821d09d ls /bahs exit ^Z^Z ^C2021/12/14 13:28:15 [notice] 1#1: signal 2 (SIGINT) received, exiting 2021/12/14 13:28:15 [notice] 32#32: exiting 2021/12/14 13:28:15 [notice] 31#31: exiting 2021/12/14 13:28:15 [notice] 31#31: exit 2021/12/14 13:28:15 [notice] 32#32: exit 2021/12/14 13:28:15 [notice] 1#1: signal 17 (SIGCHLD) received from 31 2021/12/14 13:28:15 [notice] 1#1: worker process 31 exited with code 0 2021/12/14 13:28:15 [notice] 1#1: worker process 32 exited with code 0 2021/12/14 13:28:15 [notice] 1#1: exit ^Zroot@k8s-master:~# root@k8s-master:~# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS ``` #### 4-docker logs -f containerId ``` root@k8s-master:~# docker logs -f f051884c5784 /docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration /docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/ /docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh 10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf 10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf /docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh /docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh /docker-entrypoint.sh: Configuration complete; ready for start up 2021/12/12 08:51:21 [notice] 1#1: using the "epoll" event method 2021/12/12 08:51:21 [notice] 1#1: nginx/1.21.4 2021/12/12 08:51:21 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6) 2021/12/12 08:51:21 [notice] 1#1: OS: Linux 4.19.0-17-amd64 2021/12/12 08:51:21 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576 2021/12/12 08:51:21 [notice] 1#1: start worker processes 2021/12/12 08:51:21 [notice] 1#1: start worker process 31 2021/12/12 08:51:21 [notice] 1#1: start worker process 32 ``` ## 2. docker api 在Docker生态系统中一共有3种API 。 (1)Registry API:提供了与来存储Docker镜像的Docker Registry集成 的功能。 (2)Docker Hub API:提供了与Docker Hub 集成的功能。 (3)Docker Remote API:提供与Docker守护进程进行集成的功能。 这里主要熟悉一下第三种,docker remote Api
### 2.1 Unix domain socket介绍 **Unix domain socket 又叫 IPC(inter-process communication 进程间通信) socket,用于实现同一主机上的进程间通信。**socket 原本是为网络通讯设计的,但后来在 socket 的框架上发展出一种 IPC 机制,就是 UNIX domain socket。虽然网络 socket 也可用于同一台主机的进程间通讯(通过 loopback 地址 127.0.0.1),但是 UNIX domain socket 用于 IPC 更有效率:不需要经过网络协议栈,不需要打包拆包、计算校验和、维护序号和应答等,只是将应用层数据从一个进程拷贝到另一个进程。这是因为,IPC 机制本质上是可靠的通讯,而网络协议是为不可靠的通讯设计的。 UNIX domain socket 是全双工的,API 接口语义丰富,相比其它 IPC 机制有明显的优越性,目前已成为使用最广泛的 IPC 机制,比如 X Window 服务器和 GUI 程序之间就是通过 UNIX domain socket 通讯的。Unix domain socket 是 POSIX 标准中的一个组件,所以不要被名字迷惑,linux 系统也是支持它的。 了解Docker的同学应该知道Docker daemon监听一个docker.sock文件,这个docker.sock文件的默认路径是`/var/run/docker.sock`,这个Socket就是一个Unix domain socket。 ### 2.2 如何通过 unix socket 使用docker 例如:参考这个, 查看所有的container信息 https://docs.docker.com/engine/api/sdk/examples/ ``` root@k8s-dnode:~# curl --unix-socket /var/run/docker.sock http://127.0.0.1/v1.40/containers/json [{"Id":"64a14bf3626b576f9fd7dd56555d0e091f770eb31926d48211dd604874805f92","Names":["/k8s_container-0_nginx-78f97d8d6d-8vtw8_default_dcf8f5c4-315b-4e43-a623-dc8842f36d36_0"],"Image":"nginx@sha256:9522864dd661dcadfd9958f9e0de192a1fdda2c162a35668ab6ac42b465f0603","ImageID":"sha256:f652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e","Command":"/docker-entrypoint.sh nginx -g 'daemon off;'","Created":1639901315,"Ports":[],"Labels":{"annotation.io.kubernetes.container.hash":"a36242a4","annotation.io.kubernetes.container.restartCount":"0","annotation.io.kubernetes.container.terminationMessagePath":"/dev/termination-log","annotation.io.kubernetes.container.terminationMessagePolicy":"File","annotation.io.kubernetes.pod.terminationGracePeriod":"30","io.kubernetes.container.logpath":"/var/log/pods/default_nginx-78f97d8d6d-8vtw8_dcf8f5c4-315b-4e43-a623-dc8842f36d36/container-0/0.log","io.kubernetes.container.name":"container-0","io.kubernetes.docker.type":"container","io.kubernetes.pod.name":"nginx-78f97d8d6d-8vtw8","io.kubernetes.pod.namespace":"default","io.kubernetes.pod.uid":"dcf8f5c4-315b-4e43-a623-dc8842f36d36","io.kubernetes.sandbox.id":"d35b5a6084bc009340f77a3594a7891c794bad76d2e80c8eafa4e0c95cd772cd","maintainer":"NGINX Docker Maintainers "},"State":"running","Status":"Up 17 minutes","HostConfig":{"NetworkMode":"container:d35b5a6084bc009340f77a3594a7891c794bad76d2e80c8eafa4e0c95cd772cd"},"NetworkSettings":{"Networks":{}},"Mounts":[{"Type":"bind","Source":"/var/lib/kubelet/pods/dcf8f5c4-315b-4e43-a623-dc8842f36d36/etc-hosts","Destination":"/etc/hosts","Mode":"","RW":true,"Propagation":"rprivate"},{"Type":"bind","Source":"/var/lib/kubelet/pods/dcf8f5c4-315b-4e43-a623-dc8842f36d36/volumes/kubernetes.io~secret/default-token-f8snr","Destination":"/var/run/secrets/kubernetes.io/serviceaccount","Mode":"ro","RW":false,"Propagation":"rprivate"},{"Type":"bind","Source":"/var/lib/kubelet/pods/dcf8f5c4-315b-4e ``` ### 2.3 如何通过restful api 使用docker 这里需要先将unix socker 和 tcp:port 绑定。操作如下: ``` root@k8s-master:~# cat /usr/lib/systemd/system/docker.service [Unit] Description=Docker Application Container Engine Documentation=https://docs.docker.com After=network-online.target firewalld.service Wants=network-online.target [Service] Type=notify ExecStart=/usr/bin/dockerd -H tcp://0.0.0.0:2375 //之前是 ExecStart=/usr/bin/dockerd ExecReload=/bin/kill -s HUP LimitNOFILE=infinity LimitNPROC=infinity LimitCORE=infinity TimeoutStartSec=0 Delegate=yes KillMode=process Restart=on-failure StartLimitBurst=3 StartLimitInterval=60s [Install] WantedBy=multi-user.target ```
``` root@k8s-master:~# sudo docker -H 192.168.0.4:2375 info Client: Debug Mode: false Server: Containers: 13 Running: 0 Paused: 0 Stopped: 13 Images: 6 Server Version: 19.03.9 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive ... ``` ## 3. 参考 [docker api](https://docs.docker.com/engine/api/v1.22/?spm=a2c6h.12873639.0.0.481a90afUqk0rt#2-endpoints) [手撕Linux Socket——Socket原理与实践分析](https://zhuanlan.zhihu.com/p/234806787) ================================================ FILE: docker/8. docker核心组件介绍.md ================================================ * [0\. 章节目的](#0-章节目的) * [1\.docker 组件介绍](#1docker-组件介绍) * [2\. docker 组件分析](#2-docker-组件分析) * [2\.1 docker](#21-docker) * [2\.2 docker proxy](#22-docker-proxy) * [2\.2 docker\-init](#22-docker-init) * [2\.4 runc](#24-runc) * [2\.5 dockerd](#25-dockerd) * [2\.6 containerd](#26-containerd) * [2\.7 containerd\-shim](#27-containerd-shim) * [2\.8 ctr](#28-ctr) * [2\.9 组件总结](#29-组件总结) * [3\. 进程关系](#3-进程关系) * [4\. docker为什么是这种结构](#4-docker为什么是这种结构) * [5\. 参考文档](#5-参考文档) ### 0. 章节目的 本节的目的就是为了弄清楚: (1)docker各组件有什么功能 (2)通过docker运行的容器,进程关系是什么样子,为什么会这样 ### 1.docker 组件介绍 二进制安装docker的时候,可以发现,docker由以下的组件组成。 ``` root@k8s-node:~# tar zxvf docker-19.03.9.tgz docker/ docker/docker-init docker/runc docker/docker docker/docker-proxy docker/containerd docker/ctr docker/dockerd docker/containerd-shim ```
### 2. docker 组件分析 #### 2.1 docker docker 是 Docker 客户端的一个完整实现,它是一个二进制文件,对用户可见的操作形式为 docker 命令,通过 docker 命令可以完成所有的 Docker 客户端与服务端的通信。 Docker 客户端与服务端的交互过程是:docker 组件向服务端发送请求后,服务端根据请求执行具体的动作并将结果返回给 docker,docker 解析服务端的返回结果,并将结果通过命令行标准输出展示给用户。这样一次完整的客户端服务端请求就完成了。 例如常见的命令 docker run/ps 等等
#### 2.2 docker proxy docker-proxy 主要是用来做端口映射的。当我们使用 docker run 命令启动容器时,如果使用了 -p 参数,docker-proxy 组件就会把容器内相应的端口映射到主机上来,底层是依赖于 iptables 实现的。 ``` root@cld-dnode1-1051:/usr/bin# docker run --name=nginx -d -p 8080:80 nginx root@cld-dnode1-1051:/usr/bin# docker inspect --format '{{ .NetworkSettings.IPAddress }}' nginx 172.17.0.2 // 会多一个docker-proxy的进程 root@cld-dnode1-1051:/usr/bin# ps aux |grep docker-proxy root 1983163 0.0 0.0 105912 4252 ? Sl 15:42 0:00 /bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8080 -container-ip 172.17.0.2 -container-port 80 root 1985160 0.0 0.0 13544 2600 pts/1 S+ 15:43 0:00 grep docker-proxy ``` #### 2.2 docker-init 在执行 docker run 启动容器时可以添加 --init 参数,此时 Docker 会使用 docker-init 作为1号进程,帮你管理容器内子进程,例如回收僵尸进程等。 ``` root@cld-dnode1-1051:/usr/bin# ls docker* docker dockerd dockerd-ce docker-init docker-proxy root@cld-dnode1-1051:/usr/bin# root@cld-dnode1-1051:/usr/bin# docker-init version [WARN tini (1973230)] Tini is not running as PID 1 and isn't registered as a child subreaper. Zombie processes will not be re-parented to Tini, so zombie reaping won't work. To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1. [FATAL tini (1973231)] exec version failed: No such file or directory root@cld-dnode1-1051:/usr/bin# docker run -it busybox sh / # ps aux PID USER TIME COMMAND 1 root 0:00 sh 6 root 0:00 ps aux // 容器里面的init就是docker-init(看起来就是tini) root@cld-dnode1-1051:/usr/bin# docker run -it --init busybox sh / # ps aux PID USER TIME COMMAND 1 root 0:00 /dev/init -- sh 6 root 0:00 sh 7 root 0:00 ps aux / # /dev/init version [WARN tini (8)] Tini is not running as PID 1 and isn't registered as a child subreaper. Zombie processes will not be re-parented to Tini, so zombie reaping won't work. To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1. [FATAL tini (9)] exec version failed: No such file or directory ``` #### 2.4 runc runc 是一个标准的 OCI 容器运行时的实现,它是一个命令行工具,可以直接用来创建和运行容器。接下来直接进行演示: (1) 准备容器运行时文件。可以看出来这里时和docker啥的都没有关系,都是一堆的基础目录和文件 ``` root@cld-dnode1-1051:/ cd /root root@cld-dnode1-1051:/ mkdir runc root@cld-dnode1-1051:/ mkdir rootfs && docker export $(docker create busybox) | tar -C rootfs -xvf - root@cld-dnode1-1051:/home/zouxiang/runc# tree -L 2 . └── rootfs ├── bin ├── dev ├── etc ├── home ├── proc ├── root ├── sys ├── tmp ├── usr └── var ``` (2)准备config文件 使用 runc spec 命令根据文件系统生成对应的 config.json 文件。 在config.json里指定了容器运行的args,env等等。 ``` root@cld-dnode1-1051:/home/zouxiang/runc# runc spec root@cld-dnode1-1051:/home/zouxiang/runc# cat config.json { "ociVersion": "1.0.1-dev", "process": { "terminal": true, "user": { "uid": 0, "gid": 0 }, "args": [ "sh" ], "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "TERM=xterm" ], "cwd": "/", "capabilities": { "bounding": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "effective": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "inheritable": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "permitted": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ], "ambient": [ "CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE" ] }, "rlimits": [ { "type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024 } ], "noNewPrivileges": true }, "root": { "path": "rootfs", "readonly": true }, "hostname": "runc", "mounts": [ { "destination": "/proc", "type": "proc", "source": "proc" }, { "destination": "/dev", "type": "tmpfs", "source": "tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] }, { "destination": "/dev/pts", "type": "devpts", "source": "devpts", "options": [ "nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5" ] }, { "destination": "/dev/shm", "type": "tmpfs", "source": "shm", "options": [ "nosuid", "noexec", "nodev", "mode=1777", "size=65536k" ] }, { "destination": "/dev/mqueue", "type": "mqueue", "source": "mqueue", "options": [ "nosuid", "noexec", "nodev" ] }, { "destination": "/sys", "type": "sysfs", "source": "sysfs", "options": [ "nosuid", "noexec", "nodev", "ro" ] }, { "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "nosuid", "noexec", "nodev", "relatime", "ro" ] } ], "linux": { "resources": { "devices": [ { "allow": false, "access": "rwm" } ] }, "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" } ], "maskedPaths": [ "/proc/acpi", "/proc/asound", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/sys/firmware", "/proc/scsi" ], "readonlyPaths": [ "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ] } } 注意 config.json 和rootfs是同一级 root@cld-dnode1-1051:/home/zouxiang/runc# tree -L 1 . ├── config.json └── rootfs ``` (3)运行容器 ``` root@cld-dnode1-1051:/home/zouxiang/runc# runc run container1 / # ps aux PID USER TIME COMMAND 1 root 0:00 sh 7 root 0:00 ps aux 另一个窗口就能看到 root@cld-dnode1-1051:/home/zouxiang# runc list ID PID STATUS BUNDLE CREATED OWNER container1 2040317 running /home/zouxiang/runc 2022-01-26T08:14:57.916602955Z root ``` #### 2.5 dockerd dockerd 是 Docker 服务端的后台常驻进程,用来接收客户端发送的请求,执行具体的处理任务,处理完成后将结果返回给客户端。 docker run/ps 是客户端。dockerd是服务器端。但是dockerd不是真正干活的,正在干活的是containerd。 #### 2.6 containerd containerd 组件是从 Docker 1.11 版本正式从 dockerd 中剥离出来的,它的诞生完全遵循 OCI 标准,是容器标准化后的产物。containerd 完全遵循了 OCI 标准,并且是完全社区化运营的,因此被容器界广泛采用。 containerd 不仅负责容器生命周期的管理,同时还负责一些其他的功能: - 镜像的管理,例如容器运行前从镜像仓库拉取镜像到本地; - 接收 dockerd 的请求,通过适当的参数调用 runc 启动容器; - 管理存储相关资源; - 管理网络相关资源。 containerd 包含一个后台常驻进程,默认的 socket 路径为 /run/containerd/containerd.sock,dockerd 通过 UNIX 套接字向 containerd 发送请求,containerd 接收到请求后负责执行相关的动作并把执行结果返回给 dockerd。 如果你不想使用 dockerd,也可以直接使用 containerd 来管理容器,由于 containerd 更加简单和轻量,生产环境中越来越多的人开始直接使用 containerd 来管理容器。 #### 2.7 **containerd-shim** containerd-shim 的意思是垫片,类似于拧螺丝时夹在螺丝和螺母之间的垫片。containerd-shim 的主要作用是将 containerd 和真正的容器进程解耦,使用 containerd-shim 作为容器进程的父进程,从而实现重启 containerd 不影响已经启动的容器进程。 ``` root@cld-dnode1-1051:/usr/bin# containerd-shim -h Usage of containerd-shim: -address string grpc address back to main containerd -containerd-binary containerd publish path to containerd binary (used for containerd publish) (default "containerd") -criu string path to criu binary -debug enable debug output in logs -namespace string namespace that owns the shim -runtime-root string root directory for the runtime (default "/run/containerd/runc") -socket string abstract socket path to serve -systemd-cgroup set runtime to use systemd-cgroup -workdir string path used to storge large temporary data ``` #### 2.8 ctr ctr 实际上是 containerd-ctr,它是 containerd 的客户端,主要用来开发和调试,在没有 dockerd 的环境中,ctr 可以充当 docker 客户端的部分角色,直接向 containerd 守护进程发送操作容器的请求。 ``` root@cld-dnode1-1051:/usr/bin# ctr -h NAME: ctr - __ _____/ /______ / ___/ __/ ___/ / /__/ /_/ / \___/\__/_/ containerd CLI USAGE: ctr [global options] command [command options] [arguments...] VERSION: 1.2.13 COMMANDS: plugins, plugin provides information about containerd plugins version print the client and server versions containers, c, container manage containers content manage content events, event display containerd events images, image, i manage images leases manage leases namespaces, namespace manage namespaces pprof provide golang pprof outputs for containerd run run a container snapshots, snapshot manage snapshots tasks, t, task manage tasks install install a new package shim interact with a shim directly cri interact with cri plugin help, h Shows a list of commands or help for one command GLOBAL OPTIONS: --debug enable debug output in logs --address value, -a value address for containerd's GRPC server (default: "/run/containerd/containerd.sock") --timeout value total timeout for ctr commands (default: 0s) --connect-timeout value timeout for connecting to containerd (default: 0s) --namespace value, -n value namespace to use with commands (default: "default") [$CONTAINERD_NAMESPACE] --help, -h show help --version, -v print the version ```
#### 2.9 组件总结 | 组件类别 | 组件名称 | 核心功能 | | ------------------ | --------------- | ------------------------------------------------------------ | | docker相关组件 | Docker | docker的客户端,复杂发送docker操作请求 | | docker相关组件 | Docker | docker服务端的入口,负责处理客户端请求 | | docker相关组件 | Docker-init | 实用docker-init作为1号进程(业务1号进程没有回收僵尸进程的能力)。 | | docker相关组件 | Docker-proxy | docker网络实现,通过操作iptables实现。 | | Containerd相关组件 | Containerd | 负责管理容器生命周期,通过接受dockerd的请求, 执行启动或者销毁容器草 | | Containerd相关组件 | Containerd-shim | 将真正运行的容器进程和Container解藕,Containerd-shim作为容器进程的父进程 | | Containerd相关组件 | Ctr | Containerd的客户端,可以直接向containerd发送容器操作的请求,主要用于开发和调试 | | 容器运行时组件 | Runc | 通过调用namespaces, groups等系统调用接口,实现容器的操作 |
### 3. 进程关系 并且查看进程树,发现进程关系为: ``` docker ctr | | V V dockerd -> containerd ---> shim -> runc -> runc init -> process |-- > shim -> runc -> runc init -> process +-- > shim -> runc -> runc init -> process ```
``` root 3250772 ... /usr/bin/dockerd -p /var/run/docker.pid root 2010 ... /usr/bin/containerd root 3467567 ... containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/cabf53bfcd5f079159b8891520c2c2c0dee811568f7d0942b80dd8d12459ab06 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc ``` Docker, containerd的父进程都是1好进程。从进程树来看,并没有直接的父子关系。 从这篇文章可以看出来,[docker进程模型,架构分析](https://segmentfault.com/a/1190000011294361),containerd进程时docker启动的。
### 4. docker为什么是这种结构 当 Kubelet 想要创建一个**容器**时, 有这么几步: 1. Kubelet 通过 **CRI 接口**(gRPC) 调用 dockershim, 请求创建一个容器. **CRI** 即容器运行时接口(Container Runtime Interface), 这一步中, Kubelet 可以视作一个简单的 CRI Client, 而 dockershim 就是接收请求的 Server. 目前 dockershim 的代码其实是内嵌在 Kubelet 中的, 所以接收调用的凑巧就是 Kubelet 进程; 2. dockershim 收到请求后, 转化成 Docker Daemon 能听懂的请求, 发到 Docker Daemon 上请求创建一个容器; 3. Docker Daemon 早在 1.12 版本中就已经将针对容器的操作移到另一个守护进程: containerd 中了, 因此 Docker Daemon 仍然不能帮我们创建容器, 而是要请求 containerd 创建一个容器; 4. containerd 收到请求后, 并不会自己直接去操作容器, 而是创建一个叫做 containerd-shim 的进程, 让 containerd-shim 去操作容器. 这是因为容器进程需要一个父进程来做诸如收集状态, 维持 stdin 等 fd 打开等工作. 而假如这个父进程就是 containerd, 那每次 containerd 挂掉或升级, 整个宿主机上所有的容器都得退出了. 而引入了 containerd-shim 就规避了这个问题(containerd 和 shim 并不是父子进程关系); 5. 我们知道创建容器需要做一些设置 namespaces 和 cgroups, 挂载 root filesystem 等等操作, 而这些事该怎么做已经有了公开的规范了, 那就是 [OCI(Open Container Initiative, 开放容器标准)](https://link.zhihu.com/?target=https%3A//github.com/opencontainers/runtime-spec). 它的一个参考实现叫做 [runc](https://link.zhihu.com/?target=https%3A//github.com/opencontainers/runc). 于是, containerd-shim 在这一步需要调用 `runc` 这个命令行工具, 来启动容器; 6. `runc` 启动完容器后本身会直接退出, containerd-shim 则会成为容器进程的父进程, 负责收集容器进程的状态, 上报给 containerd, 并在容器中 pid 为 1 的进程退出后接管容器中的子进程进行清理, 确保不会出现僵尸进程; ![image-20220226174613059](./image/struct-1.png) (1)为什么需要docker-shim? 因为k8s定义了CRI,这样可以和docker, rkt等容器运行时解藕。但是dockerd没有不支持CRI。虽有需要docker-shim进行一次转换。 (2)为什么需要containerd 其实 k8s 最开始的 Runtime 架构远没这么复杂: kubelet 想要创建容器直接跟 Docker Daemon 说一声就行, 而那时也不存在 containerd, Docker Daemon 自己调一下 `libcontainer` 这个库把容器跑起来, 整个过程就搞完了. 但是大佬们为了不让容器运行时标准被 Docker 一家公司控制, 于是就撺掇着搞了开放容器标准 OCI. Docker 则把 `libcontainer` 封装了一下, 变成 runC 捐献出来作为 OCI 的参考实现。 所以: libcontainer = runc containerd就变成了负责兼容的处理人,处理客户端请求。具体执行变成了runc (3)为什么需要containers-shim 主要作用是将 containerd 和真正的容器进程解耦,使用 containerd-shim 作为容器进程的父进程,从而实现重启 containerd 不影响已经启动的容器进程。 ### 5. 参考文档 [组件组成:剖析 Docker 组件作用及其底层工作原理](https://blog.csdn.net/qq_34556414/article/details/112247223) [系列好文 | Kubernetes 弃用 Docker,我们该何去何从?](http://blog.itpub.net/70002215/viewspace-2779207/) [docker进程模型,架构分析](https://segmentfault.com/a/1190000011294361) https://www.huweihuang.com/article/docker/code-analysis/code-analysis-of-docker-server/ [白话 Kubernetes Runtime](https://zhuanlan.zhihu.com/p/58784095) [Docker源码分析](https://www.huweihuang.com/article/docker/code-analysis/code-analysis-of-docker-server/) [docker exec 失败问题排查之旅](https://xyz.uscwifi.xyz/post/DdS5a690E/) [kubectl exec 是怎么工作的](https://www.techclone.cn/post/tech/k8s/k8s-exec-failure/) ================================================ FILE: docker/9. docker问题链路排查实例.md ================================================ * [1\. 确定问题](#1-确定问题) * [2\. 开始排查](#2-开始排查) * [2\.1 排除是否是dockerd出现了问题](#21-排除是否是dockerd出现了问题) * [2\.2 排除是否是containerd出现了问题](#22-排除是否是containerd出现了问题) * [3\.参考文档](#3参考文档) 再熟悉docker核心组件的基础上,以docker exec ls 执行失败为例。提供思路:排查docker哪个组件出现了问题。 ### 1. 确定问题 以exec容器里面执行 ls为例 ``` root@k8s-node:~# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 490008a7de69 26a9afb7027c "sleep 3600" 19 minutes ago Up 19 minutes k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_1266 e93a3ae70771 lizhenliang/pause-amd64:3.0 "/pause" 7 weeks ago Up 7 weeks k8s_POD_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_0 c3a457fe7cc5 e6ea68648f0c "/opt/bin/flanneld -…" 7 weeks ago Up 7 weeks k8s_kube-flannel_kube-flannel-ds-97qn4_kube-system_a2533c24-53f2-4dea-86df-4c61de17415f_0 6fb0829d3f0d lizhenliang/pause-amd64:3.0 "/pause" 7 weeks ago Up 7 weeks k8s_POD_kube-flannel-ds-97qn4_kube-system_a2533c24-53f2-4dea-86df-4c61de17415f_0 ``` 假设执行 `docker exec -it 490008a7de69 ls`出现了问题。 ``` root@k8s-node:~# docker exec -it 490008a7de69 ls bin etc mnt run tmp cacert.pem home opt sbin usr dev lib proc srv var entrypoint.sh media root sys ``` ### 2. 开始排查 先弄清kubelet-> docker的调用链路 ![image.png](./image/struct-1.png) 一般而言Kubelet->docker 是不是有问题很好排除。这里主要介绍当docker出现了问题,定位到时哪里出现了问题。 #### 2.1 排除是否是dockerd出现了问题 dockerd只是一个服务器端,它其实就是一个工具人,最终请求的都是转发到containerd进行处理的。 这里利用了一个工具就是ctr。 之前叫docker-containerd-ctr,安装docker的时候会自动安装这个。 这个工具就是用来调试的。 ctr 常见操作如下: 注意:-a 是 address的意思。这个一定要指定socket。这个可以 `ps -ef | grep socket` 找出来。 ``` 查看有哪些命名空间的容器(和Pod的ns不是一个东西) root@k8s-node:~# ctr -a /var/run/docker/containerd/containerd.sock namespaces ls NAME LABELS moby 查看moby ns下有哪些容器,这个其实就是对应的docker ps的容器 root@k8s-node:~# ctr -a /var/run/docker/containerd/containerd.sock -n moby containers ls CONTAINER IMAGE RUNTIME 6fb0829d3f0dae7f8e0328ef88748ed1c7bdb8d6783059461c790031232da19d - io.containerd.runtime.v1.linux 97a519dcd3d6622b9650af95450fbb2b9e6c4761c277c43dd9e7b0e9f74e703d - io.containerd.runtime.v1.linux c3a457fe7cc56185375ff67faa34a0141712c09f7b12f740f4fe4ebf18023984 - io.containerd.runtime.v1.linux e93a3ae70771ca0e4954fcb6ecf0ffd091eebfc64bcb3cbf461c94eb5474c9aa - io.containerd.runtime.v1.linux root@k8s-node:~# docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 97a519dcd3d6 26a9afb7027c "sleep 3600" 12 minutes ago Up 12 minutes k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_1267 e93a3ae70771 lizhenliang/pause-amd64:3.0 "/pause" 7 weeks ago Up 7 weeks k8s_POD_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_0 c3a457fe7cc5 e6ea68648f0c "/opt/bin/flanneld -…" 7 weeks ago Up 7 weeks k8s_kube-flannel_kube-flannel-ds-97qn4_kube-system_a2533c24-53f2-4dea-86df-4c61de17415f_0 6fb0829d3f0d lizhenliang/pause-amd64:3.0 "/pause" 7 weeks ago Up 7 weeks k8s_POD_kube-flannel-ds-97qn4_kube-system_a2533c24-53f2-4dea-86df-4c61de17415f_0 ``` **排查是不是dockerd出现了问题** * 如果docker exec ls有问题。但是ctr exec ls没有问题那就是dockerd有问题,因为ctr替代了dockerd发送了命令。 * 如果docker exec ls有问题,ctr exec ls出现了同样的问题。那dockerd没有问题,是后面某一层出现了问题。 下面的参数介绍: (可以结合ctr -h查看) -a address 指定socket -n namespaces 指定ns t tasks 表示要执行一下任务 exec 表示是exec类型的任务 --exec-id 表示任务Id,后面stupig1随便一个名字。aa/bb都可以 ``` root@k8s-node:~# ctr -a /var/run/docker/containerd/containerd.sock -n moby t exec --exec-id stupig1 97a519dcd3d6622b9650af95450fbb2b9e6c4761c277c43dd9e7b0e9f74e703d ls bin cacert.pem dev entrypoint.sh etc home lib media mnt opt proc root run sbin srv sys tmp usr var root@k8s-node:~# docker exec 97a519dcd3d6622b9650af95450fbb2b9e6c4761c277c43dd9e7b0e9f74e703d ls bin cacert.pem dev entrypoint.sh etc home lib media mnt opt proc root run sbin srv sys tmp usr var ``` #### 2.2 排除是否是containerd出现了问题 docker -> container -> runc 由于container包含了containerd+container-shim。这两个工具都不好排查。所以直接使用runc排查。 /var/run/docker/runtime-runc/moby/ 是root目录,这个是containerd运行的时候指定的 ``` root@k8s-node:~/docker# runc --root /var/run/docker/runtime-runc/moby/ exec 97a519dcd3d6622b9650af95450fbb2b9e6c4761c277c43dd9e7b0e9f74e703d ls bin cacert.pem dev entrypoint.sh etc home lib media mnt opt proc root run sbin srv sys tmp usr var root@k8s-node:~/docker# root@k8s-node:~/docker# ps -ef |grep moby 查看root目录。这个是containerd运行的时候指定的 ``` 如果runc执行没有问题,那就是containerd有问题,否则就是runc有问题。 runc有问题的时候会打印log日志。或者直接debug模式查看具体过程。 如果是docker有问题,docker是有日志输出的 ``` root@k8s-node:~# runc --debug --root /var/run/docker/runtime-runc/moby/ exec d6cef7d7206d22873050d3c5b303b32d962803bb53ddb6c3386e5b1ead3cbf5d ls DEBU[0000] nsexec:601 nsexec started DEBU[0000] child process in init() DEBU[0000] logging has already been configured bin cacert.pem dev entrypoint.sh etc home lib media mnt opt proc root run sbin srv sys tmp usr var DEBU[0000] log pipe has been closed: EOF DEBU[0000] process exited pid=3901 status=0 ``` ### 3.参考文档 [containerd的本地CLI工具ctr使用](https://www.mdnice.com/writing/78929e9fe39442fbba982009faf371b1) [docker exec 失败问题排查之旅](https://plpan.github.io/docker-exec-%E5%A4%B1%E8%B4%A5%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5%E4%B9%8B%E6%97%85/) ================================================ FILE: docker/其他/补充-僵尸进程处理.md ================================================ ### 1. 背景 再使用容器时用户不当的使用,可能会造成了大量的僵尸进程没有回收,从而导致容器kill失败。kill 9 , kill 15 singal都没有反应。 因此针对这个问题,输出一个处理报告。该报告分为两个部分:用户如何预防僵尸进程的产生?如果确定产生了僵尸进程,我们如何解决?
### 2. 如何预防僵尸进程的产生 #### 2.1 僵尸进程的产生 在UNIX 系统中,**任何一个子进程(init除外)在exit()之后,并非马上就消失掉,而是留下一个称为僵尸进程(Zombie)的数据结构,等待父进程处理。**这是每个 子进程在结束时都要经过的阶段。如果子进程在exit()之后,父进程没有来得及处理,这时用ps命令就能看到子进程的状态是“Z”。如果父进程能及时 处理,可能用ps命令就来不及看到子进程的僵尸状态,但这并不等于子进程不经过僵尸状态。 如果父进程在子进程结束之前退出,则子进程将由init接管。init将会以父进程的身份对僵尸状态的子进程进行处理。
#### 2.2 如何回收僵尸进程 **核心点:** 让容器的1号进程可以回收僵尸进程 ##### 方法1 用户层次解决 1、父进程通过wait和waitpid等函数等待子进程结束,这会导致父进程挂起 2、如果父进程很忙,那么可以用signal函数为SIGCHLD安装handler,因为子进程结束后,父进程会收到该信号,可以在handler中调用wait回收 3、如果父进程不关心子进程什么时候结束,那么可以用signal(SIGCHLD, SIG_IGN) 通知内核,自己对子进程的结束不感兴趣,那么子进程结束后,内核会回收,并不再给父进程发送信号
##### 方法2 容器层次解决 在镜像中替换1号进程 某些时候,用户运作在容器中的1号进程没办法处理僵尸进程,这个时候就需要引入init进程,让init进程为1号进程。用户需要运行的进程为子进程。这样用户进程创造出来的僵尸进程在用户进程死掉之后,init进程可以回收。 目前常见的在镜像中加入 [tini](https://github.com/krallin/tini) 或 [dumb-init](https://github.com/Yelp/dumb-init) 实现,范例如下(详细建议阅读官方 guied): ``` ## 使用tini作为1号进程 # Add Tini ENV TINI_VERSION v0.18.0 ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini RUN chmod +x /tini ENTRYPOINT ["/tini", "--"] ## 或者使用dumb-init作为1号进程 # Run your program under Tini CMD ["/your/program", "-and", "-its", "arguments"] # or docker run your-image /your/program ... # Runs "/usr/bin/dumb-init -- /my/script --with --args" ENTRYPOINT ["/usr/bin/dumb-init", "--"] ## 用户需要执行的代码 CMD ["/my/script", "--with", "--args"] ``` **实验:** **构造用户示例代码,该代码会产生一个僵尸进程** ``` import os import subprocess pid = os.fork() if pid == 0: # child pid2 = os.fork() if pid2 != 0: # parent print('The zombie pid will be: {}'.format(pid2)) else: # parent os.waitpid(pid, 0) subprocess.check_call(('ps', 'xawuf')) ``` **对应的Dockerfile** ``` FROM python:3 COPY test.sh /root/ CMD ["/root/test.sh"] ``` **运行后的结果** 出现 ``` root@cld-dnode1-1091:/home/zouxiang/DockerFiles# docker run --rm zoux/tini:sh2 The zombie pid will be: 7 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 14.0 0.0 14140 11476 ? Ss 08:04 0:00 python3 /root/test.sh root 7 0.0 0.0 0 0 ? Z 08:04 0:00 [python3] root 8 0.0 0.0 9392 3000 ? R 08:04 0:00 ps xawuf ```
**对比:使用tini** 用户代码不需要改变。修改Dockerfile如下: ``` FROM python:3 ADD tini / ##增加tini,将其作为1号进程 ENTRYPOINT ["/tini","--""] COPY test.sh /root/ CMD ["/root/test.sh"] ```
**运行后的结果:** **8号僵尸进程已经被回收** ``` root@cld-dnode1-1091:/home/zouxiang/DockerFiles# docker run --rm zoux/tini:sh3 The zombie pid will be: 8 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 2280 756 ? Ss 08:04 0:00 /tini -- /root/test.sh /root/test.sh root 6 0.0 0.0 14140 11580 ? S 08:04 0:00 python3 /root/test.sh /root/test.sh root 9 0.0 0.0 9392 3044 ? R 08:04 0:00 \_ ps xawuf ``` 除了tini或者dump-init外,用户也可以自己定制 init进程,例如:https://github.com/fpco/pid1
##### 方法三 k8s层次解决 通过pause成为init进程,并且回收僵尸进程。 每个 K8s Pod 有一个 [pause](https://github.com/kubernetes/kubernetes/blob/master/build/pause/pause.c) 容器组件,一般我们说起它的功能就是 Pod 内容器共享网络。其实除了共享网络还有睡觉之外,它还会捕获僵尸进程。默认 K8s Pod 内的 PID namespace 是不共享的,早期我们可以通过 kubelet `--docker-disable-shared-pid=false` 选项开启 Pod 内 PID namespace 共享,如此对应节点的 Pod 中 PID 为 1 的进程就是 pause 了,它便可以捕获处理僵尸进程了。kubelet 选项有一个坏处,就是调度到节点的 Pod 都会共享 PID namespace,社区就觉得应该移除这个选项,在 Pod 层实现,社区讨论见 [Remove `–docker-disable-shared-pid` from kubelet](https://github.com/kubernetes/kubernetes/issues/41938) 。在 K8s 1.10 就开始支持 Pod Spec 添加 `ShareProcessNamespace` 字段,支持在 Pod 层开启 PID namespace 共享。 硬性条件:docker >= 1.13.1。pause有回收僵尸进程的能力。
### 3. 处理僵尸进程 如何真的出现了僵尸进程,导致pod kill失败应该如何处理? 目前调研来看最常用的解决方法就是: (1) kill 僵尸进程的父进程,这样僵尸进程就会被回收。 (2)重启docker ================================================ FILE: docker/其他/补充-容器进程.md ================================================ ### 1. 为什么杀不死 容器的1号进程 动作:在容器内杀死1号进程。 ``` # kubectl exec -it zx-hpa-7c669876bb-bddsr -n test-zx /bin/sh # ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 07:57 ? 00:00:00 sleep 3600 root 3896 0 0 08:30 pts/0 00:00:00 /bin/sh root 3912 3896 0 08:30 pts/0 00:00:00 ps -ef # # kill 1 # # kill -9 1 ```
原因:出现了信号屏蔽 如何检查进程正在监听的信号?https://qastack.cn/unix/85364/how-can-i-check-what-signals-a-process-is-listening-to SIGTERM(15)和 SIGKILL(9) | 1号进程 | kill -9 1 | kill 1 | | ------- | --------- | ------ | | bash | 不行 | 不行 | | c++ | 不行 | 不行 | | golang | 不行 | 行 |
第一个概念是 Linux 1 号进程。它是第一个用户态的进程。它直接或者间接创建了 Namespace 中的其他进程。 第二个概念是 Linux 信号。Linux 有 31 个基本信号,进程在处理大部分信号时有三个选择:忽略、捕获和缺省行为。其中两个特权信号 SIGKILL 和 SIGSTOP 不能被忽略或者捕获。 容器里 1 号进程对信号处理的两个要点,这也是这一讲里我想让你记住的两句话: (1) 在容器中,1 号进程永远不会响应 SIGKILL 和 SIGSTOP 这两个特权信号; (2) 对于其他的信号,如果用户自己注册了 handler,1 号进程可以响应。
### 2. 如何通过找到容器的父进程 ``` ## 第一步:通过pod找到容器名字,这里容器名字为 zx-nginx ## 第二步:通过容器名字,找到容器id # docker ps | grep zx-nginx 8803c7c666d9 68cb644cdf30 "./main" 22 minutes ago Up 22 minutes k8s_zx-nginx_istio-ingressgateway-fc76bb8c9-667qv_test-zx_f24bdce8-277e-4e1f-8338-b6204068c6ec_1 ## 第三步:通过容器id,找到父进程id # ps -ef |grep 8803c7c666d9 root 961776 1703 0 14:45 ? 00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8803c7c666d918c37cd81891e586d2173db45d173debe567fe5aa56df12111b0 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc root 997057 977180 0 15:01 pts/0 00:00:00 grep 8803c7c666d9 ## 父进程是containerd-shim, containerd-shim的父进程是 /usr/bin/containerd。 # ps -ef | grep 1703 root 1703 1 1 2020 ? 2-19:54:09 /usr/bin/containerd ## 通过父进程id还能反推回去找到 容器的1号进程 ./main # ps -ef | grep 961776 root 961776 1703 0 14:45 ? 00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8803c7c666d918c37cd81891e586d2173db45d173debe567fe5aa56df12111b0 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc root 961793 961776 99 14:45 ? 00:19:07 ./main root 1003020 977180 0 15:04 pts/0 00:00:00 grep 961776 ## 直接通过进程名也可以找到 # ps -ef | grep /ma root 961793 961776 99 14:45 ? 00:15:49 ./main root 996535 977180 0 15:01 pts/0 00:00:00 grep /ma ```
### 3. 如何找到容器和pod的cgroup设置 #### 3.1 方法一 (1)查看pod的yaml. 获得以下信息 ``` podName: zx-hpa-7c669876bb-cv8gm nodeip: XXXXX uid: c7902b75-88ed-473c-806d-9c419bcff548 qosClass: Burstable ```
(2)在node节点的对应路径上就可以看见 ``` /sys/fs/cgroup/memory/kubepods/burstable# ls cgroup.clone_children memory.usage_in_bytes cgroup.event_control memory.use_hierarchy cgroup.procs notify_on_release memory.failcnt pod1a337b4b-0c5a-4197-bc1a-f725b126f9df memory.force_empty pod210e5803-331b-4e13-9bf7-66d4854f6e3f memory.kmem.failcnt pod2f287b5c-f979-48d5-b8dd-6399164952bc memory.kmem.limit_in_bytes pod392ecb7f-e16c-4dbd-aed7-445f066a17da memory.kmem.max_usage_in_bytes pod4000ac86-1aa1-47ff-b194-3e80a8073fd7 memory.kmem.slabinfo pod54d7ff80-cde1-47f1-b6ea-39ec0f34beb5 memory.kmem.tcp.failcnt pod7ece8e2d-1a67-4fc5-a1be-bde0b64cc8c6 memory.kmem.tcp.limit_in_bytes pod7f5ba05e-e962-40d3-a598-6e0a93a1b139 memory.kmem.tcp.max_usage_in_bytes pod8820126f-4278-4f56-babd-00b751e82520 memory.kmem.tcp.usage_in_bytes pod91cbcae8-1d2a-430a-a414-6d5f1c374c04 memory.kmem.usage_in_bytes pod9a56e405-e0c1-4851-9242-18d6526d1aa5 memory.limit_in_bytes poda08d46d1-0eeb-4cf9-bb01-7bdc07cfad08 memory.max_usage_in_bytes podae792623-f895-4726-97a4-aafbaab23fea memory.memsw.failcnt podafd1b6ea-158b-464d-816b-d307d7b67ba0 memory.memsw.limit_in_bytes podb57fe34b-5e44-472f-baba-837d85bb84fa memory.memsw.max_usage_in_bytes podb6628cac-fc22-4e21-a074-c54bd25a0204 memory.memsw.usage_in_bytes podb6c4f1fc-566d-455b-b4d2-4c135dfc41ca memory.move_charge_at_immigrate podc7902b75-88ed-473c-806d-9c419bcff548 //就是这个pod memory.numa_stat podcbf6dc19-5058-4016-b9f1-e17ebd41a751 memory.oom_control pode0651b8f-9c80-4c08-9bc9-8f92681cc6de memory.pressure_level podefa14886-4a76-4297-88da-07a6fb572ab5 memory.soft_limit_in_bytes podf304b287-417c-4812-82f6-e88d7bd010c0 memory.stat podf50a670b-058a-4ee9-945e-7a57cec8549b memory.swappiness tasks ```
(3) 进入改路径,就可以看见container的 ``` :/sys/fs/cgroup/memory/kubepods/burstable/podc7902b75-88ed-473c-806d-9c419bcff548# ls //这个就是container的。 docker ps的container id就是钱12位 5cf735569b3adcf74582e1a6082adf3ebebf0250bd68170b7ce980a952760b73 memory.max_usage_in_bytes cgroup.clone_children memory.memsw.failcnt cgroup.event_control memory.memsw.limit_in_bytes cgroup.procs memory.memsw.max_usage_in_bytes fc1c7dfcfa73e6ed23fbfb4011f20d24646ea2b3f1ce0fbbaa23802a5cdf7f79 memory.memsw.usage_in_bytes memory.failcnt memory.move_charge_at_immigrate memory.force_empty memory.numa_stat memory.kmem.failcnt memory.oom_control memory.kmem.limit_in_bytes memory.pressure_level memory.kmem.max_usage_in_bytes memory.soft_limit_in_bytes memory.kmem.slabinfo memory.stat memory.kmem.tcp.failcnt memory.swappiness memory.kmem.tcp.limit_in_bytes memory.usage_in_bytes memory.kmem.tcp.max_usage_in_bytes memory.use_hierarchy memory.kmem.tcp.usage_in_bytes notify_on_release memory.kmem.usage_in_bytes tasks memory.limit_in_bytes ```
#### 3.2 方法二 (1) docker ps 找出来 containerid (2)docker inspect containerid | grep \"Pid\", 找出来 pidId ``` docker inspect fc1c7dfcfa73 | grep "Pid" "Pid": 1139631, "PidMode": "", "PidsLimit": 0, ``` (3)cat /proc/pidId/cgroup | grep memory ``` # cat /proc/1139631/cgroup | grep memory 11:memory:/kubepods/burstable/podc7902b75-88ed-473c-806d-9c419bcff548/fc1c7dfcfa73e6ed23fbfb4011f20d24646ea2b3f1ce0fbbaa23802a5cdf7f79 ``` (3). 找出来 memory对应的 cgroup链接 前缀是/sys/fs/cgroup/memory/ 上一层就是pod的 ``` # ls /sys/fs/cgroup/memory/kubepods/burstable/podc7902b75-88ed-473c-806d-9c419bcff548/fc1c7dfcfa73e6ed23fbfb4011f20d24646ea2b3f1ce0fbbaa23802a5cdf7f79 cgroup.clone_children memory.kmem.tcp.max_usage_in_bytes memory.oom_control cgroup.event_control memory.kmem.tcp.usage_in_bytes memory.pressure_level cgroup.procs memory.kmem.usage_in_bytes memory.soft_limit_in_bytes memory.failcnt memory.limit_in_bytes memory.stat memory.force_empty memory.max_usage_in_bytes memory.swappiness memory.kmem.failcnt memory.memsw.failcnt memory.usage_in_bytes memory.kmem.limit_in_bytes memory.memsw.limit_in_bytes memory.use_hierarchy memory.kmem.max_usage_in_bytes memory.memsw.max_usage_in_bytes notify_on_release memory.kmem.slabinfo memory.memsw.usage_in_bytes tasks memory.kmem.tcp.failcnt memory.move_charge_at_immigrate memory.kmem.tcp.limit_in_bytes memory.numa_stat ```
#### 3.3 qos介绍 QoS(Quality of Service),大部分译为“服务质量等级”,又译作“服务质量保证”,是作用在 Pod 上的一个配置,当 Kubernetes 创建一个 Pod 时,它就会给这个 Pod 分配一个 QoS 等级,可以是以下等级之一: - **Guaranteed**:Pod 里的每个容器都必须有内存/CPU 限制和请求,而且值必须相等。 - **Burstable**:Pod 里至少有一个容器有内存或者 CPU 请求且不满足 Guarantee 等级的要求,即内存/CPU 的值设置的不同。 - **BestEffort**:容器必须没有任何内存或者 CPU 的限制或请求。 该配置不是通过一个配置项来配置的,而是通过配置 CPU/内存的 `limits` 与 `requests` 值的大小来确认服务质量等级的。使用 `kubectl get pod -o yaml` 可以看到 pod 的配置输出中有 `qosClass` 一项。该配置的作用是为了给资源调度提供策略支持,调度算法根据不同的服务质量等级可以确定将 pod 调度到哪些节点上。 ================================================ FILE: etcd/0. etcd常用操作.md ================================================ ### 1. 实用脚本 ``` ### cat zoux_etcdctl.sh #! /bin/bash # Already test in etcd V3.0.4 and V3.1.7 # Deal with ls command ENDPOINTS="http://7.33.96.71:24001,http://7.33.96.72:24001,http://7.33.96.73:24001" ETCDCTL_ABSPATH="/usr/local/bin/etcdctl-v3.4.3" CERT_ARGS="" export ETCDCTL_API=3 if [ $1 == "ls" ] then keys=$2 if [ -z $keys ] then keys="/" fi if [ ${keys: -1} != "/" ] then keys=$keys"/" fi num=`echo $keys | grep -o "/" | wc -l` (( num=$num+1 )) $ETCDCTL_ABSPATH --endpoints="$ENDPOINTS" get $keys --prefix=true --keys-only=true $CERT_ARGS | cut -d '/' -f 1-$num | grep -v "^$" | grep -v "compact_rev_key" | uniq | sort exit 0 fi # Deal with get command if [ $1 == "get" ] then $ETCDCTL_ABSPATH --endpoints="$ENDPOINTS" $* $CERT_ARGS #--print-value-only=true exit 0 fi # Deal with other command $ETCDCTL_ABSPATH --endpoints="$ENDPOINTS" $* $CERT_ARGS exit 0 eg. bash zoux_etcdctl.sh --debug=true ls bash zoux_etcdctl.sh endpoint status -w table bash zoux_etcdctl.sh --command-timeout=15s ls bash zoux_etcdctl.sh endpoint status ``` ### 2. etcd的基本操作 #### 2.1 查看所有的key ``` [root@k8s-master ssl]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=ca.pem --cert=server.pem --key=server-key.pem --endpoints="https://192.168.0.4:2379,https://192.168.0.5:2379" get / /registry/services/endpoints/kube-system/kube-controller-manager /registry/services/endpoints/kube-system/kube-scheduler /registry/services/specs/default/kubernetes ``` #### 2.2 查看某个pod的内容 ``` [root@k8s-master ssl]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=ca.pem --cert=server.pem --key=server-key.pem --endpoints="https://192.168.0.4:2379,https://192.168.0.5:2379" get /registry/pods/default/my-nginx-756f645cd7-4ws6k -w=json | jq . { "header": { "cluster_id": 12138850119299830000, "member_id": 6539934570868143000, "revision": 7643164, "raft_term": 1892 }, "kvs": [ { "key": "L3JlZ2lzdHJ5L3BvZHMvZGVmYXVsdC9teS1uZ2lueC03NTZmNjQ1Y2Q3LTR3czZr", "create_revision": 7642432, "mod_revision": 7642554, "version": 4, "value": "azhzAAoJCgJ2MRIDUG9kEtUICvoBChlteS1uZ2lueC03NTZmNjQ1Y2Q3LTR3czZrEhRteS1uZ2lueC03NTZmNjQ1Y2Q3LRoHZGVmYXVsdCIAKiRjM2U2ZDE3ZS03ZjJlLTExZWItOTY4OC1mYTI3MDAwNGIwMGQyADgAQggI99GSggYQAFofChFwb2QtdGVtcGxhdGUtaGFzaBIKNzU2ZjY0NWNkN1oPCgNydW4SCG15LW5naW54alQKClJlcGxpY2FTZXQaE215LW5naW54LTc1NmY2NDVjZDciJGMzZTE4NzlkLTdmMmUtMTFlYi05Njg4LWZhMjcwMDA0YjAwZCoHYXBwcy92MTABOAF6ABKlAwoxChNkZWZhdWx0LXRva2VuLTY5Yzk1EhoyGAoTZGVmYXVsdC10b2tlbi02OWM5NRikAxKcAQoIbXktbmdpbngSBW5naW54KgAyDQoAEAAYUCIDVENQKgBCAEpIChNkZWZhdWx0LXRva2VuLTY5Yzk1EAEaLS92YXIvcnVuL3NlY3JldHMva3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudCIAahQvZGV2L3Rlcm1pbmF0aW9uLWxvZ3IGQWx3YXlzgAEAiAEAkAEAogEERmlsZRoGQWx3YXlzIB4yDENsdXN0ZXJGaXJzdEIHZGVmYXVsdEoHZGVmYXVsdFILMTkyLjE2OC4wLjVYAGAAaAByAIIBAIoBAJoBEWRlZmF1bHQtc2NoZWR1bGVysgE2Chxub2RlLmt1YmVybmV0ZXMuaW8vbm90LXJlYWR5EgZFeGlzdHMaACIJTm9FeGVjdXRlKKwCsgE4Ch5ub2RlLmt1YmVybmV0ZXMuaW8vdW5yZWFjaGFibGUSBkV4aXN0cxoAIglOb0V4ZWN1dGUorALCAQDIAQAarQMKB1J1bm5pbmcSIwoLSW5pdGlhbGl6ZWQSBFRydWUaACIICPfRkoIGEAAqADIAEh0KBVJlYWR5EgRUcnVlGgAiCAjL0pKCBhAAKgAyABInCg9Db250YWluZXJzUmVhZHkSBFRydWUaACIICMvSkoIGEAAqADIAEiQKDFBvZFNjaGVkdWxlZBIEVHJ1ZRoAIggI99GSggYQACoAMgAaACIAKgsxOTIuMTY4LjAuNTILMTcyLjE3LjgzLjI6CAj30ZKCBhAAQtgBCghteS1uZ2lueBIMEgoKCAjK0pKCBhAAGgAgASgAMgxuZ2lueDpsYXRlc3Q6X2RvY2tlci1wdWxsYWJsZTovL25naW54QHNoYTI1NjpmMzY5M2ZlNTBkNWIxZGYxZWNkMzE1ZDU0ODEzYTc3YWZkNTZiMDI0NWE0MDQwNTVhOTQ2NTc0ZGViNmIzNGZjQklkb2NrZXI6Ly9iNDc4NTBmYWY2NGM1YjFiZWRjNjg0M2EzNzZlZTA1YTVlOGFmZmU4Y2VlZGNlMzNhOWJjNzQxY2EzNDVlOGRjSgpCZXN0RWZmb3J0WgAaACIA" } ], "count": 1 } [root@k8s-master ssl]# [root@k8s-master ssl]# key 是 base64加密的 [root@k8s-master ssl]# [root@k8s-master ssl]# echo L3JlZ2lzdHJ5L3BvZHMvZGVmYXVsdC9teS1uZ2lueC03NTZmNjQ1Y2Q3LTR3czZr|base64 -d /registry/pods/default/my-nginx-756f645cd7-4ws6k[root@k8s-master ssl]# ^C 但是为什么值 还有乱码 [root@k8s-master ssl]# echo azhzAAoJCgJ2MRIDUG9kEtUICvoBChlteS1uZ2lueC03NTZmNjQ1Y2Q3LTR3czZrEhRteS1uZ2lueC03NTZmNjQ1Y2Q3LRoHZGVmYXVsdCIAKiRjM2U2ZDE3ZS03ZjJlLTExZWItOTY4OC1mYTI3MDAwNGIwMGQyADgAQggI99GSggYQAFofChFwb2QtdGVtcGxhdGUtaGFzaBIKNzU2ZjY0NWNkN1oPCgNydW4SCG15LW5naW54alQKClJlcGxpY2FTZXQaE215LW5naW54LTc1NmY2NDVjZDciJGMzZTE4NzlkLTdmMmUtMTFlYi05Njg4LWZhMjcwMDA0YjAwZCoHYXBwcy92MTABOAF6ABKlAwoxChNkZWZhdWx0LXRva2VuLTY5Yzk1EhoyGAoTZGVmYXVsdC10b2tlbi02OWM5NRikAxKcAQoIbXktbmdpbngSBW5naW54KgAyDQoAEAAYUCIDVENQKgBCAEpIChNkZWZhdWx0LXRva2VuLTY5Yzk1EAEaLS92YXIvcnVuL3NlY3JldHMva3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudCIAahQvZGV2L3Rlcm1pbmF0aW9uLWxvZ3IGQWx3YXlzgAEAiAEAkAEAogEERmlsZRoGQWx3YXlzIB4yDENsdXN0ZXJGaXJzdEIHZGVmYXVsdEoHZGVmYXVsdFILMTkyLjE2OC4wLjVYAGAAaAByAIIBAIoBAJoBEWRlZmF1bHQtc2NoZWR1bGVysgE2Chxub2RlLmt1YmVybmV0ZXMuaW8vbm90LXJlYWR5EgZFeGlzdHMaACIJTm9FeGVjdXRlKKwCsgE4Ch5ub2RlLmt1YmVybmV0ZXMuaW8vdW5yZWFjaGFibGUSBkV4aXN0cxoAIglOb0V4ZWN1dGUorALCAQDIAQAarQMKB1J1bm5pbmcSIwoLSW5pdGlhbGl6ZWQSBFRydWUaACIICPfRkoIGEAAqADIAEh0KBVJlYWR5EgRUcnVlGgAiCAjL0pKCBhAAKgAyABInCg9Db250YWluZXJzUmVhZHkSBFRydWUaACIICMvSkoIGEAAqADIAEiQKDFBvZFNjaGVkdWxlZBIEVHJ1ZRoAIggI99GSggYQACoAMgAaACIAKgsxOTIuMTY4LjAuNTILMTcyLjE3LjgzLjI6CAj30ZKCBhAAQtgBCghteS1uZ2lueBIMEgoKCAjK0pKCBhAAGgAgASgAMgxuZ2lueDpsYXRlc3Q6X2RvY2tlci1wdWxsYWJsZTovL25naW54QHNoYTI1NjpmMzY5M2ZlNTBkNWIxZGYxZWNkMzE1ZDU0ODEzYTc3YWZkNTZiMDI0NWE0MDQwNTVhOTQ2NTc0ZGViNmIzNGZjQklkb2NrZXI6Ly9iNDc4NTBmYWY2NGM1YjFiZWRjNjg0M2EzNzZlZTA1YTVlOGFmZmU4Y2VlZGNlMzNhOWJjNzQxY2EzNDVlOGRjSgpCZXN0RWZmb3J0WgAaACIA|base64 -d k8s v1Pod� � my-nginx-756f645cd7-4ws6kmy-nginx-756f645cd7-default"*$c3e6d17e-7f2e-11eb-9688-fa270004b00d2�ђ�Z pod-template-hash 756f645cd7Z rumy-nginxjT ReplicaSetmy-nginx-756f645cd7"$c3e1879d-7f2e-11eb-9688-fa270004b00d*apps/v108z� 1 default-token-69c952 default-token-69c95�� my-nginxnginx*2 P"TCP*BJH default-token-69c95-/var/run/secrets/kubernetes.io/serviceaccount"j/dev/termination-logrAlways����FileAlways 2 ClusterFirstBdefaultJdefaultR 192.168.0.5X`hr���default-scheduler�6 node.kubernetes.io/not-readyExists" NoExecute(��8 node.kubernetes.io/unreachableExists" NoExecute(���� Running# InitializedTru�ђ�*2 ReadyTru�Ғ�*2' ContainersReadyTru�Ғ�*2$ ``` https://github.com/openshift/origin/tree/master/tools/etcdhelper value是乱码的原因找到了。因为使用l proto存储。有个工具可以解决这个显示问题,见上面的链接。 **参考的操作链接:** https://jimmysong.io/kubernetes-handbook/guide/using-etcdctl-to-access-kubernetes-data.html https://yq.aliyun.com/articles/561888
================================================ FILE: etcd/协议理论知识/1. cap原理.md ================================================ ### 1.cap CAP 理论对分布式系统的特性做了高度抽象,形成了三个指标 * 一致性 * 可用性 * 分区容错性 一致性(Consistency):客户端每次读写操作,不管是访问哪个节点都是一样的数据 可用性(Availability):不管客户端访问的是哪个节点,都会给你返回数据 分区容错性(Partition Tolerance):当节点间出现任意数量的消息丢失或高延迟的时候,系统仍然可以继续提供服务。也就是说,分布式系统在告诉访问本系统的客户端:不管我的内部出现什么样的数据同步问题,我会一直运行,提供服务。 **注意点** (1)一致性并不代表完整性 (2)CAP不能三个都满足的条件在于,没有网络故障或者系统问题。当系统没问题的时候CAP是可以同时满足的。当有故障时,就要根据情况,选择A 和是 C ### 2.总结 (1) CA 模型,在分布式系统中不存在。因为舍弃 P,意味着舍弃分布式系统,就比如单机版关系型数据库 MySQL,如果 MySQL 要考虑主备或集群部署时,它必须考虑 P。 (2) CP 模型,采用 CP 模型的分布式系统,一旦因为消息丢失、延迟过高发生了网络分区,就影响用户的体验和业务的可用性。因为为了防止数据不一致,集群将拒绝新数据的写入,典型的应用是 ZooKeeper,Etcd 和 HBase。 (3) AP 模型,采用 AP 模型的分布式系统,实现了服务的高可用。用户访问系统的时候,都能得到响应数据,不会出现响应错误,但当出现分区故障时,相同的读操作,访问不同的节点,得到响应数据可能不一样。典型应用就比如 Cassandra 和 DynamoDB。 ================================================ FILE: etcd/协议理论知识/2. ACID理论.md ================================================ ### 1.ACID是什么 事务是由一组SQL语句组成的逻辑处理单元,事务具有以下4个属性,通常简称为事务的ACID属性。 ACID Atomic(原子性) Consistency(一致性) Isolation(隔离性) Durability(持久性)的英文缩写。 ### 2. 分布式系统如何实现ACID #### 2.1 二阶段提交协议 两阶段提交协议(2PC:Two-Phase Commit) 两阶段提交协议的目标在于为分布式系统保证数据的一致性,许多分布式系统采用该协议提供对分布式事务的支持。顾名思义,该协议将一个分布式的事务过程拆分成两个阶段: 投票 和 事务提交 。为了让整个数据库集群能够正常的运行,该协议指定了一个 协调者 单点,用于协调整个数据库集群各节点的运行。为了简化描述,我们将数据库集群中的各个节点称为 参与者 ,三阶段提交协议中同样包含协调者和参与者这两个角色定义。 ##### 2.1.1 原理 **第一阶段:投票** 该阶段的主要目的在于打探数据库集群中的各个参与者是否能够正常的执行事务,具体步骤如下: 协调者向所有的参与者发送事务执行请求,并等待参与者反馈事务执行结果; 事务参与者收到请求之后,执行事务但不提交,并记录事务日志; 参与者将自己事务执行情况反馈给协调者,同时阻塞等待协调者的后续指令。 **第二阶段:事务提交** 在经过第一阶段协调者的询盘之后,各个参与者会回复自己事务的执行情况,这时候存在 3 种可能性: (1)所有的参与者都回复能够正常执行事务。 (2)一个或多个参与者回复事务执行失败。 (3)协调者等待超时。
对于第 1 种情况,协调者将向所有的参与者发出提交事务的通知,具体步骤如下: 协调者向各个参与者发送 commit 通知,请求提交事务; 参与者收到事务提交通知之后执行 commit 操作,然后释放占有的资源; 参与者向协调者返回事务 commit 结果信息。 ![image-20220406164951426](../images/acid-1.png) 对于第 2 和第 3 种情况,协调者均认为参与者无法成功执行事务,为了整个集群数据的一致性,所以要向各个参与者发送事务回滚通知,具体步骤如下: 协调者向各个参与者发送事务 rollback 通知,请求回滚事务; 参与者收到事务回滚通知之后执行 rollback 操作,然后释放占有的资源; 参与者向协调者返回事务 rollback 结果信息。 ![image-20220406165445685](../images/acid-2.png) __两阶段提交协议解决的是分布式数据库数据强一致性问题__,实际应用中更多的是用来解决事务操作的原子性,下图描绘了协调者与参与者的状态转换。 ![image-20220406165528208](../images/acid-3.png) 站在协调者的角度,在发起投票之后就进入了 WAIT 状态,等待所有参与者回复各自事务执行状态,并在收到所有参与者的回复后决策下一步是发送 commit 或 rollback 信息。站在参与者的角度,当回复完协调者的投票请求之后便进入 READY 状态(能够正常执行事务),接下去就是等待协调者最终的决策通知,一旦收到通知便可依据决策执行 commit 或 rollback 操作。 ##### 2.1.2 优缺点 两阶段提交协议原理简单、易于实现,但是缺点也是显而易见的,包含如下: (1)单点问题 协调者在整个两阶段提交过程中扮演着举足轻重的作用,一旦协调者所在服务器宕机,就会影响整个数据库集群的正常运行。比如在第二阶段中,如果协调者因为故障不能正常发送事务提交或回滚通知,那么参与者们将一直处于阻塞状态,整个数据库集群将无法提供服务。 (2)同步阻塞 两阶段提交执行过程中,所有的参与者都需要听从协调者的统一调度,期间处于阻塞状态而不能从事其他操作,这样效率极其低下。 (3)数据不一致性 两阶段提交协议虽然是分布式数据强一致性所设计,但仍然存在数据不一致性的可能性。比如在第二阶段中,假设协调者发出了事务 commit 通知,但是因为网络问题该通知仅被一部分参与者所收到并执行了commit 操作,其余的参与者则因为没有收到通知一直处于阻塞状态,这时候就产生了数据的不一致性。 针对上述问题可以引入 超时机制 和 互询机制 在很大程度上予以解决。 对于协调者来说如果在指定时间内没有收到所有参与者的应答,则可以自动退出 WAIT 状态,并向所有参与者发送 rollback 通知。对于参与者来说如果位于 READY 状态,但是在指定时间内没有收到协调者的第二阶段通知,则不能武断地执行 rollback 操作,因为协调者可能发送的是 commit 通知,这个时候执行 rollback 就会导致数据不一致。 此时,我们可以介入互询机制,让参与者 A 去询问其他参与者 B 的执行情况。如果 B 执行了 rollback 或 commit 操作,则 A 可以大胆的与 B 执行相同的操作;如果 B 此时还没有到达 READY 状态,则可以推断出协调者发出的肯定是 rollback 通知;如果 B 同样位于 READY 状态,则 A 可以继续询问另外的参与者。只有当所有的参与者都位于 READY 状态时,此时两阶段提交协议无法处理,将陷入长时间的阻塞状态。 **三段式提交协议多了一个预询盘阶段** #### 2.2 TCC (Try-confirm-cancel) TCC 其实就是采用的补偿机制,其核心思想是:针对每个操作,都要注册一个与其对应的确认和补偿(撤销)操作。它分为三个阶段: (1)Try 阶段主要是对业务系统做检测及资源预留 (2)Confirm 阶段主要是对业务系统做确认提交,Try阶段执行成功并开始执行 Confirm阶段时,默认 Confirm阶段是不会出错的。即:只要Try成功,Confirm一定成功。 (3)Cancel 阶段主要是在业务执行错误,需要回滚的状态下执行的业务取消,预留资源释放。 举个例子,假入 Bob 要向 Smith 转账,思路大概是:我们有一个本地方法,里面依次调用 1、首先在 Try 阶段,要先调用远程接口把 Smith 和 Bob 的钱给冻结起来。 2、在 Confirm 阶段,执行远程调用的转账的操作,转账成功进行解冻。 3、如果第2步执行成功,那么转账成功,如果第二步执行失败,则调用远程冻结接口对应的解冻方法 (Cancel)。 #### 2.3 二阶段提交和TCC区别 经常在网络上看见有人介绍TCC时,都提一句,”TCC是两阶段提交的一种”。其理由是TCC将业务逻辑分成try、confirm/cancel在两个不同的阶段中执行。其实这个说法,是不正确的。 可能是因为既不太了解两阶段提交机制、也不太了解TCC机制的缘故,于是将两阶段提交机制的prepare、commit两个事务提交阶段和TCC机制的try、confirm/cancel两个业务执行阶段互相混淆,才有了这种说法。两阶段提交(Two Phase Commit,下文简称2PC),简单的说,是将事务的提交操作分成了prepare、commit两个阶段。 其事务处理方式为: 1、 在全局事务决定提交时, ​ a)逐个向RM发送prepare请求; ​ b)若所有RM都返回OK,则逐个发送commit请求最终提交事务;否则,逐个发送rollback请求来回滚事务; 2、 在全局事务决定回滚时,直接逐个发送rollback请求即可,不必分阶段。 需要注意的是:2PC机制需要RM提供底层支持(一般是兼容XA),而TCC机制则不需要。 TCC(Try-Confirm-Cancel),则是将业务逻辑分成try、confirm/cancel两个阶段执行,其事务处理方式为: 1、 在全局事务决定提交时,调用与try业务逻辑相对应的confirm业务逻辑; 2、 在全局事务决定回滚时,调用与try业务逻辑相对应的cancel业务逻辑。 可见,TCC在事务处理方式上,是很简单的:要么调用confirm业务逻辑,要么调用cancel逻辑。这里为什么没有提到try业务逻辑呢?因为try逻辑与全局事务处理无关。 当讨论2PC时,我们只专注于事务处理阶段,因而只讨论prepare和commit,所以,可能很多人都忘了,使用2PC事务管理机制时也是有业务逻辑阶段的。正是因为业务逻辑的执行,发起了全局事务,这才有其后的事务处理阶段。 实际上,使用2PC机制时 ————以提交为例———— 一个完整的事务生命周期是:begin -> 业务逻辑 -> prepare -> commit。 再看TCC,也不外乎如此。我们要发起全局事务,同样也必须通过执行一段业务逻辑来实现。该业务逻辑 一来通过执行触发TCC全局事务的创建;二来也需要执行部分数据写操作; 此外,还要通过执行来向TCC全局事务注册自己,以便后续TCC全局事务commit/rollback时回调其相应的confirm/cancel业务逻辑。 所以,使用TCC机制时 ————以提交为例———— 一个完整的事务生命周期是:begin -> 业务逻辑(try业务) -> commit(comfirm业务)。 综上,我们可以从执行的阶段上将二者一一对应起来: 1、 2PC机制的业务阶段 等价于 TCC机制的try业务阶段; 2、 2PC机制的提交阶段(prepare & commit) 等价于 TCC机制的提交阶段(confirm); 3、 2PC机制的回滚阶段(rollback) 等价于 TCC机制的回滚阶段(cancel)。 因此,可以看出,虽然TCC机制中有两个阶段都存在业务逻辑的执行,但其中try业务阶段其实是与全局事务处理无关的。认清了这一点,当我们再比较TCC和2PC时,就会很容易地发现,TCC不是两阶段提交,而只是它对事务的提交/回滚是通过执行一段confirm/cancel业务逻辑来实现,仅此而已。 ================================================ FILE: etcd/协议理论知识/3. base理论.md ================================================ BASE 理论是 CAP 理论中的 AP 的延伸,是对互联网大规模分布式系统的实践总结,强调可用性。几乎所有的互联网后台分布式系统都有 BASE 的支持,这个理论很重要,地位也很高。一旦掌握它,你就能掌握绝大部分场景的分布式系统的架构技巧,设计出适合业务场景特点的、高可用性的分布式系统。 而它的核心就是基本可用(Basically Available)和最终一致性(Eventually consistent)。也有人会提到软状态(Soft state),在我看来,软状态描述的是实现服务可用性的时候系统数据的一种过渡状态,也就是说不同节点间,数据副本存在短暂的不一致。你只需要知道软状态是一种过渡状态就可以了,我们不多说。 如何做到基本可用? 当发生系统故障时: 掌握流量削峰(不同项目、业务分时间断访问)、延迟响应、体验降级、过载保护(拒绝请求)这 4 板斧 还有可以考虑:重试、幂等、异步、负载均衡、故障隔离、流量切换、自动扩缩容、兜底(熔断限流降级)、容量规划 acid是数据库系统经典之作;base是在实践中受挫后的思想松绑,提出一种重要的指导,给人以信心 ================================================ FILE: etcd/协议理论知识/4. raft协议.md ================================================ [toc] ### 1. raft算法是如何初始化的 初始状态下,集群中所有的节点都是跟随者状态。 Raft 算法实现了随机超时时间的特性。也就是说,每个节点等待领导者节点心跳信息的超时时间间隔是随机的。通过下面的图片你可以看到,集群中没有领导者,而节点 A 的等待超时时间最小(150ms),它会最先因为没有等到领导者的心跳信息,发生超时。 所以A节点最先没有收到领导者的心跳。所以这个时候,节点 A 就增加自己的任期编号,并推举自己为候选人,先给自己投上一张选票,然后向其他节点发送请求投票 RPC 消息,请它们选举自己为领导者。 如果其他节点接收到候选人 A 的请求投票 RPC 消息,在编号为 1 的这届任期内,也还没有进行过投票,那么它将把选票投给节点 A,并增加自己的任期编号。 如果候选人在选举超时时间内赢得了大多数的选票,那么它就会成为本届任期内新的领导者。 节点 A 当选领导者后,他将周期性地发送心跳消息,通知其他服务器我是领导者,阻止跟随者发起新的选举,篡权。 ![image-20220406170608476](../images/raft-1.png) #### 1.1 节点之间是如何通信的 在 Raft 算法中,服务器节点间的沟通联络采用的是远程过程调用(RPC),在领导者选举中,需要用到这样两类的 RPC: 1. 请求投票(RequestVote)RPC,是由候选人在选举期间发起,通知各节点进行投票; 2. 日志复制(AppendEntries)RPC,是由领导者发起,用来复制日志和提供心跳消息。 我想强调的是,日志复制 RPC 只能由领导者发起,这是实现强领导者模型的关键之一,希望你能注意这一点,后续能更好地理解日志复制,理解日志的一致是怎么实现的。 #### 1.2 任期编号有什么用 任期编号是递增的,随着选举的举行而不断变化的。具体有: (1)跟随者在等待领导者心跳信息超时后,推举自己为候选人时,会增加自己的任期号,比如节点 A 的当前任期编号为 0,那么在推举自己为候选人时,会将自己的任期编号增加为 1。 (2)如果一个服务器节点,发现自己的任期编号比其他节点小,那么它会更新自己的编号到较大的编号值。比如节点 B 的任期编号是 0,当收到来自节点 A 的请求投票 RPC 消息时,因为消息中包含了节点 A 的任期编号,且编号为 1,那么节点 B 将把自己的任期编号更新为 1。 #### 1.3 选举规则 (1)领导者周期性地向所有跟随者发送心跳消息(即不包含日志项的日志复制 RPC 消息),通知大家我是领导者,阻止跟随者发起新的选举。 (2)如果在指定时间内,跟随者没有接收到来自领导者的消息,那么它就认为当前没有领导者,推举自己为候选人,发起领导者选举。 (3)在一次选举中,赢得大多数选票的候选人,将晋升为领导者。 (4)在一个任期内,领导者一直都会是领导者,直到它自身出现问题(比如宕机),或者因为网络延迟,其他节点发起一轮新的选举。 (5)在一次选举中,每一个服务器节点最多会对一个任期编号投出一张选票,并且按照“先来先服务”的原则进行投票。比如节点 C 的任期编号为 3,先收到了 1 个包含任期编号为 4 的投票请求(来自节点 A),然后又收到了 1 个包含任期编号为 4 的投票请求(来自节点 B)。那么节点 C 将会把唯一一张选票投给节点 A,当再收到节点 B 的投票请求 RPC 消息时,对于编号为 4 的任期,已没有选票可投了。 (6)当任期编号相同时,日志完整性高的跟随者(也就是最后一条日志项对应的任期编号值更大,索引号更大),拒绝投票给日志完整性低的候选人。比如节点 B、C 的任期编号都是 3,节点 B 的最后一条日志项对应的任期编号为 3,而节点 C 为 2,那么当节点 C 请求节点 B 投票给自己时,节点 B 将拒绝投票。 选举是跟随者发起的,推举自己为候选人;大多数选票是指集群成员半数以上的选票;大多数选票规则的目标,是为了保证在一个给定的任期内最多只有一个领导者。 #### 1.4 如何理解随机超时时间 在议会选举中,常出现未达到指定票数,选举无效,需要重新选举的情况。在 Raft 算法的选举中,也存在类似的问题,那它是如何处理选举无效的问题呢? 其实,Raft 算法巧妙地使用随机选举超时时间的方法,把超时时间都分散开来,在大多数情况下只有一个服务器节点先发起选举,而不是同时发起选举,这样就能减少因选票瓜分导致选举失败的情况。 随机超时包括两层含义: (1)跟随者等待领导者心跳信息超时的时间间隔,是随机的; (2)当没有候选人赢得过半票数,选举无效了,这时需要等待一个随机时间间隔,也就是说,等待选举超时的时间间隔,是随机的。 #### 1.5.疑问 (1)选举规则5,6有矛盾,先来先到 和根据日志判断。 看起来跟随者是有等待时间的吗,等待所有的候选人发的rpc到了之后,再选择一个投票吗? 目前看起来就是和自己比。如果当前节点收到了一个投票者的RPC。如果item比自己小直接拒绝;如果item一致,但是日志index比自己小也直接拒绝。因为集群写入好一个数据,会在大部分节点上都写好才算。所以候选人会和挂之前的leader保持一致的。 (2)如果A是候选人,B,C是跟随者。但是A->C 出现了网络问题。 C发起重新选举。并且C和A的日志是一样的,会怎么样? 目前看起来是会切换为C的。 #### 1.6. raft的选举机制的局限 关于raft的领导者选举限制和局限: 1.读写请求和数据转发压力落在领导者节点,导致领导者压力。 2.大规模跟随者的集群,领导者需要承担大量元数据维护和心跳通知的成本。 3.领导者单点问题,故障后直到新领导者选举出来期间集群不可用。 4.随着候选人规模增长,收集半数以上投票的成本更大。
### 2.raft日志机制 #### 2.1 什么是日志 日志项是一种数据格式,它主要包含用户指定的数据,也就是指令(Command),还包含一些附加信息,比如索引值(Log index)、任期编号(Term)。 ![image-20220406170705593](../images/raft-2.png) (1)指令:一条由客户端请求指定的、状态机需要执行的指令。你可以将指令理解成客户端指定的数据。 (2)索引值:日志项对应的整数索引值。它其实就是用来标识日志项的,是一个连续的、单调递增的整数号码。 (3)任期编号:创建这条日志项的领导者的任期编号。 从图中可以看到,一届领导者任期,往往有多条日志项。而且日志项的索引值是连续的。 上述四个节点的日志不一致的原因在于,由于网络原因或者服务器其他问题,导致某些节点的日志点没跟上。所以这个时候就要进行日志同步了。 #### 2.2 日志同步 ![image-20220406170731474](../images/raft-3.png) 正常的日志同步是这样的: (1)接收到客户端请求后,领导者基于客户端请求中的指令,创建一个新日志项,并附加到本地日志中。 (2)领导者通过日志复制 RPC,将新的日志项复制到其他的服务器。 (3)当领导者将日志项,成功复制到大多数的服务器上的时候,领导者会将这条日志项提交到它的状态机中。 (4)领导者将执行的结果返回给客户端。 (5)当跟随者接收到心跳信息,或者新的日志复制 RPC 消息后,如果跟随者发现领导者已经提交了某条日志项,而它还没提交,那么跟随者就将这条日志项提交到本地的状态机中。 但是由于网络或者其他原因,某些节点并不是大多数之一,所以日志就一直落后。这个时候就需要复制日志了。 在 Raft 算法中,领导者通过强制跟随者直接复制自己的日志项,处理不一致日志。也就是说,Raft 是通过以领导者的日志为准,来实现各节点日志的一致的。具体有 2 个步骤。 首先,领导者通过日志复制 RPC 的一致性检查,找到跟随者节点上,与自己相同日志项的最大索引值。也就是说,这个索引值之前的日志,领导者和跟随者是一致的,之后的日志是不一致的了。 然后,领导者强制跟随者更新覆盖的不一致日志项,实现日志的一致。 详细过程如下: PrevLogEntry:表示当前要复制的日志项,前面一条日志项的索引值。比如在图中,如果领导者将索引值为 8 的日志项发送给跟随者,那么此时 PrevLogEntry 值为 7。 PrevLogTerm:表示当前要复制的日志项,前面一条日志项的任期编号,比如在图中,如果领导者将索引值为 8 的日志项发送给跟随者,那么此时 PrevLogTerm 值为 4。 ![image-20220406170756696](../images/raft-4.png) 那么复制日志的过程如下: (1)领导者通过日志复制 RPC 消息,发送当前最新日志项到跟随者(为了演示方便,假设当前需要复制的日志项是最新的),这个消息的 PrevLogEntry 值为 7,PrevLogTerm 值为 4。 (2)如果跟随者在它的日志中,找不到与 PrevLogEntry 值为 7、PrevLogTerm 值为 4 的日志项,也就是说它的日志和领导者的不一致了,那么跟随者就会拒绝接收新的日志项,并返回失败信息给领导者。 (3)这时,领导者会递减要复制的日志项的索引值,并发送新的日志项到跟随者,这个消息的 PrevLogEntry 值为 6,PrevLogTerm 值为 3。 (4)如果跟随者在它的日志中,找到了 PrevLogEntry 值为 6、PrevLogTerm 值为 3 的日志项,那么日志复制 RPC 返回成功,这样一来,领导者就知道在 PrevLogEntry 值为 6、PrevLogTerm 值为 3 的位置,跟随者的日志项与自己相同。 (5)领导者通过日志复制 RPC,复制并更新覆盖该索引值之后的日志项(也就是不一致的日志项),最终实现了集群各节点日志的一致。 从上面步骤中你可以看到,领导者通过日志复制 RPC 一致性检查,找到跟随者节点上与自己相同日志项的最大索引值,然后复制并更新覆盖该索引值之后的日志项,实现了各节点日志的一致。需要你注意的是,跟随者中的不一致日志项会被领导者的日志覆盖,而且领导者从来不会覆盖或者删除自己的日志。
当这个跟随者与leader恢复响应后,leader通过rpc日志检查一致性来进行日志同步,但是这里有个问题,如果跟随者跟leader的日志相差太多,会有很频繁的rpc日志检查。 这个只是思想,代码实现的时候可以优化,不是递增的寻找。 **etcd** 中就是leader节点定期向每个fllower节点发送 PrevLogEntry+PrevLogTerm用于判断日志同步 ### 3.raft集群成员变更 (1)成员变更的问题,主要在于进行成员变更时,可能存在新旧配置的 2 个“大多数”,导致集群中同时出现两个领导者,破坏了 Raft 的领导者的唯一性原则,影响了集群的稳定运行。 (2)单节点变更是利用“一次变更一个节点,不会同时存在旧配置和新配置 2 个‘大多数’”的特性,实现成员变更。 (3)因为联合共识实现起来复杂,不好实现,所以绝大多数 Raft 算法的实现,采用的都是单节点变更的方法(比如 Etcd、Hashicorp Raft)。其中,Hashicorp Raft 单节点变更的实现,是由 Raft 算法的作者迭戈·安加罗(Diego Ongaro)设计的,很有参考价值。 ================================================ FILE: k8s/README.md ================================================ ## 版本说明 如无特别说明,本章节所涉及的k8s源码版本皆为 V1.17.4 ================================================ FILE: k8s/client-go/1- clientGo简介与章节安排.md ================================================ Table of Contents ================= * [1. client-go简介](#1-client-go简介) * [1.1 client-go章节安排](#11-client-go章节安排) * [2. client-go如何使用kubeconfig配置](#2-client-go如何使用kubeconfig配置) * [2.1 kube-config介绍](#21-kube-config介绍) * [2.1 client-go加载kubeconfig](#21-client-go加载kubeconfig) * [2.2 BuildConfigFromFlags](#22-buildconfigfromflags) * [3.总结](#3总结) ### 1. client-go简介 client-go就是 Go client for Kubernetes。它提供了与k8s交互的各种方法。 Kubernetes官方从2016年8月份开始,将Kubernetes资源操作相关的核心源码抽取出来,独立出来一个项目Client-go,作为官方提供的Go client。Kubernetes的部分代码也是基于这个client实现的,所以对这个client的质量、性能等方面还是非常有信心的。 client-go是一个调用kubernetes集群资源对象API的客户端,即通过client-go实现对kubernetes集群中资源对象(包括deployment、service、ingress、replicaSet、pod、namespace、node等)的增删改查等操作。大部分对kubernetes进行前置API封装的二次开发都通过client-go这个第三方包来实现。 client-go的代码库已经集成到Kubernetes源码中了,无须考虑版本兼容性问题,源码结构示例如下。client-go源码目录结构如下所示: ``` [root@k8s-node client-go]# tree -L 1 . ├── code-of-conduct.md ├── CONTRIBUTING.md ├── discovery 提供discovery client客户端 ├── dynamic 提供dynamic客户端 ├── examples 几个常见的example示例 ├── Godeps godeps的简单说明 ├── informers 每种资源的informer实现 ├── INSTALL.md ├── kubernetes 提供clientset客户端 ├── LICENSE ├── listers 每种资源的list实现 ├── OWNERS ├── pkg ├── plugin 提供openstack, GCP, Azure等云服务商授权插件 ├── rest 提供restful客户端,执行restful操作 ├── restmapper ├── scale 提供scale客户端,用于deploy,rs,rc等的扩缩容。 ├── SECURITY_CONTACTS ├── testing ├── third_party ├── tools 提供常用的工具,例如cache,Indexers,DealtFIFO ├── transport 提供安全的TCP连接,支持Http Stream └── util 提供常用方法,例如workqueue,证书管理等。 ``` #### 1.1 client-go章节安排 打算主要从这三个方面入手,研究client-go的源码 (1)client-go提供四种连接apiserver的客户端 (2)client-go list-watch功能实现 (3)与之配套提供的cache,dealtFifo,queue等辅助功能 希望加强对这些部分更深入的了解,对k8s整理以及以后控制器的编写根据得心应手。 接下来的文章安排就是了解上述的功能如何使用,如何实现。
### 2. client-go如何使用kubeconfig配置 #### 2.1 kube-config介绍 kubeconfig用于管理访问kube-apiserver的配置信息,同时也支持访问多kube-apiserver的配置管理,可以在不同的环境下管理不同的kube-apiserver集群配置,不同的业务线也可以拥有不同的集群。Kubernetes的其他组件都使用kubeconfig配置信息来连接kube-apiserver组件,例如当kubectl访问kube-apiserver时,会默认加载kubeconfig配置信息。kubeconfig中存储了集群、用户、命名空间和身份验证等信息,在默认的情况下,kubeconfig存放在$HOME/.kube/config路径下。Kubeconfig配置信息如下: ``` cat /root/.kube/config apiVersion: v1 clusters: - cluster: server: https://39.98.210.73:6443 certificate-authority-data: name: kubernetes contexts: - context: cluster: kubernetes user: "kubernetes-admin" name: kubernetes-admin-cd0201255113548b782faa6fbf68c80cd current-context: kubernetes-admin-cd0201255113548b782faa6fbf68c80cd kind: Config preferences: {} users: - name: "kubernetes-admin" user: client-certificate-data: client-key-data: ``` kubeconfig配置信息通常包含3个部分,分别介绍如下。 ● clusters:定义Kubernetes集群信息,例如kube-apiserver的服务地址及集群的证书信息等。 ● users:定义Kubernetes集群用户身份验证的客户端凭据,例如client-certificate、client-key、token及username/password等。 ● contexts:定义Kubernetes集群用户信息和命名空间等,用于将请求发送到指定的集群。 这里其实就很好理解。就是定义 集群用户,和上下文。集群上下文可以有多个。例如 context1 <集群A,用于A> context2 <集群B,用户A> 这样使用 kubectl config指定 context2就能马上 以用户A的角色连接到 集群B。 #### 2.1 client-go加载kubeconfig client-go会读取kubeconfig配置信息并生成config对象,用于与kube-apiserver通信。这里主要就是通过 tools/clientcmd包实现的。更具体就是通过 clientcmd.BuildConfigFromFlags ``` 像kube-eventwatcher组件也是通过这个 连接集群。 func NewPodController(opt *config.Option) (*PodController, error) { cfg, err := clientcmd.BuildConfigFromFlags("", opt.KubeConfig) if err != nil { glog.Errorf("can not read the cfg: %v\n", err) return nil, err } ```
#### 2.2 BuildConfigFromFlags 这个函数的主要作用就是,通过 path,或者命令行输入,实例化一个 restclient.Config对象。 从主函数BuildConfigFromFlags可以看出来。还是命令行输入的config优先使用 ```go // 1.主函数BuildConfigFromFlags // BuildConfigFromFlags is a helper function that builds configs from a master // url or a kubeconfig filepath. These are passed in as command line flags for cluster // components. Warnings should reflect this usage. If neither masterUrl or kubeconfigPath // are passed in we fallback to inClusterConfig. If inClusterConfig fails, we fallback // to the default config. func BuildConfigFromFlags(masterUrl, kubeconfigPath string) (*restclient.Config, error) { if kubeconfigPath == "" && masterUrl == "" { glog.Warningf("Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.") kubeconfig, err := restclient.InClusterConfig() if err == nil { return kubeconfig, nil } glog.Warning("error creating inClusterConfig, falling back to default config: ", err) } return NewNonInteractiveDeferredLoadingClientConfig( &ClientConfigLoadingRules{ExplicitPath: kubeconfigPath}, &ConfigOverrides{ClusterInfo: clientcmdapi.Cluster{Server: masterUrl}}).ClientConfig() } // 2. 调用了clientconfig()看起来就是合并,因为可能一份kubeconfig可能要操作多个集群。并且还是通过文件指定一部分集群,通过命令行指定一部分集群。 // ClientConfig implements ClientConfig func (config *DeferredLoadingClientConfig) ClientConfig() (*restclient.Config, error) { mergedClientConfig, err := config.createClientConfig() if err != nil { return nil, err } // load the configuration and return on non-empty errors and if the // content differs from the default config mergedConfig, err := mergedClientConfig.ClientConfig() switch { case err != nil: if !IsEmptyConfig(err) { // return on any error except empty config return nil, err } case mergedConfig != nil: // the configuration is valid, but if this is equal to the defaults we should try // in-cluster configuration if !config.loader.IsDefaultConfig(mergedConfig) { return mergedConfig, nil } } // check for in-cluster configuration and use it if config.icc.Possible() { glog.V(4).Infof("Using in-cluster configuration") return config.icc.ClientConfig() } // return the result of the merged client config return mergedConfig, err } // 3.调用NewNonInteractiveDeferredLoadingClientConfig函数 // NewNonInteractiveDeferredLoadingClientConfig creates a ConfigClientClientConfig using the passed context name func NewNonInteractiveDeferredLoadingClientConfig(loader ClientConfigLoader, overrides *ConfigOverrides) ClientConfig { return &DeferredLoadingClientConfig{loader: loader, overrides: overrides, icc: &inClusterClientConfig{overrides: overrides}} } // 4.最终就是实例化这样一个对象 // DeferredLoadingClientConfig is a ClientConfig interface that is backed by a client config loader. // It is used in cases where the loading rules may change after you've instantiated them and you want to be sure that // the most recent rules are used. This is useful in cases where you bind flags to loading rule parameters before // the parse happens and you want your calling code to be ignorant of how the values are being mutated to avoid // passing extraneous information down a call stack type DeferredLoadingClientConfig struct { loader ClientConfigLoader overrides *ConfigOverrides fallbackReader io.Reader clientConfig ClientConfig loadingLock sync.Mutex // provided for testing icc InClusterConfig } ``` 合并Kubeconfig的效果如下: ![merged-config.png](../images/merged-config.png) ### 3.总结 (1)简单了解client-go的结构和对kube-config的使用 ================================================ FILE: k8s/client-go/10. Controller-runtime原理分析.md ================================================ - [1. Controller-runtime结构介绍](#1-controller-runtime----) - [2. Controller-runtime 底层原理](#2-controller-runtime-----) * [2.1 manager相关结构体介绍](#21-manager-------) * [2.2 controller相关结构体介绍](#22-controller-------) * [2.3 controller启动流程](#23-controller----) * [2.4 manager是如何启动controller的](#24-manager-----controller-) + [2.4.1 第一步-manager的初始化](#241-----manager----) + [2.4.2 第二步-将controller绑定到manager](#242------controller---manager) + [2.4.3 第三步-启动manager.start](#243-------managerstart) * [2.5 runtime cache](#25-runtime-cache) + [2.5.1 cache是什么](#251-cache---) + [2.5.2 cache初始化逻辑](#252-cache-----) - [3.总结](#3--) - [4. 参考](#4---) ### 1. Controller-runtime结构介绍 kubebuilder底层使用的就是Controller-runtime,Controller-runtime为 Controller 的开发提供了各种功能模块,每个模块中包括了一个 或多个实现,通过这些模块,开发者可以灵活地构建自己的 Controller,主要包括以下内容: (1) Client:用于读写 Kubernetes 资源对象的客户端。 (2) Cache:本地缓存,用于保存需要监听的 Kubernetes 资源。缓存提供了只读客户端, 用于从缓存中读取对象。缓存还可以注册处理方法(EventHandler),以响应更新的事件。 (3) Manager:用于控制多个 Controller,提供 Controller 共用的依赖项,如 Client、 Cache、Schemes 等。通过调用 Manager.Start 方法,可以启动 Controller。 (4) Controller:控制器,响应事件(Kubernetes 资源对象的创建、更新、删除)并 确保对象规范(Spec 字段)中指定的状态与系统状态匹配,如果不匹配,则控制器需要根 据事件的对象,通过协调器(Reconciler)进行同步。在实现上,Controller 是用于处理 reconcile.Requests 的工作队列,reconcile.Requests 包含了需要匹配状态的资源对象。 ① Controller 需要提供 Reconciler 来处理从工作队列中获取的请求。 ② Controller 需要配置相应的资源监听,根据监听到的 Event 生成 reconcile.Requests 并加入队列。 (5) Reconciler:为 Controller 提供同步的功能,Controller 可以随时通过资源对象的 Name 和 Namespace 来调用 Reconciler,调用时,Reconciler 将确保系统状态与资源对象 所表示的状态相匹配。例如,当某个 ReplicaSet 的副本数为 5,但系统中只有 3 个 Pod 时, 同步 ReplicaSet 资源的 Reconciler 需要新建两个 Pod,并将它们的 OwnerReference 字段 指向对应的 ReplicaSet。 ① Reconciler 包含了 Controller 所有的业务逻辑。 ② Reconciler 通常只处理单个对象类型,例如只处理 ReplicaSets 的 Reconciler,不 处理其他的对象类型。如果需要处理多种对象类型,需要实现多个 Controller。如果你 希望通过其他类型来触发 Reconciler,例如,通过 Pod 对象的事件来触发 ReplicaSet 的 Recon- ciler,则可以提供一个映射,通过该映射将触发 Reconciler 的类型映射到需要匹 配的类型。 ③ 提供给 Reconciler 的参数是需要匹配的资源对象的 Name 和 Namespace。 ④ Reconciler 不关心触发它的事件的内容和类型。例如,对于同步 ReplicaSet 资源的 Reconciler 来说,触发它的是 ReplicaSet 的创建还是更新并不重要,Reconciler 总是会比 较系统中相应的 Pod 数量和 ReplicaSet 中指定的副本数量。 (6) WebHook:准 入 WebHook(Admission WebHook) 是 扩 展 Kubernetes API 的 一种机制,WebHook 可以根据事件类型进行配置,比如资源对象的创建、删除、更改等 事件,当配置的事件发生时,Kubernetes 的 APIServer 会向 WebHook 发送准入请求 (AdmissionRequests),WebHook 可以对请求中的资源对象进行更改或准入验证,然后将 处理结果响应给 APIServer。 (7) Source:resource.Source 是 Controller.Watch 的参数,提供事件,事件通常是来 自 Kubernetes 的 APIServer(如 Pod 创建、更新和删除)。例如,source.Kind 使用指定 对象(通过 GroupVersionKind 指定)的 Kubernetes API Watch 接口来提供此对象的创建、 更新、删除事件。 ① Source 通过 Watch API 提供 Kubernetes 指定对象的事件流。 ② 建议开发者使用 Controller-runtime 中已有的 Source 实现,而不是自己实现此接口。 (8) EventHandler:handler.EventHandler 是 Controller.Watch 的 参 数, 用 于 将 事 件对应的 reconcile.Requests 加入队列。例如,从 Source 中接收到一个 Pod 的创建事 件,eventhandler.EnqueueHandler 会 根 据 Pod 的 Name 与 Namespace 生 成 reconcile. Requests 后,加入队列。 ① EventHandlers 处理事件的方式是将一个或多个 reconcile.Requests 加入队列。 ② 在 EventHandler 的处理中,事件所属的对象的类型(比如 Pod 的创建事件属于 Pod 对象),可能与 reconcile.Requests 所加入的对象类型相同。 ③ 事件所属的对象的类型也可能与 reconcile.Requests 所加入的对象类型不同。例如 将 Pod 的事件映射为所属的 ReplicaSet 的 reconcile.Requests。 ④ EventHandler 可能会将一个事件映射为多个 reconcile.Requests 并加入队列,多个 reconcile.Requests 可能属于一个对象类型,也可能涉及多个对象类型。例如,由于集群扩 展导致的 Node 事件。 ⑤ 在大多数情况下,建议开发者使用 Controller-runtime 中已有的 EventHandler 来 实现,而不是自己实现此接口。 (9) Predicate:predicate.Predicate 是 Controller.Watch 的参数,是用于过滤事件的 过滤器,过滤器可以复用或者组合。 ① Predicate 接口以事件作为输入,以布尔值作为输出,当返回 True 时,表示需要将 事件加入队列。 ② Predicate 是可选的。 ③ 建议开发者使用 Controller-runtime 中已有的 Predicate 实现,但可以使用其他 Predicate 进行过滤。 ![image-20220826141310531](../images/image-20220826141310531.png) Controller-runtime 核心流程如下: * Source 通过 Kubernetes APIServer 监听指定资源对象 * EventHandler 根据资源对象变化事件,将 reconcile.Request 加入队列 * 从队列中获取 reconcile.Request,并调用 Reconciler 进行同步 ![image-20220826142338724](../images/image-20220826142338724.png) ### 2. Controller-runtime 底层原理 #### 2.1 manager相关结构体介绍 Manager的方法 ``` type Manager interface { cluster.Cluster //cluster.Cluster 提供了一系列方法,以获取与集群相关的对象。 Add(Runnable) error //添加controller Elected() <-chan struct{} // 选举相关, 返回一个 Channel 结构,用于判断选举状态。当未配 置选举或当选 Leader 时,Channel 将被关闭。 AddMetricsExtraHandler(path string, handler http.Handler) error // metrics相关 AddHealthzCheck(name string, check healthz.Checker) error // 健康检查相关 AddReadyzCheck(name string, check healthz.Checker) error // 是否就绪 Start(ctx context.Context) error // 启动所有的controller GetWebhookServer() *webhook.Server GetLogger() logr.Logger GetControllerOptions() v1alpha1.ControllerConfigurationSpec } ``` Manager启动时Options介绍。这里介绍几个关键的。 (1) Scheme 结构。一般先通过 k8s.io/apimachinery/pkg/runtime 中的 NewScheme() 方法获取 Kubernetes 的 Scheme,然后再将 CRD 注册到 Scheme (2) MapperProvider 是一个函数对象,其定义为 func(c *rest.Config) (meta.RESTMapper,error),用于定义 Manager 如何获取 RESTMapper。默认通过 k8s.io/client-go 中的 DiscoveryClient 请求获取 Kube-APIServer。 (3) Logger 用于定义 Manager 的日志输出对象,默认使用 pkg/internal/log 包下的 全局参数 RuntimeLog。 (4)SyncPeriod 参数用于指定 Informer 重新同步并处理资源的时间间隔,默认为 10 小时。此参数也决定了 Controller 重新同步的时间间隔,每个 Controller 的时间间隔以此 参数为基准有 10% 的抖动,以避免多个 Controller 同时进行重新同步。 (5) Namespace 参数用于限制 Manager.Cache 只监听指定 Namespace 的资源,默认 情况下无限制。 (6) EventBroadcaster 参数用于提供 Manager,以获取 EventRecorder,当前已不 推荐使用,因为当Manager或Controller的生命周期短于EventBroadcaster的生命周期时, 可能会导致 goroutine 泄露。 ``` // Options are the arguments for creating a new Manager. type Options struct { // Scheme is the scheme used to resolve runtime.Objects to GroupVersionKinds / Resources // Defaults to the kubernetes/client-go scheme.Scheme, but it's almost always better // idea to pass your own scheme in. See the documentation in pkg/scheme for more information. Scheme *runtime.Scheme // MapperProvider provides the rest mapper used to map go types to Kubernetes APIs MapperProvider func(c *rest.Config) (meta.RESTMapper, error) SyncPeriod *time.Duration 。。。 } ```
#### 2.2 controller相关结构体介绍 **接口** ``` type Controller interface { // 匿名接口,定义了 Reconcile(context.Context,Request) (Result,error) reconcile.Reconciler // Watch() 方法会从 source.Source 中 获 取 Event, 并 根 据 参 数 Eventhandler 来 决 定 如 何 入 队, 根 据 参 数 Predicates 进行 Event 过滤,Preficates 可能有多个,只有所有的 Preficates 都返回True 时,才会将 Event 发送给 Eventhandler 处理。 Watch(src source.Source, eventhandler handler.EventHandler, predicates ...predicate.Predicate) error // Controller 的启动方法,实现了 Controller 接 口的对象,也实现了 Runnable,因此,该方法可以被 Manager 管理。 Start(ctx context.Context) error // 获取 Controller 内的 Logger,用于日志输出。 GetLogger() logr.Logger } ``` **结构体实现** Controller 的实现在 pkg/internal/controller/controller.go 下,为结构体 Controller, Controller 结构体中包括的主要成员如下。 (1) Name string:必须设置,用于标识 Controller,会在 Controller 的日志输 出中进行关联。 (2) MaxConcurrentReconciles int:定义允许 reconcile.Reconciler 同时运行的最多个 数,默认为 1。 (3) Do reconcile.Reconciler:定义了 Reconcile() 方法,包含了 Controller 同步的业务 逻辑。Reconcile() 能在任意时刻被调用,接收一个对象的 Name 与 Namespace,并同步集 群当前实际状态至该对象被设置的期望状态。 (4) MakeQueue func() workqueue.RateLimitingInterface:用 于 在 Controller 启动时,创建工作队列。由于标准的 Kubernetes 工作队列创建后会立即启动,因此, 如果在 Controller 启动前就创建队列,在重复调用 controller.New() 方法创建 Controller 的情况下,就会导致 Goroutine 泄露。 (5) Queue workqueue.RateLimitingInterface:使用上面方法创建的工作队列。 (6) SetFields func(i interface{}) error:用 于 从 Manager 中 获 取 Controller 依 赖 的 方 法, 依 赖 包 括 Sourcess、EventHandlers 和 Predicates 等。 此 方 法 存 储 的 是 controllerManager.SetFields() 方法。 (7) Started Bool:用于表示 Controller 是否已经启动。 (8) CacheSyncTimeout time.Duration:定义了 Cache 完成同步的等待时长,超过时 长会被认为是同步失败。默认时长为 2 分钟。 (9) startWatches [ ]watchDescription:定 义 了 一 组 Watch 操 作 的 属 性, 会 在 Controller 启动时,根据属性进行 Watch 操作。watchDescription 的定义见代码清单 3-30,watchDescription 包 括 Event 的 源 source.Source、Event 的 入 队 方 法 handler. EventHandler 以及 Event 的过滤方法 predicate.Predicate。 ``` // Controller implements controller.Controller. type Controller struct { // Name is used to uniquely identify a Controller in tracing, logging and monitoring. Name is required. Name string // MaxConcurrentReconciles is the maximum number of concurrent Reconciles which can be run. Defaults to 1. MaxConcurrentReconciles int // Reconciler is a function that can be called at any time with the Name / Namespace of an object and // ensures that the state of the system matches the state specified in the object. // Defaults to the DefaultReconcileFunc. Do reconcile.Reconciler // MakeQueue constructs the queue for this controller once the controller is ready to start. // This exists because the standard Kubernetes workqueues start themselves immediately, which // leads to goroutine leaks if something calls controller.New repeatedly. MakeQueue func() workqueue.RateLimitingInterface // Queue is an listeningQueue that listens for events from Informers and adds object keys to // the Queue for processing Queue workqueue.RateLimitingInterface // SetFields is used to inject dependencies into other objects such as Sources, EventHandlers and Predicates // Deprecated: the caller should handle injected fields itself. SetFields func(i interface{}) error // mu is used to synchronize Controller setup mu sync.Mutex // Started is true if the Controller has been Started Started bool // ctx is the context that was passed to Start() and used when starting watches. // // According to the docs, contexts should not be stored in a struct: https://golang.org/pkg/context, // while we usually always strive to follow best practices, we consider this a legacy case and it should // undergo a major refactoring and redesign to allow for context to not be stored in a struct. ctx context.Context // CacheSyncTimeout refers to the time limit set on waiting for cache to sync // Defaults to 2 minutes if not set. CacheSyncTimeout time.Duration // startWatches maintains a list of sources, handlers, and predicates to start when the controller is started. startWatches []watchDescription // Log is used to log messages to users during reconciliation, or for example when a watch is started. Log logr.Logger // RecoverPanic indicates whether the panic caused by reconcile should be recovered. RecoverPanic bool } ``` #### 2.3 controller启动流程 controller跟随manager.start而启动。然后根据下面的流程运行。在c.Do.Reconcile函数中调用了我们实现的Reconcile函数进行真正的控制器逻辑处理。 ![image-20220826145939603](../images/image-20220826145939603.png) #### 2.4 manager是如何启动controller的 ##### 2.4.1 第一步-manager的初始化 一般在main函数就调用ctrl.NewManager函数进行初始化。ctrl.NewManager函数有2个参数,第一个参数就是k8s集群的*rest.Config, 第二个就是Options。就是manger结构体介绍的参数,比如可以自定义SyncPeriod等等。 ``` mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ Scheme: scheme, MetricsBindAddress: metricsAddr, Port: 9443, HealthProbeBindAddress: probeAddr, LeaderElection: enableLeaderElection, LeaderElectionID: "ec7e1f70.github.com", // LeaderElectionReleaseOnCancel defines if the leader should step down voluntarily // when the Manager ends. This requires the binary to immediately end when the // Manager is stopped, otherwise, this setting is unsafe. Setting this significantly // speeds up voluntary leader transitions as the new leader don't have to wait // LeaseDuration time first. // // In the default scaffold provided, the program ends immediately after // the manager stops, so would be fine to enable this option. However, // if you are doing or is intended to do any operation such as perform cleanups // after the manager stops then its usage might be unsafe. // LeaderElectionReleaseOnCancel: true, }) ```
ctrl.NewManager实际就是初始化这个结构体,有了这个结构体就可以和k8s集群打交道了。 ``` return &controllerManager{ stopProcedureEngaged: pointer.Int64(0), cluster: cluster, runnables: runnables, errChan: errChan, recorderProvider: recorderProvider, resourceLock: resourceLock, metricsListener: metricsListener, metricsExtraHandlers: metricsExtraHandlers, controllerOptions: options.Controller, logger: options.Logger, elected: make(chan struct{}), port: options.Port, host: options.Host, certDir: options.CertDir, webhookServer: options.WebhookServer, leaseDuration: *options.LeaseDuration, renewDeadline: *options.RenewDeadline, retryPeriod: *options.RetryPeriod, healthProbeListener: healthProbeListener, readinessEndpointName: options.ReadinessEndpointName, livenessEndpointName: options.LivenessEndpointName, gracefulShutdownTimeout: *options.GracefulShutdownTimeout, internalProceduresStop: make(chan struct{}), leaderElectionStopped: make(chan struct{}), leaderElectionReleaseOnCancel: options.LeaderElectionReleaseOnCancel, }, nil ``` ##### 2.4.2 第二步-将controller绑定到manager 这一步需要调用SetupWithManager函数,这个是每个controller自己实现的。最简单就是使用通用的方法。 ``` // SetupWithManager sets up the controller with the Manager. func (r *PodCountReconciler) SetupWithManager(mgr ctrl.Manager) error { return ctrl.NewControllerManagedBy(mgr). For(&zouxappv1.PodCount{}). Complete(r) } ``` 详细地说,创建 Controller 基本分为 3 步。 **第一步**,通过 ControllerManagedBy(m manager.Manager) *Builder 方法实例化一个 Builder 对象,其中传入的 Manager 提供创建 Controller 所需的依赖。 这步骤的意思是,我定义了一个builder,绑定了manager ``` // Builder builds a Controller. type Builder struct { forInput ForInput ownsInput []OwnsInput watchesInput []WatchesInput mgr manager.Manager globalPredicates []predicate.Predicate ctrl controller.Controller ctrlOptions controller.Options name string } // ControllerManagedBy returns a new controller builder that will be started by the provided Manager. func ControllerManagedBy(m manager.Manager) *Builder { return &Builder{mgr: m} } ``` **第二步**,使用 For(object client.Object,opts ...ForOption)方法设置需要监听的资源 类型。 实际就是完善Builder的forInput结构体。 **注意:**这里就相对于调用了Watches(&source.Kind{Type: apiType}, &handler.EnqueueRequestForObject{}). 如果想一个controller监听多个对象,或者想实现自己的监听逻辑,比如不想监听删除操作,执行监听特定的update操作。就需要自己NewController来实现了。 ``` // For defines the type of Object being *reconciled*, and configures the ControllerManagedBy to respond to create / delete / // update events by *reconciling the object*. // This is the equivalent of calling // Watches(&source.Kind{Type: apiType}, &handler.EnqueueRequestForObject{}). func (blder *Builder) For(object client.Object, opts ...ForOption) *Builder { if blder.forInput.object != nil { blder.forInput.err = fmt.Errorf("For(...) should only be called once, could not assign multiple objects for reconciliation") return blder } input := ForInput{object: object} for _, opt := range opts { opt.ApplyToFor(&input) } blder.forInput = input return blder } ``` **第三步**,使用Complete函数将controller绑定到manager ``` // Complete builds the Application Controller. func (blder *Builder) Complete(r reconcile.Reconciler) error { _, err := blder.Build(r) return err } ``` **总结:** controller-runtime实际是通过builder这个对象,将mgr和controller绑定。 ##### 2.4.3 第三步-启动manager.start controllerManager.start会依次启动serveMetrics,serveHealthProbes,Webhooks,Caches,startLeaderElectionRunnables 这里就是关注如何启动每个controller的。manager.start->startLeaderElectionRunnables->cm.runnables.LeaderElection.Start -> go r.reconcile() -> fou循环go routinue启动每个controller ``` if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil { setupLog.Error(err, "problem running manager") os.Exit(1) } // Start starts the manager and waits indefinitely. // There is only two ways to have start return: // An error has occurred during in one of the internal operations, // such as leader election, cache start, webhooks, and so on. // Or, the context is cancelled. func (cm *controllerManager) Start(ctx context.Context) (err error) { cm.Lock() if cm.started { cm.Unlock() return errors.New("manager already started") } var ready bool defer func() { // Only unlock the manager if we haven't reached // the internal readiness condition. if !ready { cm.Unlock() } }() // Initialize the internal context. cm.internalCtx, cm.internalCancel = context.WithCancel(ctx) // This chan indicates that stop is complete, in other words all runnables have returned or timeout on stop request stopComplete := make(chan struct{}) defer close(stopComplete) // This must be deferred after closing stopComplete, otherwise we deadlock. defer func() { // https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/gettyimages-459889618-1533579787.jpg stopErr := cm.engageStopProcedure(stopComplete) if stopErr != nil { if err != nil { // Utilerrors.Aggregate allows to use errors.Is for all contained errors // whereas fmt.Errorf allows wrapping at most one error which means the // other one can not be found anymore. err = kerrors.NewAggregate([]error{err, stopErr}) } else { err = stopErr } } }() // Add the cluster runnable. if err := cm.add(cm.cluster); err != nil { return fmt.Errorf("failed to add cluster to runnables: %w", err) } // Metrics should be served whether the controller is leader or not. // (If we don't serve metrics for non-leaders, prometheus will still scrape // the pod but will get a connection refused). if cm.metricsListener != nil { cm.serveMetrics() } // Serve health probes. if cm.healthProbeListener != nil { cm.serveHealthProbes() } // First start any webhook servers, which includes conversion, validation, and defaulting // webhooks that are registered. // // WARNING: Webhooks MUST start before any cache is populated, otherwise there is a race condition // between conversion webhooks and the cache sync (usually initial list) which causes the webhooks // to never start because no cache can be populated. if err := cm.runnables.Webhooks.Start(cm.internalCtx); err != nil { if err != wait.ErrWaitTimeout { return err } } // Start and wait for caches. if err := cm.runnables.Caches.Start(cm.internalCtx); err != nil { if err != wait.ErrWaitTimeout { return err } } // Start the non-leaderelection Runnables after the cache has synced. if err := cm.runnables.Others.Start(cm.internalCtx); err != nil { if err != wait.ErrWaitTimeout { return err } } // Start the leader election and all required runnables. { ctx, cancel := context.WithCancel(context.Background()) cm.leaderElectionCancel = cancel go func() { if cm.resourceLock != nil { if err := cm.startLeaderElection(ctx); err != nil { cm.errChan <- err } } else { // 启动每个controller // Treat not having leader election enabled the same as being elected. if err := cm.startLeaderElectionRunnables(); err != nil { cm.errChan <- err } close(cm.elected) } }() } ready = true cm.Unlock() select { case <-ctx.Done(): // We are done return nil case err := <-cm.errChan: // Error starting or running a runnable return err } } ```
最终Controller就像内置的控制器一样,通过processNextWorkItem函数一个个处理。主要这里还可以通过**MaxConcurrentReconciles**提高并发。 ``` // Start implements controller.Controller. func (c *Controller) Start(ctx context.Context) error { // use an IIFE to get proper lock handling // but lock outside to get proper handling of the queue shutdown c.mu.Lock() if c.Started { return errors.New("controller was started more than once. This is likely to be caused by being added to a manager multiple times") } c.initMetrics() // Set the internal context. c.ctx = ctx c.Queue = c.MakeQueue() go func() { <-ctx.Done() c.Queue.ShutDown() }() wg := &sync.WaitGroup{} err := func() error { defer c.mu.Unlock() // TODO(pwittrock): Reconsider HandleCrash defer utilruntime.HandleCrash() // NB(directxman12): launch the sources *before* trying to wait for the // caches to sync so that they have a chance to register their intendeded // caches. for _, watch := range c.startWatches { c.Log.Info("Starting EventSource", "source", fmt.Sprintf("%s", watch.src)) if err := watch.src.Start(ctx, watch.handler, c.Queue, watch.predicates...); err != nil { return err } } // Start the SharedIndexInformer factories to begin populating the SharedIndexInformer caches c.Log.Info("Starting Controller") for _, watch := range c.startWatches { syncingSource, ok := watch.src.(source.SyncingSource) if !ok { continue } if err := func() error { // use a context with timeout for launching sources and syncing caches. sourceStartCtx, cancel := context.WithTimeout(ctx, c.CacheSyncTimeout) defer cancel() // WaitForSync waits for a definitive timeout, and returns if there // is an error or a timeout if err := syncingSource.WaitForSync(sourceStartCtx); err != nil { err := fmt.Errorf("failed to wait for %s caches to sync: %w", c.Name, err) c.Log.Error(err, "Could not wait for Cache to sync") return err } return nil }(); err != nil { return err } } // All the watches have been started, we can reset the local slice. // // We should never hold watches more than necessary, each watch source can hold a backing cache, // which won't be garbage collected if we hold a reference to it. c.startWatches = nil // Launch workers to process resources c.Log.Info("Starting workers", "worker count", c.MaxConcurrentReconciles) wg.Add(c.MaxConcurrentReconciles) for i := 0; i < c.MaxConcurrentReconciles; i++ { go func() { defer wg.Done() // Run a worker thread that just dequeues items, processes them, and marks them done. // It enforces that the reconcileHandler is never invoked concurrently with the same object. for c.processNextWorkItem(ctx) { } }() } c.Started = true return nil }() if err != nil { return err } <-ctx.Done() c.Log.Info("Shutdown signal received, waiting for all workers to finish") wg.Wait() c.Log.Info("All workers finished") return nil } ``` #### 2.5 runtime cache ##### 2.5.1 cache是什么 Cache 接口定义了如下两个接口: (1)client.Reader:用于从 Cache 中获取及列举 Kubernetes 集群的资源。 (2)Informers:可为不同的 GVK 创建或获取对应的 Informer,并将 Index 添加到对 应的 Informer 中。 Kubernetes 是典型的 Server-Client 的架构,APIServer 作为集群统一的操作入口,任 何对资源所做的操作(包括增删改查)都必须经过 APIServer。为了减轻 APIServer 的压力, Controller-runtime 抽象出一个 Cache 层,Client 端对 APIServer 数据的读取和监听操作都 将通过 Cache 层来进行。 ``` // Cache knows how to load Kubernetes objects, fetch informers to request // to receive events for Kubernetes objects (at a low-level), // and add indices to fields on the objects stored in the cache. type Cache interface { // Cache acts as a client to objects stored in the cache. client.Reader // Cache loads informers and adds field indices. Informers } ```
##### 2.5.2 cache初始化逻辑 在new Manager的时候就初始化了缓存,具体的步骤是 New-> cluster.New -> ``` // New returns a new Manager for creating Controllers. func New(config *rest.Config, options Options) (Manager, error) { // Set default values for options fields options = setOptionsDefaults(options) cluster, err := cluster.New(config, func(clusterOptions *cluster.Options) { clusterOptions.Scheme = options.Scheme clusterOptions.MapperProvider = options.MapperProvider clusterOptions.Logger = options.Logger clusterOptions.SyncPeriod = options.SyncPeriod clusterOptions.Namespace = options.Namespace clusterOptions.NewCache = options.NewCache clusterOptions.NewClient = options.NewClient clusterOptions.ClientDisableCacheFor = options.ClientDisableCacheFor clusterOptions.DryRunClient = options.DryRunClient clusterOptions.EventBroadcaster = options.EventBroadcaster //nolint:staticcheck }) options.NewCache初始化cache // Create the cache for the cached read client and registering informers cache, err := options.NewCache(config, cache.Options{Scheme: options.Scheme, Mapper: mapper, Resync: options.SyncPeriod, Namespace: options.Namespace}) if err != nil { return nil, err } // Allow newCache to be mocked if options.NewCache == nil { options.NewCache = cache.New } ``` 在 Controller Manager 的初始化启动过程中,将会构建 Cache 层,以供 Manager 使 用。在用户没有指定 Cache 初始化函数的前提下,将使用 Controller-runtime 默认提供的 Cache 初始化函数,默认 Cache 初始化的流程如下: ``` // New initializes and returns a new Cache. func New(config *rest.Config, opts Options) (Cache, error) { opts, err := defaultOpts(config, opts) if err != nil { return nil, err } selectorsByGVK, err := convertToSelectorsByGVK(opts.SelectorsByObject, opts.DefaultSelector, opts.Scheme) if err != nil { return nil, err } disableDeepCopyByGVK, err := convertToDisableDeepCopyByGVK(opts.UnsafeDisableDeepCopyByObject, opts.Scheme) if err != nil { return nil, err } im := internal.NewInformersMap(config, opts.Scheme, opts.Mapper, *opts.Resync, opts.Namespace, selectorsByGVK, disableDeepCopyByGVK) return &informerCache{InformersMap: im}, nil } ``` (1) 设 置 默 认 参 数:若 Scheme 为 空, 则 设 置 为 scheme.Scheme ;若 Mapper 为 空, 则 通 过 apiutil.NewDiscoveryRESTMapper 基 于 Discovery 的 信 息 构 建 出 一 个 RESTMapper,用于管理所有 Object 的信息;若同步时间为空,则将 Informer 的同步时 间设置为 10 小时。 (2)初始化 InformersMap,为 3 种不同类型的 Object(structured、unstructured、 metadata-only)分别构建 InformersMap。 (3)初始化 specificInformersMap:该接口通过 Object 与 GVK 的组合信息创建并缓 存 Informers。 (4)定义 List-Watch 函数:为 3 种不同类型的 Object 实现 List-Watch 函数,通过 该函数可对 GVK 进行 List 和 Watch 操作。 通过 Cache 的初始化流程,我们可以看出 Cache 主要创建了 InformersMap,Scheme 中的每个 GVK 都会创建对应的 Informer,再通过 informersByGVK 的 Map,实现 GVK 到 Informer的映射;每个Informer都会通过List-Watch函数对相应的GVK进行List和Watch操作。 ![image-20220829162503586](../images/image-20220829162503586.png) Cache 启动的核心是启动创建的所有 Informer ``` // Start calls Run on each of the informers and sets started to true. Blocks on the context. func (m *InformersMap) Start(ctx context.Context) error { go m.structured.Start(ctx) go m.unstructured.Start(ctx) go m.metadata.Start(ctx) <-ctx.Done() return nil } // Start calls Run on each of the informers and sets started to true. Blocks on the context. // It doesn't return start because it can't return an error, and it's not a runnable directly. func (ip *specificInformersMap) Start(ctx context.Context) { func() { ip.mu.Lock() defer ip.mu.Unlock() // Set the stop channel so it can be passed to informers that are added later ip.stop = ctx.Done() // Start each informer for _, informer := range ip.informersByGVK { go informer.Informer.Run(ctx.Done()) } // Set started to true so we immediately start any informers added later. ip.started = true close(ip.startWait) }() <-ctx.Done() } ``` Informer 的启动流程主要包含以下 3 个步骤: (1)初始化 Delta FIFO 队列。 (2)创建内部 Controller:配置 Delta FIFO 队列和事件的处理函数。 (3)启动 Controller:创建 Reflector,负责监听 APIServer 上指定的 GVK,将 Add、 Update、Delete 变更事件写入 Delta FIFO 队列中,作为变更事件的生产者;Controller 中 的事件处理函数 HandleDeltas() 会消费这些变更事件,负责将更新写入本地 Indexer,同 时将这些 Add、Update、Delete 事件分发给之前注册的监听器。 ### 3.总结 controller-runtime其实就是利用client-go informer那套,底层是创建shareIndexInformer。 controller-runtime通过屏蔽底层细节,让crd operator的实现非常简单。梳理一下,整理的工作流程如下所示: ![image-20220826161016681](../images/image-20220826161016681.png) ### 4. 参考 云原生应用开发:Operator原理与实践 ================================================ FILE: k8s/client-go/2-clientGo提供的四种客户端.md ================================================ Table of Contents ================= * [0. 四种客户端简介](#0-四种客户端简介) * [1.discovery](#1discovery) * [1.1 ServerGroups](#11-servergroups) * [1.2 ServerGroupsAndResources](#12-servergroupsandresources) * [1.3 缓存](#13-缓存) * [1.4 实例展示](#14-实例展示) * [2.restClient客户端](#2restclient客户端) * [3.clientSet客户端](#3clientset客户端) * [2.1 Clientset的定义](#21-clientset的定义) * [4.DynamicClient客户端](#4dynamicclient客户端) ### 0. 四种客户端简介 client-go的客户端对象有4个,作用各有不同: - RESTClient: 是对HTTP Request进行了封装,实现了RESTful风格的API。其他客户端都是在RESTClient基础上的实现。可与用于k8s内置资源和CRD资源 - ClientSet:是对k8s内置资源对象的客户端的集合,默认情况下,不能操作CRD资源,但是通过client-gen代码生成的话,也是可以操作CRD资源的。 - DynamicClient:不仅能对K8S内置资源进行处理,还可以对CRD资源进行处理,不需要client-gen生成代码即可实现。 - DiscoveryClient:用于发现kube-apiserver所支持的资源组、资源版本、资源信息(即Group、Version、Resources)。 ![client](../images/client.png) RESTClient是最基础的客户端。RESTClient对HTTP Request进行了封装,实现了RESTful风格的API。ClientSet、DynamicClient及DiscoveryClient客户端都是基于RESTClient实现的。 ClientSet在RESTClient的基础上封装了对Resource和Version的管理方法。每一个Resource可以理解为一个客户端,而ClientSet则是多个客户端的集合,每一个Resource和Version都以函数的方式暴露给开发者。ClientSet只能够处理Kubernetes内置资源,它是通过client-gen代码生成器自动生成的。 DynamicClient与ClientSet最大的不同之处是,ClientSet仅能访问Kubernetes自带的资源(即Client集合内的资源),不能直接访问CRD自定义资源。DynamicClient能够处理Kubernetes中的所有资源对象,包括Kubernetes内置资源与CRD自定义资源。 DiscoveryClient发现客户端,用于发现kube-apiserver所支持的资源组、资源版本、资源信息(即Group、Versions、Resources)。以上4种客户端:RESTClient、ClientSet、DynamicClient、DiscoveryClient都可以通过kubeconfig配置信息连接到指定的KubernetesAPI Server。 **总结下**:RESTCLient、ClientSet和DynamicClient都可以对K8S内置资源和CRD资源进行操作。只是clientSet需要生成代码才能操作CRD资源。 而clientSet 和dynamicClient不同在于,dynamicClient可以操作任意的对象,clientset初始化是只能指定一种对象操作。
### 1.discovery discovery包的主要作用就是提供当前k8s集群支持哪些资源以及版本信息。 Kubernetes API Server暴露出/api和/apis接口。DiscoveryClient通过RESTClient分别请求/api和/apis接口,从而获取Kubernetes API Server所支持的资源组、资源版信息。这个是通过ServerGroups函数实现的 有了group, version信息后,但是还是不够,因为还没有具体到资源。 ServerGroupsAndResources 就获得了所有的资源信息(所有的GVR资源信息),而在Resource资源的定义中,会定义好该资源支持哪些操作:list, delelte ,get等等。 所以kubectl中就使用discovery做了资源的校验。获取所有资源的版本信息,以及支持的操作。就可以判断客户端当前操作是否合理。 #### 1.1 ServerGroups staging/src/k8s.io/client-go/discovery/discovery_client.go ``` // ServerGroups returns the supported groups, with information like supported versions and the // preferred version. func (d *DiscoveryClient) ServerGroups() (apiGroupList *metav1.APIGroupList, err error) { // Get the groupVersions exposed at /api v := &metav1.APIVersions{} // 先请求 https://192.168.0.4:6443/api,获得core下面的组 err = d.restClient.Get().AbsPath(d.LegacyPrefix).Do().Into(v) apiGroup := metav1.APIGroup{} if err == nil && len(v.Versions) != 0 { apiGroup = apiVersionsToAPIGroup(v) } if err != nil && !errors.IsNotFound(err) && !errors.IsForbidden(err) { return nil, err } // Get the groupVersions exposed at /apis apiGroupList = &metav1.APIGroupList{} // 再请求https://192.168.0.4:6443/api ,获得其他的组 err = d.restClient.Get().AbsPath("/apis").Do().Into(apiGroupList) if err != nil && !errors.IsNotFound(err) && !errors.IsForbidden(err) { return nil, err } // to be compatible with a v1.0 server, if it's a 403 or 404, ignore and return whatever we got from /api if err != nil && (errors.IsNotFound(err) || errors.IsForbidden(err)) { apiGroupList = &metav1.APIGroupList{} } // prepend the group retrieved from /api to the list if not empty if len(v.Versions) != 0 { apiGroupList.Groups = append([]metav1.APIGroup{apiGroup}, apiGroupList.Groups...) } return apiGroupList, nil } ```
apiGroupList 就是获取所有的 组,每个组所有的version信息 ``` // APIGroupList is a list of APIGroup, to allow clients to discover the API at // /apis. type APIGroupList struct { TypeMeta `json:",inline"` // groups is a list of APIGroup. Groups []APIGroup `json:"groups" protobuf:"bytes,1,rep,name=groups"` } // +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object // APIGroup contains the name, the supported versions, and the preferred version // of a group. type APIGroup struct { TypeMeta `json:",inline"` // name is the name of the group. Name string `json:"name" protobuf:"bytes,1,opt,name=name"` // versions are the versions supported in this group. Versions []GroupVersionForDiscovery `json:"versions" protobuf:"bytes,2,rep,name=versions"` // preferredVersion is the version preferred by the API server, which // probably is the storage version. // +optional PreferredVersion GroupVersionForDiscovery `json:"preferredVersion,omitempty" protobuf:"bytes,3,opt,name=preferredVersion"` // a map of client CIDR to server address that is serving this group. // This is to help clients reach servers in the most network-efficient way possible. // Clients can use the appropriate server address as per the CIDR that they match. // In case of multiple matches, clients should use the longest matching CIDR. // The server returns only those CIDRs that it thinks that the client can match. // For example: the master will return an internal IP CIDR only, if the client reaches the server using an internal IP. // Server looks at X-Forwarded-For header or X-Real-Ip header or request.RemoteAddr (in that order) to get the client IP. // +optional ServerAddressByClientCIDRs []ServerAddressByClientCIDR `json:"serverAddressByClientCIDRs,omitempty" protobuf:"bytes,4,rep,name=serverAddressByClientCIDRs"` } ``` 直接访问 /api /apis就能或者 gruop, version信息。 ``` root@k8s-master:~# curl https://192.168.0.4:6443/api --cert /opt/kubernetes/ssl/server.pem --key /opt/kubernetes/ssl/server-key.pem --cacert /opt/kubernetes/ssl/ca.pem{ "kind": "APIVersions", "versions": [ //这里省略了 gruop=core,其实core也是我们后面的称号,可以认为没有gruop的概念。 "v1" ], "serverAddressByClientCIDRs": [ { "clientCIDR": "0.0.0.0/0", "serverAddress": "192.168.0.4:6443" } ] } root@k8s-master:~# curl https://192.168.0.4:6443/apis --cert /opt/kubernetes/ssl/server.pem --key /opt/kubernetes/ssl/server-key.pem --cacert /opt/kubernetes/ssl/ca.pem { "kind": "APIGroupList", "apiVersion": "v1", "groups": [ { "name": "apiregistration.k8s.io", "versions": [ { "groupVersion": "apiregistration.k8s.io/v1", "version": "v1" }, { "groupVersion": "apiregistration.k8s.io/v1beta1", "version": "v1beta1" } ], "preferredVersion": { "groupVersion": "apiregistration.k8s.io/v1", "version": "v1" } }, ... } ``` #### 1.2 ServerGroupsAndResources ``` func ServerGroupsAndResources(d DiscoveryInterface) ([]*metav1.APIGroup, []*metav1.APIResourceList, error) { ... groupVersionResources, failedGroups := fetchGroupVersionResources(d, sgs) ... } // fetchServerResourcesForGroupVersions uses the discovery client to fetch the resources for the specified groups in parallel. func fetchGroupVersionResources(d DiscoveryInterface, apiGroups *metav1.APIGroupList) (map[schema.GroupVersion]*metav1.APIResourceList, map[schema.GroupVersion]error) { for _, apiGroup := range apiGroups.Groups { for _, version := range apiGroup.Versions { apiResourceList, err := d.ServerResourcesForGroupVersion(groupVersion.String()) } // ServerResourcesForGroupVersion returns the supported resources for a group and version. func (d *DiscoveryClient) ServerResourcesForGroupVersion(groupVersion string) (resources *metav1.APIResourceList, err error) { url := url.URL{} if len(groupVersion) == 0 { return nil, fmt.Errorf("groupVersion shouldn't be empty") } // 如果是core v1,直接访问 curl https://192.168.0.4:6443/api/v1, 获得所有的资源 if len(d.LegacyPrefix) > 0 && groupVersion == "v1" { url.Path = d.LegacyPrefix + "/" + groupVersion } else { url.Path = "/apis/" + groupVersion } resources = &metav1.APIResourceList{ GroupVersion: groupVersion, } err = d.restClient.Get().AbsPath(url.String()).Do().Into(resources) if err != nil { // ignore 403 or 404 error to be compatible with an v1.0 server. if groupVersion == "v1" && (errors.IsNotFound(err) || errors.IsForbidden(err)) { return resources, nil } return nil, err } return resources, nil } ``` 实践: ``` root@k8s-master:~# curl https://192.168.0.4:6443/api/v1 --cert /opt/kubernetes/ssl/server.pem --key /opt/kubernetes/ssl/server-key.pem --cacert /opt/kubernetes/ssl/ca.pem { //省略了很多输出 "kind": "APIResourceList", "groupVersion": "v1", "resources": [ { "name": "bindings", "singularName": "", "namespaced": true, "kind": "Binding", "verbs": [ "create" ] { "name": "pods", "singularName": "", "namespaced": true, "kind": "Pod", "verbs": [ "create", "delete", "deletecollection", "get", "list", "patch", "update", "watch" ], "shortNames": [ "po" ], "categories": [ "all" ], "storageVersionHash": "xPOwRZ+Yhw8=" } ``` #### 1.3 缓存 DiscoveryClient可以将资源相关信息存储于本地,默认存储位置为~/.kube/cache和~/.kube/http-cache。缓存可以减轻client-go对KubernetesAPI Server的访问压力。默认每10分钟与Kubernetes API Server同步一次,同步周期较长,因为资源组、源版本、资源信息一般很少变动。本地缓存的DiscoveryClient如图5-4所示。DiscoveryClient第一次获取资源组、资源版本、资源信息时,首先会查询本地缓存,如果数据不存在(没有命中)则请求Kubernetes API Server接口(回源),Cache将Kubernetes API Server响应的数据存储在本地一份并返回给DiscoveryClient。当下一次DiscoveryClient再次获取资源信息时,会将数据直接从本地缓存返回(命中)给DiscoveryClient。本地缓存的默认存储周期为10分钟。代码示例如下: staging/src/k8s.io/client-go/discovery/cached/disk/cached_discovery.go ``` func (d *CachedDiscoveryClient) getCachedFile(filename string) ([]byte, error) { // after invalidation ignore cache files not created by this process d.mutex.Lock() _, ourFile := d.ourFiles[filename] if d.invalidated && !ourFile { d.mutex.Unlock() return nil, errors.New("cache invalidated") } d.mutex.Unlock() file, err := os.Open(filename) if err != nil { return nil, err } defer file.Close() fileInfo, err := file.Stat() if err != nil { return nil, err } if time.Now().After(fileInfo.ModTime().Add(d.ttl)) { return nil, errors.New("cache expired") } // the cache is present and its valid. Try to read and use it. cachedBytes, err := ioutil.ReadAll(file) if err != nil { return nil, err } d.mutex.Lock() defer d.mutex.Unlock() d.fresh = d.fresh && ourFile return cachedBytes, nil } ``` #### 1.4 实例展示 ``` package main import ( "fmt" "k8s.io/apimachinery/pkg/runtime/schema" "k8s.io/client-go/discovery" "k8s.io/client-go/tools/clientcmd" ) func main() { // 加载kubeconfig文件,生成config对象 config, err := clientcmd.BuildConfigFromFlags("", "D:\\coding\\config") if err != nil { panic(err) } // discovery.NewDiscoveryClientForConfigg函数通过config实例化discoveryClient对象 discoveryClient, err := discovery.NewDiscoveryClientForConfig(config) if err != nil { panic(err) } // discoveryClient.ServerGroupsAndResources 返回API Server所支持的资源组、资源版本、资源信息 _, APIResourceList, err := discoveryClient.ServerGroupsAndResources() if err != nil { panic(err) } // 输出所有资源信息 for _, list := range APIResourceList { gv, err := schema.ParseGroupVersion(list.GroupVersion) if err != nil { panic(err) } for _, resource := range list.APIResources { fmt.Printf("NAME: %v, GROUP: %v, VERSION: %v \n", resource.Name, gv.Group, gv.Version) } } } // 测试 go run .\discoveryClient-example.go NAME: bindings, GROUP: , VERSION: v1 NAME: componentstatuses, GROUP: , VERSION: v1 NAME: configmaps, GROUP: , VERSION: v1 NAME: endpoints, GROUP: , VERSION: v1 NAME: events, GROUP: , VERSION: v1 NAME: limitranges, GROUP: , VERSION: v1 NAME: namespaces, GROUP: , VERSION: v1 NAME: namespaces/finalize, GROUP: , VERSION: v1 NAME: namespaces/status, GROUP: , VERSION: v1 NAME: nodes, GROUP: , VERSION: v1 NAME: nodes/proxy, GROUP: , VERSION: v1 NAME: nodes/status, GROUP: , VERSION: v1 NAME: persistentvolumeclaims, GROUP: , VERSION: v1 NAME: persistentvolumeclaims/status, GROUP: , VERSION: v1 NAME: persistentvolumes, GROUP: , VERSION: v1 NAME: persistentvolumes/status, GROUP: , VERSION: v1 NAME: pods, GROUP: , VERSION: v1 NAME: pods/attach, GROUP: , VERSION: v1 NAME: pods/binding, GROUP: , VERSION: v1 NAME: pods/eviction, GROUP: , VERSION: v1 NAME: pods/exec, GROUP: , VERSION: v1 NAME: pods/log, GROUP: , VERSION: v1 NAME: pods/portforward, GROUP: , VERSION: v1 NAME: pods/proxy, GROUP: , VERSION: v1 NAME: pods/status, GROUP: , VERSION: v1 NAME: podtemplates, GROUP: , VERSION: v1 NAME: replicationcontrollers, GROUP: , VERSION: v1 NAME: replicationcontrollers/scale, GROUP: , VERSION: v1 NAME: replicationcontrollers/status, GROUP: , VERSION: v1 NAME: resourcequotas, GROUP: , VERSION: v1 NAME: resourcequotas/status, GROUP: , VERSION: v1 NAME: secrets, GROUP: , VERSION: v1 NAME: serviceaccounts, GROUP: , VERSION: v1 NAME: services, GROUP: , VERSION: v1 NAME: services/proxy, GROUP: , VERSION: v1 NAME: services/status, GROUP: , VERSION: v1 NAME: apiservices, GROUP: apiregistration.k8s.io, VERSION: v1 NAME: apiservices/status, GROUP: apiregistration.k8s.io, VERSION: v1 NAME: apiservices, GROUP: apiregistration.k8s.io, VERSION: v1beta1 NAME: apiservices/status, GROUP: apiregistration.k8s.io, VERSION: v1beta1 NAME: ingresses, GROUP: extensions, VERSION: v1beta1 NAME: ingresses/status, GROUP: extensions, VERSION: v1beta1 NAME: controllerrevisions, GROUP: apps, VERSION: v1 NAME: daemonsets, GROUP: apps, VERSION: v1 NAME: daemonsets/status, GROUP: apps, VERSION: v1 NAME: deployments, GROUP: apps, VERSION: v1 NAME: deployments/scale, GROUP: apps, VERSION: v1 NAME: deployments/status, GROUP: apps, VERSION: v1 NAME: replicasets, GROUP: apps, VERSION: v1 NAME: replicasets/scale, GROUP: apps, VERSION: v1 NAME: replicasets/status, GROUP: apps, VERSION: v1 NAME: statefulsets, GROUP: apps, VERSION: v1 NAME: statefulsets/scale, GROUP: apps, VERSION: v1 NAME: statefulsets/status, GROUP: apps, VERSION: v1 NAME: events, GROUP: events.k8s.io, VERSION: v1beta1 NAME: tokenreviews, GROUP: authentication.k8s.io, VERSION: v1 NAME: tokenreviews, GROUP: authentication.k8s.io, VERSION: v1beta1 NAME: localsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1 NAME: selfsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1 NAME: selfsubjectrulesreviews, GROUP: authorization.k8s.io, VERSION: v1 NAME: subjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1 NAME: localsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1beta1 NAME: selfsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1beta1 NAME: selfsubjectrulesreviews, GROUP: authorization.k8s.io, VERSION: v1beta1 NAME: subjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1beta1 NAME: horizontalpodautoscalers, GROUP: autoscaling, VERSION: v1 NAME: horizontalpodautoscalers/status, GROUP: autoscaling, VERSION: v1 NAME: horizontalpodautoscalers, GROUP: autoscaling, VERSION: v2beta1 NAME: horizontalpodautoscalers/status, GROUP: autoscaling, VERSION: v2beta1 NAME: horizontalpodautoscalers, GROUP: autoscaling, VERSION: v2beta2 NAME: horizontalpodautoscalers/status, GROUP: autoscaling, VERSION: v2beta2 NAME: jobs, GROUP: batch, VERSION: v1 NAME: jobs/status, GROUP: batch, VERSION: v1 NAME: cronjobs, GROUP: batch, VERSION: v1beta1 NAME: cronjobs/status, GROUP: batch, VERSION: v1beta1 NAME: certificatesigningrequests, GROUP: certificates.k8s.io, VERSION: v1beta1 NAME: certificatesigningrequests/approval, GROUP: certificates.k8s.io, VERSION: v1beta1 NAME: certificatesigningrequests/status, GROUP: certificates.k8s.io, VERSION: v1beta1 NAME: networkpolicies, GROUP: networking.k8s.io, VERSION: v1 NAME: ingressclasses, GROUP: networking.k8s.io, VERSION: v1beta1 NAME: ingresses, GROUP: networking.k8s.io, VERSION: v1beta1 NAME: ingresses/status, GROUP: networking.k8s.io, VERSION: v1beta1 NAME: poddisruptionbudgets, GROUP: policy, VERSION: v1beta1 NAME: poddisruptionbudgets/status, GROUP: policy, VERSION: v1beta1 NAME: podsecuritypolicies, GROUP: policy, VERSION: v1beta1 NAME: clusterrolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1 NAME: clusterroles, GROUP: rbac.authorization.k8s.io, VERSION: v1 NAME: rolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1 NAME: roles, GROUP: rbac.authorization.k8s.io, VERSION: v1 NAME: clusterrolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1 NAME: clusterroles, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1 NAME: rolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1 NAME: roles, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1 NAME: csidrivers, GROUP: storage.k8s.io, VERSION: v1 NAME: csinodes, GROUP: storage.k8s.io, VERSION: v1 NAME: storageclasses, GROUP: storage.k8s.io, VERSION: v1 NAME: volumeattachments, GROUP: storage.k8s.io, VERSION: v1 NAME: volumeattachments/status, GROUP: storage.k8s.io, VERSION: v1 NAME: csidrivers, GROUP: storage.k8s.io, VERSION: v1beta1 NAME: csinodes, GROUP: storage.k8s.io, VERSION: v1beta1 NAME: storageclasses, GROUP: storage.k8s.io, VERSION: v1beta1 NAME: volumeattachments, GROUP: storage.k8s.io, VERSION: v1beta1 NAME: mutatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1 NAME: validatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1 NAME: mutatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1beta1 NAME: validatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1beta1 NAME: customresourcedefinitions, GROUP: apiextensions.k8s.io, VERSION: v1 NAME: customresourcedefinitions/status, GROUP: apiextensions.k8s.io, VERSION: v1 NAME: customresourcedefinitions, GROUP: apiextensions.k8s.io, VERSION: v1beta1 NAME: customresourcedefinitions/status, GROUP: apiextensions.k8s.io, VERSION: v1beta1 NAME: priorityclasses, GROUP: scheduling.k8s.io, VERSION: v1 NAME: priorityclasses, GROUP: scheduling.k8s.io, VERSION: v1beta1 NAME: leases, GROUP: coordination.k8s.io, VERSION: v1 NAME: leases, GROUP: coordination.k8s.io, VERSION: v1beta1 NAME: runtimeclasses, GROUP: node.k8s.io, VERSION: v1beta1 NAME: endpointslices, GROUP: discovery.k8s.io, VERSION: v1beta1 ``` ### 2.restClient客户端 rest.RESTClientFor函数通过kubeconfig配置信息实例化RESTClient对象,RESTClient对象构建HTTP请求参数,例如Get函数设置请求方法为get操作,它还支持Post、Put、Delete、Patch,list, watch等请求方法。 rest由于是三个client的父类,这里介绍详细一点。 ``` rest目录如下, 添加了每个文件的功能。代码就不一一展示 │ BUILD │ client.go 初始化restClient,从初始化的过程中,可以看出来使用了令牌桶限速。同时实现了Get,put等方法,就是设置http请求的verb字段 │ client_test.go │ config.go 处理kubeconfig的一些函数 │ config_test.go │ OWNERS │ plugin.go 插件,从代码中看,目前只有auth插件 │ plugin_test.go │ request.go 处理发送http请求相关的函数, get, list等等都在这 │ request_test.go │ transport.go 还是处理http请求相关的函数,http中的transport │ urlbackoff.go 处理backoff │ urlbackoff_test.go │ url_utils.go 处理url,定义了defaultUrl │ url_utils_test.go │ zz_generated.deepcopy.go └─watch BUILD decoder.go 对watch事件对象解码 decoder_test.go encoder.go 对watch事件对象编码 encoder_test.go ``` restClient并没有直接调用create,get等资源的接口。它需要自己确定url,访问资源。如下的例子: ``` package main import ( "fmt" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/client-go/kubernetes/scheme" "k8s.io/client-go/rest" "k8s.io/client-go/tools/clientcmd" ) func main() { // 加载kubeconfig文件,生成config对象 config, err := clientcmd.BuildConfigFromFlags("", "D:\\coding\\config") if err != nil { panic(err) } // 配置API路径和请求的资源组/资源版本信息 config.APIPath = "api" config.GroupVersion = &corev1.SchemeGroupVersion config.NegotiatedSerializer = scheme.Codecs // 通过rest.RESTClientFor()生成RESTClient对象。 RESTClientFor通过令牌桶算法,有限制的说法。 restClient, err := rest.RESTClientFor(config) if err != nil { panic(err) } // 通过RESTClient构建请求参数,查询default空间下所有pod资源 result := &corev1.PodList{} err = restClient.Get(). Namespace("default"). Resource("pods"). VersionedParams(&metav1.ListOptions{Limit: 500}, scheme.ParameterCodec). Do(). Into(result) if err != nil { panic(err) } for _, d := range result.Items { fmt.Printf("NAMESPACE:%v \t NAME: %v \t STATUS: %v\n", d.Namespace, d.Name, d.Status.Phase) } } // 测试 go run .\restClient-example.go NAMESPACE:default NAME: nginx-deployment-6b474476c4-lpld7 STATUS: Running NAMESPACE:default NAME: nginx-deployment-6b474476c4-t6xl4 STATUS: Running ``` 以这个例子为例:一般的使用就是 restClient.Get().XX.XX.Do().Into(result)。最终会回到Do 和 into函数 前面的XX例如VersionedParams函数将一些查询选项(如limit、TimeoutSeconds等)添加到请求参数中。通过Do函数执行该请求,并且获得结构。 inTO就是进行decode,然后赋值给result对象。 ``` // Do formats and executes the request. Returns a Result object for easy response // processing. // // Error type: // * If the server responds with a status: *errors.StatusError or *errors.UnexpectedObjectError // * http.Client.Do errors are returned directly. func (r *Request) Do() Result { if err := r.tryThrottle(); err != nil { return Result{err: err} } var result Result err := r.request(func(req *http.Request, resp *http.Response) { result = r.transformResponse(resp, req) }) if err != nil { return Result{err: err} } return result } // Into stores the result into obj, if possible. If obj is nil it is ignored. // If the returned object is of type Status and has .Status != StatusSuccess, the // additional information in Status will be used to enrich the error. func (r Result) Into(obj runtime.Object) error { if r.err != nil { // Check whether the result has a Status object in the body and prefer that. return r.Error() } if r.decoder == nil { return fmt.Errorf("serializer for %s doesn't exist", r.contentType) } if len(r.body) == 0 { return fmt.Errorf("0-length response with status code: %d and content type: %s", r.statusCode, r.contentType) } out, _, err := r.decoder.Decode(r.body, nil, obj) if err != nil || out == obj { return err } // if a different object is returned, see if it is Status and avoid double decoding // the object. switch t := out.(type) { case *metav1.Status: // any status besides StatusSuccess is considered an error. if t.Status != metav1.StatusSuccess { return errors.FromObject(t) } } return nil } ```
### 3.clientSet客户端 RESTClient是一种最基础的客户端,使用时需要指定Resource和Version等信息,编写代码时需要提前知道Resource所在的Group和对应的Version信息。相比RESTClient,ClientSet使用起来更加便捷,一般情况下,开发者对Kubernetes进行二次开发时通常使用ClientSet。 ClientSet对应的是 client-go/kubernetes 这个目录 这个目录结构核心目录和文件如下: ``` │ BUILD │ clientset.go 定义和初始化clientset相关函数 │ typed目录 里面定义了所有内置资源的get,list等等 │ scheme ``` #### 2.1 Clientset的定义 ``` // Clientset contains the clients for groups. Each group has exactly one // version included in a Clientset. type Clientset struct { *discovery.DiscoveryClient admissionregistrationV1 *admissionregistrationv1.AdmissionregistrationV1Client admissionregistrationV1beta1 *admissionregistrationv1beta1.AdmissionregistrationV1beta1Client appsV1 *appsv1.AppsV1Client ... coreV1 *corev1.CoreV1Client ... } CoreV1Client 其实就是一个rest client接口 // CoreV1Client is used to interact with features provided by the group. type CoreV1Client struct { restClient rest.Interface } 只不过封装了很多额外的函数 func (c *CoreV1Client) Pods(namespace string) PodInterface { return newPods(c, namespace) } ```
staging/src/k8s.io/client-go/kubernetes/typed/core/v1/pod.go 到typed目录下具体的一个资源对象文件看看, 这里以get为例。可以看出来其实就是封装了restClient的写法而已。 ``` // Get takes name of the pod, and returns the corresponding pod object, and an error if there is any. func (c *pods) Get(name string, options metav1.GetOptions) (result *v1.Pod, err error) { result = &v1.Pod{} err = c.client.Get(). Namespace(c.ns). Resource("pods"). Name(name). VersionedParams(&options, scheme.ParameterCodec). Do(). Into(result) return } ``` 这样的好处就是每次使用的时候,简化一点。 如下的例子可见,clientSet通过 NewForConfig 实现一个客户端。用起来也方便很多。 ``` package main import ( "fmt" apiv1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/client-go/kubernetes" "k8s.io/client-go/tools/clientcmd" ) func main() { // 加载kubeconfig文件,生成config对象 config, err := clientcmd.BuildConfigFromFlags("", "D:\\coding\\config") if err != nil { panic(err) } // kubernetes.NewForConfig通过config实例化ClientSet对象 clientset, err := kubernetes.NewForConfig(config) if err != nil { panic(err) } //请求core核心资源组v1资源版本下的Pods资源对象 podClient := clientset.CoreV1().Pods(apiv1.NamespaceDefault) // 设置选项 list, err := podClient.List(metav1.ListOptions{Limit: 500}) if err != nil { panic(err) } for _, d := range list.Items { fmt.Printf("NAMESPACE: %v \t NAME:%v \t STATUS: %+v\n", d.Namespace, d.Name, d.Status.Phase) } } // 测试 go run .\clientSet-example.go NAMESPACE: default NAME:nginx-deployment-6b474476c4-lpld7 STATUS: Running NAMESPACE: default NAME:nginx-deployment-6b474476c4-t6xl4 STATUS: Running ```
### 4.DynamicClient客户端 DynamicClient是一种动态客户端,它可以对任意Kubernetes资源进行RESTful操作,包括CRD自定义资源。DynamicClient与ClientSet操作类似,同样封装了RESTClient,同样提供了Create、Update、Delete、Get、List、Watch、Patch等方法。DynamicClient与ClientSet最大的不同之处是,ClientSet仅能访问Kubernetes自带的资源(即客户端集合内的资源),不能直接访问CRD自定义资源。ClientSet需要预先实现每种Resource和Version的操作,其内部的数据都是结构化数据(即已知数据结构)。而DynamicClient内部实现了Unstructured,用于处理非结构化数据结构(即无法提前预知数据结构),这也是DynamicClient能够处理CRD自定义资源的关键。 dynamic目录结构如下: ``` │ BUILD │ client_test.go │ interface.go │ scheme.go │ simple.go 感觉叫dynamicClient.go更好,就是定义和初始化dynamic文件。然后定义update,get函数的等实现 │ ├─dynamicinformer │ BUILD │ informer.go 定义dynamicinformer类型的Informer,其他内置资源在informer目录中都定义了 │ informer_test.go │ interface.go │ ├─dynamiclister │ BUILD │ interface.go │ lister.go 定义dynamicinformer类型的lister,其他内置资源在lister目录中都定义了 │ lister_test.go │ shim.go │ └─fake BUILD simple.go simple_test.go ``` staging/src/k8s.io/client-go/dynamic/simple.go 以Get为例,看看是如何实现的。其实和clientset是一样的。 ``` func (c *dynamicResourceClient) Get(name string, opts metav1.GetOptions, subresources ...string) (*unstructured.Unstructured, error) { if len(name) == 0 { return nil, fmt.Errorf("name is required") } // 拼凑好rest url result := c.client.client.Get().AbsPath(append(c.makeURLSegments(name), subresources...)...).SpecificallyVersionedParams(&opts, dynamicParameterCodec, versionV1).Do() if err := result.Error(); err != nil { return nil, err } retBytes, err := result.Raw() if err != nil { return nil, err } // 都是使用unstructured.Unstructured接收返回的结果 uncastObj, err := runtime.Decode(unstructured.UnstructuredJSONScheme, retBytes) if err != nil { return nil, err } return uncastObj.(*unstructured.Unstructured), nil } ``` informer 和 list再下一节单独介绍 **注意:** * DynamicClient获得的数据都是一个object类型。存的时候是 unstructured * DynamicClient不是类型安全的,因此在访问CRD自定义资源时需要特别注意。例如,在操作指针不当的情况下可能会导致程序崩溃。 * DynamicClient如果要使用informer,必须是NewFilteredDynamicSharedInformerFactory ```ruby f := dynamicinformer.NewFilteredDynamicSharedInformerFactory(dc, 0, v1.NamespaceAll, nil) ``` ``` package main import ( "fmt" apiv1 "k8s.io/api/core/v1" corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/runtime" "k8s.io/apimachinery/pkg/runtime/schema" "k8s.io/client-go/dynamic" "k8s.io/client-go/tools/clientcmd" ) func main() { // 加载kubeconfig文件,生成config对象 config, err := clientcmd.BuildConfigFromFlags("", "D:\\coding\\config") if err != nil { panic(err) } // dynamic.NewForConfig函数通过config实例化dynamicClient对象 dynamicClient, err := dynamic.NewForConfig(config) if err != nil { panic(err) } // 通过schema.GroupVersionResource设置请求的资源版本和资源组,设置命名空间和请求参数,得到unstructured.UnstructuredList指针类型的PodList gvr := schema.GroupVersionResource{Version: "v1", Resource: "pods"} unstructObj, err := dynamicClient.Resource(gvr).Namespace(apiv1.NamespaceDefault).List(metav1.ListOptions{Limit: 500}) if err != nil { panic(err) } // 通过runtime.DefaultUnstructuredConverter函数将unstructured.UnstructuredList转为PodList类型 podList := &corev1.PodList{} err = runtime.DefaultUnstructuredConverter.FromUnstructured(unstructObj.UnstructuredContent(), podList) if err != nil { panic(err) } for _, d := range podList.Items { fmt.Printf("NAMESPACE: %v NAME:%v \t STATUS: %+v\n", d.Namespace, d.Name, d.Status.Phase) } } // 测试 go run .\dynamicClient-example.go NAMESPACE: default NAME:nginx-deployment-6b474476c4-lpld7 STATUS: Running NAMESPACE: default NAME:nginx-deployment-6b474476c4-t6xl4 STATUS: Running ```
================================================ FILE: k8s/client-go/3. apiserver中的list-watch机制.md ================================================ Table of Contents ================= * [1. 背景](#1-背景) * [2. list watch机制](#2-list-watch机制) * [2.1 如何实现实时性](#21-如何实现实时性) * [2.2 如何实现顺序性](#22-如何实现顺序性) * [2.3 如何实现消息可靠性](#23-如何实现消息可靠性) * [2.4 如何解决性能问题](#24-如何解决性能问题) * [3.总结](#3总结) ### 1. 背景 client-go实际只是一个客户端,list-watch我们经常听到。但实际上是apisever的实现。在apisever注册资源对象的create, update, delete等等hanlder时,也注册了List-watch的实现。 所以在研究client-go是如果处理list watch之前,先了解一个apiserver的list watch机制 ### 2. list watch机制 `List-watch`是`K8S`统一的异步消息处理机制,保证了消息的实时性,可靠性,顺序性,性能等等,为声明式风格的`API`奠定了良好的基础,它是优雅的通信方式,是`K8S 架构`的精髓。 #### 2.1 如何实现实时性 一般客户端和服务器端的同步,无非就是两种大类:一种是客户端轮训服务器端。第二种就是服务器端主动发起通知。 k8s采用的是第二种,主动发起通知。这里具体就是使用了watch机制。 list, watch其实都是特殊的get接口。 get default命名空间所有的pod url如下: curl http://xxx:port/api/v1/namespaces/default/pods watch default命名空间所有的pod url如下: curl http://XXX/api/v1/namespaces/default/pods?watch=true 就是多了一个watch=true的参数 ``` root:/# curl http://XXX/api/v1/namespaces/default/pods?watch=true {"type":"ADDED","object":{"kind":"Pod","apiVersion":"v1","metadata":{"name":"zx-vpa-786d4b8bb5-xv5zw","generateName":"zx-vpa-786d4b8bb5-","namespace":"default","selfLink":"/api/v1/namespaces/default/pods/zx-vpa-786d4b8bb5-xv5zw","uid":"639944b7-3495-4fbb-a21d-cbc7f4d6f7a5","resourceVersion":"157197390","creationTimestamp":"2021-11-12T10:59:39Z","labels":{"app":"zx-vpa-test","pod-template-hash":"786d4b8bb5"},"annotations":{"v2-fixed-ip":"","v2-subnet":"faf7c8b0-55c3-42c7-ba27-ad90290a9cd9","v2-tenant":"","v2-vpc":"6af350be-c456-44bc-909d-4b92c48b3b54","vpaObservedContainers":"zx-vpa, zx-vpa2","vpaUpdates":"Pod resources updated by hamster-vpa: container 0: memory request, cpu request, memory limit, cpu limit; container 1: cpu request, memory request, cpu limit, memory limit"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"zx-vpa-786d4b8bb5","uid":"8199639c-40fc-4dc5-81c3-d3faff7f6b4c","controller":true,"blockOwnerDeletion":true}]},"spec":{"volumes":[{"name":"default-token-dbxf8","secret":{"secretName":"default-token-dbxf8","defaultMode":420}}],"containers":[{"name":"zx-vpa","image":"dockerhub.nie.netease.com/fanqihong/ubuntu:stress","command":["sleep","36000"],"resources":{"limits":{"cpu":"12m","memory":"131072k"},"requests":{"cpu":"12m","memory":"131072k"}},"volumeMounts":[{"name":"default-token-dbxf8","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"},{"name":"zx-vpa2","image":"ncr.nie.netease.com/zouxiang/testcpu:v1","command":["sleep","36000"],"resources":{"limits":{"cpu":"12m","memory":"131072k"},"requests":{"cpu":"12m","memory":"131072k"}},"volumeMounts":[{"name":"default-token-dbxf8","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":5,"dnsPolicy":"ClusterFirst","serviceAccountName":"default","serviceAccount":"default","nodeName":"7.34.19.14","hostNetwork":true,"securityContext":{},"schedulerName":"default-scheduler","enableServiceLinks":true},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-12T10:59:39Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-14T02:59:47Z"},{"type":"ContainersReady","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-14T02:59:47Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-12T10:59:39Z"}],"hostIP":"7.34.19.14","podIP":"7.34.19.14","podIPs":[{"ip":"7.34.19.14"}],"startTime":"2021-11-12T10:59:39Z","containerStatuses":[{"name":"zx-vpa","state":{"running":{"startedAt":"2021-11-14T02:59:46Z"}},"lastState":{"terminated":{"exitCode":0,"reason":"Completed","startedAt":"2021-11-13T16:59:45Z","finishedAt":"2021-11-14T02:59:45Z","containerID":"docker://87a70d2061b7fb37b0f97be3a4f9d44b345fbd54be3dcc4d8a61879dd5c6a127"}},"ready":true,"restartCount":4,"image":"dockerhub.nie.netease.com/fanqihong/ubuntu:stress","imageID":"docker-pullable://dockerhub.nie.netease.com/fanqihong/ubuntu@sha256:ac49d16f9686c2acd351d436ed7154311e4dba50ed8c18b6abaa578dde696440","containerID":"docker://bc586f53f363e9afb08c7a214eef06c8c1202f72439fc972d4c7d6177cfb8e63","started":true},{"name":"zx-vpa2","state":{"running":{"startedAt":"2021-11-14T02:59:47Z"}},"lastState":{"terminated":{"exitCode":0,"reason":"Completed","startedAt":"2021-11-13T16:59:46Z","finishedAt":"2021-11-14T02:59:46Z","containerID":"docker://37d8dd54be6d27ed9f055049e700f12fa4aa30ec29f2fd16fd5176218b2acce9"}},"ready":true,"restartCount":4,"image":"ncr.nie.netease.com/zouxiang/testcpu:v1","imageID":"docker-pullable://ncr.nie.netease.com/zouxiang/testcpu@sha256:4560824247d61f92c0d4b62224fdb3efc47560339ff05c92f73d6c731eba2717","containerID":"docker://9ccb7968bee2c155c472e03d56a5987c9cf7e6833a4cb125084ceb19158474ed","started":true}],"qosClass":"Guaranteed"}}} {"type":"ADDED","object":{"kind":"Pod","apiVersion":"v1","metadata":{"name":"zx-vpa-786d4b8bb5-mw6mr","generateName":"zx-vpa-786d4b8bb5-","namespace":"default","selfLink":"/api/v1/namespaces/default/pods/zx-vpa-786d4b8bb5-mw6mr","uid":"4e7f3a44-7483-434d-a917-52b37c0eae33","resourceVersion":"157192079","creationTimestamp":"2021-11-12T10:52:37Z","labels":{"app":"zx-vpa-test","pod-template-hash":"786d4b8bb5"},"annotations":{"v2-fixed-ip":"","v2-subnet":"faf7c8b0-55c3-42c7-ba27-ad90290a9cd9","v2-tenant":"","v2-vpc":"6af350be-c456-44bc-909d-4b92c48b3b54","vpaObservedContainers":"zx-vpa, zx-vpa2","vpaUpdates":"Pod resources updated by hamster-vpa: container 0: cpu request, memory request, cpu limit, memory limit; container 1: memory request, cpu request, cpu limit, memory limit"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"zx-vpa-786d4b8bb5","uid":"8199639c-40fc-4dc5-81c3-d3faff7f6b4c","controller":true,"blockOwnerDeletion":true}]},"spec":{"volumes":[{"name":"default-token-dbxf8","secret":{"secretName":"default-token-dbxf8","defaultMode":420}}],"containers":[{"name":"zx-vpa","image":"dockerhub.nie.netease.com/fanqihong/ubuntu:stress","command":["sleep","36000"],"resources":{"limits":{"cpu":"12m","memory":"131072k"},"requests":{"cpu":"12m","memory":"131072k"}},"volumeMounts":[{"name":"default-token-dbxf8","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"},{"name":"zx-vpa2","image":"ncr.nie.netease.com/zouxiang/testcpu:v1","command":["sleep","36000"],"resources":{"limits":{"cpu":"12m","memory":"131072k"},"requests":{"cpu":"12m","memory":"131072k"}},"volumeMounts":[{"name":"default-token-dbxf8","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":5,"dnsPolicy":"ClusterFirst","serviceAccountName":"default","serviceAccount":"default","nodeName":"10.90.67.175","hostNetwork":true,"securityContext":{},"schedulerName":"default-scheduler","enableServiceLinks":true},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-12T10:52:37Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-14T02:52:54Z"},{"type":"ContainersReady","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-14T02:52:54Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-12T10:52:37Z"}],"hostIP":"10.90.67.175","podIP":"10.90.67.175","podIPs":[{"ip":"10.90.67.175"}],"startTime":"2021-11-12T10:52:37Z","containerStatuses":[{"name":"zx-vpa","state":{"running":{"startedAt":"2021-11-14T02:52:50Z"}},"lastState":{"terminated":{"exitCode":0,"reason":"Completed","startedAt":"2021-11-13T16:52:47Z","finishedAt":"2021-11-14T02:52:47Z","containerID":"docker://ddf625ee9c90ba70ba5f1d27caa4d61ded938143a724dccbfada898271ac7fd0"}},"ready":true,"restartCount":4,"image":"dockerhub.nie.netease.com/fanqihong/ubuntu:stress","imageID":"docker-pullable://dockerhub.nie.netease.com/fanqihong/ubuntu@sha256:ac49d16f9686c2acd351d436ed7154311e4dba50ed8c18b6abaa578dde696440","containerID":"docker://854091543fc0ba88d3dc4a839f8014d21ecbaecf4e40221d2ce9d6a343ddbe29","started":true},{"name":"zx-vpa2","state":{"running":{"startedAt":"2021-11-14T02:52:53Z"}},"lastState":{"terminated":{"exitCode":0,"reason":"Completed","startedAt":"2021-11-13T16:52:50Z","finishedAt":"2021-11-14T02:52:50Z","containerID":"docker://0241d05357e1f6b8eec73810341bddf14479a168aaa7958ee580855eb2f4300f"}},"ready":true,"restartCount":4,"image":"ncr.nie.netease.com/zouxiang/testcpu:v1","imageID":"docker-pullable://ncr.nie.netease.com/zouxiang/testcpu@sha256:4560824247d61f92c0d4b62224fdb3efc47560339ff05c92f73d6c731eba2717","containerID":"docker://ad0d09158c0db8b10d2c282f2749c1c1e97fdb707e10392b9033aa619b162450","started":true}],"qosClass":"Guaranteed"}}} // 一旦有对象改变就会收到事件。type有三种,modifyed, added ,deleted {"type":"MODIFIED","object":{"kind":"Pod","apiVersion":"v1","metadata":{"name":"zx-vpa-786d4b8bb5-xv5zw","generateName":"zx-vpa-786d4b8bb5-","namespace":"default","selfLink":"/api/v1/namespaces/default/pods/zx-vpa-786d4b8bb5-xv5zw","uid":"639944b7-3495-4fbb-a21d-cbc7f4d6f7a5","resourceVersion":"157401529","creationTimestamp":"2021-11-12T10:59:39Z","labels":{"app":"zx-vpa-test","pod-template-hash":"786d4b8bb5"},"annotations":{"v2-fixed-ip":"","v2-subnet":"faf7c8b0-55c3-42c7-ba27-ad90290a9cd9","v2-tenant":"","v2-vpc":"6af350be-c456-44bc-909d-4b92c48b3b54","vpaObservedContainers":"zx-vpa, zx-vpa2","vpaUpdates":"11111111111Pod resources updated by hamster-vpa: container 0: memory request, cpu request, memory limit, cpu limit; container 1: cpu request, memory request, cpu limit, memory limit"},"ownerReferences":[{"apiVersion":"apps/v1","kind":"ReplicaSet","name":"zx-vpa-786d4b8bb5","uid":"8199639c-40fc-4dc5-81c3-d3faff7f6b4c","controller":true,"blockOwnerDeletion":true}]},"spec":{"volumes":[{"name":"default-token-dbxf8","secret":{"secretName":"default-token-dbxf8","defaultMode":420}}],"containers":[{"name":"zx-vpa","image":"dockerhub.nie.netease.com/fanqihong/ubuntu:stress","command":["sleep","36000"],"resources":{"limits":{"cpu":"12m","memory":"131072k"},"requests":{"cpu":"12m","memory":"131072k"}},"volumeMounts":[{"name":"default-token-dbxf8","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"},{"name":"zx-vpa2","image":"ncr.nie.netease.com/zouxiang/testcpu:v1","command":["sleep","36000"],"resources":{"limits":{"cpu":"12m","memory":"131072k"},"requests":{"cpu":"12m","memory":"131072k"}},"volumeMounts":[{"name":"default-token-dbxf8","readOnly":true,"mountPath":"/var/run/secrets/kubernetes.io/serviceaccount"}],"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File","imagePullPolicy":"IfNotPresent"}],"restartPolicy":"Always","terminationGracePeriodSeconds":5,"dnsPolicy":"ClusterFirst","serviceAccountName":"default","serviceAccount":"default","nodeName":"7.34.19.14","hostNetwork":true,"securityContext":{},"schedulerName":"default-scheduler","enableServiceLinks":true},"status":{"phase":"Running","conditions":[{"type":"Initialized","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-12T10:59:39Z"},{"type":"Ready","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-14T02:59:47Z"},{"type":"ContainersReady","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-14T02:59:47Z"},{"type":"PodScheduled","status":"True","lastProbeTime":null,"lastTransitionTime":"2021-11-12T10:59:39Z"}],"hostIP":"7.34.19.14","podIP":"7.34.19.14","podIPs":[{"ip":"7.34.19.14"}],"startTime":"2021-11-12T10:59:39Z","containerStatuses":[{"name":"zx-vpa","state":{"running":{"startedAt":"2021-11-14T02:59:46Z"}},"lastState":{"terminated":{"exitCode":0,"reason":"Completed","startedAt":"2021-11-13T16:59:45Z","finishedAt":"2021-11-14T02:59:45Z","containerID":"docker://87a70d2061b7fb37b0f97be3a4f9d44b345fbd54be3dcc4d8a61879dd5c6a127"}},"ready":true,"restartCount":4,"image":"dockerhub.nie.netease.com/fanqihong/ubuntu:stress","imageID":"docker-pullable://dockerhub.nie.netease.com/fanqihong/ubuntu@sha256:ac49d16f9686c2acd351d436ed7154311e4dba50ed8c18b6abaa578dde696440","containerID":"docker://bc586f53f363e9afb08c7a214eef06c8c1202f72439fc972d4c7d6177cfb8e63","started":true},{"name":"zx-vpa2","state":{"running":{"startedAt":"2021-11-14T02:59:47Z"}},"lastState":{"terminated":{"exitCode":0,"reason":"Completed","startedAt":"2021-11-13T16:59:46Z","finishedAt":"2021-11-14T02:59:46Z","containerID":"docker://37d8dd54be6d27ed9f055049e700f12fa4aa30ec29f2fd16fd5176218b2acce9"}},"ready":true,"restartCount":4,"image":"ncr.nie.netease.com/zouxiang/testcpu:v1","imageID":"docker-pullable://ncr.nie.netease.com/zouxiang/testcpu@sha256:4560824247d61f92c0d4b62224fdb3efc47560339ff05c92f73d6c731eba2717","containerID":"docker://9ccb7968bee2c155c472e03d56a5987c9cf7e6833a4cb125084ceb19158474ed","started":true}],"qosClass":"Guaranteed"}}} //删除一个pod,发现会进入 MODIFIED (设置deletiontimestamp)-> ADDED "status":{"phase":"Pending","qosClass":"Guaranteed"}}} 新pod pending MODIFIED podScheduled MODIFIED ContainerCreating MODIFIED.. 到pod running DELETED 删除旧Pod curl http://7.34.19.44:58201/api/v1/watch/namespaces/default/pods 看起来也是一样的效果 ``` 通过上面的实践可以发现: (1)watch其实就是一种特殊的get (2)可以看到删除操作后,对象的整个变化过程 (3)watch每次都会返回type,和**完整**的对象信息 #### 2.2 如何实现顺序性 `K8S`在每个资源的事件中都带一个`resourceVersion`的标签,这个标签是递增的数字,所以当客户端并发处理同一个资源的事件时,它就可以对比`resourceVersion`来保证最终的状态和最新的事件所期望的状态保持一致。 #### 2.3 如何实现消息可靠性 `list`和`watch`一起保证了消息的可靠性,避免因消息丢失而造成状态不一致场景。具体而言,`list API`可以查询当前的资源及其对应的状态(即期望的状态),客户端通过拿`期望的状态`和`实际的状态`进行对比,纠正状态不一致的资源。`Watch API`和`apiserver`保持一个`长链接`,接收资源的`状态变更事件`并做相应处理。如果仅调用`watch API`,若某个时间点连接中断,就有可能导致消息丢失,所以需要通过`list API`解决`消息丢失`的问题。从另一个角度出发,我们可以认为`list API`获取全量数据,`watch API`获取增量数据。虽然仅仅通过轮询`list API`,也能达到同步资源状态的效果,但是存在开销大,实时性不足的问题。 #### 2.4 如何解决性能问题 (1)list-watch机制的结合就已经在apiserver做了性能优化。(是不是可以watch的时候,只传递更新了的字段,而不是全量数据) (2)client-go的 tool.cache做了客户端的性能优化问题 ### 3.总结 本节主要从apiserver端探究了以下list-watch。接下来从client-go端源码看看具体是如何实现的 ================================================ FILE: k8s/client-go/4. client informer机制简介.md ================================================ Table of Contents ================= * [1. informer机制简介](#1-informer机制简介) * [1.2. informer机制 example介绍](#12-informer机制-example介绍) * [2. informer](#2-informer) * [2.1 shared informer](#21-shared-informer) * [2.2 shared informer是如何实现的](#22-shared-informer是如何实现的) * [2.3 informer和reflector的关系](#23-informer和reflector的关系) * [3. Reflector](#3-reflector) * [4. listAndwatcher](#4-listandwatcher) * [4.1 list](#41-list) * [4.2 watcher](#42-watcher) * [5. DeltaFIFO](#5-deltafifo) * [5.1 生产者](#51-生产者) * [5.2 消费者](#52-消费者) * [5.3 Resync](#53-resync) * [6.Indexer](#6indexer) * [6.1 Indexer索引器](#61-indexer索引器) * [7. 总结](#7-总结) * [8.附录](#8附录) * [9.参考](#9参考) ### 1. informer机制简介 在Kubernetes系统中,组件之间通过HTTP协议进行通信,在不依赖任何中间件的情况下需要保证消息的实时性、可靠性、顺序性是通过list-watch机制实现的。 作为客户端,client-go也实现了一套对应的list-watch进行用来处理对象的变化。这个机制在client-go就是informer机制。 Kubernetes的其他组件(kcm, kubelet等等)都是通过client-go的Informer机制与Kubernetes API Server进行通信的。 Informer机制运行原理如图: ![informer](../images/informer.png) 大体流程如下: (1)new一个informer,然后informer的时候指定了 listAndwatcher(这个就是获取apiserver数据) (2)informer.Run的时候,会new 一个 Reflector对象。Reflector包含了listAndwatcher,接下来基本就是Reflector进行操作了 (3)Reflector对listWatcher来的数据进行处理,这里使用到了DeltaFIFO队列对watch来的数据一个个的处理,HandleDeltas函数 (4)具体的处理逻辑分为两部分,第一部分是,通过操作cache.indexer,更新本地缓存+索引; 第二部分是,将watch的数据发送给 Informer自定义的处理函数进行处理 本节就先总结一下informer机制的大概流程,然后简单介绍一个流程中出现的几个概念。后面的章节一个一个进行详细研究
#### 1.2. informer机制 example介绍 直接阅读Informer机制代码会比较晦涩,通过Informers Example代码示例来理解Informer,印象会更深刻。Informers Example代码示例如下: ``` package main import ( "log" "time" corev1 "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/client-go/informers" "k8s.io/client-go/kubernetes" "k8s.io/client-go/tools/cache" "k8s.io/client-go/tools/clientcmd" ) func main() { config, err := clientcmd.BuildConfigFromFlags("", "/root/.kube/config") if err!= nil { pannic(er) } clientset, err := kubernetes.NewForConfig(config) if err!=nil { panic(err) } stopCh := make(chan struct{}) defer close(stopCh) sharedInformers := informers.NewSharedInformerFactory(clientset, time.Minute) informer := sharedInformers.Core().V1().Pods().Informer() informers.AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) { mObj := obj.(v1.Object) log.Printf("New Pod Added to Store:%s", mObj.GetName()) }, UpdateFunc: func(oldObj, newObj interface{}) { oObj := oldObj.(v1.Object) nObj := newObj.(v1.Object) log.Printf("%s Pod Updated to %s", oObj.GetName(), nObj.GetName()) }, DeleteFunc: func(obj interface{}) { mObj := obj.(v1.Object) log.Printf("Pod Deleted from Store:%s", mObj.GetName()) }, }) informer.Run(stopCh) } ``` 首先通过kubernetes.NewForConfig创建clientset对象,Informer需要通过ClientSet与Kubernetes API Server进行交互。另外,创建stopCh对象,该对象用于在程序进程退出之前通知Informer提前退出,因为Informer是一个持久运行的goroutine。informers.NewSharedInformerFactory函数实例化了SharedInformer对象,它接收两个参数:第1个参数clientset是用于与Kubernetes API Server交互的客户端,第2个参数time.Minute用于设置多久进行一次resync(重新同步),resync会周期性地执行List操作,将所有的资源存放在Informer Store中,如果该参数为0,则禁用resync功能。 在Informers Example代码示例中,通过sharedInformers.Core().V1().Pods().Informer可以得到具体Pod资源的informer对象。通过informer.AddEventHandler函数可以为Pod资源添加资源事件回调方法,支持3种资源事件回调方法,分别介绍如下。 ● AddFunc:当创建Pod资源对象时触发的事件回调方法。 ● UpdateFunc:当更新Pod资源对象时触发的事件回调方法。 ● DeleteFunc:当删除Pod资源对象时触发的事件回调方法。在正常的情况下,Kubernetes的其他组件在使用Informer机制时触发资源事件回调方法,将资源对象推送到WorkQueue或其他队列中(**实际过程中大都是这样的**),在InformersExample代码示例中,我们直接输出触发的资源事件。最后通过informer.Run函数运行当前的Informer,内部为Pod资源类型创建Informer。
### 2. informer 每一个Kubernetes资源上都实现了Informer机制。每一个Informer上都会实现Informer和Lister方法,例如PodInformer,代码示例如下 ``` // PodInformer provides access to a shared informer and lister for // Pods. type PodInformer interface { Informer() cache.SharedIndexInformer Lister() v1.PodLister } ``` 用不同资源的Informer,代码示例如下: ``` podInformer := sharedInformers.Core().V1().Pods().Informer() nodeInformer := sharedInformers.Node().V1beta1().RuntimeClasses().Informer ``` 定义不同资源的Informer,允许监控不同资源的资源事件,例如,监听Node资源对象,当Kubernetes集群中有新的节点(Node)加入时,client-go能够及时收到资源对象的变更信息。
#### 2.1 shared informer 可以认为 informer都是 shared informer Informer也被称为Shared Informer,它是可以共享使用的。在用client-go编写代码程序时,若同一资源的Informer被实例化了多次,每个Informer使用一个Reflector,那么会运行过多相同的ListAndWatch,太多重复的序列化和反序列化操作会导致Kubernetes API Server负载过重。Shared Informer可以使同一类资源Informer共享一个Reflector,这样可以节约很多资源。通过map数据结构实现共享的Informer机制。SharedInformer定义了一个map数据结构,用于存放所有Informer的字段,代码示例如下: ``` type sharedInformerFactory struct { client kubernetes.Interface namespace string tweakListOptions internalinterfaces.TweakListOptionsFunc lock sync.Mutex defaultResync time.Duration customResync map[reflect.Type]time.Duration informers map[reflect.Type]cache.SharedIndexInformer // startedInformers is used for tracking which informers have been started. // This allows Start() to be called multiple times safely. startedInformers map[reflect.Type]bool } ``` informers字段中存储了资源类型和对应于SharedIndexInformer的映射关系。InformerFor函数添加了不同资源的Informer,在添加过程中如果已经存在同类型的资源Informer,则返回当前Informer,不再继续添加。最后通过Shared Informer的Start方法使f.informers中的每个informer通过goroutine持久运行。 同一个factory定义的shareInformer可以复用复用。 #### 2.2 shared informer是如何实现的 从结构体可以看出来:有一个字段 Store,这里就是保存从apiserver同步过来的数据。 还有一个函数Run(),这个函数会调用controller.Run --> Reflector.Run->ListAndWatch() 而ListAndWatch()就是从apiserver获取数据。 ``` // SharedInformer has a shared data cache and is capable of distributing notifications for changes // to the cache to multiple listeners who registered via AddEventHandler. If you use this, there is // one behavior change compared to a standard Informer. When you receive a notification, the cache // will be AT LEAST as fresh as the notification, but it MAY be more fresh. You should NOT depend // on the contents of the cache exactly matching the notification you've received in handler // functions. If there was a create, followed by a delete, the cache may NOT have your item. This // has advantages over the broadcaster since it allows us to share a common cache across many // controllers. Extending the broadcaster would have required us keep duplicate caches for each // watch. type SharedInformer interface { // AddEventHandler adds an event handler to the shared informer using the shared informer's resync // period. Events to a single handler are delivered sequentially, but there is no coordination // between different handlers. AddEventHandler(handler ResourceEventHandler) // AddEventHandlerWithResyncPeriod adds an event handler to the shared informer using the // specified resync period. Events to a single handler are delivered sequentially, but there is // no coordination between different handlers. AddEventHandlerWithResyncPeriod(handler ResourceEventHandler, resyncPeriod time.Duration) // GetStore returns the Store. GetStore() Store // GetController gives back a synthetic interface that "votes" to start the informer GetController() Controller // Run starts the shared informer, which will be stopped when stopCh is closed. Run(stopCh <-chan struct{}) // HasSynced returns true if the shared informer's store has synced. HasSynced() bool // LastSyncResourceVersion is the resource version observed when last synced with the underlying // store. The value returned is not synchronized with access to the underlying store and is not // thread-safe. LastSyncResourceVersion() string } ``` EventHandler:这是一个回调函数,当一个`Informer`/`SharedInformer`要分发一个对象到控制器时,会调用此函数。例如:将对象的`Key`放在`WorkQueue`中并等待后续的处理。 这里先简单介绍整体的逻辑。后面再详细介绍。
#### 2.3 informer和reflector的关系 再使用informer的时候,一般都是: (1)new 一个sharedInformerFactory对象 (2)根据sharedInformerFactory生成一个informer (3)定义informer的 addFunc, deleteFunc, updateFunc函数 (4)informer.Run(stopCh) 运行起来 ``` // 在informer的Run函数中调用了controller.Run func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) { func() { s.startedLock.Lock() defer s.startedLock.Unlock() s.controller = New(cfg) s.controller.(*controller).clock = s.clock s.started = true }() s.controller.Run(stopCh) } // run函数生成了一个reflector对象 // Run begins processing items, and will continue until a value is sent down stopCh. // It's an error to call Run more than once. // Run blocks; call via go. func (c *controller) Run(stopCh <-chan struct{}) { .... r := NewReflector( c.config.ListerWatcher, c.config.ObjectType, c.config.Queue, c.config.FullResyncPeriod, ) ... } ``` **可以看出来:** (1)一个informer对应一个reflector (2)reflector是informer run的时候才生成的,并且Informer list-watch都是由reflector完成的. informer只管定义add, update, del处理事件即可 ### 3. Reflector Informer可以对Kubernetes API Server的资源执行监控(Watch)操作,资源类型可以是Kubernetes内置资源,也可以是CRD自定义资源,其中最核心的功能是Reflector。Reflector用于监控指定资源的Kubernetes资源,当监控的资源发生变化时,触发相应的变更事件,例如Added(资源添加)事件、Updated(资源更新)事件、Deleted(资源删除)事件,并将其资源对象存放到本地缓存DeltaFIFO中。通过NewReflector实例化Reflector对象,实例化过程中须传入ListerWatcher数据接口对象,它拥有List和Watch方法,用于获取及监控资源列表。只要实现了List和Watch方法的对象都可以称为ListerWatcher。Reflector对象通过Run函数启动监控并处理监控事件。而在Reflector源码实现中,其中最主要的是ListAndWatch函数,它负责获取资源列表(List)和监控(Watch)指定的Kubernetes API Server资源。 ``` // Reflector watches a specified resource and causes all changes to be reflected in the given store. type Reflector struct { // name identifies this reflector. By default it will be a file:line if possible. name string // The name of the type we expect to place in the store. The name // will be the stringification of expectedGVK if provided, and the // stringification of expectedType otherwise. It is for display // only, and should not be used for parsing or comparison. expectedTypeName string // The type of object we expect to place in the store. expectedType reflect.Type // The GVK of the object we expect to place in the store if unstructured. expectedGVK *schema.GroupVersionKind // The destination to sync up with the watch source store Store // store对象 // listerWatcher is used to perform lists and watches. listerWatcher ListerWatcher // listwatcher对象 // period controls timing between one watch ending and // the beginning of the next one. period time.Duration resyncPeriod time.Duration ShouldResync func() bool // clock allows tests to manipulate time clock clock.Clock // lastSyncResourceVersion is the resource version token last // observed when doing a sync with the underlying store // it is thread safe, but not synchronized with the underlying store lastSyncResourceVersion string // lastSyncResourceVersionMutex guards read/write access to lastSyncResourceVersion lastSyncResourceVersionMutex sync.RWMutex // WatchListPageSize is the requested chunk size of initial and resync watch lists. // Defaults to pager.PageSize. WatchListPageSize int64 } ``` **reflector包含了listwatch对象**
### 4. listAndwatcher #### 4.1 list ``` // ListAndWatch first lists all items and get the resource version at the moment of call, // and then use the resource version to watch. // It returns error if ListAndWatch didn't even try to initialize watch. func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error { klog.V(3).Infof("Listing and watching %v from %s", r.expectedTypeName, r.name) var resourceVersion string // Explicitly set "0" as resource version - it's fine for the List() // to be served from cache and potentially be delayed relative to // etcd contents. Reflector framework will catch up via Watch() eventually. // resourceVersion=0 表示 list所有资源 options := metav1.ListOptions{ResourceVersion: "0"} // if err := func() error { initTrace := trace.New("Reflector ListAndWatch", trace.Field{"name", r.name}) defer initTrace.LogIfLong(10 * time.Second) var list runtime.Object var err error listCh := make(chan struct{}, 1) panicCh := make(chan interface{}, 1) go func() { defer func() { if r := recover(); r != nil { panicCh <- r } }() // 判断是否chunks一段一段的list // Attempt to gather list in chunks, if supported by listerWatcher, if not, the first // list request will return the full response. pager := pager.New(pager.SimplePageFunc(func(opts metav1.ListOptions) (runtime.Object, error) { return r.listerWatcher.List(opts) })) if r.WatchListPageSize != 0 { pager.PageSize = r.WatchListPageSize } // Pager falls back to full list if paginated list calls fail due to an "Expired" error. list, err = pager.List(context.Background(), options) close(listCh) }() select { case <-stopCh: return nil case r := <-panicCh: panic(r) case <-listCh: } if err != nil { return fmt.Errorf("%s: Failed to list %v: %v", r.name, r.expectedTypeName, err) } initTrace.Step("Objects listed") listMetaInterface, err := meta.ListAccessor(list) if err != nil { return fmt.Errorf("%s: Unable to understand list result %#v: %v", r.name, list, err) } resourceVersion = listMetaInterface.GetResourceVersion() initTrace.Step("Resource version extracted") items, err := meta.ExtractList(list) if err != nil { return fmt.Errorf("%s: Unable to understand list result %#v (%v)", r.name, list, err) } initTrace.Step("Objects extracted") if err := r.syncWith(items, resourceVersion); err != nil { return fmt.Errorf("%s: Unable to sync list result: %v", r.name, err) } initTrace.Step("SyncWith done") r.setLastSyncResourceVersion(resourceVersion) initTrace.Step("Resource version updated") return nil }(); err != nil { return err } resyncerrc := make(chan error, 1) cancelCh := make(chan struct{}) defer close(cancelCh) go func() { resyncCh, cleanup := r.resyncChan() defer func() { cleanup() // Call the last one written into cleanup }() for { select { case <-resyncCh: case <-stopCh: return case <-cancelCh: return } if r.ShouldResync == nil || r.ShouldResync() { klog.V(4).Infof("%s: forcing resync", r.name) if err := r.store.Resync(); err != nil { resyncerrc <- err return } } cleanup() resyncCh, cleanup = r.resyncChan() } }() for { // give the stopCh a chance to stop the loop, even in case of continue statements further down on errors select { case <-stopCh: return nil default: } timeoutSeconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0)) options = metav1.ListOptions{ ResourceVersion: resourceVersion, // We want to avoid situations of hanging watchers. Stop any wachers that do not // receive any events within the timeout window. TimeoutSeconds: &timeoutSeconds, // To reduce load on kube-apiserver on watch restarts, you may enable watch bookmarks. // Reflector doesn't assume bookmarks are returned at all (if the server do not support // watch bookmarks, it will ignore this field). AllowWatchBookmarks: true, } w, err := r.listerWatcher.Watch(options) if err != nil { switch err { case io.EOF: // watch closed normally case io.ErrUnexpectedEOF: klog.V(1).Infof("%s: Watch for %v closed with unexpected EOF: %v", r.name, r.expectedTypeName, err) default: utilruntime.HandleError(fmt.Errorf("%s: Failed to watch %v: %v", r.name, r.expectedTypeName, err)) } // If this is "connection refused" error, it means that most likely apiserver is not responsive. // It doesn't make sense to re-list all objects because most likely we will be able to restart // watch where we ended. // If that's the case wait and resend watch request. if utilnet.IsConnectionRefused(err) { time.Sleep(time.Second) continue } return nil } if err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil { if err != errorStopRequested { switch { case apierrs.IsResourceExpired(err): klog.V(4).Infof("%s: watch of %v ended with: %v", r.name, r.expectedTypeName, err) default: klog.Warningf("%s: watch of %v ended with: %v", r.name, r.expectedTypeName, err) } } return nil } } } ``` ListAndWatch List在程序第一次运行时获取该资源下所有的对象数据并将其存储至DeltaFIFO中。以Informers Example代码示例为例,在其中,我们获取的是所有Pod的资源数据。ListAndWatch List流程图如下所示。 ![list-and-watcher](../images/list-and-watcher.png) (1)r.listerWatcher.List用于获取资源下的所有对象的数据,例如,获取所有Pod的资源数据。获取资源数据是由options的ResourceVersion(资源版本号)参数控制的,如果ResourceVersion为0,则表示获取所有Pod的资源数据;如果ResourceVersion非0,则表示根据资源版本号继续获取,功能有些类似于文件传输过程中的“断点续传”,当传输过程中遇到网络故障导致中断,下次再连接时,会根据资源版本号继续传输未完成的部分。可以使本地缓存中的数据与Etcd集群中的数据保持一致。 (2)listMetaInterface.GetResourceVersion用于获取资源版本号,ResourceVersion (资源版本号)非常重要,Kubernetes中所有的资源都拥有该字段,它标识当前资源对象的版本号。每次修改当前资源对象时,Kubernetes API Server都会更改ResourceVersion,使得client-go执行Watch操作时可以根据ResourceVersion来确定当前资源对象是否发生变化。更多关于ResourceVersion资源版本号的内容,请参考6.5.2节“ResourceVersion资源版本号”。 (3)meta.ExtractList用于将资源数据转换成资源对象列表,将runtime.Object对象转换成[]runtime.Object对象。因为r.listerWatcher.List获取的是资源下的所有对象的数据,例如所有的Pod资源数据,所以它是一个资源列表。 (4) r.syncWith用于将资源对象列表中的资源对象和资源版本号存储至DeltaFIFO中,并会替换已存在的对象。 (5)r.setLastSyncResourceVersion用于设置最新的资源版本号。 r.listerWatcher.List函数实际调用了Pod Informer下的ListFunc函数(NewFilteredListWatchFromClient),它通过ClientSet客户端与Kubernetes API Server交互并获取Pod资源列表数据.
#### 4.2 watcher Watch(监控)操作通过HTTP协议与Kubernetes API Server建立长连接,接收Kubernetes API Server发来的资源变更事件。Watch操作的实现机制使用HTTP协议的分块传输编码(Chunked Transfer Encoding)。当client-go调用Kubernetes API Server时,Kubernetes API Server在Response的HTTPHeader中设置Transfer-Encoding的值为chunked,表示采用分块传输编码,客户端收到该信息后,便与服务端进行连接,并等待下一个数据块(即资源的事件信息)。 ListAndWatch Watch代码示例如下: ```go for { timeoutSeconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0)) // 列出要watcher的资源和timeout时间 options = metav1.ListOptions{ ResourceVersion: resourceVersion, // We want to avoid situations of hanging watchers. Stop any wachers that do not // receive any events within the timeout window. TimeoutSeconds: &timeoutSeconds, } r.metrics.numberOfWatches.Inc() // 这个就是reflector提到的watch函数 w, err := r.listerWatcher.Watch(options) // 用于处理资源的变更事件。 if err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil { if err != errorStopRequested { glog.Warningf("%s: watch of %v ended with: %v", r.name, r.expectedType, err) } return nil } } ```
r.watchHandler用于处理资源的变更事件。当触发Added(资源添加)事件、Updated (资源更新)事件、Deleted(资源删除)事件时,将对应的资源对象更新到本地缓存DeltaFIFO中并更新ResourceVersion资源版本号。r.watchHandler代码示例如下: ``` // watchHandler watches w and keeps *resourceVersion up to date. func (r *Reflector) watchHandler(w watch.Interface, resourceVersion *string, errc chan error, stopCh <-chan struct{}) error { start := r.clock.Now() eventCount := 0 newResourceVersion := meta.GetResourceVersion() switch event.Type { case watch.Added: err := r.store.Add(event.Object) if err != nil { utilruntime.HandleError(fmt.Errorf("%s: unable to add watch event object (%#v) to store: %v", r.name, event.Object, err)) } case watch.Modified: err := r.store.Update(event.Object) if err != nil { utilruntime.HandleError(fmt.Errorf("%s: unable to update watch event object (%#v) to store: %v", r.name, event.Object, err)) } case watch.Deleted: // TODO: Will any consumers need access to the "last known // state", which is passed in event.Object? If so, may need // to change this. err := r.store.Delete(event.Object) if err != nil { utilruntime.HandleError(fmt.Errorf("%s: unable to delete watch event object (%#v) from store: %v", r.name, event.Object, err)) } default: utilruntime.HandleError(fmt.Errorf("%s: unable to understand watch event %#v", r.name, event)) } // 有改变就将 resourceVersion+1 *resourceVersion = newResourceVersion r.setLastSyncResourceVersion(newResourceVersion) eventCount++ } } watchDuration := r.clock.Now().Sub(start) if watchDuration < 1*time.Second && eventCount == 0 { r.metrics.numberOfShortWatches.Inc() return fmt.Errorf("very short watch: %s: Unexpected watch close - watch lasted less than a second and no items received", r.name) } glog.V(4).Infof("%s: Watch close - %v total %v items received", r.name, r.expectedType, eventCount) return nil } ```
### 5. DeltaFIFO DeltaFIFO可以分开理解,FIFO是一个先进先出的队列,它拥有队列操作的基本方法,例如Add、Update、Delete、List、Pop、Close等,而Delta是一个资源对象存储,它可以保存资源对象的操作类型,例如Added(添加)操作类型、Updated(更新)操作类型、Deleted(删除)操作类型、Sync(同步)操作类型等。DeltaFIFO结构代码示例如下: ``` type DeltaFIFO struct { // lock/cond protects access to 'items' and 'queue'. lock sync.RWMutex cond sync.Cond // We depend on the property that items in the set are in // the queue and vice versa, and that all Deltas in this // map have at least one Delta. items map[string]Deltas queue []string // populated is true if the first batch of items inserted by Replace() has been populated // or Delete/Add/Update was called first. populated bool // initialPopulationCount is the number of items inserted by the first call of Replace() initialPopulationCount int // keyFunc is used to make the key used for queued item // insertion and retrieval, and should be deterministic. keyFunc KeyFunc // knownObjects list keys that are "known", for the // purpose of figuring out which items have been deleted // when Replace() or Delete() is called. knownObjects KeyListerGetter // Indication the queue is closed. // Used to indicate a queue is closed so a control loop can exit when a queue is empty. // Currently, not used to gate any of CRED operations. closed bool closedLock sync.Mutex } ``` DeltaFIFO与其他队列最大的不同之处是,它会保留所有关于资源对象(obj)的操作类型,队列中会存在拥有不同操作类型的同一个资源对象,消费者在处理该资源对象时能够了解该资源对象所发生的事情。queue字段存储资源对象的key,该key通过KeyOf函数计算得到。items字段通过map数据结构的方式存储,value存储的是对象的Deltas数组。DeltaFIFO存储结构如下图所示。 ![delta](../images/delta.png) DeltaFIFO本质上是一个先进先出的队列,有数据的生产者和消费者,其中生产者是Reflector调用的Add方法,消费者是Controller调用的Pop方法。 #### 5.1 生产者 ``` // Add inserts an item, and puts it in the queue. The item is only enqueued // if it doesn't already exist in the set. func (f *DeltaFIFO) Add(obj interface{}) error { f.lock.Lock() defer f.lock.Unlock() f.populated = true return f.queueActionLocked(Added, obj) } // Update is just like Add, but makes an Updated Delta. func (f *DeltaFIFO) Update(obj interface{}) error { f.lock.Lock() defer f.lock.Unlock() f.populated = true return f.queueActionLocked(Updated, obj) } // Delete is just like Add, but makes an Deleted Delta. If the item does not // already exist, it will be ignored. (It may have already been deleted by a // Replace (re-list), for example. func (f *DeltaFIFO) Delete(obj interface{}) error { id, err := f.KeyOf(obj) if err != nil { return KeyError{obj, err} } f.lock.Lock() defer f.lock.Unlock() f.populated = true if f.knownObjects == nil { if _, exists := f.items[id]; !exists { // Presumably, this was deleted when a relist happened. // Don't provide a second report of the same deletion. return nil } } else { // We only want to skip the "deletion" action if the object doesn't // exist in knownObjects and it doesn't have corresponding item in items. // Note that even if there is a "deletion" action in items, we can ignore it, // because it will be deduped automatically in "queueActionLocked" _, exists, err := f.knownObjects.GetByKey(id) _, itemsExist := f.items[id] if err == nil && !exists && !itemsExist { // Presumably, this was deleted when a relist happened. // Don't provide a second report of the same deletion. return nil } } return f.queueActionLocked(Deleted, obj) } ``` DeltaFIFO队列中的资源对象在Added(资源添加)事件、Updated(资源更新)事件、Deleted(资源删除)事件中都调用了queueActionLocked函数,它是DeltaFIFO实现的关键,代码示例如下: ``` // queueActionLocked appends to the delta list for the object. // Caller must lock first. func (f *DeltaFIFO) queueActionLocked(actionType DeltaType, obj interface{}) error { id, err := f.KeyOf(obj) if err != nil { return KeyError{obj, err} } // If object is supposed to be deleted (last event is Deleted), // then we should ignore Sync events, because it would result in // recreation of this object. if actionType == Sync && f.willObjectBeDeletedLocked(id) { return nil } newDeltas := append(f.items[id], Delta{actionType, obj}) newDeltas = dedupDeltas(newDeltas) _, exists := f.items[id] if len(newDeltas) > 0 { if !exists { f.queue = append(f.queue, id) } f.items[id] = newDeltas f.cond.Broadcast() } else if exists { // We need to remove this from our map (extra items // in the queue are ignored if they are not in the // map). delete(f.items, id) } return nil } ``` queueActionLocked代码执行流程如下。 (1)通过f.KeyOf函数计算出资源对象的key。 (2)如果操作类型为Sync,则标识该数据来源于Indexer(本地存储)。如果Indexer中的资源对象已经被删除,则直接返回。 (3)将actionType和资源对象构造成Delta,添加到items中,并通过dedupDeltas函数进行去重操作。 (4)更新构造后的Delta并通过cond.Broadcast通知所有消费者解除阻塞。
#### 5.2 消费者 Pop方法作为消费者方法使用,从DeltaFIFO的头部取出最早进入队列中的资源对象数据。Pop方法须传入process函数(**而这里的process函数就是**),用于接收并处理对象的回调方法,代码示例如下: ``` // Pop blocks until an item is added to the queue, and then returns it. If // multiple items are ready, they are returned in the order in which they were // added/updated. The item is removed from the queue (and the store) before it // is returned, so if you don't successfully process it, you need to add it back // with AddIfNotPresent(). // process function is called under lock, so it is safe update data structures // in it that need to be in sync with the queue (e.g. knownKeys). The PopProcessFunc // may return an instance of ErrRequeue with a nested error to indicate the current // item should be requeued (equivalent to calling AddIfNotPresent under the lock). // // Pop returns a 'Deltas', which has a complete list of all the things // that happened to the object (deltas) while it was sitting in the queue. func (f *DeltaFIFO) Pop(process PopProcessFunc) (interface{}, error) { f.lock.Lock() defer f.lock.Unlock() for { for len(f.queue) == 0 { // When the queue is empty, invocation of Pop() is blocked until new item is enqueued. // When Close() is called, the f.closed is set and the condition is broadcasted. // Which causes this loop to continue and return from the Pop(). if f.IsClosed() { return nil, FIFOClosedError } f.cond.Wait() } id := f.queue[0] f.queue = f.queue[1:] item, ok := f.items[id] if f.initialPopulationCount > 0 { f.initialPopulationCount-- } if !ok { // Item may have been deleted subsequently. continue } // 从队列中删除 delete(f.items, id) // 然后调用 process处理,这里的item还是之前的列表,bojkey1 {“add”,obj1; "update",obj1} err := process(item) if e, ok := err.(ErrRequeue); ok { f.addIfNotPresent(id, item) err = e.Err } // Don't need to copyDeltas here, because we're transferring // ownership to the caller. return item, err } } ```
当队列中没有数据时,通过f.cond.wait阻塞等待数据,只有收到cond.Broadcast时才说明有数据被添加,解除当前阻塞状态。如果队列中不为空,取出f.queue的头部数据,将该对象传入process回调函数,由上层消费者进行处理。如果process回调函数处理出错,则将该对象重新存入队列。Controller的processLoop方法负责从DeltaFIFO队列中取出数据传递给process回调函数。process回调函数代码示例如下: ```go func (s *sharedIndexInformer) HandleDeltas(obj interface{}) error { s.blockDeltas.Lock() defer s.blockDeltas.Unlock() // from oldest to newest for _, d := range obj.(Deltas) { switch d.Type { case Sync, Added, Updated: isSync := d.Type == Sync s.cacheMutationDetector.AddObject(d.Object) if old, exists, err := s.indexer.Get(d.Object); err == nil && exists { if err := s.indexer.Update(d.Object); err != nil { return err } s.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync) } else { if err := s.indexer.Add(d.Object); err != nil { return err } s.processor.distribute(addNotification{newObj: d.Object}, isSync) } case Deleted: if err := s.indexer.Delete(d.Object); err != nil { return err } s.processor.distribute(deleteNotification{oldObj: d.Object}, false) } } return nil } ``` HandleDeltas函数作为process回调函数,当资源对象的操作类型为Added、Updated、Deleted时,将该资源对象存储至Indexer(它是并发安全的存储),并通过distribute函数将资源对象分发至SharedInformer。还记得Informers Example代码示例吗?在Informers Example代码示例中,我们通过informer.AddEventHandler函数添加了对资源事件进行处理的函数,distribute函数则将资源对象分发到该事件处理函数中。
#### 5.3 Resync Resync机制会将Indexer本地存储中的资源对象同步到DeltaFIFO中,并将这些资源对象设置为Sync的操作类型。Resync函数在Reflector中定时执行,它的执行周期由NewReflector函数传入的resyncPeriod参数设定。Resync→syncKeyLocked代码示例如下: ``` func (f *DeltaFIFO) syncKeyLocked(key string) error { obj, exists, err := f.knownObjects.GetByKey(key) if err != nil { glog.Errorf("Unexpected error %v during lookup of key %v, unable to queue object for sync", err, key) return nil } else if !exists { glog.Infof("Key %v does not exist in known objects store, unable to queue object for sync", key) return nil } // If we are doing Resync() and there is already an event queued for that object, // we ignore the Resync for it. This is to avoid the race, in which the resync // comes with the previous value of object (since queueing an event for the object // doesn't trigger changing the underlying store . id, err := f.KeyOf(obj) if err != nil { return KeyError{obj, err} } if len(f.items[id]) > 0 { return nil } if err := f.queueActionLocked(Sync, obj); err != nil { return fmt.Errorf("couldn't queue object: %v", err) } return nil } ``` f.knownObjects是Indexer本地存储对象,通过该对象可以获取client-go目前存储的所有资源对象,Indexer对象在NewDeltaFIFO函数实例化DeltaFIFO对象时传入。 ### 6.Indexer 的数据与Etcd集群中的数据保持完全一致。client-go可以很方便地从本地存储中读取相应的资源对象数据,而无须每次都从远程Etcd集群中读取,这样可以减轻Kubernetes API Server和Etcd集群的压力。 在介绍Indexer之前,先介绍一下ThreadSafeMap。ThreadSafeMap是实现并发安全的存储。作为存储,它拥有存储相关的增、删、改、查操作方法,例如Add、Update、Delete、List、Get、Replace、Resync等。Indexer在ThreadSafeMap的基础上进行了封装,它继承了与ThreadSafeMap相关的操作方法并实现了Indexer Func等功能,例如Index、IndexKeys、GetIndexers等方法,这些方法为ThreadSafeMap提供了索引功能。Indexer存储结构如下图所示。 ![index](../images/index.png) ThreadSafeMap是一个内存中的存储,其中的数据并不会写入本地磁盘中,每次的增、删、改、查操作都会加锁,以保证数据的一致性。ThreadSafeMap将资源对象数据存储于一个map数据结构中,ThreadSafeMap结构代码示例如下: ``` // threadSafeMap implements ThreadSafeStore type threadSafeMap struct { lock sync.RWMutex items map[string]interface{} // indexers maps a name to an IndexFunc indexers Indexers // indices maps a name to an Index indices Indices } ``` items字段中存储的是资源对象数据,其中items的key通过keyFunc函数计算得到,计算默认使用MetaNamespaceKeyFunc函数,该函数根据资源对象计算出/格式的key,如果资源对象的为空,则作为key,而items的value用于存储资源对象。
#### 6.1 Indexer索引器 在每次增、删、改ThreadSafeMap数据时,都会通过updateIndices或deleteFromIndices函数变更Indexer。Indexer被设计为可以自定义索引函数,这符合Kubernetes高扩展性的特点。Indexer有4个非常重要的数据结构,分别是Indices、Index、Indexers及IndexFunc。直接阅读相关代码会比较晦涩,通过Indexer Example代码示例来理解Indexer,印象会更深刻。Indexer Example代码示例如下: ``` package main import ( "fmt" "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/client-go/tools/cache" "strings" ) func UsersIndexFunc(obj interface{}) ([]string, error) { pod := obj.(*v1.Pod) userString := pod.Annotations["users"] return strings.Split(userString, ","), nil } func main() { index := cache.NewIndexer(cache.MetaNamespaceKeyFunc,cache.Indexers{"byUser": UsersIndexFunc}) pod1 := &v1.Pod{ObjectMeta:metav1.ObjectMeta{Name:"one",Annotations: map[string]string{"users": "ernie,bert"}}} pod2 := &v1.Pod{ObjectMeta:metav1.ObjectMeta{Name:"two",Annotations: map[string]string{"users": "oscar,bert"}}} pod3 := &v1.Pod{ObjectMeta:metav1.ObjectMeta{Name:"tre",Annotations: map[string]string{"users": "ernie,elmo"}}} index.Add(pod1) index.Add(pod2) index.Add(pod3) erniePods, err := index.ByIndex("byUser", "ernie") if err != nil { panic(err) } for _, erniePod := range erniePods { fmt.Println(erniePod.(*v1.Pod).Name) } } ## 输出 one tre ``` 首先定义一个索引器函数UsersIndexFunc,在该函数中,我们定义查询出所有Pod资源下Annotations字段的key为users的Pod。cache.NewIndexer函数实例化了Indexer对象,该函数接收两个参数:第1个参数是KeyFunc,它用于计算资源对象的key,计算默认使用cache.MetaNamespaceKeyFunc函数;第2个参数是cache.Indexers,用于定义索引器,其中key为索引器的名称(即byUser),value为索引器。通过index.Add函数添加3个Pod资源对象。最后通过index.ByIndex函数查询byUser索引器下匹配ernie字段的Pod列表。Indexer Example代码示例最终检索出名称为one和tre的Pod。现在再来理解Indexer的4个重要的数据结构就非常容易了,它们分别是Indexers、IndexFunc、Indices、Index,数据结构如下: ``` // Index maps the indexed value to a set of keys in the store that match on that value type Index map[string]sets.String // Indexers maps a name to a IndexFunc type Indexers map[string]IndexFunc // Indices maps a name to an Index type Indices map[string]Index // IndexFunc knows how to provide an indexed value for an object. type IndexFunc func(obj interface{}) ([]string, error) ``` Indexer数据结构说明如下。 ● Indexers:存储索引器,key为索引器名称,value为索引器的实现函数。 ● IndexFunc:索引器函数,定义为接收一个资源对象,返回检索结果列表。 ● Indices:存储缓存器,key为缓存器名称(在Indexer Example代码示例中,缓存器命名与索引器命名相对应),value为缓存数据。 ● Index:存储缓存数据,其结构为K/V。
### 7. 总结 目前通过整体的介绍已经大概理清楚client-go informer的大致过程: (1)定义好 informerFactory, 然后初始化一个Informer (2)定义好add, update, del处理函数 (3)informer.run运行 (4)informer.run初始化了一个reflector,里面实现了list-watch (5)reflector里面使用了deltaFIFO队列对list watch的数据进行处理。 一方面:通过该队列的数据 使得本地cache和etcd数据一致 (indexer里面的数据) 另一方面:之前定义好的add ,update ,del就是这些数据的消费者 当然这个只是大概的运作过程。接下来将详细研究具体每个过程是如何实现的。 ### 8.附录 以下的例子对pod的监听。可以看出来步骤为: (1)生成clientset客户端 (2)New一个 listandwatcher对象,这里是pod (3)实例化一个informer,在这个informer中,指定ADD,UPDATE,DELETE的处理函数。 ``` // creates the clientset clientset, err := kubernetes.NewForConfig(cfg) if err != nil { glog.Errorf("can not creates the clientset: %v\n", err) return nil, err } // create the pod watcher, set the func of list and watch podListWatcher := cache.NewListWatchFromClient( clientset.Core().RESTClient(), "pods", v1.NamespaceAll, fields.Everything(), ) indexer, informer := cache.NewIndexerInformer( podListWatcher, &v1.Pod{}, 0, cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) {}, UpdateFunc: func(old interface{}, new interface{}) { pusher.PushBlackHole(old, new, opt) }, DeleteFunc: func(obj interface{}) {}, }, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, ) ```
### 9.参考 https://zhuanlan.zhihu.com/p/228534306 https://houmin.cc/posts/1f0eb2ff/ <> ================================================ FILE: k8s/client-go/5. SharedInformerFactory机制.md ================================================ Table of Contents ================= * [1.章节介绍](#1章节介绍) * [2. SharedInformerFactory](#2-sharedinformerfactory) * [2.1 SharedInformerFactory实例介绍](#21-sharedinformerfactory实例介绍) * [2.2 sharedInformerFactory结构体](#22-sharedinformerfactory结构体) * [2.3 sharedInformerFactory成员函数](#23-sharedinformerfactory成员函数) * [2.4 总结](#24-总结) * [3. podInformer](#3-podinformer) * [3.1 PodInformer结构体](#31-podinformer结构体) * [3.2 PodInformer成员函数](#32-podinformer成员函数) * [4.总结](#4总结) ### 1.章节介绍 本章首先介绍SharedInformerFactory,了解其组成和作用。 然后以Podinformer为例,了解一个资源实例的Informer应该需要实现哪些函数。 本节并没有设计到具体图中的informer机制,只是从大的入口入手,看看SharedInformerFactory到底是什么 ![informer](../images/informer.png)
### 2. SharedInformerFactory SharedInformerFactory封装了NewSharedIndexInformer方法。字如其名,SharedInformerFactory使用的是工厂模式来生成各类的Informer。无论是k8s控制器,还是自定义控制器, SharedInformerFactory都是非常重要的一环。所以首先分析SharedInformerFactory。这里以一个实例入手分析SharedInformerFactory。 #### 2.1 SharedInformerFactory实例介绍 ``` package main import ( "fmt" clientset "k8s.io/client-go/kubernetes" "k8s.io/client-go/rest" "k8s.io/client-go/informers" "k8s.io/client-go/tools/cache" "k8s.io/api/core/v1" "k8s.io/apimachinery/pkg/labels" "time" ) func main() { config := &rest.Config{ Host: "http://172.21.0.16:8080", } client := clientset.NewForConfigOrDie(config) // 生成SharedInformerFactory factory := informers.NewSharedInformerFactory(client, 5 * time.Second) // 生成PodInformer podInformer := factory.Core().V1().Pods() // 获得一个cache.SharedIndexInformer 单例模式 sharedInformer := podInformer.Informer() //注册add, update, del处理事件 sharedInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) {fmt.Printf("add: %v\n", obj.(*v1.Pod).Name)}, UpdateFunc: func(oldObj, newObj interface{}) {fmt.Printf("update: %v\n", newObj.(*v1.Pod).Name)}, DeleteFunc: func(obj interface{}){fmt.Printf("delete: %v\n", obj.(*v1.Pod).Name)}, }) stopCh := make(chan struct{}) // 第一种方式 // 可以这样启动 也可以按照下面的方式启动 // go sharedInformer.Run(stopCh) // time.Sleep(2 * time.Second) // 第二种方式,这种方式是启动factory下面所有的informer factory.Start(stopCh) factory.WaitForCacheSync(stopCh) pods, _ := podInformer.Lister().Pods("default").List(labels.Everything()) for _, p := range pods { fmt.Printf("list pods: %v\n", p.Name) } <- stopCh } ``` #### 2.2 sharedInformerFactory结构体 ``` type sharedInformerFactory struct { // client客户端 client kubernetes.Interface // sharedInformerFactory是没有namespaces限制的。不过可以设置namespaces限制该factory后面的informer都是指定namespaces的 namespace string // TweakListOptionsFunc其实就是ListOptions,这个是针对所有Informer List生效的 (WithTweakListOptions可以看出来) tweakListOptions internalinterfaces.TweakListOptionsFunc lock sync.Mutex // 这个是list默认定期同步的时间间隔 defaultResync time.Duration // 每种informer还可以自定义 customResync map[reflect.Type]time.Duration // 属于该factory下面的所有的informer informers map[reflect.Type]cache.SharedIndexInformer // startedInformers is used for tracking which informers have been started. // This allows Start() to be called multiple times safely. // 判断informer是否已经 Run起来了 startedInformers map[reflect.Type]bool } ```
#### 2.3 sharedInformerFactory成员函数 ``` 定义customResync // WithCustomResyncConfig sets a custom resync period for the specified informer types. func WithCustomResyncConfig(resyncConfig map[v1.Object]time.Duration) SharedInformerOption 定义tweakListOptions // WithTweakListOptions sets a custom filter on all listers of the configured SharedInformerFactory. func WithTweakListOptions(tweakListOptions internalinterfaces.TweakListOptionsFunc) SharedInformerOption 定义namespaces // WithNamespace limits the SharedInformerFactory to the specified namespace. func WithNamespace(namespace string) SharedInformerOption // start所有的informer // Start initializes all requested informers. func (f *sharedInformerFactory) Start(stopCh <-chan struct{}) { f.lock.Lock() defer f.lock.Unlock() for informerType, informer := range f.informers { if !f.startedInformers[informerType] { go informer.Run(stopCh) f.startedInformers[informerType] = true } } } // WaitForCacheSync让所有的informers同步cache。一般informer.run函数中都有一个这样的语句。先等cache同步。这个的含义就是等list完了的数据,全部转换到cache中去。 // Wait for all involved caches to be synced, before processing items from the queue is started if !cache.WaitForCacheSync(stopCh, ctrl.Informer.HasSynced) { runtime.HandleError(fmt.Errorf("Timed out waiting for caches to sync")) return } // WaitForCacheSync waits for all started informers' cache were synced. func (f *sharedInformerFactory) WaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool { informers := func() map[reflect.Type]cache.SharedIndexInformer { f.lock.Lock() defer f.lock.Unlock() informers := map[reflect.Type]cache.SharedIndexInformer{} for informerType, informer := range f.informers { if f.startedInformers[informerType] { informers[informerType] = informer } } return informers }() res := map[reflect.Type]bool{} for informType, informer := range informers { res[informType] = cache.WaitForCacheSync(stopCh, informer.HasSynced) } return res } // InternalInformerFor returns the SharedIndexInformer for obj using an internal // client. func (f *sharedInformerFactory) InformerFor(obj runtime.Object, newFunc internalinterfaces.NewInformerFunc) cache.SharedIndexInformer { f.lock.Lock() defer f.lock.Unlock() informerType := reflect.TypeOf(obj) informer, exists := f.informers[informerType] // 如果存在同类的,直接返回,不会再new一个。这里的type就是 pod/deploy if exists { return informer } resyncPeriod, exists := f.customResync[informerType] if !exists { resyncPeriod = f.defaultResync } informer = newFunc(f.client, resyncPeriod) f.informers[informerType] = informer return informer } // SharedInformerFactory provides shared informers for resources in all known // API group versions. type SharedInformerFactory interface { internalinterfaces.SharedInformerFactory ForResource(resource schema.GroupVersionResource) (GenericInformer, error) WaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool // 提供k8s内置资源的定义接口,从这里可以看出来 Admissionregistration() admissionregistration.Interface Apps() apps.Interface Autoscaling() autoscaling.Interface Batch() batch.Interface Certificates() certificates.Interface Coordination() coordination.Interface Core() core.Interface Events() events.Interface Extensions() extensions.Interface Networking() networking.Interface Policy() policy.Interface Rbac() rbac.Interface Scheduling() scheduling.Interface Settings() settings.Interface Storage() storage.Interface } // 例如core组下面的资源,f.Core().v1.pods() 就是这个 func (f *sharedInformerFactory) Core() core.Interface { return core.New(f, f.namespace, f.tweakListOptions) } ```
#### 2.4 总结 通过对sharedInformerFactory的成员和函数介绍,了解到: (1)factory就是提供了一个构造informer的入口,里面包含了一堆Informer (2)同一中资源类型共用一个Infomer。这样的话就可以节省不必要的资源。例如kcm中,rs可以需要监听pod资源,gc也需要监听Pod资源,通过factory机制就可以使用同一个 (3)但是监听同一种类型的资源,但是不同的listOption看起来也是不行,例如一个Informer监听running的pod,一个Informer监听error的Pod, 是需要多个factory。 ### 3. podInformer 从上诉可以看出来,sharedInformerFactory只是一个入口。接下来以podInformer为例,看看一个具体的资源Informer需要实现哪些功能。 #### 3.1 PodInformer结构体 ``` // PodInformer provides access to a shared informer and lister for // Pods. // 只需要实现Informer,Lister函数 type PodInformer interface { Informer() cache.SharedIndexInformer Lister() v1.PodLister } type podInformer struct { factory internalinterfaces.SharedInformerFactory // 是哪一个factory生成的informer tweakListOptions internalinterfaces.TweakListOptionsFunc // 有哪些filter namespace string // 命名空间 } ``` #### 3.2 PodInformer成员函数 从函数定义可以看出来,informer其实就是 cache.SharedIndexInformer New SharedIndexInformer的时候指定了ListWatch函数。 listFunc: client.CoreV1().Pods(namespace).List(options) WatchFunc: client.CoreV1().Pods(namespace).Watch(options) 所以从结构体上推测: (1) informer最终都是 cache.SharedIndexInformer。但是 cache.SharedIndexInformer需要先定义好list, watch函数 (2)cache.SharedIndexInformer里面的index就是存储+查询。根据定义好的list, watch更新index的数据 接下来继续看看cache.SharedIndexInformer是如何实现的。 ``` // NewPodInformer constructs a new informer for Pod type. // Always prefer using an informer factory to get a shared informer instead of getting an independent // one. This reduces memory footprint and number of connections to the server. func NewPodInformer(client kubernetes.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers) cache.SharedIndexInformer { return NewFilteredPodInformer(client, namespace, resyncPeriod, indexers, nil) } // NewFilteredPodInformer constructs a new informer for Pod type. // Always prefer using an informer factory to get a shared informer instead of getting an independent // one. This reduces memory footprint and number of connections to the server. func NewFilteredPodInformer(client kubernetes.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers, tweakListOptions internalinterfaces.TweakListOptionsFunc) cache.SharedIndexInformer { return cache.NewSharedIndexInformer( &cache.ListWatch{ ListFunc: func(options metav1.ListOptions) (runtime.Object, error) { if tweakListOptions != nil { tweakListOptions(&options) } return client.CoreV1().Pods(namespace).List(options) }, WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) { if tweakListOptions != nil { tweakListOptions(&options) } return client.CoreV1().Pods(namespace).Watch(options) }, }, &corev1.Pod{}, resyncPeriod, indexers, ) } // 默认只有namespaces这个indexer func (f *podInformer) defaultInformer(client kubernetes.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer { return NewFilteredPodInformer(client, f.namespace, resyncPeriod, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, f.tweakListOptions) } func (f *podInformer) Informer() cache.SharedIndexInformer { return f.factory.InformerFor(&corev1.Pod{}, f.defaultInformer) } // 返回Lister数据, 这里是从index里面获取,而不是从apiserver中获取 func (f *podInformer) Lister() v1.PodLister { return v1.NewPodLister(f.Informer().GetIndexer()) } cache中的index定义 k8s.io/client-go/tools/cache/index.go // Indexer is a storage interface that lets you list objects using multiple indexing functions type Indexer interface { Store // Retrieve list of objects that match on the named indexing function Index(indexName string, obj interface{}) ([]interface{}, error) // IndexKeys returns the set of keys that match on the named indexing function. IndexKeys(indexName, indexKey string) ([]string, error) // ListIndexFuncValues returns the list of generated values of an Index func ListIndexFuncValues(indexName string) []string // ByIndex lists object that match on the named indexing function with the exact key ByIndex(indexName, indexKey string) ([]interface{}, error) // GetIndexer return the indexers GetIndexers() Indexers // AddIndexers adds more indexers to this store. If you call this after you already have data // in the store, the results are undefined. AddIndexers(newIndexers Indexers) error } ``` ### 4.总结 (1)factory就是提供了一个构造informer的入口,里面包含了一堆Informer (2)同一中资源类型共用一个Infomer。这样的话就可以节省不必要的资源。例如kcm中,rs可以需要监听pod资源,gc也需要监听Pod资源,通过factory机制就可以使用同一个 (3)但是监听同一种类型的资源,但是不同的listOption看起来也是不行,例如一个Informer监听running的pod,一个Informer监听error的Pod, 是需要多个factory。 (4)当前factory并没有利用到图中表示Informer机制。最终是cache.SharedIndexInformer 包含了所有的参数,实现了上诉图中的Informer机制。下一节开始介绍cache.SharedIndexInformer ================================================ FILE: k8s/client-go/6. informer机制之cache.indexer机制.md ================================================ Table of Contents ================= * [1. 背景](#1-背景) * [2. Indexer结构说明](#2-indexer结构说明) * [3 store结构说明](#3-store结构说明) * [4. cache](#4-cache) * [4.1 cache结构说明](#41-cache结构说明) * [4.2 ThreadSafeStore结构说明](#42-threadsafestore结构说明) * [4.3 举例说明](#43-举例说明) * [4.4 Cache总结](#44-cache总结) * [5. cache.index在informer中的应用](#5-cacheindex在informer中的应用) ### 1. 背景 tool/cache.indexer是informer中提供本地缓存,并且带有丰富索引的机制。 index是索引的实现。类似于数据库的索引一样,index可以加快查找速度。 本节就是弄清楚cache中indexer是如何实现的 本节研究的内容位置整个informer机制的红色圈起来区域 ![informer-indexer](../images/informer-indexer.png) 如何存储+如何索引
### 2. Indexer结构说明 Indexer是一个接口,包含两个部分: (1)Store。从Store定义来看,Store是真正保存数据的结构体。Store本身也是一个接口,具体的存储需要实现这些接口。 (2)Index,IndexKeys,ListIndexFuncValues,ByIndex,GetIndexers,AddIndexers 等和操作索引有关的函数 ``` // IndexFunc knows how to provide an indexed value for an object. type IndexFunc func(obj interface{}) ([]string, error) // Index maps the indexed value to a set of keys in the store that match on that value type Index map[string]sets.String // Indexers maps a name to a IndexFunc type Indexers map[string]IndexFunc // Indices maps a name to an Index type Indices map[string]Index // Indexer接口是为了添加或者查询索引用的。当前可能一下子看注释很迷惑,先看看后面的例子就清楚了 type Indexer interface { Store // 通过indexName获得索引函数,然后obj(pod)对象作为函数输入,输出所有检索值。然后找出来所有包含检索值的对象(pod) // 举例pod1 通过byuser这个函数,检索出来有ernie,bert两个检索值 // 然后Index("byuser",pod1) 会输出pod1, pod2(包含bert),pod3(包含ernie) // Retrieve list of objects that match on the named indexing function Index(indexName string, obj interface{}) ([]interface{}, error) // 通过索引函数的名字(byUser)+具体的值(bert),获得pod的名字(ns/podName) // IndexKeys returns the set of keys that match on the named indexing function. IndexKeys(indexName, indexKey string) ([]string, error) // 通过索引函数的名字(byUser), 获得所有的索引值。这里输入byuser, 输出:ernie, bert, elmo, oscar // ListIndexFuncValues returns the list of generated values of an Index func ListIndexFuncValues(indexName string) []string // 通过索引函数的名字(byUser)+具体的值(bert),获得pod对象 // ByIndex lists object that match on the named indexing function with the exact key ByIndex(indexName, indexKey string) ([]interface{}, error) // 返回所有的索引函数 // GetIndexer return the indexers GetIndexers() Indexers // AddIndexers adds more indexers to this store. If you call this after you already have data // in the store, the results are undefined. // 添加 索引函数。每个索引函数都有一个唯一的名字,那就是 indexName AddIndexers(newIndexers Indexers) error } ```
Store是一个存储的接口,后面结合具体存储实现再讲。这里先讲一下 Index, Indexers, Indices的关系。 IndexFunc:索引函数。输入对象,输出对象在该索引函数下匹配的字段(索引值)列表。 Index: 索引表。 map结构,key索引值, value是对象名(初始化Indexer的时候需要指定,默认是ns+metadata.name表示一个对象) Indexers:索引函数表。 map结构,索引函数可以有多个,所以每个索引函数需要起一个名字来表示。map的key是一个索引函数的名称,value是一个个的索引函数。 Indices:Index的复数形式。每个索引函数名对应一个索引函数,每个索引函数对应很多索引值。每个索引值会对应很多实际的对象。 index只能知道索引值对应对象。 Indices可以通过函数索引名,知道每个索引值对应的对象。
### 3 store结构说明 store可以认为只是一个父类,它只是一个接口,说明了要想实现存储,必须要实现这些函数。 ``` // Store is a generic object storage interface. Reflector knows how to watch a server // and update a store. A generic store is provided, which allows Reflector to be used // as a local caching system, and an LRU store, which allows Reflector to work like a // queue of items yet to be processed. // // Store makes no assumptions about stored object identity; it is the responsibility // of a Store implementation to provide a mechanism to correctly key objects and to // define the contract for obtaining objects by some arbitrary key type. type Store interface { Add(obj interface{}) error //往存储增加,更新,删除元素 Update(obj interface{}) error Delete(obj interface{}) error List() []interface{} ListKeys() []string Get(obj interface{}) (item interface{}, exists bool, err error) GetByKey(key string) (item interface{}, exists bool, err error) // Replace will delete the contents of the store, using instead the // given list. Store takes ownership of the list, you should not reference // it after calling this function. Replace([]interface{}, string) error Resync() error } ``` ### 4. cache #### 4.1 cache结构说明 cache结构体本身只有 cacheStorage + keyFunc两个元素。 ``` // cache responsibilities are limited to: // 1. Computing keys for objects via keyFunc // 2. Invoking methods of a ThreadSafeStorage interface type cache struct { // cacheStorage bears the burden of thread safety for the cache cacheStorage ThreadSafeStore // keyFunc is used to make the key for objects stored in and retrieved from items, and // should be deterministic. keyFunc KeyFunc } ``` cacheStorage是真正的存储结构。 keyFunc 就是如何通过一个 String 定位到一个对象(例如pod) 查看k8s.io/client-go/tools/cache/store.go 中的函数定义。 可以发现 cache即实现了 indexer的所有函数,又实现了store的所有函数。但是cache结构的所有方法都是调用了成员变量cacheStorage的方法。如下: ``` // Add inserts an item into the cache. func (c *cache) Add(obj interface{}) error { key, err := c.keyFunc(obj) if err != nil { return KeyError{obj, err} } c.cacheStorage.Add(key, obj) return nil } ``` 所以`ThreadSafeStore`才是真正实现了 缓存+索引 功能的结构体。 #### 4.2 ThreadSafeStore结构说明 ThreadSafeStore本身就是一个接口,定义了 store + indexer的所有函数。threadSafeMap是真正的实现类。 在thread_safe_store.go文件一看就非常清楚 k8s.io/client-go/tools/cache/thread_safe_store.go ``` // threadSafeMap implements ThreadSafeStore type threadSafeMap struct { lock sync.RWMutex items map[string]interface{} //真正的存储,存储所有的元数据 // indexers maps a name to an IndexFunc indexers Indexers // indices maps a name to an Index indices Indices } ``` #### 4.3 举例说明 threadSafeMap的实现都非常简单。看看代码就明白了。但是结合上面对Indexer的文字描述太过于枯燥,所以这里以一个例子说明cache.indexer是如何实现 `存储+索引` 的。该例子来源于 k8s.io/client-go/tools/cache/index_test.go 具体如下: ``` // 1. 先定义一个IndexFunc // testUsersIndexFunc 就是上面提到的索引函数 // 从函数的实现可以看出来。这个就是想根据 pod Annotations中users的名字做索引 func testUsersIndexFunc(obj interface{}) ([]string, error) { pod := obj.(*v1.Pod) usersString := pod.Annotations["users"] return strings.Split(usersString, ","), nil } // 2. 初始化一个NewIndexer // NewIndexer必须指定一个func,这个func的作用就是KeyFunc, 能用一个string代表 pod对象。这里就是MetaNamespaceKeyFunc,用ns/name来表示一个pod // 同时还指定一个Indexers。这个表示,当前Indexers只有一个索引函数testUsersIndexFunc,索引函数名为byUser index := NewIndexer(MetaNamespaceKeyFunc, Indexers{"byUser": testUsersIndexFunc}) 查看NewIndexer的定义可以发现,就是生成了cache结构体 // NewIndexer returns an Indexer implemented simply with a map and a lock. func NewIndexer(keyFunc KeyFunc, indexers Indexers) Indexer { return &cache{ cacheStorage: NewThreadSafeStore(indexers, Indices{}), keyFunc: keyFunc, } } // 3.定义三个pod // pod1 -> ernie,bert // pod2 -> bert,oscar // pod3 -> ernie,elmo pod1 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "one", Annotations: map[string]string{"users": "ernie,bert"}}} pod2 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "two", Annotations: map[string]string{"users": "bert,oscar"}}} pod3 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "tre", Annotations: map[string]string{"users": "ernie,elmo"}}} // 4.将三个pod放入pod index.Add(pod1) index.Add(pod2) index.Add(pod3) 到这里先暂停一下,看看上面提到的IndexFunc,Index,Indexers,Indices都有哪些内容 IndexFunc:testUsersIndexFunc threadSafeMap.Indexers: { "byUser": testUsersIndexFunc } threadSafeMap.Indices: { "byUser": { "ernie": ["one","tre"], "bert": ["one","two"], "oscar": ["two"], "elmo": ["tre"], } } Index:就是上面的Indices的一个个数据,就是byUser。因为只有一个索引函数 "byUser": { "ernie": ["one","tre"], "bert": ["one","two"], "oscar": ["two"], "elmo": ["tre"], } threadSafeMap.items { "one" : pod1, "two" : pod2, "tre" : pod3 } 其中。pod1,pod2,pod3都是一个个pod结构的对象。 所以可以看到 threadSafeMap 通过 items实现了存储,Indices + Indexers实现了索引 // 增加一个元素,处理操作items外,还要更新Indices func (c *threadSafeMap) Add(key string, obj interface{}) { c.lock.Lock() defer c.lock.Unlock() oldObject := c.items[key] c.items[key] = obj c.updateIndices(oldObject, obj, key) } 接下来再看看 threadSafeMap 是如何实现 索引的各个函数的。代码不在贴了,直接写输出。 // indexName就是索引函数名,obj(pod)就是pod对象。 // 该函数的功能是, 通过索引函数名,找到索引函数,在将pod作为索引函数的输入,得到所有的检索值。然后再找出来所有包含检索值的对象列表 // Index("byuser",pod1) 会输出[pod1, pod2,pod3] // 原因:pod1通过byuser这个函数,检索出来有ernie,bert两个检索值 // pod1,pod2,pod3都包含ernie,bert之一,所有都符合条件 Index(indexName string, obj interface{}) ([]interface{}, error) // 该函数的功能是:通过索引函数名+索引值,得到所有的对象的名字 // 举例:IndexKeys("byUser", "bert")的输出是: ["one","two"] // IndexKeys returns the set of keys that match on the named indexing function. IndexKeys(indexName, indexKey string) ([]string, error) // 该函数的功能是:根据索引函数名,得到所有的索引值 // 举例:ListIndexFuncValues("byuser") 输出为:ernie, bert, elmo, oscar // ListIndexFuncValues returns the list of generated values of an Index func ListIndexFuncValues(indexName string) []string // 该函数的功能是:通过索引函数名+索引值,得到所有的对象 // 举例:IndexKeys("byUser", "bert")的输出是: [pod1,pod2] // IndexKeys得到的是对象的名字(key) // ByIndex lists object that match on the named indexing function with the exact key ByIndex(indexName, indexKey string) ([]interface{}, error) // 返回所有的索引函数 // GetIndexer return the indexers GetIndexers() Indexers // AddIndexers adds more indexers to this store. If you call this after you already have data // in the store, the results are undefined. // 添加 索引函数。每个索引函数都有一个唯一的名字,那就是 indexName AddIndexers(newIndexers Indexers) error ```
#### 4.4 Cache总结 (1)cache提供了 存储+索引的功能,最终是通过threadSafeMap实现的 (2)threadSafeMap中items实现了存储。indexers + Indices 实现了索引 (3)add, del, update元素除了更新items这个map,还要更新indexers + Indices (4)吐槽一下,indexers,Indices,index这些名字感觉没起好,咋一看莫名其妙
### 5. cache.index在informer中的应用 以podinformer为例介绍cache这一套在informer中的应用。本节只是介绍podinformer是如何生成cache的。具体cache的更新,结合list watcher再做说明。
k8s.io/client-go/informers/core/v1/pod.go (1)defaultInformer传入的是cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc} indexers就是一个map。因为索引函数有很多,所以就需要一个名字来区分不同的索引函数。 比如MetaNamespaceIndexFunc,就是根据对象的namespace来做索引 ``` func (f *podInformer) defaultInformer(client kubernetes.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer { return NewFilteredPodInformer(client, f.namespace, resyncPeriod, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, f.tweakListOptions) } key是一个string const ( NamespaceIndex string = "namespace" ) type IndexFunc func(obj interface{}) ([]string, error) // MetaNamespaceIndexFunc is a default index function that indexes based on an object's namespace func MetaNamespaceIndexFunc(obj interface{}) ([]string, error) { meta, err := meta.Accessor(obj) if err != nil { return []string{""}, fmt.Errorf("object has no meta: %v", err) } return []string{meta.GetNamespace()}, nil } ``` (2)cache.Indexers是一个参数,传入到了SharedIndexInformer的实例化 ``` func NewFilteredPodInformer(client kubernetes.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers, tweakListOptions internalinterfaces.TweakListOptionsFunc) cache.SharedIndexInformer { return cache.NewSharedIndexInformer( &cache.ListWatch{ ListFunc: func(options metav1.ListOptions) (runtime.Object, error) { if tweakListOptions != nil { tweakListOptions(&options) } return client.CoreV1().Pods(namespace).List(options) //直接调用apiserver的list接口 }, WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) { if tweakListOptions != nil { tweakListOptions(&options) } return client.CoreV1().Pods(namespace).Watch(options) // 直接调用apiserver的watch接口 }, }, &corev1.Pod{}, //说明是pod对象 resyncPeriod, indexers, //指定indexer ) } ``` 从这里可以看出来,cache只管做缓存+索引。数据来源都定义好了,不用管。
(3) 实例化时调用了 NewIndexer(DeletionHandlingMetaNamespaceKeyFunc, indexers) ``` func NewSharedIndexInformer(lw ListerWatcher, objType runtime.Object, defaultEventHandlerResyncPeriod time.Duration, indexers Indexers) SharedIndexInformer { realClock := &clock.RealClock{} sharedIndexInformer := &sharedIndexInformer{ processor: &sharedProcessor{clock: realClock}, indexer: NewIndexer(DeletionHandlingMetaNamespaceKeyFunc, indexers), listerWatcher: lw, objectType: objType, resyncCheckPeriod: defaultEventHandlerResyncPeriod, defaultEventHandlerResyncPeriod: defaultEventHandlerResyncPeriod, cacheMutationDetector: NewCacheMutationDetector(fmt.Sprintf("%T", objType)), clock: realClock, } return sharedIndexInformer } // NewIndexer returns an Indexer implemented simply with a map and a lock. func NewIndexer(keyFunc KeyFunc, indexers Indexers) Indexer { return &cache{ cacheStorage: NewThreadSafeStore(indexers, Indices{}), keyFunc: keyFunc, } } DeletionHandlingMetaNamespaceKeyFunc 最终调用了 MetaNamespaceKeyFunc 所以 ns/podname 就能代表一个 Pod实例 // MetaNamespaceKeyFunc is a convenient default KeyFunc which knows how to make // keys for API objects which implement meta.Interface. // The key uses the format / unless is empty, then // it's just . // // TODO: replace key-as-string with a key-as-struct so that this // packing/unpacking won't be necessary. func MetaNamespaceKeyFunc(obj interface{}) (string, error) { if key, ok := obj.(ExplicitKey); ok { return string(key), nil } meta, err := meta.Accessor(obj) if err != nil { return "", fmt.Errorf("object has no meta: %v", err) } if len(meta.GetNamespace()) > 0 { return meta.GetNamespace() + "/" + meta.GetName(), nil } return meta.GetName(), nil } ``` (4)informer.indexer 最终就是一个熟悉的cache结构体 ``` // // cache responsibilities are limited to: // 1. Computing keys for objects via keyFunc // 2. Invoking methods of a ThreadSafeStorage interface type cache struct { // cacheStorage bears the burden of thread safety for the cache cacheStorage ThreadSafeStore // keyFunc is used to make the key for objects stored in and retrieved from items, and // should be deterministic. keyFunc KeyFunc } ```
总结: 到这里就可以看出来一个Informer是如何定义本地存储+索引的。至于整个系统如何运转,看后面的Informer分析。 ================================================ FILE: k8s/client-go/7. informer机制详解.md ================================================ Table of Contents ================= * [1.章节介绍](#1章节介绍) * [2. cache.SharedIndexInformer结构介绍](#2-cachesharedindexinformer结构介绍) * [3. sharedIndexInformer.Run](#3-sharedindexinformerrun) * [3.1 NewDeltaFIFO](#31-newdeltafifo) * [3.1.1 DeltaFIFO的定位](#311-deltafifo的定位) * [3.1.2 DeltaFIFO结构介绍](#312--deltafifo结构介绍) * [3.1.3 举例说明deltaFifo核心结构](#313-举例说明deltafifo核心结构) * [3.2 sharedIndexInformer生产数据](#32-sharedindexinformer生产数据) * [3.2.1 controller结构](#321-controller结构) * [3.2.2 controller.run](#322-controllerrun) * [3.2.3 Reflector实例](#323-reflector实例) * [3.2.4 Reflector.run](#324-reflectorrun) * [3.2.5 ListAndWatch](#325-listandwatch) * [知识补充](#知识补充) * [源码分析](#源码分析) * [3.2.6 c.processLoop](#326-cprocessloop) * [HandleDeltas函数](#handledeltas函数) * [理解listeners和syncingListeners的区别](#理解listeners和syncinglisteners的区别) * [3.3 s.processor.run消费数据](#33-sprocessorrun消费数据) * [processorListener结构](#processorlistener结构) * [pod and run](#pod-and-run) * [4 参考](#4-参考) ### 1.章节介绍 从上一章节可以知道。利用informer机制可以非常简单地实现一个资源对象的控制器,具体步骤为 (1)new SharedInformerFactory实例,然后指定indexer,listWatch参数,就可以生成一个 cache.SharedIndexInformer 对象 (2)cache.SharedIndexInformer 实际是完成了下图中的informer机制 ![informer.png](../images/informer.png) 这一章节开始从SharedIndexInformer入手研究informer机制。 ### 2. cache.SharedIndexInformer结构介绍 ``` type sharedIndexInformer struct { indexer Indexer // 本地的缓存+索引机制,上一篇文章详解介绍了 controller Controller // 控制器,启动reflector, 这个controller包含reflector:根据用户定义的ListWatch方法获取对象并更新增量队列DeltaFIFO processor *sharedProcessor // 注册了add,update,del事件的listener集合 cacheMutationDetector CacheMutationDetector // 突变检测器 // This block is tracked to handle late initialization of the controller listerWatcher ListerWatcher // 定义了list, watch函数, 看podinformer那里就可以知道,是直接调用了client往apiserver发送了请求 objectType runtime.Object // 定义要List watch的对象类型。如果是Podinfomer,就是要传入core.v1.pod // resyncCheckPeriod is how often we want the reflector's resync timer to fire so it can call // shouldResync to check if any of our listeners need a resync. resyncCheckPeriod time.Duration // 给自己的controller的reflector每隔多少s<尝试>调用listener的shouldResync方法 // defaultEventHandlerResyncPeriod is the default resync period for any handlers added via // AddEventHandler (i.e. they don't specify one and just want to use the shared informer's default // value). defaultEventHandlerResyncPeriod time.Duration // 通过AddEventHandler注册的handler的默认同步值 // clock allows for testability clock clock.Clock started, stopped bool startedLock sync.Mutex // blockDeltas gives a way to stop all event distribution so that a late event handler // can safely join the shared informer. blockDeltas sync.Mutex } ``` SharedIndexInformer主要包括以下对象: (1)indexer 图中右下角的indexer。上一节已经分析了具体的实现。 (2)Controller 图中左边的Controller,启动reflector, list-watch那一套机制。接下来重点分析 (3)processor 图中最下面的listeners,所有往 informer注册了 ResourceEventHandler的都是一个listener。 因为是共享informer,所以存在一个inforemr实例化了多次,然后注册了多个ResourceEventHandler。一般情况下,一个Informer一个listener ``` type sharedProcessor struct { listenersStarted bool listenersLock sync.RWMutex listeners []*processorListener // 记录了informer添加的所有listener syncingListeners []*processorListener // 记录了informer中哪些listener处于sync状态。由resyncCheckPeriod参数控制。每隔resyncCheckPeriod秒,listener都需要重新同步一下,同步就是将listener变成syncingListeners。 clock clock.Clock wg wait.Group } ``` ResourceEventHandler结构体如下。这个就是定义Informer,add, update, del的处理事件。 ``` type ResourceEventHandler interface { OnAdd(obj interface{}) OnUpdate(oldObj, newObj interface{}) OnDelete(obj interface{}) } ``` (4)CacheMutationDetector 突变检测器,用来检测内存中对象是否发生了突变。测试的时候用,默认不开启。这个先不深入了解
### 3. sharedIndexInformer.Run k8s.io/client-go/tools/cache/shared_informer.go 在使用informer的时候,定义好sharedIndexInformer后,就直接运行了sharedIndexInformer.Run函数开始了整个Informer机制。 整个informer的运转逻辑就是: (1)deltaFIFO接收listAndWatch的全量/增量数据,然后通过pop函数发送到HandleDeltas函数中 (生产) (2)HandleDeltas将一个一个的事件发送到自定义的handlers 和 更新indexer缓存 (消费) 现在就沿着 Run这个函数入手,看看具体是如何实现的。sharedIndexInformer.Run主要逻辑如下: 1. new一个 deltafifo对象,并且指定对象的keyfun为 MetaNamespaceKeyFunc,就是用 ns/name 来当对象的key 2. 生成config,利用config 生成一个controller 3. 运行用户自定义handler的处理逻辑,s.processor.run (开启消费) 4. 运行controller.run,实现整体的运作逻辑 (开启生产) ``` func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) { defer utilruntime.HandleCrash() // 1. new一个 deltafifo对象,并且指定对象的keyfun为 MetaNamespaceKeyFunc,就是用 ns/name 来当对象的key fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, s.indexer) // 2. 生成config cfg := &Config{ Queue: fifo, ListerWatcher: s.listerWatcher, ObjectType: s.objectType, FullResyncPeriod: s.resyncCheckPeriod, // 同步周期 RetryOnError: false, ShouldResync: s.processor.shouldResync, // 这是个函数,用于判断自定义的handler是否需要同步 Process: s.HandleDeltas, // listwatch来了数据,如何处理的函数 } func() { s.startedLock.Lock() defer s.startedLock.Unlock() // 3. 利用config 生成一个controller s.controller = New(cfg) s.controller.(*controller).clock = s.clock s.started = true }() // Separate stop channel because Processor should be stopped strictly after controller processorStopCh := make(chan struct{}) var wg wait.Group defer wg.Wait() // Wait for Processor to stop defer close(processorStopCh) // Tell Processor to stop // 内存突变检测,忽略 wg.StartWithChannel(processorStopCh, s.cacheMutationDetector.Run) // 4. 运行用户自定义handler的处理逻辑 wg.StartWithChannel(processorStopCh, s.processor.run) defer func() { s.startedLock.Lock() defer s.startedLock.Unlock() s.stopped = true // Don't want any new listeners }() // 5.运行controller s.controller.Run(stopCh) } ``` #### 3.1 NewDeltaFIFO ##### 3.1.1 DeltaFIFO的定位 在apisever中的list-watch机制介绍中,就可以知道。直接使用list,watch api就可以获得全量和增量数据。 如果让我写一个最简单的client-go客户端,我实现的方式是: (1)定义一个本地存储cache,list的时候将数据放到cache中 (2)然后watch的时候就更新cache数据,然后再将对象发送到自定义的add, update, del handler函数中 需要cache的原因:本地缓存一份etcd数据,这样控制器需要访问数据的话,直接从本地拿。
以上可以实现一个很简陋的客户端,但是还远远达不到informer机制的要求。 informer机制为啥需要DeltaFIFO? (1)为啥需要FIFO队列? 很容易理解,FIFO是保障有序,不有序就会导致数据错乱。 队列是为了缓冲,如果更新的数据太多,informer机制可能就扛不住了 (2)为啥需要delta? FIFO队列的元素总共就两个去向。第一用于同步本地缓存。第二用于发送给自定义的add, update, del handler函数。 假设某个极短的时间内,某一个对象做了大量的update,最后被删除了。这样的话,FIFO队列其实是堆积了很多数据。 一个一个发送给handler函数没有问题,因为用户就想知道这个过程。但是如果是一个一个的更新本地缓存,最后又delete了,那前面的update就浪费了。 所以这个时候DeltaFIFO队列出现了。它解决了这个问题。 ##### 3.1.2 DeltaFIFO结构介绍 DeltaFIFO可以认为是一个特殊的FIFO队列。Delta就是k8s系统中对象的变化(增、删、改、同步)的一个标记。 增、删、改肯定是需要的,因为就算我们自己实现一个队列也需要当前是做了什么操作。 同步是重新List apiserver的时候需要的 ``` // 有着四种类型 // Change type definition const ( Added DeltaType = "Added" Updated DeltaType = "Updated" Deleted DeltaType = "Deleted" // The other types are obvious. You'll get Sync deltas when: // * A watch expires/errors out and a new list/watch cycle is started. // * You've turned on periodic syncs. // (Anything that trigger's DeltaFIFO's Replace() method.) Sync DeltaType = "Sync" ) // Delta is the type stored by a DeltaFIFO. It tells you what change // happened, and the object's state after* that change. // // [*] Unless the change is a deletion, and then you'll get the final // state of the object before it was deleted. type Delta struct { Type DeltaType Object interface{} //k8s中的对象 } // Deltas is a list of one or more 'Delta's to an individual object. // The oldest delta is at index 0, the newest delta is the last one. type Deltas []Delta type DeltaFIFO struct { // lock/cond protects access to 'items' and 'queue'. lock sync.RWMutex cond sync.Cond // We depend on the property that items in the set are in // the queue and vice versa, and that all Deltas in this // map have at least one Delta. items map[string]Deltas queue []string // populated和initialPopulationCount 是用来判断 process是否同步的 // populated is true if the first batch of items inserted by Replace() has been populated // or Delete/Add/Update was called first. populated bool //队列的元素开始被消费 // initialPopulationCount is the number of items inserted by the first call of Replace() initialPopulationCount int // keyFunc is used to make the key used for queued item // insertion and retrieval, and should be deterministic. keyFunc KeyFunc // knownObjects list keys that are "known", for the // purpose of figuring out which items have been deleted // when Replace() or Delete() is called. knownObjects KeyListerGetter // Indication the queue is closed. // Used to indicate a queue is closed so a control loop can exit when a queue is empty. // Currently, not used to gate any of CRED operations. closed bool closedLock sync.Mutex } ``` DeltaFIFO最关键的是, items, queue, 和knownObjects。 items: 对象的变化过程列表 Queue: 表示对象的key。 knownObjects:从下面的初始化可以看出来,就是 cache.indexer ``` fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, s.indexer) func NewDeltaFIFO(keyFunc KeyFunc, knownObjects KeyListerGetter) *DeltaFIFO { f := &DeltaFIFO{ items: map[string]Deltas{}, queue: []string{}, keyFunc: keyFunc, knownObjects: knownObjects, } f.cond.L = &f.lock return f } ``` ##### 3.1.3 举例说明deltaFifo核心结构 假设监听了 default命名空间的所有Pod,最开始该命名空间没有Pod,然后监听了一会后,创建了三个Pod, 分别为: ``` pod1 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "one", Annotations: map[string]string{"users": "ernie,bert"}}} pod2 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "two", Annotations: map[string]string{"users": "bert,oscar"}}} pod3 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "tre", Annotations: map[string]string{"users": "ernie,elmo"}}} ``` 那么watch函数依次会产生如下的事件: pod1-1:表示pod1对应的第一个阶段 (pending状态) pod1-2:表示pod1对应的第二个阶段 (scheduled状态) pod1-3:表示pod1对应的第三个阶段 (running状态) ``` ADD: pod1-1(省略模式,其实是整个pod的元数据,{ObjectMeta: metav1.ObjectMeta{Name: "one", Annotations: map[string]string{"users": "ernie,bert"}}}) ADD: pod2-1 MODIFIED: pod1-2 ADD: pod3-1 MODIFIED: pod2-2 MODIFIED: pod3-2 MODIFIED: pod1-3 MODIFIED: pod3-3 MODIFIED: pod2-3 ``` 这个时候 deltaFIFO结构体的对象为: deltaFIFO { ​ queue: ["one", "two", "tree"], ​ Items: { ​ "one": [{"add", pod1-1}, {"update", pod1-2}, {"update", pod1-3}], ​ "two": [{"add", pod2-1}, {"update", pod2-2}, {"update", pod2-3}], ​ "tre": [{"add", pod3-1}, {"update", pod3-2}, {"update", pod3-3}], ​ } } 这样的好处就是: (1)每次是以一个对象为单位进行发送,比如这里一次就将 "one": [{"add", pod1-1}, {"update", pod1-2}, {"update", pod1-3}] 三个事件发送给了 handler方 (2)indexer可以知道当前对象的最终状态。比如 "one": [{"add", pod1-1}, {"update", pod1-2}, {"update", pod1-3}], 这个,能跳过pod1-1, pod1-2状态,直接将pod1-3状态更新到缓存中去。
#### 3.2 sharedIndexInformer生产数据 都知道数据产生方来着 apisever的listAndWatch。接下来看看是如何使用的。这里直接从 controller.run入手。 ##### 3.2.1 controller结构 controller结构本身非常简单,主要就是一个config,然后根据config实现的一些生产数据相关的函数 ``` // New makes a new Controller from the given Config. func New(c *Config) Controller { ctlr := &controller{ config: *c, clock: &clock.RealClock{}, } return ctlr } // Config contains all the settings for a Controller. type Config struct { // The queue for your objects - has to be a DeltaFIFO due to // assumptions in the implementation. Your Process() function // should accept the output of this Queue's Pop() method. // 弄一个数据缓存 Queue // 从aipserver接收数据 ListerWatcher // Something that can process your objects. // 如何处理接收到的数据 Process ProcessFunc // The type of your objects. // 数据是什么类型,Pod? deploy? ObjectType runtime.Object FullResyncPeriod time.Duration // 是否需要同步 ShouldResync ShouldResyncFunc //是否错误重试 RetryOnError bool } ``` ##### 3.2.2 controller.run 1. 实例化 NewReflector 2. 通过List-watch获得生产数据 3. 处理生产数据,不断执行processLoop,这个方法其实就是从DeltaFIFO pop出对象,再调用reflector的Process(其实是shareIndexInformer的HandleDeltas方法)处理 ``` func (c *controller) Run(stopCh <-chan struct{}) { defer utilruntime.HandleCrash() go func() { <-stopCh c.config.Queue.Close() }() // 1.实例化 NewReflector r := NewReflector( c.config.ListerWatcher, c.config.ObjectType, c.config.Queue, c.config.FullResyncPeriod, ) r.ShouldResync = c.config.ShouldResync r.clock = c.clock c.reflectorMutex.Lock() c.reflector = r c.reflectorMutex.Unlock() var wg wait.Group defer wg.Wait() // 2. 通过List-watch获得生产数据 wg.StartWithChannel(stopCh, r.Run) // 3. 处理生产数据 // 不断执行processLoop,这个方法其实就是从DeltaFIFO pop出对象,再调用reflector的Process(其实是shareIndexInformer的HandleDeltas方法)处理 wait.Until(c.processLoop, time.Second, stopCh) } ``` ##### 3.2.3 Reflector实例 Reflector核心结构,可以看出来基本都是从config基础下来的。 ``` // Reflector watches a specified resource and causes all changes to be reflected in the given store. type Reflector struct { // name identifies this reflector. By default it will be a file:line if possible. name string // The name of the type we expect to place in the store. The name // will be the stringification of expectedGVK if provided, and the // stringification of expectedType otherwise. It is for display // only, and should not be used for parsing or comparison. expectedTypeName string // The type of object we expect to place in the store. expectedType reflect.Type // The GVK of the object we expect to place in the store if unstructured. expectedGVK *schema.GroupVersionKind // The destination to sync up with the watch source store Store //获得数据存放哪里,就是deltaFIFO队列 // listerWatcher is used to perform lists and watches. listerWatcher ListerWatcher // period controls timing between one watch ending and // the beginning of the next one. period time.Duration resyncPeriod time.Duration ShouldResync func() bool // clock allows tests to manipulate time clock clock.Clock // lastSyncResourceVersion is the resource version token last // observed when doing a sync with the underlying store // it is thread safe, but not synchronized with the underlying store lastSyncResourceVersion string // lastSyncResourceVersionMutex guards read/write access to lastSyncResourceVersion lastSyncResourceVersionMutex sync.RWMutex // WatchListPageSize is the requested chunk size of initial and resync watch lists. // Defaults to pager.PageSize. WatchListPageSize int64 } ```
##### 3.2.4 Reflector.run 就是上面的r.un。就做一件事。运行listAndWatch函数。 注意:ListAndWatch函数是1s运行一次哟。 所以relist并不是listAndWatch干的。ListAndWatch只是进行一轮list 和 watch(正常情况会一直保持watch) 当ListAndWatch因为异常/错误或者其他原因退出了,Reflector会自动再次执行listAndWatch ``` // Run starts a watch and handles watch events. Will restart the watch if it is closed. // Run will exit when stopCh is closed. func (r *Reflector) Run(stopCh <-chan struct{}) { klog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedTypeName, r.resyncPeriod, r.name) wait.Until(func() { if err := r.ListAndWatch(stopCh); err != nil { utilruntime.HandleError(err) } }, r.period, stopCh) } NewReflector定义了period是1s // NewReflector creates a new Reflector object which will keep the given store up to // date with the server's contents for the given resource. Reflector promises to // only put things in the store that have the type of expectedType, unless expectedType // is nil. If resyncPeriod is non-zero, then lists will be executed after every // resyncPeriod, so that you can use reflectors to periodically process everything as // well as incrementally processing the things that change. func NewReflector(lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector { return NewNamedReflector(naming.GetNameFromCallsite(internalPackages...), lw, expectedType, store, resyncPeriod) } // NewNamedReflector same as NewReflector, but with a specified name for logging func NewNamedReflector(name string, lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector { r := &Reflector{ name: name, listerWatcher: lw, store: store, period: time.Second, // period是1s resyncPeriod: resyncPeriod, clock: &clock.RealClock{}, } r.setExpectedType(expectedType) return r } ``` ##### 3.2.5 ListAndWatch ###### 知识补充 listAndWatch核心思路就是:将apiserver list/watch到的数据发送到deltaFIFO队列中去。 在看代码之前,先通过curl kube-apiserver来看看 list-watch的特性。 (1)podList可以认为是一个新的对象,它也是有资源版本的说法 (2)list默认是用来chunk(分段传输)的,chunk的介绍和好处 https://zh.wikipedia.org/wiki/%E5%88%86%E5%9D%97%E4%BC%A0%E8%BE%93%E7%BC%96%E7%A0%81 (3)v1.19 及以上版本的 API 服务器支持 `resourceVersionMatch` 参数,用以确定如何对 LIST 调用应用 resourceVersion 值。 强烈建议在为 LIST 调用设置了 `resourceVersion` 时也设置 `resourceVersionMatch`。 如果 `resourceVersion` 未设置,则 `resourceVersionMatch` 是不允许设置的。 为了向后兼容,客户端必须能够容忍服务器在某些场景下忽略 `resourceVersionMatch` 的行为: - 当设置 `resourceVersionMatch=NotOlderThan` 且指定了 `limit` 时,客户端必须能够 处理 HTTP 410 "Gone" 响应。例如,客户端可以使用更新一点的 `resourceVersion` 来重试,或者回退到 `resourceVersion=""` (即允许返回任何版本)。 - 当设置了 `resourceVersionMatch=Exact` 且未指定 `limit` 时,客户端必须验证 响应数据中 `ListMeta` 的 `resourceVersion` 与所请求的 `resourceVersion` 匹配, 并处理二者可能不匹配的情况。例如,客户端可以重试设置了 `limit` 的请求。 除非你对一致性有着非常强烈的需求,使用 `resourceVersionMatch=NotOlderThan` 同时为 `resourceVersion` 设定一个已知值是优选的交互方式,因为与不设置 `resourceVersion` 和 `resourceVersionMatch` 相比,这种配置可以取得更好的 集群性能和可扩缩性。后者需要提供带票选能力的读操作。 参考:https://kubernetes.io/zh/docs/reference/using-api/api-concepts/ | resourceVersionMatch 参数 | 分页参数 | resourceVersion 未设置 | resourceVersion="0" | resourceVersion="<非零值>" | | ------------------------------------- | --------------------------- | ----------------------- | ------------------------------------- | ------------------------------------- | | resourceVersionMatch 未设置 | limit 未设置 | 最新版本 | 任意版本 | 不老于指定版本 | | resourceVersionMatch 未设置 | limit=, continue 未设置 | 最新版本 | 任意版本 | 精确匹配 | | resourceVersionMatch 未设置 | limit=, continue= | 从 token 开始、精确匹配 | 非法请求,视为从 token 开始、精确匹配 | 非法请求,返回 HTTP `400 Bad Request` | | resourceVersionMatch=Exact [1] | limit 未设置 | 非法请求 | 非法请求 | 精确匹配 | | resourceVersionMatch=Exact [1] | limit=, continue 未设置 | 非法请求 | 非法请求 | 精确匹配 | | resourceVersionMatch=NotOlderThan [1] | limit 未设置 | 非法请求 | 任意版本 | 不老于指定版本 | | resourceVersionMatch=NotOlderThan [1] | limit=, continue 未设置 | 非法请求 | 任意版本 | 不老于指定版本 | ``` // curl http://7.34.19.44:58201/api/v1/namespaces/test-test/pods -i HTTP/1.1 200 OK Audit-Id: 4ff9e833-e3e0-4001-9e1a-d83c9a9b1937 Cache-Control: no-cache, private Content-Type: application/json Date: Sat, 20 Nov 2021 02:10:48 GMT Transfer-Encoding: chunked { "kind": "PodList", "apiVersion": "v1", "metadata": { "selfLink": "/api/v1/namespaces/test-test/pods", "resourceVersion": "163916927" }, "items": [ root@cld-kmaster1-1051:/home/zouxiang# curl http://7.34.19.44:58201/api/v1/namespaces/test-test/pods?limit=1 -i HTTP/1.1 200 OK Audit-Id: 17d0d42f-a122-4c5a-9659-70224a22522a Cache-Control: no-cache, private Content-Type: application/json Date: Sat, 20 Nov 2021 02:09:32 GMT Transfer-Encoding: chunked //chunked传输 { "kind": "PodList", "apiVersion": "v1", "metadata": { "selfLink": "/api/v1/namespaces/test-test/pods", "resourceVersion": "163915936", // 注意这continue "continue": "eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTYzOTE1OTM2LCJzdGFydCI6ImFwcC1pc3Rpb3ZlcnNpb24tdGVzdC01NDZkZmZmNTYtNnQ2MnBcdTAwMDAifQ", "remainingItemCount": 23 //表示当前还有23个没有展示处理 }, "items": [ { "metadata": { "name": "app-istioversion-test-546dfff56-6t62p", "generateName": "app-istioversion-test-546dfff56-", // 加上这个continue参数,会把剩下的23个展示出来。 curl http://7.34.19.44:58201/api/v1/namespaces/test-test/pods?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTYzOTE1OTM2LCJzdGFydCI6ImFwcC1pc3Rpb3ZlcnNpb24tdGVzdC01NDZkZmZmNTYtNnQ2MnBcdTAwMDAifQ ```
watch很简单,就是一个长链接,chunked ``` root@cld-kmaster1-1051:/home/zouxiang# curl http://7.34.19.44:58201/api/v1/namespaces/default/pods?watch=true -i HTTP/1.1 200 OK Cache-Control: no-cache, private Content-Type: application/json Date: Sat, 20 Nov 2021 01:32:06 GMT Transfer-Encoding: chunked ```
###### 源码分析 ``` // ListAndWatch first lists all items and get the resource version at the moment of call, // and then use the resource version to watch. // It returns error if ListAndWatch didn't even try to initialize watch. func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error { klog.V(3).Infof("Listing and watching %v from %s", r.expectedTypeName, r.name) var resourceVersion string // Explicitly set "0" as resource version - it's fine for the List() // to be served from cache and potentially be delayed relative to // etcd contents. Reflector framework will catch up via Watch() eventually. // 以版本号ResourceVersion=0开始首次list options := metav1.ListOptions{ResourceVersion: "0"} if err := func() error { initTrace := trace.New("Reflector ListAndWatch", trace.Field{"name", r.name}) defer initTrace.LogIfLong(10 * time.Second) var list runtime.Object var err error listCh := make(chan struct{}, 1) panicCh := make(chan interface{}, 1) go func() { defer func() { if r := recover(); r != nil { panicCh <- r } }() // Attempt to gather list in chunks, if supported by listerWatcher, if not, the first // list request will return the full response. // 开始list数据,分页 pager := pager.New(pager.SimplePageFunc(func(opts metav1.ListOptions) (runtime.Object, error) { return r.listerWatcher.List(opts) })) if r.WatchListPageSize != 0 { pager.PageSize = r.WatchListPageSize } // Pager falls back to full list if paginated list calls fail due to an "Expired" error. // 获取list的数据 list, err = pager.List(context.Background(), options) close(listCh) }() select { case <-stopCh: return nil case r := <-panicCh: panic(r) case <-listCh: } if err != nil { return fmt.Errorf("%s: Failed to list %v: %v", r.name, r.expectedTypeName, err) } initTrace.Step("Objects listed") listMetaInterface, err := meta.ListAccessor(list) if err != nil { return fmt.Errorf("%s: Unable to understand list result %#v: %v", r.name, list, err) } resourceVersion = listMetaInterface.GetResourceVersion() initTrace.Step("Resource version extracted") // 提取list items, err := meta.ExtractList(list) if err != nil { return fmt.Errorf("%s: Unable to understand list result %#v (%v)", r.name, list, err) } initTrace.Step("Objects extracted") // 提取list的数据 if err := r.syncWith(items, resourceVersion); err != nil { return fmt.Errorf("%s: Unable to sync list result: %v", r.name, err) } initTrace.Step("SyncWith done") // 设置下一次list的resourceVersion r.setLastSyncResourceVersion(resourceVersion) initTrace.Step("Resource version updated") return nil }(); err != nil { return err } resyncerrc := make(chan error, 1) cancelCh := make(chan struct{}) defer close(cancelCh) go func() { resyncCh, cleanup := r.resyncChan() defer func() { cleanup() // Call the last one written into cleanup }() for { select { case <-resyncCh: case <-stopCh: return case <-cancelCh: return } if r.ShouldResync == nil || r.ShouldResync() { klog.V(4).Infof("%s: forcing resync", r.name) // 进行deltaFIFo的同步 if err := r.store.Resync(); err != nil { resyncerrc <- err return } } cleanup() resyncCh, cleanup = r.resyncChan() } }() for { // give the stopCh a chance to stop the loop, even in case of continue statements further down on errors select { case <-stopCh: return nil default: } timeoutSeconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0)) options = metav1.ListOptions{ ResourceVersion: resourceVersion, // We want to avoid situations of hanging watchers. Stop any wachers that do not // receive any events within the timeout window. TimeoutSeconds: &timeoutSeconds, // To reduce load on kube-apiserver on watch restarts, you may enable watch bookmarks. // Reflector doesn't assume bookmarks are returned at all (if the server do not support // watch bookmarks, it will ignore this field). AllowWatchBookmarks: true, } // 开始Watch w, err := r.listerWatcher.Watch(options) if err != nil { switch err { case io.EOF: // watch closed normally case io.ErrUnexpectedEOF: klog.V(1).Infof("%s: Watch for %v closed with unexpected EOF: %v", r.name, r.expectedTypeName, err) default: utilruntime.HandleError(fmt.Errorf("%s: Failed to watch %v: %v", r.name, r.expectedTypeName, err)) } // If this is "connection refused" error, it means that most likely apiserver is not responsive. // It doesn't make sense to re-list all objects because most likely we will be able to restart // watch where we ended. // If that's the case wait and resend watch request. if utilnet.IsConnectionRefused(err) { time.Sleep(time.Second) continue } return nil } // 处理watch的事件 if err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil { if err != errorStopRequested { switch { case apierrs.IsResourceExpired(err): klog.V(4).Infof("%s: watch of %v ended with: %v", r.name, r.expectedTypeName, err) default: klog.Warningf("%s: watch of %v ended with: %v", r.name, r.expectedTypeName, err) } } return nil } } } ``` 结合知识补充大概的流程很清楚。回答以下几个问题 (1)list操作为什么需要resoureversion? A: list机制本来就有resoureversion,resoureversion不同的值有不同的含义。每次list的时候记录了resoureversion,可以保证数据最少是上一次list后的(实际基本都是最新的) (2)为什么list会分页? 如果设置了limit就会分页 (3)如果提取list的数据 先是通过 items, err := meta.ExtractList(list) ,将list数据保持到items数组中 然后通过syncWith将List数据保持到 deltafIfo队列中去 syncWith的逻辑如下: (1)遍历所有list的数据,通过 queueActionLocked(Sync, item)将所有的数据,以(sync, item)的方式追加到 deltafifo的items里面 (2)遍历所有fIfo queue的数据,判断是否存下 fifo有,但是最新list没有的数据。如果存在这种情况。说明fifo漏到了delete请求,所以封装一个(delete, DeletedFinalStateUnknown) 到deltafifo的items里面。 为什么是DeletedFinalStateUnknown呢? 因为Replace方法可能是reflector发生re-list的时候再次调用,这个时候就会出现knownObjects中存在的对象不在Replace list的情况(比 如watch的delete事件丢失了),这个时候是把这些对象筛选出来,封装成DeletedFinalStateUnknown对象以Delete type类型再次加入 到deltaFIFO中,这样最终从detaFIFO处理这个DeletedFinalStateUnknown 增量时就可以更新本地缓存并且触发reconcile。 因为这个对 象最终的结构确实找不到了,所以只能用knownObjects里面的记录来封装delta,所以叫做FinalStateUnknown。 ``` // syncWith replaces the store's items with the given list. func (r *Reflector) syncWith(items []runtime.Object, resourceVersion string) error { found := make([]interface{}, 0, len(items)) for _, item := range items { found = append(found, item) } return r.store.Replace(found, resourceVersion) } // Replace will delete the contents of 'f', using instead the given map. // 'f' takes ownership of the map, you should not reference the map again // after calling this function. f's queue is reset, too; upon return, it // will contain the items in the map, in no particular order. func (f *DeltaFIFO) Replace(list []interface{}, resourceVersion string) error { f.lock.Lock() defer f.lock.Unlock() keys := make(sets.String, len(list)) // 第一次遍历list到的数据 for _, item := range list { key, err := f.KeyOf(item) if err != nil { return KeyError{item, err} } keys.Insert(key) // 2.将数据同步到fifo队列中去。这个就是往fifi的items加入元素。可以看出来,list的都是同步的数据 // items的delta有四种类型:add, update, del, sync (这里都是sync) if err := f.queueActionLocked(Sync, item); err != nil { return fmt.Errorf("couldn't enqueue object: %v", err) } } // 这个不存在,因为f.knownObjects=deltafifo if f.knownObjects == nil { // Do deletion detection against our own list. } // Detect deletions not already in the queue. knownKeys := f.knownObjects.ListKeys() queuedDeletions := 0 // 第二次遍历fifo中队列的数据 for _, k := range knownKeys { // 如果fifo中的数据,List也有,那就不用管,因为上面的for循环已经处理了 if keys.Has(k) { continue } // 如果fifo中的数据,list没有,那就是该数据已经删除了,但是由于某些原因,缓存没有收到,所以要删除这个队形 deletedObj, exists, err := f.knownObjects.GetByKey(k) if err != nil { deletedObj = nil klog.Errorf("Unexpected error %v during lookup of key %v, placing DeleteFinalStateUnknown marker without object", err, k) } else if !exists { deletedObj = nil klog.Infof("Key %v does not exist in known objects store, placing DeleteFinalStateUnknown marker without object", k) } queuedDeletions++ // 发送的是delete的delta,主要这里是DeletedFinalStateUnknown 因为Replace方法可能是reflector发生re-list的时候再次调用,这个时候就会出现knownObjects中存在的对象不在Replace list的情况(比如watch的delete事件丢失了),这个时候是把这些对象筛选出来,封装成DeletedFinalStateUnknown对象以Delete type类型再次加入到deltaFIFO中,这样最终从detaFIFO处理这个DeletedFinalStateUnknown 增量时就可以更新本地缓存并且触发reconcile。 因为这个对象最终的结构确实找不到了,所以只能用knownObjects里面的记录来封装delta,所以叫做FinalStateUnknown。 if err := f.queueActionLocked(Deleted, DeletedFinalStateUnknown{k, deletedObj}); err != nil { return err } } if !f.populated { f.populated = true f.initialPopulationCount = len(list) + queuedDeletions } return nil } ``` ##### 3.2.6 c.processLoop list, watch将apiserver获取的数据最终都保存到了 deltafifo队列中去 processLoop将数据进行了分发处理。 processLoop就是将一个个元素拿出来, ``` func (c *controller) processLoop() { for { // for循环的方式将fifo队列中的元素发送到 PopProcessFunc函数中去 obj, err := c.config.Queue.Pop(PopProcessFunc(c.config.Process)) // 在new config的时候指定了process= cfg :=HandleDeltas 函数 } if err != nil { if err == ErrFIFOClosed { return } if c.config.RetryOnError { // This is the safe way to re-enqueue. c.config.Queue.AddIfNotPresent(obj) } } } } func (f *DeltaFIFO) Pop(process PopProcessFunc) (interface{}, error) { f.lock.Lock() defer f.lock.Unlock() for { // 1.队列为空,判断是否关闭,如果没有关闭就等待,否则返回 for len(f.queue) == 0 { // When the queue is empty, invocation of Pop() is blocked until new item is enqueued. // When Close() is called, the f.closed is set and the condition is broadcasted. // Which causes this loop to continue and return from the Pop(). if f.IsClosed() { return nil, ErrFIFOClosed } f.cond.Wait() } // 2.取出来第一个元素, 注意是 queue里面的一个元素,对应的是Items里面的一个 map key-value对 id := f.queue[0] f.queue = f.queue[1:] if f.initialPopulationCount > 0 { f.initialPopulationCount-- } item, ok := f.items[id] if !ok { // Item may have been deleted subsequently. continue } delete(f.items, id) // 3.调用process进行处理 err := process(item) if e, ok := err.(ErrRequeue); ok { f.addIfNotPresent(id, item) err = e.Err } // Don't need to copyDeltas here, because we're transferring // ownership to the caller. return item, err } } ``` ###### HandleDeltas函数 终于出现了HandleDeltas, 如图中HandleDeltas功能所示: HandleDeltas就是干两件事情: (1)更新Indexer (这里很奇怪,没有一次性更新Indexer到位,就是如果Deltas最后一个是del事件,还是会先update后再删除) (2)将事件进行distribute发送 ``` func (s *sharedIndexInformer) HandleDeltas(obj interface{}) error { s.blockDeltas.Lock() defer s.blockDeltas.Unlock() // from oldest to newest for _, d := range obj.(Deltas) { switch d.Type { // 同步就是relist的时候,fifo replace函数发出来的事件 case Sync, Added, Updated: isSync := d.Type == Sync s.cacheMutationDetector.AddObject(d.Object) if old, exists, err := s.indexer.Get(d.Object); err == nil && exists { if err := s.indexer.Update(d.Object); err != nil { return err } s.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync) } else { if err := s.indexer.Add(d.Object); err != nil { return err } s.processor.distribute(addNotification{newObj: d.Object}, isSync) } case Deleted: if err := s.indexer.Delete(d.Object); err != nil { return err } s.processor.distribute(deleteNotification{oldObj: d.Object}, false) } } return nil } ```
distribute就很简单,将事件进行发送,这里有一个很简单的逻辑: 就是注册resourceHandler的时候,可以指定是否需要同步。比如我New一个informer,然后指定不同步。 这个时候我对应的resourceHandler就不是syncingListeners. ###### 理解listeners和syncingListeners的区别 processor可以支持listener的维度配置是否需要resync:一个informer可以配置多个EventHandler,而一个EventHandler对应processor中的一个listener,每个listener可以配置需不需要resync,如果某个listener需要resync,那么添加到deltaFIFO的Sync增量最终也只会回到对应的listener reflector中会定时判断每一个listener是否需要进行resync,判断的依据是看配置EventHandler的时候指定的resyncPeriod,0代表该listener不需要resync,否则就每隔resyncPeriod看看是否到时间了 - listeners:记录了informer添加的所有listener - syncingListeners:记录了informer中哪些listener处于sync状态 syncingListeners是listeners的子集,syncingListeners记录那些开启了resync且时间已经到达了的listener,把它们放在一个独立的slice是避免下面分析的distribute方法中把obj增加到了还不需要resync的listener中 ``` func (p *sharedProcessor) distribute(obj interface{}, sync bool) { p.listenersLock.RLock() defer p.listenersLock.RUnlock() if sync { for _, listener := range p.syncingListeners { listener.add(obj) } } else { for _, listener := range p.listeners { listener.add(obj) } } } add 就是往 addch chan发送数据 虽然p.addCh是一个无缓冲的channel,但是因为listener中存在ring buffer,所以这里并不会一直阻塞 func (p *processorListener) add(notification interface{}) { p.addCh <- notification } ``` #### 3.3 s.processor.run消费数据 sharedIndexInformer.Run指定了controller.run进行数据生产:就是将List, watch到的数据,以delta的方式保存到了deltafifo中 然后HandleDeltas 通过 distribute 函数将 delta变量发送到每一个 listener中去。 接下来分析s.processor.run是如何消费数据的。 s.processor.run的逻辑很清楚。启动每一个listener,run and pop。 ``` func (p *sharedProcessor) run(stopCh <-chan struct{}) { func() { p.listenersLock.RLock() defer p.listenersLock.RUnlock() for _, listener := range p.listeners { p.wg.Start(listener.run) p.wg.Start(listener.pop) } p.listenersStarted = true }() <-stopCh p.listenersLock.RLock() defer p.listenersLock.RUnlock() for _, listener := range p.listeners { close(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop } p.wg.Wait() // Wait for all .pop() and .run() to stop } ``` ###### processorListener结构 ``` type processorListener struct { nextCh chan interface{} // 发送给handler的对象 addCh chan interface{} // distribute发送下来的对象 handler ResourceEventHandler //定义informer时候的 add, update, del函数 // pendingNotifications is an unbounded ring buffer that holds all notifications not yet distributed. // There is one per listener, but a failing/stalled listener will have infinite pendingNotifications // added until we OOM. // TODO: This is no worse than before, since reflectors were backed by unbounded DeltaFIFOs, but // we should try to do something better. pendingNotifications buffer.RingGrowing // 缓存器,避免distribute发送的太快或者 hanlder处理的太慢 // requestedResyncPeriod is how frequently the listener wants a full resync from the shared informer requestedResyncPeriod time.Duration // 同步周期 // resyncPeriod is how frequently the listener wants a full resync from the shared informer. This // value may differ from requestedResyncPeriod if the shared informer adjusts it to align with the // informer's overall resync check period. resyncPeriod time.Duration // nextResync is the earliest time the listener should get a full resync nextResync time.Time // resyncLock guards access to resyncPeriod and nextResync resyncLock sync.Mutex } ``` ###### pod and run pop就是将addch 的对象发送到 nextCh。如果nextch满了的话,就保持在pendingNotifications中 run就是将nextCh的对象发送的 hanlder中去处理。 ``` func (p *processorListener) pop() { defer utilruntime.HandleCrash() defer close(p.nextCh) // Tell .run() to stop var nextCh chan<- interface{} var notification interface{} for { select { case nextCh <- notification: // Notification dispatched var ok bool notification, ok = p.pendingNotifications.ReadOne() if !ok { // Nothing to pop nextCh = nil // Disable this select case } case notificationToAdd, ok := <-p.addCh: if !ok { return } if notification == nil { // No notification to pop (and pendingNotifications is empty) // Optimize the case - skip adding to pendingNotifications notification = notificationToAdd nextCh = p.nextCh } else { // There is already a notification waiting to be dispatched p.pendingNotifications.WriteOne(notificationToAdd) } } } } func (p *processorListener) run() { // this call blocks until the channel is closed. When a panic happens during the notification // we will catch it, **the offending item will be skipped!**, and after a short delay (one second) // the next notification will be attempted. This is usually better than the alternative of never // delivering again. stopCh := make(chan struct{}) wait.Until(func() { // this gives us a few quick retries before a long pause and then a few more quick retries err := wait.ExponentialBackoff(retry.DefaultRetry, func() (bool, error) { for next := range p.nextCh { switch notification := next.(type) { case updateNotification: p.handler.OnUpdate(notification.oldObj, notification.newObj) case addNotification: p.handler.OnAdd(notification.newObj) case deleteNotification: p.handler.OnDelete(notification.oldObj) default: utilruntime.HandleError(fmt.Errorf("unrecognized notification: %T", next)) } } // the only way to get here is if the p.nextCh is empty and closed return true, nil }) // the only way to get here is if the p.nextCh is empty and closed if err == nil { close(stopCh) } }, 1*time.Minute, stopCh) } ``` ### 4. 总结 (1)使用shareInformerFactory机制可以共享informer (2)Infomer的核心就是下面的reflector机制,运转流程为: * 通过kube-apiserver的listAndWatch,监听到etcd的资源变化 * 内部通过deltaFIFO队列更好的分发处理这些资源变化 * deltaFIFO除了原封不动的继承kube-apiserver 的add/update/delete事件(这个是数据库元素的变化)外,还会增加一个sync动作。这个是重新list的时候,FIFO通过replace函数加的。 * 核心的处理函数事HandleDelta函数,它对这些资源变化进行处理分发,核心逻辑如下: * informer本身会自带indexer, 不管你使不使用,这是一个本队的缓存 * 对于一个资源来说,HandleDelta会首先更新本地的indexer缓存。然后再将资源变化发给每个listener。注意: (1)kube-apiserver 的add/update/delete事件,不一定是listener看到的事件。比如一个apiserver update事件,如果indexer没有数据,那么下发给listenner的时候就是一个add事件 (2)indexerInformer通过来指定resyncPeriod,表示indexer的数据会定期这个时间从apiserver拉起全量数据。这些就是sync事件。这个只会同步同步需要sync的listener。 ![informer.png](../images/informer.png) ### 5.参考 https://jimmysong.io/kubernetes-handbook/develop/client-go-informer-sourcecode-analyse.html ================================================ FILE: k8s/client-go/8. client-go的workqueue详解.md ================================================ Table of Contents ================= * [1. 章节介绍](#1-章节介绍) * [2. workerqueue介绍](#2-workerqueue介绍) * [2.1 queue](#21-queue) * [2.1.1 queue接口](#211-queue接口) * [add](#add) * [get](#get) * [done](#done) * [2.2 DelayingQueue-延迟队列](#22-delayingqueue-延迟队列) * [2.2.1 waitFor](#221-waitfor) * [2.2. 2 NewNamedDelayingQueue](#22-2-newnameddelayingqueue) * [2.2.3 waitingLoop](#223-waitingloop) * [2.2.4](#224) * [2.2.5 总结](#225-总结) * [2.3 RateLimitingQueue-限速队列](#23-ratelimitingqueue-限速队列) * [2.3.1 RateLimiting结构体](#231-ratelimiting结构体) * [2.3.2 限速器类型](#232-限速器类型) * [BucketRateLimiter](#bucketratelimiter) * [ItemExponentialFailureRateLimiter](#itemexponentialfailureratelimiter) * [ItemFastSlowRateLimiter](#itemfastslowratelimiter) * [MaxOfRateLimiter](#maxofratelimiter) * [WithMaxWaitRateLimiter](#withmaxwaitratelimiter) * [3.总结](#3总结) * [4. 参考文档](#4-参考文档) ### 1. 章节介绍 在介绍完Informer机制后,可以发现如果想自定义控制器非常简单,我们直接注册handler就行。但是绝大部分k8s原生控制器中,handler并没有直接处理。而是统一遵守一套: Add , update, Del -> queue -> run -> runWorker -> syncHandler 处理的模式。 例如 namespaces控制器中: ``` // 1.先是定义了一个限速队列 queue: workqueue.NewNamedRateLimitingQueue(nsControllerRateLimiter(), "namespace"), // 2.然后add, update都是入队列 // configure the namespace informer event handlers namespaceInformer.Informer().AddEventHandlerWithResyncPeriod( cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) { namespace := obj.(*v1.Namespace) namespaceController.enqueueNamespace(namespace) }, UpdateFunc: func(oldObj, newObj interface{}) { namespace := newObj.(*v1.Namespace) namespaceController.enqueueNamespace(namespace) }, }, resyncPeriod, ) // 3.然后controller.run,启动多个协程 // Run starts observing the system with the specified number of workers. func (nm *NamespaceController) Run(workers int, stopCh <-chan struct{}) { for i := 0; i < workers; i++ { go wait.Until(nm.worker, time.Second, stopCh) } <-stopCh } // 4. worker处理一个个数据 func (nm *NamespaceController) worker() { // 得到对象 key, quit := nm.queue.Get() // 处理完对象 defer nm.queue.Done(key) err := nm.syncNamespaceFromKey(key.(string)) if err == nil { // no error, forget this entry and return nm.queue.Forget(key) return false } } ``` 可以看出来这一套的一个好处: (1)利用了Indexer本地缓存机制,queue里面只包括 key就行。数据indexer都有 (2)workqueue除了一个缓冲机制外,还有着错误重试的机制 因此这一节分析一下,client-go提供了哪些workqueue ### 2. workerqueue介绍 client-go 的 `util/workqueue` 包里主要有三个队列,分别是普通队列,延时队列,限速队列,后一个队列以前一个队列的实现为基础,层层添加新功能,我们按照 Queue、DelayingQueue、RateLimitingQueue 的顺序层层拨开来看限速队列是如何实现的。 #### 2.1 queue ##### 2.1.1 queue接口 ``` type Interface interface { Add(item interface{}) // 添加一个元素 Len() int // 元素个数 Get() (item interface{}, shutdown bool) // 获取一个元素,第二个返回值和 channel 类似,标记队列是否关闭了 Done(item interface{}) // 标记一个元素已经处理完 ShutDown() // 关闭队列 ShuttingDown() bool // 是否正在关闭 } type Type struct { queue []t // 定义元素的处理顺序,里面所有元素都应该在 dirty set 中有,而不能出现在 processing set 中 dirty set // 标记所有需要被处理的元素 processing set // 当前正在被处理的元素,当处理完后需要检查该元素是否在 dirty set 中,如果有则添加到 queue 里 cond *sync.Cond // 条件锁 shuttingDown bool // 是否正在关闭 metrics queueMetrics unfinishedWorkUpdatePeriod time.Duration clock clock.Clock } ``` 这个 Queue 的工作逻辑大致是这样,里面的三个属性 queue、dirty、processing 都保存 items,但是含义有所不同: - queue:这是一个 []t 类型,也就是一个切片,因为其有序,所以这里当作一个列表来存储 item 的处理顺序。 - dirty:这是一个 set 类型,也就是一个集合,这个集合存储的是所有需要处理的 item,这些 item 也会保存在 queue 中,但是 set 里是无序的,set 的特性是唯一。可以认为dirty就是queue的不同实现, queue是为了有序,set是为了保证元素唯一。 - processing:这也是一个 set,存放的是当前正在处理的 item,也就是说这个 item 来自 queue 出队的元素,同时这个元素会被从 dirty 中删除。 目前看这些还有些懵,直接看看queue的核心函数。 ###### add 从这里就可以看出来,queue函数进行了过滤。比如我更新了pod1三次。 ``` pod1 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "one", Annotations: map[string]string{"users": "ernie,bert"}}} ``` informer的distrube函数会发送三个更新事件,queue也会收到三个更新事件,但是queue里面只会有一个 one(pod1的key)。 为什么只需要保留一个就行? 因为indexer已经更新了,indexer的数据是最新的。所以从这里也可以看出来,使用这一套逻辑,就没有update ,add, delete等区别了。 如果我想统计一下,每个Pod变化了多少次,那就不能使用 workqueue了,必须在handler那里直接实现。 ``` // Add marks item as needing processing. func (q *Type) Add(item interface{}) { q.cond.L.Lock() defer q.cond.L.Unlock() if q.shuttingDown { return } // dirty set 中已经有了该 item,则返回 if q.dirty.has(item) { return } q.metrics.add(item) q.dirty.insert(item) // 如果正在处理,也直接返回 if q.processing.has(item) { return } // 否则就扔进queue队列 q.queue = append(q.queue, item) q.cond.Signal() } ``` ###### get get会将元素从queue队列去列,表示这个元素,正在处理中。 dirty和queue保持一致,也会删除这个元素。 ``` // get是从 queue队列中取出一个元素(queue中删除,dirty中删除) // 并且标记它正在处理, func (q *Type) Get() (item interface{}, shutdown bool) { q.cond.L.Lock() defer q.cond.L.Unlock() for len(q.queue) == 0 && !q.shuttingDown { q.cond.Wait() } if len(q.queue) == 0 { // We must be shutting down. return nil, true } item, q.queue = q.queue[0], q.queue[1:] q.metrics.get(item) q.processing.insert(item) q.dirty.delete(item) return item, false } ``` ###### done done表明这个元素被处理完了,从processing队列删除。这里加了一个判断,如果dirty中还存在,还要将其加入 queue 为什么需要这个判断呢? 原因在于有一种请求是 itemA 正在处理,但是还没done,这个时候又来了一次 itemA。 这个时候add 逻辑中,是直接返回的,不会添加itemA到queue的。所以这里要重新添加一次 ``` // Done marks item as done processing, and if it has been marked as dirty again // while it was being processed, it will be re-added to the queue for // re-processing. func (q *Type) Done(item interface{}) { q.cond.L.Lock() defer q.cond.L.Unlock() q.metrics.done(item) q.processing.delete(item) // 判断dirty是否有该元素 if q.dirty.has(item) { q.queue = append(q.queue, item) q.cond.Signal() } } ```
#### 2.2 DelayingQueue-延迟队列 ``` // delayingType wraps an Interface and provides delayed re-enquing type delayingType struct { Interface //上面的通用队列 clock clock.Clock // 时钟,用于获取时间 stopCh chan struct{} // 延时就意味着异步,就要有另一个协程处理,所以需要退出信号 stopOnce sync.Once // 用来确保 ShutDown() 方法只执行一次 heartbeat clock.Ticker // 定时器,在没有任何数据操作时可以定时的唤醒处理协程 waitingForAddCh chan *waitFor // 所有延迟添加的元素封装成waitFor放到chan中 metrics retryMetrics } type DelayingInterface interface { Interface // AddAfter adds an item to the workqueue after the indicated duration has passed AddAfter(item interface{}, duration time.Duration) } ``` ##### 2.2.1 waitFor ``` type waitFor struct { data t // 准备添加到队列中的数据 readyAt time.Time // 应该被加入队列的时间 index int // 在 heap 中的索引 } ``` waitForPriorityQueue是一个数组,实现了最小堆,对比的就是延迟的时间。 ``` type waitForPriorityQueue []*waitFor // heap需要实现的接口,告知队列长度 func (pq waitForPriorityQueue) Len() int { return len(pq) } // heap需要实现的接口,告知第i个元素是否比第j个元素小 func (pq waitForPriorityQueue) Less(i, j int) bool { return pq[i].readyAt.Before(pq[j].readyAt) // 此处对比的就是时间,所以排序按照时间排序 } // heap需要实现的接口,实现第i和第j个元素换 func (pq waitForPriorityQueue) Swap(i, j int) { // 这种语法好牛逼,有没有,C/C++程序猿没法理解~ pq[i], pq[j] = pq[j], pq[i] pq[i].index = i // 因为heap没有所以,所以需要自己记录索引,这也是为什么waitFor定义索引参数的原因 pq[j].index = j } // heap需要实现的接口,用于向队列中添加数据 func (pq *waitForPriorityQueue) Push(x interface{}) { n := len(*pq) item := x.(*waitFor) item.index = n // 记录索引值 *pq = append(*pq, item) // 放到了数组尾部 } // heap需要实现的接口,用于从队列中弹出最后一个数据 func (pq *waitForPriorityQueue) Pop() interface{} { n := len(*pq) item := (*pq)[n-1] item.index = -1 *pq = (*pq)[0:(n - 1)] // 缩小数组,去掉了最后一个元素 return item } // 返回第一个元素 func (pq waitForPriorityQueue) Peek() interface{} { return pq[0] } ``` 到这里就可以大概猜出来延迟队列的实现了。 就是所有添加的元素,有一个延迟时间,根据延迟时间构造一个最小堆。然后每次时间一到,从堆里面拿出来当前应该加入队列的时间。
##### 2.2. 2 NewNamedDelayingQueue ```go // 这里可以传递一个名字 func NewNamedDelayingQueue(name string) DelayingInterface { return NewDelayingQueueWithCustomClock(clock.RealClock{}, name) } // 上面一个函数只是调用当前函数,附带一个名字,这里加了一个指定 clock 的能力 func NewDelayingQueueWithCustomClock(clock clock.Clock, name string) DelayingInterface { return newDelayingQueue(clock, NewNamed(name), name) // 注意这里的 NewNamed() 函数 } func newDelayingQueue(clock clock.Clock, q Interface, name string) *delayingType { ret := &delayingType{ Interface: q, clock: clock, heartbeat: clock.NewTicker(maxWait), // 10s 一次心跳 stopCh: make(chan struct{}), waitingForAddCh: make(chan *waitFor, 1000), metrics: newRetryMetrics(name), } go ret.waitingLoop() // 核心就是运行 waitingLoop return ret } ``` ##### 2.2.3 waitingLoop ``` func (q *delayingType) waitingLoop() { defer utilruntime.HandleCrash() // 队列里没有 item 时实现等待用的 never := make(<-chan time.Time) var nextReadyAtTimer clock.Timer // 构造一个优先级队列 waitingForQueue := &waitForPriorityQueue{} heap.Init(waitingForQueue) // 这一行其实是多余的,等下提个 pr 给它删掉 // 这个 map 用来处理重复添加逻辑的,下面会讲到 waitingEntryByData := map[t]*waitFor{} // 无限循环 for { // 这个地方 Interface 是多余的,等下也提个 pr 把它删掉吧 if q.Interface.ShuttingDown() { return } now := q.clock.Now() // 队列里有 item 就开始循环 for waitingForQueue.Len() > 0 { // 获取第一个 item entry := waitingForQueue.Peek().(*waitFor) // 时间还没到,先不处理 if entry.readyAt.After(now) { break } // 时间到了,pop 出第一个元素;注意 waitingForQueue.Pop() 是最后一个 item,heap.Pop() 是第一个元素 entry = heap.Pop(waitingForQueue).(*waitFor) // 将数据加到延时队列里 q.Add(entry.data) // map 里删除已经加到延时队列的 item delete(waitingEntryByData, entry.data) } // 如果队列中有 item,就用第一个 item 的等待时间初始化计时器,如果为空则一直等待 nextReadyAt := never if waitingForQueue.Len() > 0 { if nextReadyAtTimer != nil { nextReadyAtTimer.Stop() } entry := waitingForQueue.Peek().(*waitFor) nextReadyAtTimer = q.clock.NewTimer(entry.readyAt.Sub(now)) nextReadyAt = nextReadyAtTimer.C() } select { case <-q.stopCh: return case <-q.heartbeat.C(): // 心跳时间是 10s,到了就继续下一轮循环 case <-nextReadyAt: // 第一个 item 的等到时间到了,继续下一轮循环 case waitEntry := <-q.waitingForAddCh: // waitingForAddCh 收到新的 item // 如果时间没到,就加到优先级队列里,如果时间到了,就直接加到延时队列里 if waitEntry.readyAt.After(q.clock.Now()) { insert(waitingForQueue, waitingEntryByData, waitEntry) } else { q.Add(waitEntry.data) } // 下面的逻辑就是将 waitingForAddCh 中的数据处理完 drained := false for !drained { select { case waitEntry := <-q.waitingForAddCh: if waitEntry.readyAt.After(q.clock.Now()) { insert(waitingForQueue, waitingEntryByData, waitEntry) } else { q.Add(waitEntry.data) } default: drained = true } } } } } ``` ##### 2.2.4 这个方法的作用是在指定的延时到达之后,在 work queue 中添加一个元素,源码如下: ``` func (q *delayingType) AddAfter(item interface{}, duration time.Duration) { if q.ShuttingDown() { // 已经在关闭中就直接返回 return } q.metrics.retry() if duration <= 0 { // 如果时间到了,就直接添加 q.Add(item) return } select { case <-q.stopCh: // 构造 waitFor{},丢到 waitingForAddCh case q.waitingForAddCh <- &waitFor{data: item, readyAt: q.clock.Now().Add(duration)}: } } 其实就是一个往堆加入元素的过程 func insert(q *waitForPriorityQueue, knownEntries map[t]*waitFor, entry *waitFor) { // 这里的主要逻辑是看一个 entry 是否存在,如果已经存在,新的 entry 的 ready 时间更短,就更新时间 existing, exists := knownEntries[entry.data] if exists { if existing.readyAt.After(entry.readyAt) { existing.readyAt = entry.readyAt // 如果存在就只更新时间 heap.Fix(q, existing.index) } return } // 如果不存在就丢到 q 里,同时在 map 里记录一下,用于查重 heap.Push(q, entry) knownEntries[entry.data] = entry } ```
##### 2.2.5 总结 (1)延迟队列的核心就是,根据加入队列的时间,构造一个最小堆,然后再到时间点后,将其加入queue中 (2)上诉判断是否到时间点,不仅仅是一个for循环,还利用了心跳,channel机制 (3)当某个对象处理的时候失败了,可以利用延迟队列的思想,等一会再重试,因为马上重试肯定是失败的 #### 2.3 RateLimitingQueue-限速队列 ##### 2.3.1 RateLimiting结构体 ``` type RateLimitingInterface interface { DelayingInterface //延迟队列 AddRateLimited(item interface{}) //已限速方式,往队列添加一个元素 // 标记介绍重试 Forget(item interface{}) // 重试了几次 NumRequeues(item interface{}) int } // rateLimitingType wraps an Interface and provides rateLimited re-enquing type rateLimitingType struct { DelayingInterface rateLimiter RateLimiter //多了一个限速器 } ``` ##### 2.3.2 限速器类型 可以看出来,限速队列和 延迟队列是一模一样的。 延迟队列是自己决定 某个元素延迟多久。 而限速队列是 有限速器决定 某个元素延迟多久。 ``` type RateLimiter interface { // 输入一个对象,判断延迟多久 When(item interface{}) time.Duration // 标记介绍重试 Forget(item interface{}) // 重试了几次 NumRequeues(item interface{}) int } ``` 这个接口有五个实现,分别为: 1. *BucketRateLimiter* 2. *ItemExponentialFailureRateLimiter* 3. *ItemFastSlowRateLimiter* 4. *MaxOfRateLimiter* 5. *WithMaxWaitRateLimiter* ###### BucketRateLimiter 这个限速器可说的不多,用了 golang 标准库的 `golang.org/x/time/rate.Limiter` 实现。BucketRateLimiter 实例化的时候比如传递一个 `rate.NewLimiter(rate.Limit(10), 100)` 进去,表示令牌桶里最多有 100 个令牌,每秒发放 10 个令牌。 所有元素都是一样的,来几次都是一样,所以NumRequeues,Forget都没有意义。 ``` type BucketRateLimiter struct { *rate.Limiter } var _ RateLimiter = &BucketRateLimiter{} func (r *BucketRateLimiter) When(item interface{}) time.Duration { return r.Limiter.Reserve().Delay() // 过多久后给当前 item 发放一个令牌 } func (r *BucketRateLimiter) NumRequeues(item interface{}) int { return 0 } // func (r *BucketRateLimiter) Forget(item interface{}) { } ``` ###### ItemExponentialFailureRateLimiter Exponential 是指数的意思,从这个限速器的名字大概能猜到是失败次数越多,限速越长而且是指数级增长的一种限速器。 结构体定义如下,属性含义基本可以望文生义 ``` func (r *ItemExponentialFailureRateLimiter) When(item interface{}) time.Duration { r.failuresLock.Lock() defer r.failuresLock.Unlock() exp := r.failures[item] r.failures[item] = r.failures[item] + 1 // 失败次数加一 // 每调用一次,exp 也就加了1,对应到这里时 2^n 指数爆炸 backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow(2, float64(exp)) if backoff > math.MaxInt64 { // 如果超过了最大整型,就返回最大延时,不然后面时间转换溢出了 return r.maxDelay } calculated := time.Duration(backoff) if calculated > r.maxDelay { // 如果超过最大延时,则返回最大延时 return r.maxDelay } return calculated } func (r *ItemExponentialFailureRateLimiter) NumRequeues(item interface{}) int { r.failuresLock.Lock() defer r.failuresLock.Unlock() return r.failures[item] } func (r *ItemExponentialFailureRateLimiter) Forget(item interface{}) { r.failuresLock.Lock() defer r.failuresLock.Unlock() delete(r.failures, item) } ``` ###### ItemFastSlowRateLimiter 快慢限速器,也就是先快后慢,定义一个阈值,超过了就慢慢重试。先看类型定义: ``` type ItemFastSlowRateLimiter struct { failuresLock sync.Mutex failures map[interface{}]int maxFastAttempts int // 快速重试的次数 fastDelay time.Duration // 快重试间隔 slowDelay time.Duration // 慢重试间隔 } func (r *ItemFastSlowRateLimiter) When(item interface{}) time.Duration { r.failuresLock.Lock() defer r.failuresLock.Unlock() r.failures[item] = r.failures[item] + 1 // 标识重试次数 + 1 if r.failures[item] <= r.maxFastAttempts { // 如果快重试次数没有用完,则返回 fastDelay return r.fastDelay } return r.slowDelay // 反之返回 slowDelay } func (r *ItemFastSlowRateLimiter) NumRequeues(item interface{}) int { r.failuresLock.Lock() defer r.failuresLock.Unlock() return r.failures[item] } func (r *ItemFastSlowRateLimiter) Forget(item interface{}) { r.failuresLock.Lock() defer r.failuresLock.Unlock() delete(r.failures, item) } ``` ###### MaxOfRateLimiter 组合限速器,内部放多个限速器,然后返回限速最慢的一个延时: ``` type MaxOfRateLimiter struct { limiters []RateLimiter } func (r *MaxOfRateLimiter) When(item interface{}) time.Duration { ret := time.Duration(0) for _, limiter := range r.limiters { curr := limiter.When(item) if curr > ret { ret = curr } } return ret } ```
###### WithMaxWaitRateLimiter 这个限速器也很简单,就是在其他限速器上包装一个最大延迟的属性,如果到了最大延时,则直接返回。这样就能避免延迟时间不可控,万一一个对象失败了多次,那以后的时间会越来越大。 ``` type WithMaxWaitRateLimiter struct { limiter RateLimiter // 其他限速器 maxDelay time.Duration // 最大延时 } func NewWithMaxWaitRateLimiter(limiter RateLimiter, maxDelay time.Duration) RateLimiter { return &WithMaxWaitRateLimiter{limiter: limiter, maxDelay: maxDelay} } func (w WithMaxWaitRateLimiter) When(item interface{}) time.Duration { delay := w.limiter.When(item) if delay > w.maxDelay { return w.maxDelay // 已经超过了最大延时,直接返回最大延时 } return delay } ``` ### 3.总结 (1)workerqueue使用于只关注结果的处理方式。 比如统计一个Pod update了多少次这种关乎 过程的 处理。不能用,因为workerqueue进行了合并 (2)workerqueue实现了很多限速机制,可以更加情况酌情使用 ### 4. 参考文档 https://blog.csdn.net/weixin_42663840/article/details/81482553 https://www.danielhu.cn/post/k8s/client-go-workqueue/ ================================================ FILE: k8s/client-go/9.从0到1使用kubebuilder创建crd.md ================================================ - [0. 下载kubebuilder](#0---kubebuilder) - [1. 创建目录](#1-----) - [2. 初始化项目](#2------) - [3. 创建api和controller](#3---api-controller) - [4. 实现自己的crd和控制器逻辑](#4------crd------) - [5. make manifests, 创建crd的相关yaml](#5-make-manifests----crd---yaml) - [6. 在集群中部署crd](#6-------crd) - [7. 部署controller](#7---controller) **简介** 从0到1,手把手教会如何使用kubebuilder创建crd, 并且定制自己的控制器。 代码:https://github.com/zoux86/operator-example ### 0. 下载kubebuilder ```bash # download kubebuilder and install locally. curl -L -o kubebuilder https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH) chmod +x kubebuilder && mv kubebuilder /usr/local/bin/ ``` ### 1. 创建目录 ~/go/src 是我的go src目录 github.com/zoux86/operator-example是想自定义的crd项目 ``` // 可以看出来go mod init 指定的字符串就是mod文件里面的module目录 ~/go/src/github.com/zoux86/operator-example# go mod init github.com/zoux86/operator-example go: creating new go.mod: module github.com/zoux86/operator-example ~/go/src/github.com/zoux86/operator-example # ls go.mod ~/go/src/github.com/zoux86/operator-example # cat go.mod module github.com/zoux86/operator-example go 1.18 ``` ### 2. 初始化项目 执行kubebuilder init这一条命令就行了 ``` ~/go/src/github.com/zoux86/operator-example # ~/kubebuilder init --domain github.com --license apache2 --owner "zoux86" Writing kustomize manifests for you to edit... Writing scaffold for you to edit... Get controller runtime: $ go get sigs.k8s.io/controller-runtime@v0.11.2 go: downloading sigs.k8s.io/controller-runtime v0.11.2 go: downloading k8s.io/apimachinery v0.23.5 go: downloading k8s.io/client-go v0.23.5 go: downloading k8s.io/utils v0.0.0-20211116205334-6203023598ed go: downloading k8s.io/component-base v0.23.5 go: downloading k8s.io/api v0.23.5 go: downloading k8s.io/apiextensions-apiserver v0.23.5 go: downloading sigs.k8s.io/json v0.0.0-20211020170558-c049b76a60c6 go: downloading golang.org/x/net v0.0.0-20211209124913-491a49abca63 go: downloading golang.org/x/oauth2 v0.0.0-20210819190943-2bc19b11175f Update dependencies: $ go mod tidy go: downloading github.com/Azure/go-autorest/autorest v0.11.18 go: downloading github.com/Azure/go-autorest/autorest/adal v0.9.13 go: downloading github.com/Azure/go-autorest/tracing v0.6.0 go: downloading github.com/Azure/go-autorest/autorest/mocks v0.4.1 go: downloading github.com/Azure/go-autorest/autorest/date v0.3.0 go: downloading github.com/Azure/go-autorest/logger v0.2.1 go: downloading golang.org/x/crypto v0.0.0-20210817164053-32db794688a5 Next: define a resource with: $ kubebuilder create api ```
**查看文件目录** k8s apis通常有三个组件`Resource, Controller, Manager`,它们分别定义/实现在以下的三个package当中: - **cmd/...**:主流程程序`Manager`入口,负责初始化依赖包、启停`Controller`。用户通常不需要编辑此包,可以依赖脚手架。通过`kubebuilder init`自动创建生成。 - **pkg/apis/...**:包含API资源的定义。编辑`*_types.go`文件来修改资源定义。每个资源的定义文件存在于`pkg/apis///_types.go`中。通过`kubebuilder create api`自动创建生成。 - **pkg/controller/...**:包含Controller的实现。编辑`*_controller.go`实现Controller。通过`kubebuilder create api`自动创建生成。 ``` ~/go/src/github.com/zoux86/operator-example  tree . ├── Dockerfile ├── Makefile ├── PROJECT ├── README.md ├── config │   ├── default │   │   ├── kustomization.yaml │   │   ├── manager_auth_proxy_patch.yaml │   │   └── manager_config_patch.yaml │   ├── manager │   │   ├── controller_manager_config.yaml │   │   ├── kustomization.yaml │   │   └── manager.yaml │   ├── prometheus │   │   ├── kustomization.yaml │   │   └── monitor.yaml │   └── rbac │   ├── auth_proxy_client_clusterrole.yaml │   ├── auth_proxy_role.yaml │   ├── auth_proxy_role_binding.yaml │   ├── auth_proxy_service.yaml │   ├── kustomization.yaml │   ├── leader_election_role.yaml │   ├── leader_election_role_binding.yaml │   ├── role_binding.yaml │   └── service_account.yaml ├── go.mod ├── go.sum ├── hack │   └── boilerplate.go.txt └── main.go ```
### 3. 创建api和controller 其实从create api 后的输出我们可以看出来:我们修改逻辑后就可以部署了 ``` ~/go/src/github.com/zoux86/operator-example # ~/kubebuilder create api --group zouxapp --kind PodCount --version v1 Create Resource [y/n] y Create Controller [y/n] y Writing kustomize manifests for you to edit... // 先修改这2个文件 Writing scaffold for you to edit... api/v1/podcount_types.go controllers/podcount_controller.go Update dependencies: $ go mod tidy Running make: $ make generate mkdir -p /Users/game-netease/go/src/github.com/zoux86/operator-example/bin GOBIN=/Users/game-netease/go/src/github.com/zoux86/operator-example/bin go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.8.0 go: downloading sigs.k8s.io/controller-tools v0.8.0 go: downloading github.com/spf13/cobra v1.2.1 go: downloading golang.org/x/tools v0.1.6-0.20210820212750-d4cc65f0b2ff go: downloading github.com/fatih/color v1.12.0 go: downloading k8s.io/api v0.23.0 go: downloading k8s.io/apimachinery v0.23.0 go: downloading github.com/gobuffalo/flect v0.2.3 go: downloading k8s.io/apiextensions-apiserver v0.23.0 go: downloading github.com/mattn/go-colorable v0.1.8 go: downloading github.com/mattn/go-isatty v0.0.12 go: downloading golang.org/x/sys v0.0.0-20210831042530-f4d43177bf5e go: downloading golang.org/x/mod v0.4.2 /Users/game-netease/go/src/github.com/zoux86/operator-example/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..." Next: implement your new API and generate the manifests (e.g. CRDs,CRs) with: $ make manifests ```
执行create api后,生成以下文件: ``` api/v1/groupversion_info.go api/v1/podcount_types.go // 需要修改这个文件中crd的定义 api/v1/zz_generated.deepcopy.go config/crd/kustomization.yaml config/crd/kustomizeconfig.yaml config/crd/patches/cainjection_in_podcounts.yaml config/crd/patches/webhook_in_podcounts.yaml config/rbac/podcount_editor_role.yaml config/rbac/podcount_viewer_role.yaml config/samples/zouxapp_v1_podcount.yaml controllers/podcount_controller.go // 需要修改这个文件的controller运行逻辑 controllers/suite_test.go go.mod main.go ``` ### 4. 实现自己的crd和控制器逻辑 根据实际情况而定,这里的控制器逻辑很简单,就是创建同步podCount的spec.count到status里面。 ``` func (r *PodCountReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { rlog := log.FromContext(ctx) rlog.Info("start to reconciling podCount %s", req.Name) podCount := &zouxappv1.PodCount{} err := r.Client.Get(ctx, req.NamespacedName, podCount) if err != nil { rlog.Error(err, fmt.Sprintf("get podcount %s/%s err during reconcile.", req.Namespace, req.Name)) return ctrl.Result{}, nil } podCountCopy := podCount.DeepCopy() if podCount.Spec.Count <= 0 { podCountCopy.Status.Count = 0 } else { podCountCopy.Status.Count = podCount.Spec.Count } err = r.Client.Status().Update(ctx, podCountCopy) if err != nil { rlog.Error(err, fmt.Sprintf("update crd podcount status error %s/%s during reconcile.", req.Namespace, req.Name)) } //r.Status().Update(ctx, podCountCopy, metav1.UpdateOptions{}) // TODO(user): your logic here return ctrl.Result{}, err } ``` ### 5. make manifests, 创建crd的相关yaml ``` ~/go/src/github.com/zoux86/operator-example # make manifests /Users/game-netease/go/src/github.com/zoux86/operator-example/bin/controller-gen rbac:roleName=manager-role crd webhook paths="./..." output:crd:artifacts:config=config/crd/bases ``` 执行make manifests之后,我们会得到2个文件。 ``` config/crd/bases/ config/rbac/role.yaml # cat config/rbac/role.yaml --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: creationTimestamp: null name: manager-role rules: - apiGroups: - zouxapp.github.com resources: - podcounts verbs: - create - delete - get - list - patch - update - watch - apiGroups: - zouxapp.github.com resources: - podcounts/finalizers verbs: - update - apiGroups: - zouxapp.github.com resources: - podcounts/status verbs: - get - patch - update # cat config/crd/bases/zouxapp.github.com_podcounts.yaml --- apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: annotations: controller-gen.kubebuilder.io/version: v0.8.0 creationTimestamp: null name: podcounts.zouxapp.github.com spec: group: zouxapp.github.com names: kind: PodCount listKind: PodCountList plural: podcounts singular: podcount scope: Namespaced versions: - name: v1 schema: openAPIV3Schema: description: PodCount is the Schema for the podcounts API properties: apiVersion: description: 'APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources' type: string kind: description: 'Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds' type: string metadata: type: object spec: description: PodCountSpec defines the desired state of PodCount properties: count: description: Foo is an example field of PodCount. Edit podcount_types.go to remove/update type: integer type: object status: description: PodCountStatus defines the observed state of PodCount properties: count: description: 'INSERT ADDITIONAL STATUS FIELD - define observed state of cluster Important: Run "make" to regenerate code after modifying this file' type: integer type: object type: object served: true storage: true subresources: status: {} status: acceptedNames: kind: "" plural: "" conditions: [] storedVersions: [] ``` ### 6. 在集群中部署crd ``` ~/go/src/github.com/zoux86/operator-example #kubectl --kubeconfig=kubeconfig get node create -f config/crd/bases customresourcedefinition.apiextensions.k8s.io/podcounts.zouxapp.github.com created ~/go/src/github.com/zoux86/operator-example # kubectl --kubeconfig=kubeconfig create -f config/samples/zouxapp_v1_podcount.yaml podcount.zouxapp.github.com/podcount-sample created ```
上集群验证,可以看到创建成功了,但是可以看出来没有status.count,这个因为集群还没部署控制器 ``` root# kubectl get crd | grep podc podcounts.zouxapp.github.com 2022-08-25T06:57:09Z root # kubectl get podcounts.zouxapp.github.com NAME AGE podcount-sample 11s root # kubectl get podcounts.zouxapp.github.com -oyaml apiVersion: v1 items: - apiVersion: zouxapp.github.com/v1 kind: PodCount metadata: creationTimestamp: "2022-08-25T07:01:16Z" generation: 1 name: podcount-sample namespace: default resourceVersion: "467368378" selfLink: /apis/zouxapp.github.com/v1/namespaces/default/podcounts/podcount-sample uid: a8b42a4c-1ebd-430a-890f-b0238f4ad125 spec: count: 3 kind: List metadata: resourceVersion: "" selfLink: "" ``` ### 7. 部署controller 之前 CRD 并不会完成任何工作,只是在 ETCD 中创建了一条记录。所以我们需要部署写的controller。 运行CRD controller ``` ~/go/src/github.com/zoux86/operator-example ## go run ./main.go I0825 15:35:12.827589 63628 request.go:665] Waited for 1.000074041s due to client-side throttling, not priority and fairness, request: GET:https://xxx/apis/apiextensions.k8s.io/v1?timeout=32s 1.6614129137223601e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"} 1.6614129137230568e+09 INFO setup starting manager 1.661412913723448e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"} 1.661412913723448e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"} 1.661412913723506e+09 INFO controller.podcount Starting EventSource {"reconciler group": "zouxapp.github.com", "reconciler kind": "PodCount", "source": "kind source: *v1.PodCount"} 1.661412913723542e+09 INFO controller.podcount Starting Controller {"reconciler group": "zouxapp.github.com", "reconciler kind": "PodCount"} 1.661412913825421e+09 INFO controller.podcount Starting workers {"reconciler group": "zouxapp.github.com", "reconciler kind": "PodCount", "worker count": 1} I0825 15:35:13.825542 63628 podcount_controller.go:50] start to reconciling podCount podcount-sample I0825 15:35:13.868618 63628 podcount_controller.go:50] start to reconciling podCount podcount-sample ``` **查看发现生效了** ``` root# kubectl get podcounts.zouxapp.github.com -oyaml apiVersion: v1 items: - apiVersion: zouxapp.github.com/v1 kind: PodCount metadata: creationTimestamp: "2022-08-25T07:01:16Z" generation: 1 name: podcount-sample namespace: default resourceVersion: "467385745" selfLink: /apis/zouxapp.github.com/v1/namespaces/default/podcounts/podcount-sample uid: a8b42a4c-1ebd-430a-890f-b0238f4ad125 spec: count: 3 status: count: 3 kind: List metadata: resourceVersion: "" selfLink: "" ``` ================================================ FILE: k8s/cni/0.章节介绍.md ================================================ 本章节主要了解cni的相关知识,章节安排如下: (1)网路基础知识介绍 (2)容器云用到的网络知识 (3)kubelet中的cni介绍 (4)flannel原理分析 (5)cacilo原理分析 (6)如何订制cni ================================================ FILE: k8s/cni/1. 网络基础知识.md ================================================ * [1\. 网络基础知识](#1-网络基础知识) * [1\.1 基础概念](#11-基础概念) * [1\.2 一个宿主是如何处理数据包的](#12--一个宿主是如何处理数据包的) * [2\. 物理层工作原理](#2-物理层工作原理) * [3\. 链路层工作原理](#3-链路层工作原理) * [3\.1 这个包是发给谁的?谁应该接收?](#31--这个包是发给谁的谁应该接收) * [3\.2 大家都在发,会不会产生混乱?有没有谁先发、谁后发的规则?](#32-大家都在发会不会产生混乱有没有谁先发谁后发的规则) * [3\.3 如果发送的时候出现了错误,怎么办?](#33--如果发送的时候出现了错误怎么办) * [4\. 网络层工作原理](#4-网络层工作原理) * [4\.1 ping的流程](#41-ping的流程) * [4\.2 不同子网之间的ip访问](#42-不同子网之间的ip访问) * [4\.2\.1\. 路由器是如何工作的](#421-路由器是如何工作的) * [4\.2\.2 不同子网的ip通信流程](#422-不同子网的ip通信流程) * [4\.2\.3 路由是如何设置的](#423-路由是如何设置的) * [5\. 网卡介绍](#5-网卡介绍) * [5\.1 查看网卡](#51--查看网卡) * [5\.1\.1 ifconfig介绍](#511--ifconfig介绍) * [5\.1\.2 其他方式查看网卡](#512-其他方式查看网卡) * [5\.2 虚拟网卡](#52-虚拟网卡) * [5\.2\.1 虚拟网卡介绍](#521-虚拟网卡介绍) * [5\.2\.2 云计算中的网络计算\-虚拟网卡](#522-云计算中的网络计算-虚拟网卡) 本节主要重新学习一下非常基础的网络知识,为后面cni的学习做基础。 参考: * 网络是怎样连接的-[日]户根勤 * 趣谈网络协议 ### 1. 网络基础知识 #### 1.1 基础概念 数据包(packet):IP 协议传送数据的单位; 帧(frame):链路层传送数据的单位; 节点(node):实现了 IP 协议的设备; 路由器(router):可以转发不是发给自己的 IP 包的设备; 主机(host):不是路由器的节点; 链路(link):一种通信机制或介质,节点可以通过它在链路层通信。比如以太网、PPP 链接,也包括隧道; 接口(interface):节点与链路的连接(可以理解为抽象的“网卡”); 链路层地址(link-layer address):接口的链路层标识符(如以太网的 mac 地址) #### 1.2 一个宿主是如何处理数据包的 * 网卡收到包之后会判断mac是不是自己的,如果是自己的会触发硬中断、软中断通知cpu收包(链路层) * 之后数据包进入内核网络协议栈,做四层处理,iptables、nat之类的(网络层) * 然后送到对应的socket缓冲区(传输层) * 最后送到用户空间进程(应用层) ![image-20220323175725888](../images/wangluoxieyi.png) 网络协议栈:是操作系统中对网络相关做处理的逻辑。解封包、iptables、route、netns、vxlan、tunnel、等等都是这里面的一块逻辑(hook) 所以不同的network namespaces会有自己不同的网络协议栈,比如有不同的路由规则等等。这样就达到了隔离的作用。
### 2. 物理层工作原理 **发送过程**:网卡驱动从 IP 模块获取包之后,会将其复制到网卡内的缓冲区中,然后向 MAC 模块发送发送包的命令。接下来就轮到 MAC 模块进行工作了。 首先,MAC 模块会将包从缓冲区中取出,并在开头加上报头和起始帧 分界符,在末尾加上用于检测错误的帧校验序列。 报头是一串像 10101010…这样 1 和 0 交替出现的比特序列,长度为 56 比特。当这些 1010 的比特序列被转换成电 信号后,会形成如图 高低电平的波形。然后通过光缆或者网线传输出去。
**集线器**,也叫做**Hub**。这种设备有多个口,可以将多台电脑连接起来。但是和交换机不同,集线器没有大脑,它完全在物理层工作。它会将自己收到的每一个字节,都复制到其他端口上去。这是第一层物理层联通的方案。 ### 3. 链路层工作原理 有了物理层的基础就能做到,不同主机直接可以发送数据。但是还有几个问题需要解决 #### 3.1 这个包是发给谁的?谁应该接收? * 在发送数据的时候,链路层会包装头部,指定目标地址的mac地址。 * 接受时通过mac地址来确定谁应该接受包 * 最开始可能会不知道某个ip的MAC地址,通过 arp 协议在局域网里面广播一下,ip XXX你的mac地址是啥 #### 3.2 大家都在发,会不会产生混乱?有没有谁先发、谁后发的规则? 使用的是随机接入协议:有事儿先出门,发现特堵,就回去。错过高峰再出(退避算法)。 #### 3.3 如果发送的时候出现了错误,怎么办? **循环冗余检测**。通过 XOR 异或的算法,来计算整个包是否在发送的过程中出现了错误。
链路层工作的设备是交换机。交换机是在局域网工作的,本身不需要ip 交换机的作用就是根据 mac地址进行端口转发。交换机有学习功能,举例如下: 如果机器 1 只知道机器 4 的 IP 地址,当它想要访问机器 4,把包发出去的时候,它必须要知道机器 4 的 MAC 地址。 于是机器 1 发起广播,机器 2 收到这个广播,但是这不是找它的,所以没它什么事。交换机 A 一开始是不知道任何拓扑信息的,在它收到这个广播后,采取的策略是,除了广播包来的方向外,它还要转发给其他所有的网口。于是机器 3 也收到广播信息了,但是这和它也没什么关系。 当然,交换机 B 也是能够收到广播信息的,但是这时候它也是不知道任何拓扑信息的,因而也是进行广播的策略,将包转发到局域网三。这个时候,机器 4 和机器 5 都收到了广播信息。机器 4 主动响应说,这是找我的,这是我的 MAC 地址。于是一个 ARP 请求就成功完成了。 ![image-20220324150807288](../images/jiaohuanji.png) 这里可以会有一个问题,就是可能局域网的机器太多,交换机数量也多,然后就会出现回路。这个时候可能就会出现广播风暴。解决办法就是通过STP 协议解决 ### 4. 网络层工作原理 #### 4.1 ping的流程 ping 是基于 ICMP 协议工作的。**ICMP**全称**Internet Control Message Protocol**,就是**互联网控制报文协议**。 假定主机 A 的 IP 地址是 192.168.1.1,主机 B 的 IP 地址是 192.168.1.2,它们都在同一个子网。那当你在主机 A 上运行“ping 192.168.1.2”后,会发生什么呢? (1)ping 命令执行的时候,源主机首先会构建一个 ICMP 请求数据包,ICMP 数据包内包含多个字段。最重要的是两个,第一个是**类型字段**,对于请求数据包而言该字段为 8;另外一个是**顺序号**,主要用于区分连续 ping 的时候发出的多个数据包。每发出一个请求数据包,顺序号会自动加 1。为了能够计算往返时间 RTT,它会在报文的数据部分插入发送时间。 (2)然后,由 ICMP 协议将这个数据包连同地址 192.168.1.2 一起交给 IP 层。IP 层将以 192.168.1.2 作为目的地址,本机 IP 地址作为源地址,加上一些其他控制信息,构建一个 IP 数据包。 (3)接下来,需要加入 MAC 头。如果在本节 ARP 映射表中查找出 IP 地址 192.168.1.2 所对应的 MAC 地址,则可以直接使用;如果没有,则需要发送 ARP 协议查询 MAC 地址,获得 MAC 地址后,由数据链路层构建一个数据帧,目的地址是 IP 层传过来的 MAC 地址,源地址则是本机的 MAC 地址;还要附加上一些控制信息,依据以太网的介质访问规则,将它们传送出去。 (4)主机 B 收到这个数据帧后,先检查它的目的 MAC 地址,并和本机的 MAC 地址对比,如符合,则接收,否则就丢弃。接收后检查该数据帧,将 IP 数据包从帧中提取出来,交给本机的 IP 层。同样,IP 层检查后,将有用的信息提取后交给 ICMP 协议。 (5)主机 B 会构建一个 ICMP 应答包,应答数据包的类型字段为 0,顺序号为接收到的请求数据包中的顺序号,然后再发送出去给主机 A。 (6)在规定的时候间内,源主机如果没有接到 ICMP 的应答包,则说明目标主机不可达;如果接收到了 ICMP 应答包,则说明目标主机可达。此时,源主机会检查,用当前时刻减去该数据包最初从源主机上发出的时刻,就是 ICMP 数据包的时间延迟。 ![image-20220324153428137](../images/jiaohuanji-1.png)
#### 4.2 不同子网之间的ip访问 ##### 4.2.1. 路由器是如何工作的 **路由器是一台设备,它有每个网口或者网卡,都连着局域网。每只手的 IP 地址都和局域网的 IP 地址相同的网段,每只手都是它握住的那个局域网的网关。** 其实就是路由器有多个端口,每个端口配置了ip,端口A配置了一个子网A的ip。 端口B配置了一个子网B的ip。 所以子网AB通过路由器就可以通信了。
Gateway 的地址一定是和源 IP 地址是一个网段的。往往不是第一个,就是第二个。 例如 192.168.1.0/24 这个网段,Gateway 往往会是 192.168.1.1/24 或者 192.168.1.2/24。 网关主要是用来连接两种不同的网络,同时,网关它还能够同时与两边的主机之间进行通信。但是两边的主机是不能够直接进行通信,是必须要经过网关才能进行通信。网关的工作是在应用层当中。简单来说,网关它就是为了管理不同网段的IP,我们一般在交换机上做VLAN的时候,就需要在默认的VLAN接口之下做一个IP,而这个IP它就是我们所说的网关。
##### 4.2.2 不同子网的ip通信流程 ![mac-header](../images/mac.png) **mac头部如上所示**:在 MAC 头里面,先是目标 MAC 地址,然后是源 MAC 地址,然后有一个协议类型,用来说明里面是 IP 协议。IP 头里面的版本号,目前主流的还是 IPv4,服务类型 TOS 在第三节讲 ip addr 命令的时候讲过,TTL 在第 7 节讲 ICMP 协议的时候讲过。另外,还有 8 位标识协议。这里到了下一层的协议,也就是,是 TCP 还是 UDP。最重要的就是源 IP 和目标 IP。先是源 IP 地址,然后是目标 IP 地址。 在任何一台机器上,当要访问另一个 IP 地址的时候,都会先判断,这个目标 IP 地址,和当前机器的 IP 地址,是否在同一个网段。怎么判断同一个网段呢?需要 CIDR 和子网掩码,这个在第三节的时候也讲过了。 **如果是同一个网段**,例如,你访问你旁边的兄弟的电脑,那就没网关什么事情,直接将源地址和目标地址放入 IP 头中,然后通过 ARP 获得 MAC 地址,将源 MAC 和目的 MAC 放入 MAC 头中,发出去就可以了。 **如果不是同一网段**,例如,你要访问你们校园网里面的 BBS,该怎么办?这就需要发往默认网关 Gateway。Gateway 的地址一定是和源 IP 地址是一个网段的。往往不是第一个,就是第二个。例如 192.168.1.0/24 这个网段,Gateway 往往会是 192.168.1.1/24 或者 192.168.1.2/24。 **举例说明:** ![image-20220324162338881](../images/luyou.png) 服务器A属于子网: 192.168.1.101/24 服务器B属于子网:192.168.4.101/24 A服务器需要访问B服务器。访问的过程如下: (1)服务器A配置了mac包信息 - 源 MAC:服务器 A 的 MAC - 目标 MAC:192.168.1.1 **这个网口的 MAC ** //注意这里是吓一跳的mac地址,而不是目标地址的mac地址 - 源 IP:192.168.1.101 - 目标 IP:192.168.4.101 这里为什么会知道mac地址,是因为服务器A会通过自己的路由设置,判断这个包下一跳是路由器 192.168.1.1 。由于192.168.1.1 和服务器A是一个子网。所以是可以知道mac地址,并且可以通过mac地址将包发送给路由器的。 (2)包到达 192.168.1.1 这个网口,发现 MAC 一致,将包收进来,开始思考往哪里转发。 在路由器 A 中配置了静态路由之后,要想访问 192.168.4.0/24,要从 192.168.56.1 这个口出去,下一跳为 192.168.56.2。 这个时候mac地址变程了192.168.56.2的mac (3)包到达 192.168.56.2 这个网口,发现 MAC 一致,将包收进来,开始思考往哪里转发。 在路由器 B 中配置了静态路由,要想访问 192.168.4.0/24,要从 192.168.4.1 这个口出去,没有下一跳了。因为我右手这个网卡,就是这个网段的,我是最后一跳了。 于是,路由器 B 思考的时候,匹配上了这条路由,要从 192.168.4.1 这个口发出去,发给 192.168.4.101。那 192.168.4.101 的 MAC 地址是多少呢?路由器 B 发送 ARP 获取 192.168.4.101 的 MAC 地址,然后发送包。 通过这个过程可以看出,每到一个新的局域网,MAC 都是要变的,但是 IP 地址都不变。在 IP 头里面,不会保存任何网关的 IP 地址。**所谓的下一跳是,某个 IP 要将这个 IP 地址转换为 MAC 放入 MAC 头。**
有的时候是不同的私有网络访问,可能2个子网都是一致的,这个时候路由器/网络会有NAT 转换功能。
##### 4.2.3 路由是如何设置的 (1)静态路由可以通过手动配置,修改iptables规则等 (2)动态路由通过链路状态路由算法等等动态设置
### 5. 网卡介绍 网卡的作用是负责接收网络上的数据包,通过和自己本身的物理地址相比较决定是否为本机应接信息,解包后将数据通过主板上的总线传输给本地计算机,另一方面将本地计算机上的数据打包后送出网络。 网卡是一块被设计用来允许计算机在计算机网络上进行通讯的计算机硬件。 由于其拥有MAC地址,因此属于OSI模型的第2层。
#### 5.1 查看网卡 ##### 5.1.1 ifconfig介绍 ``` root@onlinegame:/home/zouxiang# ifconfig br-15db8aed13ee: flags=4099 mtu 1500 inet 172.18.0.1 netmask 255.255.0.0 broadcast 172.18.255.255 // 桥接 ether 02:42:e7:49:61:ba txqueuelen 0 (Ethernet) RX packets 24 bytes 2202 (2.1 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 29 bytes 1965 (1.9 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 br-837c9a286528: flags=4099 mtu 1500 inet 172.19.0.1 netmask 255.255.0.0 broadcast 172.19.255.255 ether 02:42:6a:d2:3e:4b txqueuelen 0 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 docker0: flags=4099 mtu 1500 inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255 ether 02:42:17:8d:1d:41 txqueuelen 0 (Ethernet) RX packets 180476 bytes 12421319 (11.8 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 289194 bytes 417833816 (398.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 eth0: flags=4163 mtu 1400 inet 10.212.31.96 netmask 255.255.255.0 broadcast 10.212.31.255 ether 52:54:00:2d:09:10 txqueuelen 1000 (Ethernet) RX packets 31059650 bytes 6131166764 (5.7 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 29660001 bytes 6106589785 (5.6 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73 mtu 65536 // loop设备 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1 (Local Loopback) RX packets 10494509 bytes 790297025 (753.6 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 10494509 bytes 790297025 (753.6 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ``` eth0 表示第一块网卡, 其中 ether表示网卡的mac地址,可以看到目前这个网卡的物理地址(MAC地址)是 52:54:00:2d:09:10 inet addr 用来表示网卡的IP地址,此网卡的 IP地址是 192.168.120.204,广播地址, Bcast:192.168.120.255,掩码地址Mask:255.255.255.0 lo 是表示主机的回环地址,这个一般是用来测试一个网络程序,但又不想让局域网或外网的用户能够查看,只能在此台主机上运行和查看所用的网络接口。比如把 HTTPD服务器的指定到回坏地址,在浏览器输入 127.0.0.1 就能看到你所架WEB网站了。但只是您能看得到,局域网的其它主机或用户无从知道。 第一行:连接类型:Ethernet(以太网)HWaddr(硬件mac地址) 第二行:网卡的IP地址、子网、掩码 第三行:UP(代表网卡开启状态)RUNNING(代表网卡的网线被接上)MULTICAST(支持组播)MTU:1500(最大传输单元):1500字节 第四、五行:接收、发送数据包情况统计 RX packets: errors:0 dropped:0 overruns:0 frame:0 接受包数量/出错数量/丢失数量… TX packets: errors:0 dropped:0 overruns:0 carrier:0 发送包数量/出错数量/丢失数量… **loop设备** lo 是loop设备的意思,地址是127.0.0.1即本机回送地址,一般网站服务本地测试的时候时候这个ip进行本地测试 第七行:接收、发送数据字节数统计信息。 **桥接** 真实主机中安装的虚拟主机,需要和外界主机进行通讯的时候,数据需要通过真实主机的网卡进行传输,但是虚拟主机内核无法对真实主机的网卡进行控制,一般情况下需要将虚拟主机先将数据包发送给真实主机的内核,再由真实主机内核将该数据通过真实物理网卡发送出去,该过程成为NAT(网络地址转换),虽然可以实现该功能,但是数据传数度较慢。 怎么办呢? linux内核支持网络接口的桥接,什么意思?就是说可以由真实主机的内核虚拟出来一个接口br0,同时这个也是一个对外的虚拟网卡设备,通过该接口可以将虚拟主机网卡和真实主机网卡直接连接起来,进行正常的数据通讯,提升数据传输效率。该过程就是桥接。(目前只支持以太网接口,linux内核是通过一个虚拟的网桥设备来实现虚拟桥接接口的,这个虚拟设备可以绑定若干个以太网接口设备,从而将它们桥接起来) ##### 5.1.2 其他方式查看网卡 ``` root@# ip link 1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 24578: usb0: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 3a:68:dd:49:76:07 brd ff:ff:ff:ff:ff:ff 2: eth1: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 04:3f:72:ed:d5:8a brd ff:ff:ff:ff:ff:ff 3: eth0: mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 10000 link/ether 04:3f:72:ed:d5:8b brd ff:ff:ff:ff:ff:ff 4: eth3: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 04:3f:72:ed:d5:9a brd ff:ff:ff:ff:ff:ff 5: eth2: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 04:3f:72:ed:d5:9b brd ff:ff:ff:ff:ff:ff 6: eth4: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 04:3f:72:ed:d5:be brd ff:ff:ff:ff:ff:ff 7: eth5: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 04:3f:72:ed:d5:bf brd ff:ff:ff:ff:ff:ff 9: ovs-system: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 16:00:1c:04:75:4c brd ff:ff:ff:ff:ff:ff 10: acc-int: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 82:ca:f2:02:d4:4b brd ff:ff:ff:ff:ff:ff 12: docker0: mtu 1500 qdisc noqueue state UP mode DEFAULT group default link/ether 02:42:78:6b:b0:54 brd ff:ff:ff:ff:ff:ff 17: vxlan_sys_4789: mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 32:35:2a:28:fd:ab brd ff:ff:ff:ff:ff:ff 71699: qvo_d888ac@if71700: mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether fe:54:00:dd:11:4f brd ff:ff:ff:ff:ff:ff link-netns ns_network 71703: qvo_8f5819@if71704: mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether fe:54:00:4e:d0:7b brd ff:ff:ff:ff:ff:ff link-netns ns_network 71451: qvo_11a63f@if71452: mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether fe:54:00:10:56:1e brd ff:ff:ff:ff:ff:ff link-netns ns_network 71731: qvo_0d237c@if71732: mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether fe:54:00:fb:d9:ff brd ff:ff:ff:ff:ff:ff link-netns ns_network 71514: veth14feb1d@if71513: mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default link/ether 8e:f8:a2:35:53:ba brd ff:ff:ff:ff:ff:ff link-netnsid 7 71783: qvo_87db25@if71784: mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether fe:54:00:36:a2:4a brd ff:ff:ff:ff:ff:ff link-netns ns_network 71607: qvo_759d55@if71608: mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether fe:54:00:41:28:b0 brd ff:ff:ff:ff:ff:ff link-netns ns_network 71611: qvo_988854@if71612: mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether fe:54:00:f1:56:1e brd ff:ff:ff:ff:ff:ff link-netns ns_network 71387: qvo_9b9a37@if71388: mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000 link/ether fe:54:00:02:cc:4f brd ff:ff:ff:ff:ff:ff link-netns ns_network // 查看所有的网卡,这个和上面的ip link是一样的 root# cd /sys/class/net root# ls acc-int docker0 eth0 eth1 eth2 eth3 eth4 eth5 lo ovs-system qvo_0d237c qvo_11a63f qvo_759d55 qvo_87db25 qvo_8f5819 qvo_988854 qvo_9b9a37 qvo_d888ac usb0 veth14feb1d vxlan_sys_4789 ``` #### 5.2 虚拟网卡 ##### 5.2.1 虚拟网卡介绍 虚拟网卡简单来说就是通过软件模拟出来的电脑网卡。在虚拟化中经常用到。 ``` // 查看/sys/devices/virtual/net/这个目录,可以判断出哪些是虚拟网卡 root# ls /sys/devices/virtual/net/ acc-int docker0 lo ovs-system qvo_0d237c qvo_11a63f qvo_16bb20 qvo_759d55 qvo_8f5819 qvo_988854 qvo_9b9a37 qvo_d888ac veth14feb1d vxlan_sys_4789 root@cld-dnode1-1051:/sys/class/net# ```
虚拟网卡的实际工作原理就是: 协议栈处理完的会从网卡送出,这些可能是虚拟网卡,虚拟网卡最终会通过IO将数据送到物理网卡(NIC),然后发送出去。 虚拟网卡和物理网卡的连接方式有很多种。比如桥接(通过brideg连接虚拟网卡和物理网卡)。 不同namespaces经常通过veth-pair连接。 例如,在docker内部其实就是veth-pair,一个虚拟网卡在容器内部,一个在宿主,然后进行通信。 veth-pair就是一堆虚拟网卡设备。往一个网卡发送数据,另一个网卡就能收到。 ``` bash-5.1$ route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default 7.53.64.65 0.0.0.0 UG 0 0 0 eth0 7.53.64.64 * 255.255.255.192 U 0 0 0 eth0 bash-5.1$ bash-5.1$ ifconfig eth0 Link encap:Ethernet HWaddr 52:54:00:BA:9F:2D inet addr:7.53.64.112 Bcast:0.0.0.0 Mask:255.255.255.192 UP BROADCAST RUNNING MULTICAST MTU:1400 Metric:1 RX packets:132698569 errors:0 dropped:0 overruns:0 frame:0 TX packets:129715622 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:11982615680 (11.1 GiB) TX bytes:13782715446 (12.8 GiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:4567760 errors:0 dropped:0 overruns:0 frame:0 TX packets:4567760 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:332750268 (317.3 MiB) TX bytes:332750268 (317.3 MiB) ``` ##### 5.2.2 云计算中的网络计算-虚拟网卡 虚拟网卡的作用有很多,当前云计算技术中就离不开虚拟网卡。 云计算中的网络有以下的点需要实现: (1)**共享**:尽管每个虚拟机都会有一个或者多个虚拟网卡,但是物理机上可能只有有限的网卡。那这么多虚拟网卡如何共享同一个出口? **通过网桥解决共享问题** ![image-20220325154103687](../images/wangka.png) (2)**隔离**:分两个方面,一个是安全隔离,两个虚拟机可能属于两个用户,那怎么保证一个用户的数据不被另一个用户窃听?一个是流量隔离,两个虚拟机,如果有一个疯狂下片,会不会导致另外一个上不了网? 有一个命令**vconfig**,可以基于物理网卡 eth0 创建带 VLAN 的虚拟网卡,所有从这个虚拟网卡出去的包,都带这个 VLAN,如果这样,跨物理机的互通和隔离就可以通过这个网卡来实现。 不同的用户由于网桥不通,不能相互通信,一旦出了网桥,由于 VLAN 不同,也不会将包转发到另一个网桥上。另外,出了物理机,也是带着 VLAN ID 的。只要物理交换机也是支持 VLAN 的,到达另一台物理机的时候,VLAN ID 依然在,它只会将包转发给相同 VLAN 的网卡和网桥,所以跨物理机,不同的 VLAN 也不会相互通信。 ![image-20220325155034439](../images/wangka-2.png) (3) **互通**:分两个方面,一个是如果同一台机器上的两个虚拟机,属于同一个用户的话,这两个如何相互通信?另一个是如果不同物理机上的两个虚拟机,属于同一个用户的话,这两个如何相互通信? 如上 (4)**灵活**:虚拟机和物理不同,会经常创建、删除,从一个机器漂移到另一台机器,有的互通、有的不通等等,灵活性比物理网络要好得多,需要能够灵活配置。 通过OpenvSwitch 配置
虚拟网卡的介绍:https://keenjin.github.io/2019/06/virtual-net/ ================================================ FILE: k8s/cni/2. docker 4种 网络模式.md ================================================ * [1\. 介绍](#1-介绍) * [2 bridge模式](#2-bridge模式) * [3 host模式](#3-host模式) * [4\. none模式](#4-none模式) * [5 container模式](#5-container模式) ### 1. 介绍 docker run创建Docker容器时,可以用–net选项指定容器的网络模式,Docker有以下4种网络模式: (1)bridge模式:使用–net =bridge指定,默认设置; (2)host模式:使用–net =host指定; (3)none模式:使用–net =none指定; (4)container模式:使用–net =container:NAMEorID指定。 ### 2 bridge模式 bridge模式是Docker默认的网络设置,此模式会为每一个容器分配Network Namespace、设置IP等,并将并将一个主机上的Docker容器 连接到一个虚拟网桥上。当Docker server启动时,会在主机上创建一个名为docker0的虚拟网桥,此主机上启动的Docker容器会连接到 这个虚拟网桥上。虚拟网桥的工作方式和物理交换机类似,这样主机上的所有容器就通过交换机连在了一个二层网络中。接下来就要为容 器分配IP了,Docker会从RFC1918所定义的私有IP网段中,选择一个和宿主机不同的IP地址和子网分配给docker0,连接到docker0的容 器就从这个子网中选择一个未占用的IP使用。如一般Docker会使用172.17.0.0/16这个网段,并将172.17.42.1/16分配给docker0网桥(在 主机上使用ifconfig命令是可以看到docker0的,可以认为它是网桥的管理端口,在宿主机上作为一块虚拟网卡使用) 可以看到容器内部,有eth0, 并且可以ping 通外网 ``` root@k8s-node:~# docker run -it -u root curlimages/curl:7.75.0 sh / # ping www.baidu.com PING www.baidu.com (183.232.231.172): 56 data bytes 64 bytes from 183.232.231.172: seq=0 ttl=55 time=1.892 ms 64 bytes from 183.232.231.172: seq=1 ttl=55 time=1.833 ms 64 bytes from 183.232.231.172: seq=2 ttl=55 time=1.834 ms ^Z[1]+ Stopped ping www.baidu.com / # ip addr 1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 53: eth0@if54: mtu 1500 qdisc noqueue state UP link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0 valid_lft forever preferred_lft forever ```
### 3 host模式 如果启动容器的时候使用host模式,那么这个容器将不会获得一个独立的Network Namespace,而是和宿主机共用一个Network Namespace。容器将不会虚拟出自己的网卡,配置自己的IP等,而是使用宿主机的IP和端口。 使用host模式启动容器后可以发现,使用ip addr查看网络环境时,看到的都是宿主机上的信息。这种方式创建出来的容器,可以看到host上的所有网络设备。就是继承了宿主的网络 ``` root@k8s-node:~# docker run -it -u root --net=host curlimages/curl:7.75.0 sh / # ip addr 1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc mq state UP qlen 1000 link/ether fa:28:00:0d:3c:2f brd ff:ff:ff:ff:ff:ff inet 172.16.16.5/20 brd 172.16.31.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::f828:ff:fe0d:3c2f/64 scope link valid_lft forever preferred_lft forever 3: docker0: mtu 1500 qdisc noqueue state DOWN link/ether 02:42:f5:b6:cc:ca brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:f5ff:feb6:ccca/64 scope link valid_lft forever preferred_lft forever 4: flannel.1: mtu 1450 qdisc noqueue state UNKNOWN link/ether b6:fa:84:04:82:55 brd ff:ff:ff:ff:ff:ff inet 10.244.1.0/32 brd 10.244.1.0 scope global flannel.1 valid_lft forever preferred_lft forever inet6 fe80::b4fa:84ff:fe04:8255/64 scope link valid_lft forever preferred_lft forever 5: cni0: mtu 1450 qdisc noqueue state UP qlen 1000 link/ether a2:34:ac:2b:00:a3 brd ff:ff:ff:ff:ff:ff inet 10.244.1.1/24 brd 10.244.1.255 scope global cni0 valid_lft forever preferred_lft forever inet6 fe80::a034:acff:fe2b:a3/64 scope link valid_lft forever preferred_lft forever 42: veth0@veth1: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 62:bb:3c:a3:ac:31 brd ff:ff:ff:ff:ff:ff 43: veth1@veth0: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 32:9d:76:89:b4:7a brd ff:ff:ff:ff:ff:ff 44: veth2@veth3: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 4e:11:eb:21:3a:16 brd ff:ff:ff:ff:ff:ff 45: veth3@veth2: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 1a:74:4a:dd:98:2d brd ff:ff:ff:ff:ff:ff 46: veth4@veth5: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 1e:f5:74:3f:ae:00 brd ff:ff:ff:ff:ff:ff 47: veth5@veth4: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 36:a6:e6:d8:49:53 brd ff:ff:ff:ff:ff:ff 48: veth8625ade0@docker0: mtu 1450 qdisc noqueue master cni0 state UP link/ether f6:57:1a:a5:65:f7 brd ff:ff:ff:ff:ff:ff inet6 fe80::f457:1aff:fea5:65f7/64 scope link valid_lft forever preferred_lft forever ``` ### 4. none模式 在none模式下,Docker容器拥有自己的Network Namespace,但是,并不为Docker容器进行任何网络配置。也就是说,这个Docker容器没有网卡、IP、路由等信息。需要我们自己为Docker容器添加网卡、配置IP等。 ``` root@k8s-node:~# docker run -it -u root --net=none curlimages/curl:7.75.0 sh / # ip addr 1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever / # ``` ### 5 container模式 这个模式指定新创建的容器和已经存在的一个容器共享一个Network Namespace,而不是和宿主机共享。新创建的容器不会创建自己的网卡,配置自己的IP,而是和一个指定的容器共享IP、端口范围等。同样,两个容器除了网络方面,其他的如文件系统、进程列表等还是隔离的。两个容器的进程可以通过lo网卡设备通信。 ``` d66875e6adc3是一个bridge的容器 root@k8s-node:~# docker run -it -u root --net=container:d66875e6adc3 curlimages/curl:7.75.0 sh / # ip addr 1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 3: eth0@if48: mtu 1450 qdisc noqueue state UP link/ether 32:db:65:58:d2:29 brd ff:ff:ff:ff:ff:ff inet 10.244.1.10/24 brd 10.244.1.255 scope global eth0 valid_lft forever preferred_lft forever / # exit 622ee25b7390 是一个hostNetwork的容器 root@k8s-node:~# docker run -it -u root --net=container:622ee25b7390 curlimages/curl:7.75.0 sh / # ip addr 1: lo: mtu 65536 qdisc noqueue state UNKNOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc mq state UP qlen 1000 link/ether fa:28:00:0d:3c:2f brd ff:ff:ff:ff:ff:ff inet 172.16.16.5/20 brd 172.16.31.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::f828:ff:fe0d:3c2f/64 scope link valid_lft forever preferred_lft forever 3: docker0: mtu 1500 qdisc noqueue state DOWN link/ether 02:42:f5:b6:cc:ca brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:f5ff:feb6:ccca/64 scope link valid_lft forever preferred_lft forever 4: flannel.1: mtu 1450 qdisc noqueue state UNKNOWN link/ether b6:fa:84:04:82:55 brd ff:ff:ff:ff:ff:ff inet 10.244.1.0/32 brd 10.244.1.0 scope global flannel.1 valid_lft forever preferred_lft forever inet6 fe80::b4fa:84ff:fe04:8255/64 scope link valid_lft forever preferred_lft forever 5: cni0: mtu 1450 qdisc noqueue state UP qlen 1000 link/ether a2:34:ac:2b:00:a3 brd ff:ff:ff:ff:ff:ff inet 10.244.1.1/24 brd 10.244.1.255 scope global cni0 valid_lft forever preferred_lft forever inet6 fe80::a034:acff:fe2b:a3/64 scope link valid_lft forever preferred_lft forever 42: veth0@veth1: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 62:bb:3c:a3:ac:31 brd ff:ff:ff:ff:ff:ff 43: veth1@veth0: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 32:9d:76:89:b4:7a brd ff:ff:ff:ff:ff:ff 44: veth2@veth3: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 4e:11:eb:21:3a:16 brd ff:ff:ff:ff:ff:ff 45: veth3@veth2: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 1a:74:4a:dd:98:2d brd ff:ff:ff:ff:ff:ff 46: veth4@veth5: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 1e:f5:74:3f:ae:00 brd ff:ff:ff:ff:ff:ff 47: veth5@veth4: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 36:a6:e6:d8:49:53 brd ff:ff:ff:ff:ff:ff 48: veth8625ade0@docker0: mtu 1450 qdisc noqueue master cni0 state UP link/ether f6:57:1a:a5:65:f7 brd ff:ff:ff:ff:ff:ff inet6 fe80::f457:1aff:fea5:65f7/64 scope link valid_lft forever preferred_lft forever ``` ================================================ FILE: k8s/cni/3. docker容器网络的底层实现.md ================================================ * [1\. 背景](#1-背景) * [2\. 如何理解network namespaces](#2-如何理解network-namespaces) * [3\. 不同namespaces之间是如何通信的](#3-不同namespaces之间是如何通信的) * [3\.1 创建network namespace](#31-创建network-namespace) * [3\.2 两个networknamespaces之间的通信](#32-两个networknamespaces之间的通信) * [4\. 多个namespaces之间的通信](#4-多个namespaces之间的通信) * [4\.1 创建3个namespaces](#41-创建3个namespaces) * [4\.2 创建bridge](#42-创建bridge) * [4\.3 创建 veth pair](#43-创建-veth-pair) * [4\.4 将 veth pair 的一头挂到 namespace 中,一头挂到 bridge 上,并设 IP 地址](#44-将-veth-pair-的一头挂到-namespace-中一头挂到-bridge-上并设-ip-地址) * [4\.5 验证多Namespaces互通](#45-验证多namespaces互通) * [5\. 补充](#5-补充) * [5\.1 如何查看容器内和 宿主的 veth pair对](#51-如何查看容器内和-宿主的-veth-pair对) ### 1. 背景 上文提到的docker 4中网络模式,核心就是network namespaces的不同,比如可以共享宿主的network(hostnetwork)。容器网络模式的核心就是: 通过network namespaces隔离了各个容器,然后通过设置veth pari, bride等虚拟网络设备来达到 容器networks 与宿主hostnetwork的通信。最终是通过宿主的物理网卡进行了传输。 本节就是说明一些docker 网络到底是如何实现的。
### 2. 如何理解network namespaces 摘抄自:容器实战高手课-极客时间 对于 Network Namespace,我们从字面上去理解的话,可以知道它是在一台 Linux 节点上对网络的隔离,不过它具体到底隔离了哪部分的网络资源呢? 我们还是先来看看操作手册,在Linux Programmer’s Manual里对 [Network Namespace](https://man7.org/linux/man-pages/man7/network_namespaces.7.html) 有一个段简短的描述,在里面就列出了最主要的几部分资源,它们都是通过 Network Namespace 隔离的。 我把这些资源给你做了一个梳理: * 第一种,网络设备,这里指的是 lo,eth0 等网络设备。你可以通过 ip link命令看到它们。 * 第二种是 IPv4 和 IPv6 协议栈。从这里我们可以知道,IP 层以及上面的 TCP 和 UDP 协议栈也是每个 Namespace 独立工作的。所以 IP、TCP、UDP 的很多协议,它们的相关参数也是每个 Namespace 独立的,这些参数大多数都在 /proc/sys/net/ 目录下面,同时也包括了 TCP 和 UDP 的 port 资源。 * 第三种,IP 路由表,这个资源也是比较好理解的,你可以在不同的 Network Namespace 运行 ip route 命令,就能看到不同的路由表了。 * 第四种是防火墙规则,其实这里说的就是 iptables 规则了,每个 Namespace 里都可以独立配置 iptables 规则。 * 最后一种是网络的状态信息,这些信息你可以从 /proc/net 和 /sys/class/net 里得到,这里的状态基本上包括了前面 4 种资源的的状态信息。
再结合前面笔记,关于协议栈部分的介绍。network namesapces 隔离了 网络设备,网络参数,协议栈配置等等信息。这样network namespaces里面的包发送自然会受到限制,从而达到了隔离的作用。 ### 3. 不同namespaces之间是如何通信的 namespaces的隔离很简单:一个新的namespace啥都没有,只有一个lo,只能访问自己 通信就需要网络设备了,这里介绍一下docker常用的veth pair对和 bridge。 #### 3.1 创建network namespace 参考:https://www.cnblogs.com/bakari/p/10443484.html (1)创建namespaces (2)每个 namespace 在创建的时候会自动创建一个回环接口 lo ,默认不启用,可以通过 ip link set lo up 启用 ``` // 1.创建namespaces root@k8s-node:~#ip netns add netns1 root@k8s-node:~# ip netns ls netns1 root@k8s-node:~# ip netns exec netns1 ip addr 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 // 2.每个 namespace 在创建的时候会自动创建一个回环接口 lo ,默认不启用,可以通过 ip link set lo up 启用。 root@k8s-node:~# ip netns exec netns1 bash root@k8s-node:~# ping www.baidu.com ping: www.baidu.com: Temporary failure in name resolution root@k8s-node:~# ping 127.0.0.1 connect: Network is unreachable root@k8s-node:~# ip link set lo up root@k8s-node:~# ping 127.0.0.1 PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.021 ms 64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.016 ms 64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.018 ms 64 bytes from 127.0.0.1: icmp_seq=4 ttl=64 time=0.027 ms ^Z [1]+ Stopped ping 127.0.0.1 ``` #### 3.2 两个networknamespaces之间的通信 (1)再创建一个namespaces ``` root@k8s-node:~# ip netns add netns0 root@k8s-node:~# ip netns ls netns0 netns1 ``` (2)生成一堆veth pair ``` root@k8s-node:~# ip link add type veth root@k8s-node:~# ip link 42: veth0@veth1: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 62:bb:3c:a3:ac:31 brd ff:ff:ff:ff:ff:ff 43: veth1@veth0: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 32:9d:76:89:b4:7a brd ff:ff:ff:ff:ff:ff ``` (3)给 veth pair 配上 ip 地址 ``` // 进入netns0将veth0设备启动 root@k8s-node:~# ip netns exec netns0 ip link set veth0 up // 查看已经开启了,UP root@k8s-node:~# ip netns exec netns0 ip addr 1: lo: mtu 65536 qdisc noop state DOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 42: veth0@if43: mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000 link/ether 62:bb:3c:a3:ac:31 brd ff:ff:ff:ff:ff:ff link-netns netns1 // 进入netns1将veth1设备启动 root@k8s-node:~# ip netns exec netns1 ip link set veth1 up root@k8s-node:~# ip netns exec netns1 ip addr 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 43: veth1@if42: mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 32:9d:76:89:b4:7a brd ff:ff:ff:ff:ff:ff link-netns netns0 inet6 fe80::309d:76ff:fe89:b47a/64 scope link valid_lft forever preferred_lft forever ``` (4) 给veth pair网卡配置Ip ``` veth0 对应 netns0,对应Ip段10.1.1.1/24 root@k8s-node:~# ip netns exec netns0 ip addr add 10.1.1.1/24 dev veth0 veth1 对应 netns1,对应Ip段10.1.1.2/24 root@k8s-node:~# ip netns exec netns1 ip addr add 10.1.1.2/24 dev veth1 netns0 ping veth1对应的ip段可通 root@k8s-node:~# ip netns exec netns0 ping 10.1.1.2 PING 10.1.1.2 (10.1.1.2) 56(84) bytes of data. 64 bytes from 10.1.1.2: icmp_seq=1 ttl=64 time=0.038 ms 64 bytes from 10.1.1.2: icmp_seq=2 ttl=64 time=0.022 ms 64 bytes from 10.1.1.2: icmp_seq=3 ttl=64 time=0.023 ms ^Z [1]+ Stopped ip netns exec netns0 ping 10.1.1.2 ``` ### 4. 多个namespaces之间的通信 参考:https://www.cnblogs.com/bakari/p/10443484.html 2 个 namespace 之间通信可以借助 `veth pair` ,多个 namespace 之间的通信则可以使用 bridge 来转接,不然每两个 namespace 都去配 `veth pair` 将会是一件麻烦的事。下面就看看如何使用 bridge 来转接。 拓扑图如下: ![image-20220327163813484](../images/docker-net-1.png) #### 4.1 创建3个namespaces ``` root@k8s-node:~# ip netns add net0 root@k8s-node:~# ip netns add net1 root@k8s-node:~# ip netns add net2 ``` #### 4.2 创建bridge ``` root@k8s-node:~# ip link add br0 type bridge // 开启这个网络设备 root@k8s-node:~# ip link set dev br0 up root@k8s-node:~# ip link 57: br0: mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 5e:3c:aa:99:dc:09 brd ff:ff:ff:ff:ff:ff ``` #### 4.3 创建 veth pair ``` //(1)创建 3 个 veth pair # ip link add type veth # ip link add type veth # ip link add type veth veth是递增的,所以这三对是 23, 45, 67这三对 root@k8s-node:~# ip link 44: veth2@veth3: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 4e:11:eb:21:3a:16 brd ff:ff:ff:ff:ff:ff 45: veth3@veth2: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 1a:74:4a:dd:98:2d brd ff:ff:ff:ff:ff:ff 46: veth4@veth5: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 1e:f5:74:3f:ae:00 brd ff:ff:ff:ff:ff:ff 47: veth5@veth4: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 36:a6:e6:d8:49:53 brd ff:ff:ff:ff:ff:ff 48: veth8625ade0@if3: mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default link/ether f6:57:1a:a5:65:f7 brd ff:ff:ff:ff:ff:ff link-netnsid 0 55: veth6@veth7: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 5a:37:5e:74:01:0f brd ff:ff:ff:ff:ff:ff 56: veth7@veth6: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether fe:9c:e3:75:6c:23 brd ff:ff:ff:ff:ff:ff ``` #### 4.4 将 veth pair 的一头挂到 namespace 中,一头挂到 bridge 上,并设 IP 地址 ``` // 配置第一个ns 和 bridge // 将veth2 挂到 net0 这个命名空间下 root@k8s-node:~# ip link set dev veth2 netns net0 // 将namespaces ip link看到的veth2 改名为eth0 root@k8s-node:~# ip netns exec net0 ip link set dev veth2 name eth0 // 设置ip 10.0.1.1/24 root@k8s-node:~# ip netns exec net0 ip addr add 10.0.1.1/24 dev eth0 // 开启网络设备eth0,其实就是veth2 root@k8s-node:~# ip netns exec net0 ip link set dev eth0 up // 将veth3 挂着bro网桥上 root@k8s-node:~# ip link set dev veth3 master br0 // 开启bridge的网络设备 veth3 root@k8s-node:~# ip link set dev veth3 up // 配置第 2 个 net1 # ip link set dev veth4 netns net1 # ip netns exec net1 ip link set dev veth4 name eth0 # ip netns exec net1 ip addr add 10.0.1.2/24 dev eth0 # ip netns exec net1 ip link set dev eth0 up # # ip link set dev veth5 master br0 # ip link set dev veth5 up // 配置第 3 个 net2 (这里我配错了一个,重新又生成了1对,所以是veth0,veth1) # ip link set dev veth0 netns net2 # ip netns exec net2 ip link set dev veth0 name eth0 # ip netns exec net2 ip addr add 10.0.1.2/24 dev eth0 # ip netns exec net2 ip link set dev eth0 up # # ip link set dev veth1 master br0 # ip link set dev veth1 up ``` #### 4.5 验证多Namespaces互通 这样之后,竟然通不了,经查阅 [参见](https://segmentfault.com/q/1010000010011053/a-1020000010025650) ,是因为 > 原因是因为系统为bridge开启了iptables功能,导致所有经过br0的数据包都要受iptables里面规则的限制,而docker为了安全性(我的系统安装了 docker),将iptables里面filter表的FORWARD链的默认策略设置成了drop,于是所有不符合docker规则的数据包都不会被forward,导致你这种情况ping不通。 > > 解决办法有两个,二选一: > > 1. 关闭系统bridge的iptables功能,这样数据包转发就不受iptables影响了:echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables > 2. 为br0添加一条iptables规则,让经过br0的包能被forward:iptables -A FORWARD -i br0 -j ACCEPT > > 第一种方法不确定会不会影响docker,建议用第二种方法。 我采用以下方法解决: `Copyiptables -A FORWARD -i br0 -j ACCEPT` ``` root@k8s-node:~# ip netns exec net0 ping -c 2 10.0.1.2 PING 10.0.1.2 (10.0.1.2) 56(84) bytes of data. ^X ^Z [2]+ Stopped ip netns exec net0 ping -c 2 10.0.1.2 root@k8s-node:~# iptables -A FORWARD -i br0 -j ACCEPT root@k8s-node:~# root@k8s-node:~# ip netns exec net0 ping -c 2 10.0.1.2 PING 10.0.1.2 (10.0.1.2) 56(84) bytes of data. 64 bytes from 10.0.1.2: icmp_seq=1 ttl=64 time=0.061 ms 64 bytes from 10.0.1.2: icmp_seq=2 ttl=64 time=0.036 ms --- 10.0.1.2 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 12ms rtt min/avg/max/mdev = 0.036/0.048/0.061/0.014 ms ```
### 5. 补充 #### 5.1 如何查看容器内和 宿主的 veth pair对 可以参考:https://blog.csdn.net/u011563903/article/details/88593251 ================================================ FILE: k8s/cni/4.k8s pod通信原理介绍.md ================================================ * [1\. 目标](#1-目标) * [2\. 通信原理](#2-通信原理) * [2\.1 同一个Pod内部不同容器之间的通信](#21-同一个pod内部不同容器之间的通信) * [2\.2 同一个节点上不同pod之间的通信原理](#22-同一个节点上不同pod之间的通信原理) * [2\.3 不同节点之间的Pod通信原理](#23-不同节点之间的pod通信原理) * [2\.4 K8S集群内部访问服务的原理](#24-k8s集群内部访问服务的原理) * [2\.4\.1 clusterIp介绍](#241-clusterip介绍) * [2\.4\.2 clusterIp原理说明](#242-clusterip原理说明) * [2\.5 K8S集群外部部访问服务的原理](#25-k8s集群外部部访问服务的原理) * [2\.5\.1 LoadBalancer](#251-loadbalancer) * [2\.5\.2 NodePort](#252-nodeport) * [2\.6 ingress](#26-ingress) ### 1. 目标 了解以下情况下k8s集群内部网络的通信原理 (1)同一个Pod内部不同容器之间的通信原理 (2)同一个节点上不同pod之间的通信原理 (3)不同节点之间的Pod通信原理 (4)K8S集群内部访问服务的原理 (5)K8S集群外部部访问服务的原理
### 2. 通信原理 #### 2.1 同一个Pod内部不同容器之间的通信 Pods中的多个container共享一个网络栈,每个pod都是有一个pause容器,就是sandbox。所有的业务容器都是加入了这个namespaces。所以他们可以直接通过localhost通信。 #### 2.2 同一个节点上不同pod之间的通信原理 这里可以先了解一下基础知识: * 不同network namespace之间可以通过veth pair来互相通信 * 多个network namespace之间可以通过 bridge 来通信 可以参考:https://www.cnblogs.com/bakari/p/10443484.html 而这个bridge就是docker0。在pods的namespace中,pods的虚拟网络接口为veth0;在宿主机上,物理网络的网络接口为eth0。docker bridge作为veth0的默认网关,用于和宿主机网络的通信。 所有pods的veth0所能分配的IP是一个独立的IP地址范围,来自于创建cluster时候kubeadm的--pod-network-cidr参数设定的CIDR,这里看起来像是172.17.0.0/24,是一个B类局域网IP地址段;所有宿主机的网络接口eth0所能分配的IP是实际物理网络的设定,一般来自于实际物理网络中的路由器通过DHCP分配的,这里看起来是10.100.0.0/24,是一个A类局域网IP地址段。 ![image-20220319175922049](../images/cni-1.png) #### 2.3 不同节点之间的Pod通信原理 上面2个其实是docker做的工作。在k8s层,我们部署的往往需要flannel或者calico来打通网络。 下面的docker0的名字被改成了cbr0,意思是custom bridge。由此,如果左侧的pod想访问右侧的pod,则IP包会通过bridge cbr0来到左侧宿主机的eth0,然后查询宿主机上新增的路由信息,继而将IP包送往右侧的宿主机的eth0,继而再送往右侧的bridge cbr0,最后送往右侧的pod。 ![image-20220319205533302](../images/cni-2.png)
以flannel为例子, flannel为pod分配Ip,并且设置路由。所以一个pod的请求达到docker后会被flannel接收,然后进行转发。 Flannel 是 CoreOS 团队针对 Kubernetes 设计的一个网络规划实现。简单来说,它的功能有以下几点: 1、使集群中的不同 Node 主机创建的 Docker 容器都具有全集群唯一的虚拟 IP 地址; 2、建立一个覆盖网络(overlay network),这个覆盖网络会将数据包原封不动的传递到目标容器中。覆盖网络是建立在另一个网络之上并由其基础设施支持的虚拟网络。覆盖网络通过将一个分组封装在另一个分组内来将网络服务与底层基础设施分离。在将封装的数据包转发到端点后,将其解封装; 3、创建一个新的虚拟网卡 flannel0 接收 docker 网桥的数据,通过维护路由表,对接收到的数据进行封包和转发(VXLAN); 4、路由信息一般存放到 etcd 中:多个 Node 上的 Flanneld 依赖一个 etcd cluster 来做集中配置服务,etcd 保证了所有 Node 上 Flannel 所看到的配置是一致的。同时每个 Node 上的 Flannel 都可以监听 etcd 上的数据变化,实时感知集群中 Node 的变化; 5、Flannel 首先会在 Node 上创建一个名为 flannel0 的网桥(VXLAN 类型的设备),并且在每个 Node 上运行一个名为 Flanneld 的代理。每个 Node 上的 Flannel 代理会从 etcd 上为当前 Node 申请一个 CIDR 地址块用来给该 Node 上的 Pod 分配地址; 6、Flannel 致力于给 Kubernetes 集群中的 Node 提供一个三层网络,它并不控制 Node 中的容器是如何进行组网的,仅仅关心流量如何在 Node 之间流转。 ![flannel](../images/cni-3.png) #### 2.4 K8S集群内部访问服务的原理 正式业务下,Pod可能会被重启或者因为其他原因重建,而且一个服务有很多pod如何服务,怎么确定是那个服务? 这个时候就有了service的概念。在集群内部访问一般是 headless or clusterIp ##### 2.4.1 clusterIp介绍 以clusterIp为例, 创建完之后查看svc就有一个 CLUSTER-IP。 这样在集群内的容器或节点上都能够访问Service ``` apiVersion: v1 kind: Service metadata: labels: app: nginx name: nginx-clusterip spec: ports: - name: service0 port: 8080 # 访问Service的端口 protocol: TCP # 访问Service的协议,支持TCP和UDP targetPort: 80 # Service访问目标容器的端口,此端口与容器中运行的应用强相关,如本例中nginx镜像默认使用80端口 selector: # 标签选择器,Service通过标签选择Pod,将访问Service的流量转发给Pod,此处选择带有 app:nginx 标签的Pod app: nginx type: ClusterIP # Service的类型,ClusterIP表示在集群内访问 # kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nginx-clusterip ClusterIP 10.247.74.52 8080/TCP 14m ```
headless svc是一种特色的clusterIp类型。他在定义的时候制定了 clusterIp=none。例如: ``` apiVersion: v1 kind: Service metadata: name: nginx labels: app: nginx spec: ports: - port: 80 name: nginx-web # clusterIP 设置为 None clusterIP: None selector: app: nginx ``` `Headless Service`其实就是没头的`Service`。使用场景如下: - client感知到svc的所有endpoint, 通过查询dns自主选择访问哪个后端 - `Headless Service`的对应的每一个`Endpoints`,即每一个`Pod`,都会有对应的`DNS`域名;这样`Pod`之间就可以互相访问。StatefulSets就是使用了headless service ##### 2.4.2 clusterIp原理说明 到了svc这层都需要额外的控制器来处理。社区常见的就是 kube-proxy。这里只是简单说一下原理。 kube-proxy在每个节点上监听svc, ep, pod资源的变化,然后通过iptable来控制访问svc的时候,具体访问哪个pod。 iptables可以实现负载均衡:比如通过**--probability** 设置概率来保证负载均衡。 可以参考:https://blog.csdn.net/ksj367043706/article/details/89764546 比如: clusterip=10.247.74.52 的svc有两个Pod, kube-proxy会设置每个节点上iptables的规则。 访问ip=10.247.74.52的路由,以50%的概率先访问 podA; 然后 100%概率访问podB。
#### 2.5 K8S集群外部部访问服务的原理 ##### 2.5.1 LoadBalancer 负载均衡( LoadBalancer )可以通过弹性负载均衡从公网访问到工作负载,与弹性IP方式相比提供了高可靠的保障,一般用于系统中需要暴露到公网的服务。 到这里为止其实和k8s的关系不是很大了,一般各个网络服务提供了LoadBalancer,在定义yaml制定subnet-id,vpc等信息申请LoadBalancer-ip,然后对应的节点的网络配置(路由转发等)通过直接访问这个ip即可(后面负载均衡啥的,网络服务部门已经干了) ``` apiVersion: v1 kind: Service metadata: annotations: kubernetes.io/elb.pass-through: "true" kubernetes.io/elb.class: union kubernetes.io/session-affinity-mode: SOURCE_IP kubernetes.io/elb.subnet-id: a9cf6d24-ad43-4f75-94d1-4e0e0464afac kubernetes.io/elb.autocreate: '{"type":"public","bandwidth_name":"cce-bandwidth","bandwidth_chargemode":"bandwidth","bandwidth_size":5,"bandwidth_sharetype":"PER","eip_type":"5_bgp","name":"james"}' labels: app: nginx name: nginx spec: externalTrafficPolicy: Local ports: - name: service0 port: 80 protocol: TCP targetPort: 80 selector: app: nginx type: LoadBalancer ```
##### 2.5.2 NodePort 节点访问 ( NodePort )是指在每个节点的IP上开放一个静态端口,通过静态端口对外暴露服务。节点访问 ( NodePort )会路由到ClusterIP服务,这个ClusterIP服务会自动创建。通过请求 :,可以从集群的外部访问一个NodePort服务。 部署yaml如下所示: ``` apiVersion: v1 kind: Service metadata: labels: app: nginx name: nginx-nodeport spec: ports: - name: service nodePort: 30000 # 节点端口,取值范围为30000-32767 port: 8080 # 访问Service的端口 protocol: TCP # 访问Service的协议,支持TCP和UDP targetPort: 80 # Service访问目标容器的端口,此端口与容器中运行的应用强相关,如本例中nginx镜像默认使用80端口 selector: # 标签选择器,Service通过标签选择Pod,将访问Service的流量转发给Pod,此处选择带有 app:nginx 标签的Pod app: nginx type: NodePort # Service的类型,NodePort表示在通过节点端口访问 ``` NodePort的核心实现也是非常简单。外部服务通过访问节点某个port, iptables通过转发访问服务,这样就将外部访问变成了内部访问。 ### 2.6 ingress kubernetes提供了Ingress资源对象,Ingress只需要一个NodePort或者一个LB就可以满足暴露多个Service的需求。可以做到7层了负载均衡。这里只是简单说一下,目前还没有研究ingress,后面再补充。 ================================================ FILE: k8s/cni/5. k8s 容器网络接口介绍.md ================================================ * [1\. 背景](#1-背景) * [2\. Kubelet cni介绍](#2-kubelet-cni介绍) * [2\.1 kubelet cni相关启动参数介绍](#21-kubelet-cni相关启动参数介绍) * [2\.2 kubelet 调用cni分配ip的流程](#22-kubelet-调用cni分配ip的流程) * [2\.2\.1 核心interface](#221-核心interface) * [2\.2\.2 kubelet初始化cni](#222-kubelet初始化cni) * [2\.2\.3 kubelet 分配ip](#223-kubelet-分配ip) * [plugin\.addToNetwork](#pluginaddtonetwork) * [cniNet\.AddNetworkList](#cninetaddnetworklist) * [ExecPluginWithResult](#execpluginwithresult) * [3\. 总结](#3-总结) ### 1. 背景 在之前kubelet创建pod流程的分析过程中,kubelet 创建Pod 的第一步,就是创建并启动一个 Infra 容器,用来“hold”住这个 Pod 的 Network Namespace。 kubelete直接是调用了SetUpPod这个函数来设置网络。这背后其实是做了很多工作的,这些工作被kubelet弄成了一个cni接口。 ``` err = ds.network.SetUpPod(config.GetMetadata().Namespace, config.GetMetadata().Name, cID, config.Annotations, networkOptions) ``` 这样设计的目的就是可插拔,不同的厂商或者使用者,只要实现了cni接口就可以 使用自定义的网络模式。 本文就是梳理一下,kubelet中cni是如何定义的,pod创建和删除过程中cni是如何工作的。为后面的自定义cni打一个基础。
### 2. Kubelet cni介绍 #### 2.1 kubelet cni相关启动参数介绍 (1)network-plugin 指定要使用的网络插件类型,可选值cni、kubenet、""。默认为空串,代表Noop,即不配置网络插件(不构建pod网络) **kubenet**: Kubenet 是一个非常基本的、简单的网络插件,仅适用于 Linux。 它本身并不实现更高级的功能,如跨节点网络或网络策略。 它通常与云驱动一起使用,云驱动为节点间或单节点环境中的通信设置路由规则。 Kubenet 创建名为 `cbr0` 的网桥,并为每个 pod 创建了一个 veth 对, 每个 Pod 的主机端都连接到 `cbr0`。 这个 veth 对的 Pod 端会被分配一个 IP 地址,该 IP 地址隶属于节点所被分配的 IP 地址范围内。节点的 IP 地址范围则通过配置或控制器管理器来设置。 `cbr0` 被分配一个 MTU,该 MTU 匹配主机上已启用的正常接口的最小 MTU。 **cni**:通过给 Kubelet 传递 `--network-plugin=cni` 命令行选项可以选择 CNI 插件。 Kubelet 从 `--cni-conf-dir` (默认是 `/etc/cni/net.d`) 读取文件并使用 该文件中的 CNI 配置来设置各个 Pod 的网络。 CNI 配置文件必须与 [CNI 规约](https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration) 匹配,并且配置所引用的所有所需的 CNI 插件都应存在于 `--cni-bin-dir`(默认是 `/opt/cni/bin`)下。 如果这个目录中有多个 CNI 配置文件,kubelet 将会使用按文件名的字典顺序排列 的第一个作为配置文件。 除了配置文件指定的 CNI 插件外,Kubernetes 还需要标准的 CNI [`lo`](https://github.com/containernetworking/plugins/blob/master/plugins/main/loopback/loopback.go) 插件,最低版本是0.2.0。 这部分更多信息详见:https://kubernetes.io/zh/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/ (2)--cni-conf-dir:CNI 配置文件所在路径。默认值:/etc/cni/net.d。 (和第一个参数配合使用) (3)--cni-bin-dir:CNI 插件的可执行文件所在路径,kubelet 将在此路径中查找 CNI 插件的可执行文件来执行pod的网络操作。默认值:/opt/cni/bin (和第一个参数配合使用) #### 2.2 kubelet 调用cni分配ip的流程 ##### 2.2.1 核心interface 这里从ds.network.SetUpPod函数开始,SetUpPod主要调用了pm.plugin.SetUpPod。 ``` func (pm *PluginManager) SetUpPod(podNamespace, podName string, id kubecontainer.ContainerID, annotations, options map[string]string) error { defer recordOperation("set_up_pod", time.Now()) fullPodName := kubecontainer.BuildPodFullName(podName, podNamespace) pm.podLock(fullPodName).Lock() defer pm.podUnlock(fullPodName) klog.V(3).Infof("Calling network plugin %s to set up pod %q", pm.plugin.Name(), fullPodName) if err := pm.plugin.SetUpPod(podNamespace, podName, id, annotations, options); err != nil { return fmt.Errorf("networkPlugin %s failed to set up pod %q network: %v", pm.plugin.Name(), fullPodName, err) } return nil } ``` NetworkPlugin interface声明了kubelet网络插件的一些操作方法,不同类型的网络插件只需要实现这些方法即可,其中最关键的就是SetUpPod与TearDownPod方法,作用分别是构建pod网络与销毁pod网络。 ``` // NetworkPlugin is an interface to network plugins for the kubelet type NetworkPlugin interface { // Init initializes the plugin. This will be called exactly once // before any other methods are called. Init(host Host, hairpinMode kubeletconfig.HairpinMode, nonMasqueradeCIDR string, mtu int) error // Called on various events like: // NET_PLUGIN_EVENT_POD_CIDR_CHANGE Event(name string, details map[string]interface{}) // Name returns the plugin's name. This will be used when searching // for a plugin by name, e.g. Name() string // Returns a set of NET_PLUGIN_CAPABILITY_* Capabilities() utilsets.Int // SetUpPod is the method called after the infra container of // the pod has been created but before the other containers of the // pod are launched. SetUpPod(namespace string, name string, podSandboxID kubecontainer.ContainerID, annotations, options map[string]string) error // TearDownPod is the method called before a pod's infra container will be deleted TearDownPod(namespace string, name string, podSandboxID kubecontainer.ContainerID) error // GetPodNetworkStatus is the method called to obtain the ipv4 or ipv6 addresses of the container GetPodNetworkStatus(namespace string, name string, podSandboxID kubecontainer.ContainerID) (*PodNetworkStatus, error) // Status returns error if the network plugin is in error state Status() error } ``` 这里我们针对cni进行分析,cniNetworkPlugin struct实现了NetworkPlugin interface,实现了SetUpPod与TearDownPod等方法。 // pkg/kubelet/dockershim/network/cni/cni.go type cniNetworkPlugin struct { network.NoopNetworkPlugin loNetwork *cniNetwork sync.RWMutex defaultNetwork *cniNetwork host network.Host execer utilexec.Interface nsenterPath string confDir string binDirs []string cacheDir string podCidr string } ##### 2.2.2 kubelet初始化cni 这里直接写调用链如下: main (cmd/kubelet/kubelet.go) -> NewKubeletCommand (cmd/kubelet/app/server.go) -> Run (cmd/kubelet/app/server.go) -> run (cmd/kubelet/app/server.go) -> RunKubelet (cmd/kubelet/app/server.go) -> CreateAndInitKubelet(cmd/kubelet/app/server.go) -> kubelet.NewMainKubelet(pkg/kubelet/kubelet.go) -> cni.ProbeNetworkPlugins & network.InitNetworkPlugin(pkg/kubelet/network/plugins.go)
在cri的时候,如果是docker的话,调用dockershim.NewDockerService函数进行初始化: ``` switch containerRuntime { case kubetypes.DockerContainerRuntime: // Create and start the CRI shim running as a grpc server. streamingConfig := getStreamingConfig(kubeCfg, kubeDeps, crOptions) ds, err := dockershim.NewDockerService(kubeDeps.DockerClientConfig, crOptions.PodSandboxImage, streamingConfig, &pluginSettings, runtimeCgroups, kubeCfg.CgroupDriver, crOptions.DockershimRootDirectory, !crOptions.RedirectContainerStreaming, crOptions.NoJsonLogPath) if err != nil { return nil, err } if crOptions.RedirectContainerStreaming { klet.criHandler = ds } ```
这里只关心cni相关的函数: 1. 调用cni.ProbeNetworkPlugins,ProbeNetworkPlugins函数就是根据confDir,binDirs等配置,实例化一个cniNetworkPlugin结构体 2. 调用InitNetworkPlugin初始化,调用的是 cniNetWorkPlugin.Init。 Init函数逻辑为: * 调用platformInit执行nsenter命令,看是否可以进入ns * 启动一个goroutine,每隔5秒,调用一次plugin.syncNetworkConfig,作用就是根据kubelet启动参数配置,去对应的cni conf文件夹下寻找cni配置文件,返回包含cni信息的cniNetwork结构体,赋值给cniNetworkPlugin结构体的defaultNetwork属性,从而达到cni conf以及bin更新后,kubelet也能感知并更新cniNetworkPlugin结构体的效果 3. 将上面步骤中获取到的cniNetworkPlugin结构体,赋值给dockerService struct的network属性,待后续创建pod、删除pod时可以调用cniNetworkPlugin的SetUpPod、TearDownPod方法来构建pod的网络、销毁pod的网络 ``` // NewDockerService creates a new `DockerService` struct. // NOTE: Anything passed to DockerService should be eventually handled in another way when we switch to running the shim as a different process. func NewDockerService(config *ClientConfig, podSandboxImage string, streamingConfig *streaming.Config, pluginSettings *NetworkPluginSettings, cgroupsName string, kubeCgroupDriver string, dockershimRootDir string, startLocalStreamingServer bool, noJsonLogPath string) (DockerService, error) { client := NewDockerClientFromConfig(config) c := libdocker.NewInstrumentedInterface(client) checkpointManager, err := checkpointmanager.NewCheckpointManager(filepath.Join(dockershimRootDir, sandboxCheckpointDir)) if err != nil { return nil, err } ds := &dockerService{ client: c, os: kubecontainer.RealOS{}, podSandboxImage: podSandboxImage, streamingRuntime: &streamingRuntime{ client: client, execHandler: &NativeExecHandler{}, }, containerManager: cm.NewContainerManager(cgroupsName, client), checkpointManager: checkpointManager, startLocalStreamingServer: startLocalStreamingServer, networkReady: make(map[string]bool), containerCleanupInfos: make(map[string]*containerCleanupInfo), noJsonLogPath: noJsonLogPath, } // check docker version compatibility. if err = ds.checkVersionCompatibility(); err != nil { return nil, err } // create streaming server if configured. if streamingConfig != nil { var err error ds.streamingServer, err = streaming.NewServer(*streamingConfig, ds.streamingRuntime) if err != nil { return nil, err } } // Determine the hairpin mode. if err := effectiveHairpinMode(pluginSettings); err != nil { // This is a non-recoverable error. Returning it up the callstack will just // lead to retries of the same failure, so just fail hard. return nil, err } klog.Infof("Hairpin mode set to %q", pluginSettings.HairpinMode) // 1.调用cni.ProbeNetworkPlugins,函数就是根据confDir,binDirs等配置,实例化一个cniNetworkPlugin结构体 // dockershim currently only supports CNI plugins. pluginSettings.PluginBinDirs = cni.SplitDirs(pluginSettings.PluginBinDirString) cniPlugins := cni.ProbeNetworkPlugins(pluginSettings.PluginConfDir, pluginSettings.PluginCacheDir, pluginSettings.PluginBinDirs) cniPlugins = append(cniPlugins, kubenet.NewPlugin(pluginSettings.PluginBinDirs, pluginSettings.PluginCacheDir)) netHost := &dockerNetworkHost{ &namespaceGetter{ds}, &portMappingGetter{ds}, } // 2.调用InitNetworkPlugin初始化,调用的是 cniNetWorkPlugin.Init plug, err := network.InitNetworkPlugin(cniPlugins, pluginSettings.PluginName, netHost, pluginSettings.HairpinMode, pluginSettings.NonMasqueradeCIDR, pluginSettings.MTU) if err != nil { return nil, fmt.Errorf("didn't find compatible CNI plugin with given settings %+v: %v", pluginSettings, err) } // 3.将上面步骤中获取到的cniNetworkPlugin结构体,赋值给dockerService struct的network属性,待后续创建pod、删除pod时可以调用cniNetworkPlugin的SetUpPod、TearDownPod方法来构建pod的网络、销毁pod的网络 ds.network = network.NewPluginManager(plug) klog.Infof("Docker cri networking managed by %v", plug.Name()) // NOTE: cgroup driver is only detectable in docker 1.11+ cgroupDriver := defaultCgroupDriver dockerInfo, err := ds.client.Info() klog.Infof("Docker Info: %+v", dockerInfo) if err != nil { klog.Errorf("Failed to execute Info() call to the Docker client: %v", err) klog.Warningf("Falling back to use the default driver: %q", cgroupDriver) } else if len(dockerInfo.CgroupDriver) == 0 { klog.Warningf("No cgroup driver is set in Docker") klog.Warningf("Falling back to use the default driver: %q", cgroupDriver) } else { cgroupDriver = dockerInfo.CgroupDriver } if len(kubeCgroupDriver) != 0 && kubeCgroupDriver != cgroupDriver { return nil, fmt.Errorf("misconfiguration: kubelet cgroup driver: %q is different from docker cgroup driver: %q", kubeCgroupDriver, cgroupDriver) } klog.Infof("Setting cgroupDriver to %s", cgroupDriver) ds.cgroupDriver = cgroupDriver ds.versionCache = cache.NewObjectCache( func() (interface{}, error) { return ds.getDockerVersion() }, versionCacheTTL, ) // Register prometheus metrics. metrics.Register() return ds, nil } // ProbeNetworkPlugins函数就是根据confDir,binDirs等配置,实例化一个cniNetworkPlugin结构体 // ProbeNetworkPlugins : get the network plugin based on cni conf file and bin file func ProbeNetworkPlugins(confDir, cacheDir string, binDirs []string) []network.NetworkPlugin { old := binDirs binDirs = make([]string, 0, len(binDirs)) for _, dir := range old { if dir != "" { binDirs = append(binDirs, dir) } } plugin := &cniNetworkPlugin{ defaultNetwork: nil, loNetwork: getLoNetwork(binDirs), execer: utilexec.New(), confDir: confDir, binDirs: binDirs, cacheDir: cacheDir, } // sync NetworkConfig in best effort during probing. plugin.syncNetworkConfig() return []network.NetworkPlugin{plugin} } // 这里只是调用一下nsenter命令,看是否可以进入ns func (plugin *cniNetworkPlugin) platformInit() error { var err error plugin.nsenterPath, err = plugin.execer.LookPath("nsenter") if err != nil { return err } return nil } // 启动一个goroutine,每隔5秒,调用一次plugin.syncNetworkConfig,作用就是根据kubelet启动参数配置,去对应的cni conf文件夹下寻找cni配置文件,返回包含cni信息的cniNetwork结构体,赋值给cniNetworkPlugin结构体的defaultNetwork属性,从而达到cni conf以及bin更新后,kubelet也能感知并更新cniNetworkPlugin结构体的效果。 func (plugin *cniNetworkPlugin) Init(host network.Host, hairpinMode kubeletconfig.HairpinMode, nonMasqueradeCIDR string, mtu int) error { err := plugin.platformInit() if err != nil { return err } plugin.host = host plugin.syncNetworkConfig() // start a goroutine to sync network config from confDir periodically to detect network config updates in every 5 seconds go wait.Forever(plugin.syncNetworkConfig, defaultSyncConfigPeriod) return nil ```
##### 2.2.3 kubelet 分配ip 在kubelet创建pod的流程分析中,分配Pod ip的调用链路如下: -> klet.syncPod(pkg/kubelet/kubelet.go) -> kl.containerRuntime.SyncPod(pkg/kubelet/kubelet.go) -> m.createPodSandbox(pkg/kubelet/kuberuntime/kuberuntime_manager.go) -> m.runtimeService.RunPodSandbox (pkg/kubelet/kuberuntime/kuberuntime_sandbox.go) -> ds.network.SetUpPod(pkg/kubelet/dockershim/docker_sandbox.go) -> pm.plugin.SetUpPod(pkg/kubelet/dockershim/network/plugins.go) -> SetUpPod(pkg/kubelet/dockershim/network/cni/cni.go) 这里直接从 SetUpPod分析看看,Kubelet是如何调用cni的。 cniNetworkPlugin.SetUpPod方法作用cni网络插件构建pod网络的调用入口。其主要逻辑为: (1)调用plugin.checkInitialized():检查网络插件是否已经初始化完成,还有Podcidr 是否设置等。 (2)调用plugin.host.GetNetNS():获取容器网络命名空间路径,格式/proc/${容器PID}/ns/net; (3)调用context.WithTimeout():设置调用cni网络插件的超时时间; (4)调用plugin.addToNetwork():如果是linux环境,则调用cni网络插件,给pod构建回环网络; (5)调用plugin.addToNetwork():调用cni网络插件,给pod构建默认网络。 这里核心是plugin.addToNetwork函数,接着往下看 ``` // pkg/kubelet/dockershim/network/cni/cni.go func (plugin *cniNetworkPlugin) SetUpPod(namespace string, name string, id kubecontainer.ContainerID, annotations, options map[string]string) error { if err := plugin.checkInitialized(); err != nil { return err } netnsPath, err := plugin.host.GetNetNS(id.ID) if err != nil { return fmt.Errorf("CNI failed to retrieve network namespace path: %v", err) } // Todo get the timeout from parent ctx cniTimeoutCtx, cancelFunc := context.WithTimeout(context.Background(), network.CNITimeoutSec*time.Second) defer cancelFunc() // Windows doesn't have loNetwork. It comes only with Linux if plugin.loNetwork != nil { if _, err = plugin.addToNetwork(cniTimeoutCtx, plugin.loNetwork, name, namespace, id, netnsPath, annotations, options); err != nil { return err } } _, err = plugin.addToNetwork(cniTimeoutCtx, plugin.getDefaultNetwork(), name, namespace, id, netnsPath, annotations, options) return err } ```
###### plugin.addToNetwork plugin.addToNetwork方法的作用就是调用cni网络插件,给pod构建指定类型的网络,其主要逻辑为: (1)调用plugin.buildCNIRuntimeConf():构建调用cni网络插件的配置,报告 podCIDR, dns capability配置等 (2)调用cniNet.AddNetworkList():调用cni网络插件,进行网络构建 这里核心是AddNetworkList函数,接着往下看 ``` func (plugin *cniNetworkPlugin) addToNetwork(ctx context.Context, network *cniNetwork, podName string, podNamespace string, podSandboxID kubecontainer.ContainerID, podNetnsPath string, annotations, options map[string]string) (cnitypes.Result, error) { rt, err := plugin.buildCNIRuntimeConf(podName, podNamespace, podSandboxID, podNetnsPath, annotations, options) if err != nil { klog.Errorf("Error adding network when building cni runtime conf: %v", err) return nil, err } pdesc := podDesc(podNamespace, podName, podSandboxID) netConf, cniNet := network.NetworkConfig, network.CNIConfig klog.V(4).Infof("Adding %s to network %s/%s netns %q", pdesc, netConf.Plugins[0].Network.Type, netConf.Name, podNetnsPath) res, err := cniNet.AddNetworkList(ctx, netConf, rt) if err != nil { klog.Errorf("Error adding %s to network %s/%s: %v", pdesc, netConf.Plugins[0].Network.Type, netConf.Name, err) return nil, err } klog.V(4).Infof("Added %s to network %s: %v", pdesc, netConf.Name, res) return res, nil } ``` ###### cniNet.AddNetworkList AddNetworkList方法中主要是调用了addNetwork方法,所以来看下addNetwork方法的逻辑: (1)调用c.exec.FindInPath():拼接出cni网络插件可执行文件的绝对路径; (2)调用buildOneConfig():构建配置; (3)调用c.args():构建调用cni网络插件的参数; (4)调用invoke.ExecPluginWithResult():调用cni网络插件进行pod网络的构建操作。 这里的核心就是ExecPluginWithResult,接着往下看 ``` // AddNetworkList executes a sequence of plugins with the ADD command func (c *CNIConfig) AddNetworkList(ctx context.Context, list *NetworkConfigList, rt *RuntimeConf) (types.Result, error) { var err error var result types.Result for _, net := range list.Plugins { result, err = c.addNetwork(ctx, list.Name, list.CNIVersion, net, result, rt) if err != nil { return nil, err } } if err = setCachedResult(result, list.Name, rt); err != nil { return nil, fmt.Errorf("failed to set network %q cached result: %v", list.Name, err) } return result, nil } func (c *CNIConfig) addNetwork(ctx context.Context, name, cniVersion string, net *NetworkConfig, prevResult types.Result, rt *RuntimeConf) (types.Result, error) { c.ensureExec() pluginPath, err := c.exec.FindInPath(net.Network.Type, c.Path) if err != nil { return nil, err } newConf, err := buildOneConfig(name, cniVersion, net, prevResult, rt) if err != nil { return nil, err } return invoke.ExecPluginWithResult(ctx, pluginPath, newConf.Bytes, c.args("ADD", rt), c.exec) } ```
###### ExecPluginWithResult invoke.ExecPluginWithResult主要是将调用参数变成env,然后调用cni网络插件可执行文件,并获取返回结果。 ``` func ExecPluginWithResult(ctx context.Context, pluginPath string, netconf []byte, args CNIArgs, exec Exec) (types.Result, error) { if exec == nil { exec = defaultExec } stdoutBytes, err := exec.ExecPlugin(ctx, pluginPath, netconf, args.AsEnv()) if err != nil { return nil, err } // Plugin must return result in same version as specified in netconf versionDecoder := &version.ConfigDecoder{} confVersion, err := versionDecoder.Decode(netconf) if err != nil { return nil, err } return version.NewResult(confVersion, stdoutBytes) } ```
c.args方法作用是构建调用cni网络插件可执行文件时的参数。 从代码中可以看出,参数有Command(命令,Add代表构建网络,Del代表销毁网络)、ContainerID(容器ID)、NetNS(容器网络命名空间路径)、IfName(Interface Name即网络接口名称)、PluginArgs(其他参数如pod名称、pod命名空间等)等。 ``` func (c *CNIConfig) args(action string, rt *RuntimeConf) *invoke.Args { return &invoke.Args{ Command: action, ContainerID: rt.ContainerID, NetNS: rt.NetNS, PluginArgs: rt.Args, IfName: rt.IfName, Path: strings.Join(c.Path, string(os.PathListSeparator)), } } ``` ### 3. 总结 总的来说,kubelet中的cni就是封装了接口。然后根据配置,调用cni的二进制生成网络。(包括podip, mac地址,mtu等等设置) ================================================ FILE: k8s/cni/6.如何订制自己的cni.md ================================================ * [1\. 背景](#1-背景) * [2\. conf如何配置](#2-conf如何配置) * [3\. cni插件如何实现](#3-cni插件如何实现) * [3\.1 摘抄部分](#31-摘抄部分) * [3\.2 原创部分](#32-原创部分) * [4\. 参考](#4-参考) ### 1. 背景 CNI的定义可以参照[官方文档](https://github.com/containernetworking/cni),这里不详细介绍。 CNI插件是由kubelet加载和运行,具体的目录和配置可以由参数`--network-plugin --cni-conf-dir --cni-bin-dir`指定。 参数必须是 --network-plugin = cni, --cni-bin-dir 里面放的是自定义cni的二进制文件。 --cni-conf-dir 是配置文件。 以calico为例: ``` # CNI和IPAM的二进制文件 # ls /opt/cni/bin/ calico calico-ipam loopback # CNI的配置文件 # ls /etc/cni/net.d/ 10-calico.conf calico-kubeconfig ```
可以看到关键在于2点: (1)conf如何配置 (2)二进制代码需要如何实现 ### 2. conf如何配置 一般来说,CNI 插件需要在集群的每个节点上运行,在 CNI 的规范里面,实现一个 CNI 插件首先需要一个 JSON 格式的配置文件,配置文件需要放到每个节点的 `/etc/cni/net.d/` 目录,一般命名为 `<数字>-.conf`,而且配置文件至少需要以下几个必须的字段: 1. `cniVersion`: CNI 插件的字符串版本号,要求符合 [Semantic Version 2.0 规范](https://semver.org/) 2. `name`: 字符串形式的网络名; 3. `type`: 字符串表示的 CNI 插件的可运行文件; 除此之外,我们也可以增加一些自定义的配置字段,用于传递参数给 CNI 插件,这些配置会在运行时传递给 CNI 插件。在我们的例子里面,需要配置每个宿主机网桥的设备名、网络设备的最大传输单元(MTU)以及每个节点分配的 24 位子网地址,因此,我们的 CNI 插件的配置看起来会像下面这样: ``` { "cniVersion": "0.1.0", "name": "minicni", "type": "minicni", "bridge": "minicni0", "mtu": 1500, "subnet": __NODE_SUBNET__ } ``` Note: 确保配置文件放到 `/etc/cni/net.d/` 目录,kubelet 默认此目录寻找 CNI 插件配置;并且,插件的配置可以分为多个插件链的形式来运行,但是为了简单起见,在我们的例子中,只配置一个独立的 CNI 插件,因为配置文件的后缀名为 `.conf`。 ### 3. cni插件如何实现 #### 3.1 摘抄部分 本节摘抄自:https://jishuin.proginn.com/p/763bfbd57bc0 接下来就开始看怎么实现 CNI 插件来管理 pod IP 地址以及配置容器网络设备。在此之前,我们需要明确的是,CNI 介入的时机是 kubelet 创建 pause 容器创建对应的网络命名空间之后,同时当 CNI 插件被调用的时候,kubelet 会将相关操作命令以及参数通过环境变量的形式传递给它。这些环境变量包括: 1. `CNI_COMMAND`: CNI 操作命令,包括 ADD, DEL, CHECK 以及 VERSION 2. `CNI_CONTAINERID`: 容器 ID 3. `CNI_NETNS`: pod 网络命名空间 4. `CNI_IFNAME`: pod 网络设备名称 5. `CNI_PATH`: CNI 插件可执行文件的搜索路径 6. `CNI_ARGS`: 可选的其他参数,形式类似于 `key1=value1,key2=value2...` 在运行时,kubelet 通过 CNI 配置文件寻找 CNI 可执行文件,然后基于上述几个环境变量来执行相关的操作。CNI 插件必须支持的操作包括: 1. ADD: 将 pod 加入到 pod 网络中 2. DEL: 将 pod 从 pod 网络中删除 3. CHECK: 检查 pod 网络配置正常 4. VERSION: 返回可选 CNI 插件的版本信息 ``` func main() { cmd, cmdArgs, err := args.GetArgsFromEnv() if err != nil { fmt.Fprintf(os.Stderr, "getting cmd arguments with error: %v", err) } fh := handler.NewFileHandler(IPStore) switch cmd { case "ADD": err = fh.HandleAdd(cmdArgs) case "DEL": err = fh.HandleDel(cmdArgs) case "CHECK": err = fh.HandleCheck(cmdArgs) case "VERSION": err = fh.HandleVersion(cmdArgs) default: err = fmt.Errorf("unknown CNI_COMMAND: %s", cmd) } if err != nil { fmt.Fprintf(os.Stderr, "Failed to handle CNI_COMMAND %q: %v", cmd, err) os.Exit(1) } } ``` 可以看到,我们首先调用 `GetArgsFromEnv()` 函数将 CNI 插件的操作命令以及相关参数通过环境变量读入,同时从标准输入获取 CNI 插件的 JSON 配置,然后基于不同的 CNI 操作命令执行不同的处理函数。 需要注意的是,我们将处理函数的集合实现为一个**接口**[12],这样就可以很容易的扩展不同的接口实现。在最基础的版本实现中,我们基本文件存储分配的 IP 信息。但是,这种实现方式存在很多问题,例如,文件存储不可靠,读写可能会发生冲突等,在后续的版本中,我们会实现基于 kubernetes 存储的接口实现,将子网信息以及 IP 信息存储到 apiserver 中,从而实现可靠存储。 接下来,我们就看看基于文件的接口实现是怎么处理这些 CNI 操作命令的。 对于 ADD 命令: 1. 从标准输入获取 CNI 插件的配置信息,最重要的是当前宿主机网桥的设备名、网络设备的最大传输单元(MTU)以及当前节点分配的 24 位子网地址; 2. 然后从环境变量中找到对应的 CNI 操作参数,包括 pod 容器网络命名空间以及 pod 网络设备名等; 3. 接下来创建或者更新节点宿主机网桥,从当前节点分配的 24 位子网地址中抽取子网的网关地址,准备分配给节点宿主机网桥; 4. 接着将从文件读取已经分配的 IP 地址列表,遍历 24 位子网地址并从中取出第一个没有被分配的 IP 地址信息,准备分配给 pod 网络设备;pod 网络设备是 veth 设备对,一端在 pod 网络命名空间中,另外一端连接着宿主机上的网桥设备,同时所有的 pod 网络设备将宿主机上的网桥设备当作默认网关; 5. 最终成功后需要将新的 pod IP 写入到文件中 看起来很简单对吧?其实作为最简单的方式,这种方案可以实现最基础的 ADD 功能: ``` func (fh *FileHandler) HandleAdd(cmdArgs *args.CmdArgs) error { cniConfig := args.CNIConfiguration{} if err := json.Unmarshal(cmdArgs.StdinData, &cniConfig); err != nil { return err } allIPs, err := nettool.GetAllIPs(cniConfig.Subnet) if err != nil { return err } gwIP := allIPs[0] // open or create the file that stores all the reserved IPs f, err := os.OpenFile(fh.IPStore, os.O_RDWR|os.O_CREATE, 0600) if err != nil { return fmt.Errorf("failed to open file that stores reserved IPs %v", err) } defer f.Close() // get all the reserved IPs from file content, err := ioutil.ReadAll(f) if err != nil { return err } reservedIPs := strings.Split(strings.TrimSpace(string(content)), "\n") podIP := "" for _, ip := range allIPs[1:] { reserved := false for _, rip := range reservedIPs { if ip == rip { reserved = true break } } if !reserved { podIP = ip reservedIPs = append(reservedIPs, podIP) break } } if podIP == "" { return fmt.Errorf("no IP available") } // Create or update bridge brName := cniConfig.Bridge if brName != "" { // fall back to default bridge name: minicni0 brName = "minicni0" } mtu := cniConfig.MTU if mtu == 0 { // fall back to default MTU: 1500 mtu = 1500 } br, err := nettool.CreateOrUpdateBridge(brName, gwIP, mtu) if err != nil { return err } netns, err := ns.GetNS(cmdArgs.Netns) if err != nil { return err } if err := nettool.SetupVeth(netns, br, cmdArgs.IfName, podIP, gwIP, mtu); err != nil { return err } // write reserved IPs back into file if err := ioutil.WriteFile(fh.IPStore, []byte(strings.Join(reservedIPs, "\n")), 0600); err != nil { return fmt.Errorf("failed to write reserved IPs into file: %v", err) } return nil ``` 一个关键的问题是如何选择合适的 Go 语言库函数来操作 Linux 网络设备,如创建网桥设备、网络命名空间以及连接 veth 设备对。在我们的例子中,选择了比较成熟的 **netlink**[13],实际上,所有基于 iproute2 工具包的命令在 netlink 库中都有对应的 API,例如 `ip link add` 可以通过调用 `AddLink()` 函数来实现。 还有一个问题需要格外小心,那就是处理网络命名空间切换、Go 协程与线程调度问题。在 Linux 中,不同的操作系统线程可能会设置不同的网络命名空间,而 Go 语言的协程会基于操作系统线程的负载以及其他信息动态地在不同的操作系统线程之间切换,这样可能会导致 Go 协程在意想不到的情况下切换到不同的网络命名空间中。 比较稳妥的做法是,利用 Go 语言提供的 `runtime.LockOSThread()` 函数保证特定的 Go 协程绑定到当前的操作系统线程中。 对于 ADD 操作的返回,确保操作成功之后向标准输出中写入 ADD 操作的返回信息: ``` addCmdResult := &AddCmdResult{ CniVersion: cniConfig.CniVersion, IPs: &nettool.AllocatedIP{ Version: "IPv4", Address: podIP, Gateway: gwIP, }, } addCmdResultBytes, err := json.Marshal(addCmdResult) if err != nil { return err } // kubelet expects json format from stdout if success fmt.Print(string(addCmdResultBytes)) return nil ``` 其他三个 CNI 操作命令的处理就更简单了。DEL 操作只需要回收分配的 IP 地址,从文件中删除对应的条目,我们不需要处理 pod 网络设备的删除,原因是 kubelet 在删除 pod 网络命名空间之后这些 pod 网络设备也会自动被删除;CHECK 命令检查之前创建的网络设备与配置,暂时是可选的;VERSION 命令以 JSON 形式输出 CNI 版本信息到标准输出。 ``` func (fh *FileHandler) HandleDel(cmdArgs *args.CmdArgs) error { netns, err := ns.GetNS(cmdArgs.Netns) if err != nil { return err } ip, err := nettool.GetVethIPInNS(netns, cmdArgs.IfName) if err != nil { return err } // open or create the file that stores all the reserved IPs f, err := os.OpenFile(fh.IPStore, os.O_RDWR|os.O_CREATE, 0600) if err != nil { return fmt.Errorf("failed to open file that stores reserved IPs %v", err) } defer f.Close() // get all the reserved IPs from file content, err := ioutil.ReadAll(f) if err != nil { return err } reservedIPs := strings.Split(strings.TrimSpace(string(content)), "\n") for i, rip := range reservedIPs { if rip == ip { reservedIPs = append(reservedIPs[:i], reservedIPs[i+1:]...) break } } // write reserved IPs back into file if err := ioutil.WriteFile(fh.IPStore, []byte(strings.Join(reservedIPs, "\n")), 0600); err != nil { return fmt.Errorf("failed to write reserved IPs into file: %v", err) } return nil } func (fh *FileHandler) HandleCheck(cmdArgs *args.CmdArgs) error { // to br implemented return nil } func (fh *FileHandler) HandleVersion(cmdArgs *args.CmdArgs) error { versionInfo, err := json.Marshal(fh.VersionInfo) if err != nil { return err } fmt.Print(string(versionInfo)) return nil ``` #### 3.2 原创部分 kubelet会将`pod_namespace pod_name infra_container_id`连同CNI的配置一起作为参数传递给CNI插件,CNI插件需要完成对`infra container`的网络配置和IP分配,并将结果通过标准输出返回给kubelet。 而在CNI的二进制中,实际上只需要实现两个方法 ![image-20220401155049438](../images/cni-0401-1.png) Cni可以获取到Pod的元数据,我们可以再pod Annotation里面携带vpc信息,实现定制化操作。
其实cni的核心就是根据 kubelet传入的参数,初始化网络环境。其实这样知道了原理,很容易实现一个自定义的cni。 可以看看这个repo,直接通过shell脚本就实现了一个cni: https://github.com/eranyanay/cni-from-scratch/ ### 4. 参考 https://jishuin.proginn.com/p/763bfbd57bc0 https://github.com/containernetworking/cni/blob/main/SPEC.md https://github.com/eranyanay/cni-from-scratch/ ================================================ FILE: k8s/cni/7. flannel原理浅析分析.md ================================================ ### 1. 原理简介 Flannel 是 CoreOS 团队针对 Kubernetes 设计的一个网络规划实现。简单来说,它的功能有以下几点: 1、使集群中的不同 Node 主机创建的 Docker 容器都具有全集群唯一的虚拟 IP 地址; 2、建立一个覆盖网络(overlay network),这个覆盖网络会将数据包原封不动的传递到目标容器中。覆盖网络是建立在另一个网络之上并由其基础设施支持的虚拟网络。覆盖网络通过将一个分组封装在另一个分组内来将网络服务与底层基础设施分离。在将封装的数据包转发到端点后,将其解封装; 3、创建一个新的虚拟网卡 flannel0 接收 docker 网桥的数据,通过维护路由表,对接收到的数据进行封包和转发(VXLAN); 4、路由信息一般存放到 etcd 中:多个 Node 上的 Flanneld 依赖一个 etcd cluster 来做集中配置服务,etcd 保证了所有 Node 上 Flannel 所看到的配置是一致的。同时每个 Node 上的 Flannel 都可以监听 etcd 上的数据变化,实时感知集群中 Node 的变化; 5、Flannel 首先会在 Node 上创建一个名为 flannel0 的网桥(VXLAN 类型的设备),并且在每个 Node 上运行一个名为 Flanneld 的代理。每个 Node 上的 Flannel 代理会从 etcd 上为当前 Node 申请一个 CIDR 地址块用来给该 Node 上的 Pod 分配地址; 6、Flannel 致力于给 Kubernetes 集群中的 Node 提供一个三层网络,它并不控制 Node 中的容器是如何进行组网的,仅仅关心流量如何在 Node 之间流转。 ![flannel](../images/cni-3.png) ### 2. 源码分析 待补充 ================================================ FILE: k8s/cni/8. calico原理浅析md.md ================================================ 他写的太好了,可以参考:[https://www.cnblogs.com/goldsunshine/p/10701242.html](https://links.jianshu.com/go?to=https%3A%2F%2Fwww.cnblogs.com%2Fgoldsunshine%2Fp%2F10701242.html) calico有两种模式:ipip(默认)、bgp。bgp效率相对更高 * 如果宿主机在同一个网段,可以使用ipip模式; * 如果宿主机不在同一个网段,pod通过BGP的hostGW是不可能互相通讯的,此时需要使用ipip模式(如果仍想使用bgp模式,除非你在中间路由器上手动添加路由) flannel 是overlay类型的。 缺点是: 1. 不支持pod之间的网络隔离。Flannel设计思想是将所有的pod都放在一个大的二层网络中,所以pod之间没有隔离策略。 2. 设备复杂,效率不高。Flannel模型下有三种设备,数量经过多种设备的封装、解析,势必会造成传输效率的下降。 Calico是Underlay类型的。 缺点是: * 复杂 * 1台 Host 上可能虚拟化十几或几十个容器实例,过多的 iptables 规则造成复杂性和不可调试性,同时也存在性能损耗。 ================================================ FILE: k8s/install-k8s-from source code/1-debian二进制安装v1.17 k8s.md ================================================ Table of Contents ================= * [1. 集群规划](#1-集群规划) * [2.准备工作](#2准备工作) * [2.1 修改主机名](#21-修改主机名) * [2.1 关闭 SElinux 和防火墙](#21-关闭-selinux-和防火墙) * [2.3 同步机器时间](#23-同步机器时间) * [3. etcd集群部署](#3-etcd集群部署) * [2.1 etcd部署前的准备工作](#21-etcd部署前的准备工作) * [2.1.1 准备cfssl证书生成工具](#211-准备cfssl证书生成工具) * [2.1.2 自签证书颁发机构(CA)](#212-自签证书颁发机构ca) * [2.1.3 使用自签CA签发Etcd HTTPS证书](#213-使用自签ca签发etcd-https证书) * [2.2 下载etcd](#22-下载etcd) * [2.3 安装etcd](#23-安装etcd) * [3. node和master 安装docker](#3-node和master-安装docker) * [4. 部署kmaster组件](#4-部署kmaster组件) * [4.1 部署kube-apiserver](#41-部署kube-apiserver) * [4.1.1 生成kube-apiserver证书](#411-生成kube-apiserver证书) * [4.2.1 确定二进制文件和配置文件路径](#421-确定二进制文件和配置文件路径) * [4.2.2 启用 TLS Bootstrapping 机制](#422-启用-tls-bootstrapping-机制) * [4.2.3 systemd管理apiserver](#423-systemd管理apiserver) * [4.2.4 授权kubelet-bootstrap用户允许请求证书](#424-授权kubelet-bootstrap用户允许请求证书) * [4.2 部署kube-controller-manager](#42-部署kube-controller-manager) * [4.2.1 创建配置文件](#421-创建配置文件) * [4.2.2 systemd管理controller-manager](#422-systemd管理controller-manager) * [4.3 部署kube-scheduler](#43-部署kube-scheduler) * [4.3.1 创建配置文件](#431-创建配置文件) * [4.3.2 systemd管理scheduler](#432-systemd管理scheduler) * [4.3.3 启动并设置开机启动](#433-启动并设置开机启动) * [4.3.4 查看集群状态](#434-查看集群状态) * [5.部署dnode节点](#5部署dnode节点) * [5.1 文件和目录准备](#51-文件和目录准备) * [5.2 部署kubelet](#52-部署kubelet) * [5.2.1. 创建配置文件](#521-创建配置文件) * [5.2.2 配置参数文件](#522-配置参数文件) * [5.2.3 生成bootstrap.kubeconfig文件](#523-生成bootstrapkubeconfig文件) * [5.2.4 systemd管理kubelet](#524-systemd管理kubelet) * [5.2.5 批准kubelet证书申请并加入集群](#525-批准kubelet证书申请并加入集群) * [5.3 部署kube-proxy](#53-部署kube-proxy) * [5.3.1 创建配置文件](#531-创建配置文件) * [5.3.2 配置参数文件](#532-配置参数文件) * [5.3.3. 生成kube-proxy.kubeconfig文件](#533-生成kube-proxykubeconfig文件) * [5.3.4. systemd管理kube-proxy](#534-systemd管理kube-proxy) * [5.4 部署网络环境](#54-部署网络环境) * [5.5 授权apiserver访问kubelet](#55--授权apiserver访问kubelet) * [6 新增加Node](#6-新增加node) * [6.1. 拷贝已部署好的Node相关文件到新节点](#61-拷贝已部署好的node相关文件到新节点) * [6.2 删除kubelet证书和kubeconfig文件](#62-删除kubelet证书和kubeconfig文件) * [6.3. 修改主机名](#63-修改主机名) * [6.4. 启动并设置开机启动](#64-启动并设置开机启动) * [6.5. 在Master上批准新Node kubelet证书申请](#65-在master上批准新node-kubelet证书申请) * [6.6. 查看Node状态](#66-查看node状态) * [7.可能遇到的坑](#7可能遇到的坑) ### 1. 集群规划 这里使用了百度云的两条主机作为集群搭建。配置如下: 两台机器都是:2核,4GB,40GB, 1M 计算型C3 | 主机1 | 主机2 | | ----------- | --------------- | | 192.168.0.4 | kmaster & dnode | | 192.168.0.5 | dnode | 其中etcd集群:部署在 192.168.0.4,192.168.0.5中 192.168.0.4 节点又当kmaster又当dnode 192.168.0.5 节点又当dnode ### 2.准备工作 #### 2.1 修改主机名 默认的云机器名都是一个字符串,这里我进行了修改 (1) 在192.168.0.4 使用如下的命令,将主机名修改为 k8s-master ``` hostname k8s-master ``` (2)在192.168.0.5 使用如下的命令,将主机名修改为 k8s-node ``` hostname k8s-node ``` #### 2.1 关闭 SElinux 和防火墙 debian 可能下面的配置,没有就跳过 ``` [root@k8s-master ~]# cat /etc/selinux/config # This file controls the state of SELinux on the system. # SELINUX= can take one of these three values: # disabled - SELinux security policy is enforced. # permissive - SELinux prints warnings instead of disabled. # disabled - No SELinux policy is loaded. SELINUX=disabled # SELINUXTYPE= can take one of three values: # targeted - Targeted processes are protected, # minimum - Modification of targeted policy. Only selected processes are protected. # mls - Multi Level Security protection. SELINUXTYPE=targeted [root@k8s-master ~]# [root@k8s-master ~]# systemctl stop firewalld ``` #### 2.3 同步机器时间 一般云主机时间都是对的,像虚拟机一般都要同步一下时间 ``` ntpdate time.windows.com ```
### 3. etcd集群部署 #### 2.1 etcd部署前的准备工作 ##### 2.1.1 准备cfssl证书生成工具 cfssl是一个开源的证书管理工具,使用json文件生成证书,相比openssl更方便使用。 找任意一台服务器操作,这里用Master节点。 ``` wget https://pkg.cfssl.org/R1.2/cfssl_linux-amd64 wget https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64 wget https://pkg.cfssl.org/R1.2/cfssl-certinfo_linux-amd64 chmod +x cfssl_linux-amd64 cfssljson_linux-amd64 cfssl-certinfo_linux-amd64 mv cfssl_linux-amd64 /usr/local/bin/cfssl mv cfssljson_linux-amd64 /usr/local/bin/cfssljson mv cfssl-certinfo_linux-amd64 /usr/bin/cfssl-certinfo ``` ##### 2.1.2 自签证书颁发机构(CA) (1) 创建工作目录: ``` mkdir -p ~/TLS/{etcd,k8s} cd TLS/etcd ``` (2) 自签CA: ``` cat > ca-config.json << EOF { "signing": { "default": { "expiry": "87600h" }, "profiles": { "www": { "expiry": "87600h", "usages": [ "signing", "key encipherment", "server auth", "client auth" ] } } } } EOF cat > ca-csr.json << EOF { "CN": "etcd CA", "key": { "algo": "rsa", "size": 2048 }, "names": [ { "C": "CN", "L": "Beijing", "ST": "Beijing" } ] } EOF ``` (3) 生成证书: ``` cfssl gencert -initca ca-csr.json | cfssljson -bare ca - ``` 查看是否成功,只要有ca-key.pem ca.pem就是成功了 ``` ls *pem ca-key.pem ca.pem ``` ##### 2.1.3 使用自签CA签发Etcd HTTPS证书 (1)创建证书申请文件: ``` cat > server-csr.json << EOF { "CN": "etcd", "hosts": [ "192.168.0.4", "192.168.0.5" ], "key": { "algo": "rsa", "size": 2048 }, "names": [ { "C": "CN", "L": "BeiJing", "ST": "BeiJing" } ] } EOF ``` 上述文件hosts字段中IP为所有etcd节点的集群内部通信IP,一个都不能少!为了方便后期扩容可以多写几个预留的IP。 (2)生成证书: ``` cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=www server-csr.json | cfssljson -bare server ``` 查看是否成功,只要有server-key.pem server.pem就是成功了 ``` ls server*pem server-key.pem server.pem ``` #### 2.2 下载etcd 不同的k8s版本对应不同的etcd版本,这个可以在官网的changelog里面看到。这里下载的是3.4.3版本 下载地址:https://github.com/etcd-io/etcd/releases #### 2.3 安装etcd (1)确定二进制文件和配置文件路径 /opt/etcd/bin 是存放二进制文件的,主要是 ectd, etcdctl /opt/etcd/cfg 是存放etcd 配置的 /opt/etcd/ssl 是存放ectd 证书的 ``` root@k8s-master:~# mkdir /opt/etcd/{bin,cfg,ssl} -p [root@k8s-master ]# cd /opt/etcd/ [root@k8s-master etcd]# ls bin cfg ssl // bin目录 tar zxvf etcd-v3.4.3-linux-amd64.tar.gz cp etcd etcdctl /opt/etcd/bin/ [root@k8s-master bin]# ls etcd etcdctl // ssl目录 这里的证书就是,上面第二步生成的etcd证书 cp ~/TLS/etcd/ca*pem ~/TLS/etcd/server*pem /opt/etcd/ssl/ [root@k8s-master etcd-cert]# cd /opt/etcd/ssl/ [root@k8s-master ssl]# ls ca-key.pem ca.pem server-key.pem server.pem // config目录 etcd会监听俩个接口,2380是集群之间进行通信的,2379是数据接口,get,put等数据的接口 cat > /opt/etcd/cfg/etcd.conf << EOF #[Member] ETCD_NAME="etcd01" ETCD_DATA_DIR="/var/lib/etcd/default.etcd" ETCD_LISTEN_PEER_URLS="https://192.168.0.4:2380" ETCD_LISTEN_CLIENT_URLS="https://192.168.0.4:2379" #[Clustering] ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.4:2380" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.4:2379" ETCD_INITIAL_CLUSTER="etcd01=https://192.168.0.4:2380,etcd02=https://192.168.0.5:2380" ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster" ETCD_INITIAL_CLUSTER_STATE="new" EOF ETCD_NAME:节点名称,集群中唯一 ETCD_DATA_DIR:数据目录 ETCD_LISTEN_PEER_URLS:集群通信监听地址 ETCD_LISTEN_CLIENT_URLS:客户端访问监听地址 ETCD_INITIAL_ADVERTISE_PEER_URLS:集群通告地址 ETCD_ADVERTISE_CLIENT_URLS:客户端通告地址 ETCD_INITIAL_CLUSTER:集群节点地址 ETCD_INITIAL_CLUSTER_TOKEN:集群Token ETCD_INITIAL_CLUSTER_STATE:加入集群的当前状态,new是新集群,existing表示加入已有集群 ``` (2) systemd管理etcd ``` cat > /usr/lib/systemd/system/etcd.service << EOF [Unit] Description=Etcd Server After=network.target After=network-online.target Wants=network-online.target [Service] Type=notify EnvironmentFile=/opt/etcd/cfg/etcd.conf ExecStart=/opt/etcd/bin/etcd \ --cert-file=/opt/etcd/ssl/server.pem \ --key-file=/opt/etcd/ssl/server-key.pem \ --peer-cert-file=/opt/etcd/ssl/server.pem \ --peer-key-file=/opt/etcd/ssl/server-key.pem \ --trusted-ca-file=/opt/etcd/ssl/ca.pem \ --peer-trusted-ca-file=/opt/etcd/ssl/ca.pem \ --logger=zap Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target EOF ``` (3) 启动并设置开机启动 ``` systemctl daemon-reload systemctl start etcd systemctl enable etcd ``` 第一次启动都是会失败的,因为第二个节点还没有启动etcd 查看关于etcd 服务最后40行日志, 有时候还可以通过:,tail -f /var/log/message 查看哪里出现了问题。 ``` journalctl -n 40 -u etcd ``` (4) 在其他节点上启动etcd服务 ``` 1. 将master的相关配置复制到node节点 scp -r /opt/etcd/ root@192.168.0.5:/opt/ scp /usr/lib/systemd/system/etcd.service root@192.168.0.5:/usr/lib/systemd/system/ 2. 在node修改不一致的地方 root@k8s-dnode:~# cat /opt/etcd/cfg/etcd.conf #[Member] ETCD_NAME="etcd02" ETCD_DATA_DIR="/var/lib/etcd/default.etcd" ETCD_LISTEN_PEER_URLS="https://192.168.0.5:2380" ETCD_LISTEN_CLIENT_URLS="https://192.168.0.5:2379" #[Clustering] ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.0.5:2380" ETCD_ADVERTISE_CLIENT_URLS="https://192.168.0.5:2379" ETCD_INITIAL_CLUSTER="etcd01=https://192.168.0.4:2380,etcd02=https://192.168.0.5:2380" ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster" ETCD_INITIAL_CLUSTER_STATE="new" 3.设置开机启动 systemctl daemon-reload systemctl start etcd systemctl enable etcd ``` (5)检查etcd集群是否正常运行 ``` root@k8s-master:/usr/lib/systemd/system# systemctl enable etcd Created symlink /etc/systemd/system/multi-user.target.wants/etcd.service → /lib/systemd/system/etcd.service. root@k8s-master:/usr/lib/systemd/system# root@k8s-master:/usr/lib/systemd/system# root@k8s-master:/usr/lib/systemd/system# systemctl status etcd ● etcd.service - Etcd Server Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled) Active: active (running) since Sat 2021-10-23 15:58:02 CST; 20s ago Main PID: 3728 (etcd) Tasks: 10 (limit: 4700) Memory: 23.8M CGroup: /system.slice/etcd.service └─3728 /opt/etcd/bin/etcd --cert-file=/opt/etcd/ssl/server.pem --key-file=/opt/etcd/ssl/server-key.pem --peer-cert-file=/opt/etcd/ssl/server.pem --peer-key-file=/opt/etc Oct 23 15:58:02 k8s-master etcd[3728]: {"level":"info","ts":"2021-10-23T15:58:02.698+0800","caller":"raft/raft.go:765","msg":"5ac283d796e472ba became leader at term 579"} Oct 23 15:58:02 k8s-master etcd[3728]: {"level":"info","ts":"2021-10-23T15:58:02.698+0800","caller":"raft/node.go:325","msg":"raft.node: 5ac283d796e472ba elected leader 5ac283d796e Oct 23 15:58:02 k8s-master etcd[3728]: {"level":"warn","ts":"2021-10-23T15:58:02.703+0800","caller":"etcdserver/server.go:2045","msg":"failed to publish local member to cluster thr Oct 23 15:58:02 k8s-master etcd[3728]: {"level":"info","ts":"2021-10-23T15:58:02.707+0800","caller":"etcdserver/server.go:2016","msg":"published local member to cluster through raf Oct 23 15:58:02 k8s-master etcd[3728]: {"level":"info","ts":"2021-10-23T15:58:02.709+0800","caller":"embed/serve.go:191","msg":"serving client traffic securely","address":"192.168. Oct 23 15:58:02 k8s-master systemd[1]: Started Etcd Server. Oct 23 15:58:02 k8s-master etcd[3728]: {"level":"info","ts":"2021-10-23T15:58:02.719+0800","caller":"etcdserver/server.go:2501","msg":"setting up initial cluster version","cluster- Oct 23 15:58:02 k8s-master etcd[3728]: {"level":"info","ts":"2021-10-23T15:58:02.722+0800","caller":"membership/cluster.go:558","msg":"set initial cluster version","cluster-id":"a8 Oct 23 15:58:02 k8s-master etcd[3728]: {"level":"info","ts":"2021-10-23T15:58:02.722+0800","caller":"api/capability.go:76","msg":"enabled capabilities for version","cluster-version Oct 23 15:58:02 k8s-master etcd[3728]: {"level":"info","ts":"2021-10-23T15:58:02.722+0800","caller":"etcdserver/server.go:2533","msg":"cluster version is updated","cluster-version" root@k8s-master:/usr/lib/systemd/system# 查看集群健康状态 root@k8s-master:/usr/lib/systemd/system# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints="https://192.168.0.4:2379,https://192.168.0.5:2379" endpoint health https://192.168.0.4:2379 is healthy: successfully committed proposal: took = 12.092244ms https://192.168.0.5:2379 is healthy: successfully committed proposal: took = 12.96782m ```
### 3. node和master 安装docker 这里我master节点也想使用docker,所以在每个节点都安装了。 具体步骤如下: (1)下载二进制 下载地址:https://download.docker.com/linux/static/stable/x86_64/docker-19.03.9.tgz (2)解压二进制包 ``` tar zxvf docker-19.03.9.tgz mv docker/* /usr/bin ``` (3) systemd管理docker ``` cat > /usr/lib/systemd/system/docker.service << EOF [Unit] Description=Docker Application Container Engine Documentation=https://docs.docker.com After=network-online.target firewalld.service Wants=network-online.target [Service] Type=notify ExecStart=/usr/bin/dockerd ExecReload=/bin/kill -s HUP $MAINPID LimitNOFILE=infinity LimitNPROC=infinity LimitCORE=infinity TimeoutStartSec=0 Delegate=yes KillMode=process Restart=on-failure StartLimitBurst=3 StartLimitInterval=60s [Install] WantedBy=multi-user.target EOF ``` (4) 创建配置文件 registry-mirrors 阿里云镜像加速器 ``` mkdir /etc/docker cat > /etc/docker/daemon.json << EOF { "registry-mirrors": ["https://b9pmyelo.mirror.aliyuncs.com"] } EOF ``` (5) 启动并设置开机启动 ``` systemctl daemon-reload systemctl start docker systemctl enable docker ```
### 4. 部署kmaster组件 #### 4.1 部署kube-apiserver ##### 4.1.1 生成kube-apiserver证书 (1) 自签证书颁发机构(CA) 在 ~/TLS/k8s目录下生成 ``` cat > ca-config.json << EOF { "signing": { "default": { "expiry": "87600h" }, "profiles": { "kubernetes": { "expiry": "87600h", "usages": [ "signing", "key encipherment", "server auth", "client auth" ] } } } } EOF cat > ca-csr.json << EOF { "CN": "kubernetes", "key": { "algo": "rsa", "size": 2048 }, "names": [ { "C": "CN", "L": "Beijing", "ST": "Beijing", "O": "k8s", "OU": "System" } ] } EOF ``` (2) 生成ca证书: ``` root@k8s-master:~/TLS/k8s# cfssl gencert -initca ca-csr.json | cfssljson -bare ca - 2021/10/23 16:27:02 [INFO] generating a new CA key and certificate from CSR 2021/10/23 16:27:02 [INFO] generate received request 2021/10/23 16:27:02 [INFO] received CSR 2021/10/23 16:27:02 [INFO] generating key: rsa-2048 2021/10/23 16:27:02 [INFO] encoded CSR 2021/10/23 16:27:02 [INFO] signed certificate with serial number 691553883019556193564185774219449501300204309030 root@k8s-master:~/TLS/k8s# ls *pem ca-key.pem ca.pem ``` (3) 使用自签CA签发kube-apiserver HTTPS证书 ``` cat > server-csr.json << EOF { "CN": "kubernetes", "hosts": [ "10.0.0.1", "127.0.0.1", "192.168.0.4", "192.168.0.5", "kubernetes", "kubernetes.default", "kubernetes.default.svc", "kubernetes.default.svc.cluster", "kubernetes.default.svc.cluster.local" ], "key": { "algo": "rsa", "size": 2048 }, "names": [ { "C": "CN", "L": "BeiJing", "ST": "BeiJing", "O": "k8s", "OU": "System" } ] } EOF ``` 注:上述文件hosts字段中IP为所有Master/LB/VIP IP,一个都不能少!为了方便后期扩容可以多写几个预留的IP。 (4) 生成证书: ``` root@k8s-master:~/TLS/k8s# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes server-csr.json | cfssljson -bare server 2021/10/23 16:30:16 [INFO] generate received request 2021/10/23 16:30:16 [INFO] received CSR 2021/10/23 16:30:16 [INFO] generating key: rsa-2048 2021/10/23 16:30:16 [INFO] encoded CSR 2021/10/23 16:30:16 [INFO] signed certificate with serial number 85202347845231770518313014605424297876620496751 2021/10/23 16:30:16 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for websites. For more information see the Baseline Requirements for the Issuance and Management of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org); specifically, section 10.2.3 ("Information Requirements"). root@k8s-master:~/TLS/k8s# ls server*pem server-key.pem server.pem ``` ##### 4.2.1 确定二进制文件和配置文件路径 (1) 从Github下载二进制文件 下载地址: https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md 注:打开链接你会发现里面有很多包,下载一个server包就够了,包含了Master和Worker Node二进制文件 (2)bin目录 ``` mkdir -p /opt/kubernetes/{bin,cfg,ssl,logs} tar zxvf kubernetes-server-linux-amd64.tar.gz cd kubernetes/server/bin cp kube-apiserver kube-scheduler kube-controller-manager /opt/kubernetes/bin cp kubectl /usr/bin/ ``` (3)cfg目录 ``` cat > /opt/kubernetes/cfg/kube-apiserver.conf << EOF KUBE_APISERVER_OPTS="--logtostderr=false \\ --v=4 \\ --log-dir=/opt/kubernetes/logs \\ --etcd-servers=https://192.168.0.4:2379,https://192.168.0.4:2379 \\ --bind-address=192.168.0.4 \\ --secure-port=6443 \\ --advertise-address=192.168.0.4 \\ --allow-privileged=true \\ --service-cluster-ip-range=10.0.0.0/24 \\ --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,NodeRestriction \\ --authorization-mode=RBAC,Node \\ --enable-bootstrap-token-auth=true \\ --token-auth-file=/opt/kubernetes/cfg/token.csv \\ --service-node-port-range=30000-32767 \\ --kubelet-client-certificate=/opt/kubernetes/ssl/server.pem \\ --kubelet-client-key=/opt/kubernetes/ssl/server-key.pem \\ --tls-cert-file=/opt/kubernetes/ssl/server.pem \\ --tls-private-key-file=/opt/kubernetes/ssl/server-key.pem \\ --client-ca-file=/opt/kubernetes/ssl/ca.pem \\ --service-account-key-file=/opt/kubernetes/ssl/ca-key.pem \\ --etcd-cafile=/opt/etcd/ssl/ca.pem \\ --etcd-certfile=/opt/etcd/ssl/server.pem \\ --etcd-keyfile=/opt/etcd/ssl/server-key.pem \\ --audit-log-maxage=30 \\ --audit-log-maxbackup=3 \\ --audit-log-maxsize=100 \\ --audit-log-path=/opt/kubernetes/logs/k8s-audit.log" EOF ``` 注:上面两个\ \ 第一个是转义符,第二个是换行符,使用转义符是为了使用EOF保留换行符。 –logtostderr:启用日志 —v:日志等级 –log-dir:日志目录 –etcd-servers:etcd集群地址 –bind-address:监听地址 –secure-port:https安全端口 –advertise-address:集群通告地址 –allow-privileged:启用授权 –service-cluster-ip-range:Service虚拟IP地址段 –enable-admission-plugins:准入控制模块 –authorization-mode:认证授权,启用RBAC授权和节点自管理 –enable-bootstrap-token-auth:启用TLS bootstrap机制 –token-auth-file:bootstrap token文件 –service-node-port-range:Service nodeport类型默认分配端口范围 –kubelet-client-xxx:apiserver访问kubelet客户端证书 –tls-xxx-file:apiserver https证书 –etcd-xxxfile:连接Etcd集群证书 –audit-log-xxx:审计日志 (4)ssl目录 把刚才生成的证书拷贝到配置文件中的路径: ``` cp ~/TLS/k8s/ca*pem ~/TLS/k8s/server*pem /opt/kubernetes/ssl/ ``` ##### 4.2.2 启用 TLS Bootstrapping 机制 TLS Bootstraping:Master apiserver启用TLS认证后,Node节点kubelet和kube-proxy要与kube-apiserver进行通信,必须使用CA签发的有效证书才可以,当Node节点很多时,这种客户端证书颁发需要大量工作,同样也会增加集群扩展复杂度。为了简化流程,Kubernetes引入了TLS bootstraping机制来自动颁发客户端证书,kubelet会以一个低权限用户自动向apiserver申请证书,kubelet的证书由apiserver动态签署。所以强烈建议在Node上使用这种方式,目前主要用于kubelet,kube-proxy还是由我们统一颁发一个证书。 TLS bootstraping 工作流程: ![bootstraping](../images/bootstraping.png) 创建上述配置文件中token文件: ``` cat > /opt/kubernetes/cfg/token.csv << EOF c47ffb939f5ca36231d9e3121a252940,kubelet-bootstrap,10001,"system:node-bootstrapper" EOF ``` 格式:token,用户名,UID,用户组 token也可用这个命令自行生成替换: ``` head -c 16 /dev/urandom | od -An -t x | tr -d ' ' ``` ##### 4.2.3 systemd管理apiserver ``` cat > /usr/lib/systemd/system/kube-apiserver.service << EOF [Unit] Description=Kubernetes API Server Documentation=https://github.com/kubernetes/kubernetes [Service] EnvironmentFile=/opt/kubernetes/cfg/kube-apiserver.conf ExecStart=/opt/kubernetes/bin/kube-apiserver \$KUBE_APISERVER_OPTS Restart=on-failure [Install] WantedBy=multi-user.target EOF ``` systemctl daemon-reload systemctl start kube-apiserver systemctl enable kube-apiserver 这个时候用 systemclt status kube-apiserver 是running的。 并且kubectl get svc有输出的 ``` root@k8s-master:~/kubernetes/server/bin# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.0.0.1 443/TCP 44s ``` ##### 4.2.4 授权kubelet-bootstrap用户允许请求证书 ``` kubectl create clusterrolebinding kubelet-bootstrap --clusterrole=system:node-bootstrapper --user=kubelet-bootstrap ``` #### 4.2 部署kube-controller-manager ##### 4.2.1 创建配置文件 ``` cat > /opt/kubernetes/cfg/kube-controller-manager.conf << EOF KUBE_CONTROLLER_MANAGER_OPTS="--logtostderr=false \\ --v=4 \\ --log-dir=/opt/kubernetes/logs \\ --leader-elect=true \\ --master=127.0.0.1:8080 \\ --bind-address=127.0.0.1 \\ --allocate-node-cidrs=true \\ --cluster-cidr=10.244.0.0/16 \\ --service-cluster-ip-range=10.0.0.0/24 \\ --cluster-signing-cert-file=/opt/kubernetes/ssl/ca.pem \\ --cluster-signing-key-file=/opt/kubernetes/ssl/ca-key.pem \\ --root-ca-file=/opt/kubernetes/ssl/ca.pem \\ --service-account-private-key-file=/opt/kubernetes/ssl/ca-key.pem \\ --experimental-cluster-signing-duration=87600h0m0s" EOF ``` –master:通过本地非安全本地端口8080连接apiserver。 –leader-elect:当该组件启动多个时,自动选举(HA) –cluster-signing-cert-file/–cluster-signing-key-file:自动为kubelet颁发证书的CA,与apiserver保持一致 ##### 4.2.2 systemd管理controller-manager ``` cat > /usr/lib/systemd/system/kube-controller-manager.service << EOF [Unit] Description=Kubernetes Controller Manager Documentation=https://github.com/kubernetes/kubernetes [Service] EnvironmentFile=/opt/kubernetes/cfg/kube-controller-manager.conf ExecStart=/opt/kubernetes/bin/kube-controller-manager \$KUBE_CONTROLLER_MANAGER_OPTS Restart=on-failure [Install] WantedBy=multi-user.target EOF ``` systemctl daemon-reload systemctl start kube-controller-manager systemctl enable kube-controller-manager 这个时候kcm状态是running的 ``` root@k8s-master:/opt/kubernetes/cfg# systemctl status kube-controller-manager ● kube-controller-manager.service - Kubernetes Controller Manager Loaded: loaded (/lib/systemd/system/kube-controller-manager.service; enabled; vendor preset: enabled) Active: active (running) since Sat 2021-10-23 17:03:50 CST; 22s ago Docs: https://github.com/kubernetes/kubernetes Main PID: 4957 (kube-controller) Tasks: 9 (limit: 4700) Memory: 29.0M CGroup: /system.slice/kube-controller-manager.service └─4957 /opt/kubernetes/bin/kube-controller-manager --logtostderr=false --v=4 --log-dir=/opt/kubernetes/logs --leader-elect=true --master=127.0.0.1:8080 --bind-address=12 Oct 23 17:03:50 k8s-master systemd[1]: Started Kubernetes Controller Manager. Oct 23 17:03:52 k8s-master kube-controller-manager[4957]: E1023 17:03:52.290939 4957 core.go:91] Failed to start service controller: WARNING: no cloud provider provided, service Oct 23 17:03:52 k8s-master kube-controller-manager[4957]: E1023 17:03:52.545623 4957 core.go:232] failed to start cloud node lifecycle controller: no cloud provider provided Oct 23 17:04:02 k8s-master kube-controller-manager[4957]: E1023 17:04:02.670438 4957 clusterroleaggregation_controller.go:180] admin failed with : Operation cannot be fulfilled Oct 23 17:04:02 k8s-master kube-controller-manager[4957]: E1023 17:04:02.683306 4957 clusterroleaggregation_controller.go:180] admin failed with : Operation cannot be fulfilled root@k8s-master:/opt/kubernetes/cfg# ``` #### 4.3 部署kube-scheduler ##### 4.3.1 创建配置文件 ``` cat > /opt/kubernetes/cfg/kube-scheduler.conf << EOF KUBE_SCHEDULER_OPTS="--logtostderr=false \ --v=4 \ --log-dir=/opt/kubernetes/logs \ --leader-elect \ --master=127.0.0.1:8080 \ --bind-address=127.0.0.1" EOF ``` –master:通过本地非安全本地端口8080连接apiserver。 –leader-elect:当该组件启动多个时,自动选举(HA) ##### 4.3.2 systemd管理scheduler ``` cat > /usr/lib/systemd/system/kube-scheduler.service << EOF [Unit] Description=Kubernetes Scheduler Documentation=https://github.com/kubernetes/kubernetes [Service] EnvironmentFile=/opt/kubernetes/cfg/kube-scheduler.conf ExecStart=/opt/kubernetes/bin/kube-scheduler \$KUBE_SCHEDULER_OPTS Restart=on-failure [Install] WantedBy=multi-user.target EOF ``` ##### 4.3.3 启动并设置开机启动 systemctl daemon-reload systemctl start kube-scheduler systemctl enable kube-scheduler ##### 4.3.4 查看集群状态 如下输出说明Master节点组件运行正常。 ``` root@k8s-master:/opt/kubernetes/cfg# kubectl get cs NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"} ``` ### 5.部署dnode节点 #### 5.1 文件和目录准备 下面还是在Master Node上操作,即同时也作为Node **master节点:** 从master节点拷贝: cd kubernetes/server/bin cp kubelet kube-proxy /opt/kubernetes/bin # 本地拷贝 **node节点** 在所有worker node创建工作目录: mkdir -p /opt/kubernetes/{bin,cfg,ssl,logs} 从master节点拷贝: scp -r /root/kubernetes/server/bin/ root@192.168.0.5:/root/kubernetes/server/bin cd kubernetes/server/bin cp kubelet kube-proxy /opt/kubernetes/bin # 本地拷贝 #### 5.2 部署kubelet ##### 5.2.1. 创建配置文件 ``` cat > /opt/kubernetes/cfg/kubelet.conf << EOF KUBELET_OPTS="--logtostderr=false \\ --v=4 \\ --log-dir=/opt/kubernetes/logs \\ --hostname-override=k8s-master \\ --network-plugin=cni \\ --kubeconfig=/opt/kubernetes/cfg/kubelet.kubeconfig \\ --bootstrap-kubeconfig=/opt/kubernetes/cfg/bootstrap.kubeconfig \\ --config=/opt/kubernetes/cfg/kubelet-config.yml \\ --cert-dir=/opt/kubernetes/ssl \\ --pod-infra-container-image=lizhenliang/pause-amd64:3.0" EOF ``` –hostname-override:显示名称,集群中唯一 –network-plugin:启用CNI –kubeconfig:空路径,会自动生成,后面用于连接apiserver –bootstrap-kubeconfig:首次启动向apiserver申请证书 –config:配置参数文件 –cert-dir:kubelet证书生成目录 –pod-infra-container-image:管理Pod网络容器的镜像 ##### 5.2.2 配置参数文件 ``` cat > /opt/kubernetes/cfg/kubelet-config.yml << EOF kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 address: 0.0.0.0 port: 10250 readOnlyPort: 10255 cgroupDriver: cgroupfs clusterDNS: - 10.0.0.2 clusterDomain: cluster.local failSwapOn: false authentication: anonymous: enabled: false webhook: cacheTTL: 2m0s enabled: true x509: clientCAFile: /opt/kubernetes/ssl/ca.pem authorization: mode: Webhook webhook: cacheAuthorizedTTL: 5m0s cacheUnauthorizedTTL: 30s evictionHard: imagefs.available: 15% memory.available: 100Mi nodefs.available: 10% nodefs.inodesFree: 5% maxOpenFiles: 1000000 maxPods: 110 EOF ``` ##### 5.2.3 生成bootstrap.kubeconfig文件 ``` KUBE_APISERVER="https://192.168.0.4:6443" # apiserver IP:PORT TOKEN="c47ffb939f5ca36231d9e3121a252940" # 与token.csv里保持一致 cd /opt/kubernetes/cfg/ # 生成 kubelet bootstrap kubeconfig 配置文件 kubectl config set-cluster kubernetes --certificate-authority=/opt/kubernetes/ssl/ca.pem --embed-certs=true --server=${KUBE_APISERVER} --kubeconfig=bootstrap.kubeconfig kubectl config set-credentials "kubelet-bootstrap" --token=${TOKEN} --kubeconfig=bootstrap.kubeconfig kubectl config set-context default --cluster=kubernetes --user="kubelet-bootstrap" --kubeconfig=bootstrap.kubeconfig kubectl config use-context default --kubeconfig=bootstrap.kubeconfig ``` ##### 5.2.4 systemd管理kubelet ``` cat > /usr/lib/systemd/system/kubelet.service << EOF [Unit] Description=Kubernetes Kubelet After=docker.service [Service] EnvironmentFile=/opt/kubernetes/cfg/kubelet.conf ExecStart=/opt/kubernetes/bin/kubelet \$KUBELET_OPTS Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target EOF ``` 启动并设置开机启动 systemctl daemon-reload systemctl start kubelet systemctl enable kubelet ##### 5.2.5 批准kubelet证书申请并加入集群 查看kubelet证书请求 ``` root@k8s-master:/opt/kubernetes/cfg# kubectl get csr NAME AGE REQUESTOR CONDITION node-csr-uYm2cSUxv0HWPXQ4JNj5bYPaR_B2rLbkCM257un0iV4 41s kubelet-bootstrap Pending ``` 批准申请 ``` kubectl certificate approve node-csr-uYm2cSUxv0HWPXQ4JNj5bYPaR_B2rLbkCM257un0iV4 ``` 查看节点 ``` root@k8s-master:/opt/kubernetes/cfg# kubectl get node NAME STATUS ROLES AGE VERSION k8s-master NotReady 4s v1.17.3 ``` 注:由于网络插件还没有部署,节点会没有准备就绪 NotReady #### 5.3 部署kube-proxy ##### 5.3.1 创建配置文件 ``` cat > /opt/kubernetes/cfg/kube-proxy.conf << EOF KUBE_PROXY_OPTS="--logtostderr=false \\ --v=2 \\ --log-dir=/opt/kubernetes/logs \\ --config=/opt/kubernetes/cfg/kube-proxy-config.yml" EOF ``` ##### 5.3.2 配置参数文件 ``` cat > /opt/kubernetes/cfg/kube-proxy-config.yml << EOF kind: KubeProxyConfiguration apiVersion: kubeproxy.config.k8s.io/v1alpha1 bindAddress: 0.0.0.0 metricsBindAddress: 0.0.0.0:10249 clientConnection: kubeconfig: /opt/kubernetes/cfg/kube-proxy.kubeconfig hostnameOverride: k8s-master clusterCIDR: 10.0.0.0/24 EOF ``` ##### 5.3.3. 生成kube-proxy.kubeconfig文件 生成kube-proxy证书: 切换工作目录 cd TLS/k8s (1) 创建证书请求文件 ``` cat > kube-proxy-csr.json << EOF { "CN": "system:kube-proxy", "hosts": [], "key": { "algo": "rsa", "size": 2048 }, "names": [ { "C": "CN", "L": "BeiJing", "ST": "BeiJing", "O": "k8s", "OU": "System" } ] } EOF ``` (2) 生成证书 cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kube-proxy-csr.json | cfssljson -bare kube-proxy ``` ls kube-proxy*pem kube-proxy-key.pem kube-proxy.pem ``` 将证书拷贝到/opt/kubernetes/ssl/ 目录: cp kube-proxy-key.pem kube-proxy.pem /opt/kubernetes/ssl/ (3) 生成kubeconfig文件: ``` cd /opt/kubernetes/cfg/ KUBE_APISERVER="https://192.168.0.4:6443" kubectl config set-cluster kubernetes --certificate-authority=/opt/kubernetes/ssl/ca.pem --embed-certs=true --server=${KUBE_APISERVER} --kubeconfig=kube-proxy.kubeconfig kubectl config set-credentials kube-proxy --client-certificate=/opt/kubernetes/ssl/kube-proxy.pem --client-key=/opt/kubernetes/ssl/kube-proxy-key.pem --embed-certs=true --kubeconfig=kube-proxy.kubeconfig kubectl config set-context default --cluster=kubernetes --user=kube-proxy --kubeconfig=kube-proxy.kubeconfig kubectl config use-context default --kubeconfig=kube-proxy.kubeconfig ``` ##### 5.3.4. systemd管理kube-proxy ``` cat > /usr/lib/systemd/system/kube-proxy.service << EOF [Unit] Description=Kubernetes Proxy After=network.target [Service] EnvironmentFile=/opt/kubernetes/cfg/kube-proxy.conf ExecStart=/opt/kubernetes/bin/kube-proxy \$KUBE_PROXY_OPTS Restart=on-failure LimitNOFILE=65536 [Install] WantedBy=multi-user.target EOF ``` 启动并设置开机启动 systemctl daemon-reload systemctl start kube-proxy systemctl enable kube-proxy #### 5.4 部署网络环境 先准备好CNI二进制文件: 下载地址:https://github.com/containernetworking/plugins/releases/download/v0.8.6/cni-plugins-linux-amd64-v0.8.6.tgz 解压二进制包并移动到默认工作目录: mkdir /opt/cni/bin tar zxvf cni-plugins-linux-amd64-v0.8.6.tgz -C /opt/cni/bin 部署CNI网络: ``` wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml sed -i -r "s#quay.io/coreos/flannel:.*-amd64#lizhenliang/flannel:v0.12.0-amd64#g" kube-flannel.yml ``` 默认镜像地址无法访问,修改为docker hub镜像仓库。 ``` root@k8s-master:~# kubectl get pod -n kube-system NAME READY STATUS RESTARTS AGE kube-flannel-ds-mwmmn 1/1 Running 0 72s root@k8s-master:~# root@k8s-master:~# root@k8s-master:~# kubectl get node NAME STATUS ROLES AGE VERSION k8s-master Ready 23m v1.17.3 ``` 部署好网络插件,Node准备就绪。 #### 5.5 授权apiserver访问kubelet 如何没有这个,kubectl exec -it pod会报错 ``` cat > apiserver-to-kubelet-rbac.yaml << EOF apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: annotations: rbac.authorization.kubernetes.io/autoupdate: "true" labels: kubernetes.io/bootstrapping: rbac-defaults name: system:kube-apiserver-to-kubelet rules: - apiGroups: - "" resources: - nodes/proxy - nodes/stats - nodes/log - nodes/spec - nodes/metrics - pods/log verbs: - "*" --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: system:kube-apiserver namespace: "" roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:kube-apiserver-to-kubelet subjects: - apiGroup: rbac.authorization.k8s.io kind: User name: kubernetes EOF kubectl apply -f apiserver-to-kubelet-rbac.yaml ``` ### 6 新增加Node ##### 6.1. 拷贝已部署好的Node相关文件到新节点 在Master节点将Worker Node涉及文件拷贝到新节点 scp -r /opt/kubernetes root@192.168.0.5:/opt/ scp -r /usr/lib/systemd/system/{kubelet,kube-proxy}.service root@192.168.0.5:/usr/lib/systemd/system scp -r /opt/cni/ root@192.168.0.5:/opt/ scp /opt/kubernetes/ssl/ca.pem root@192.168.0.5:/opt/kubernetes/ssl ##### 6.2 删除kubelet证书和kubeconfig文件 ``` rm /opt/kubernetes/cfg/kubelet.kubeconfig rm -f /opt/kubernetes/ssl/kubelet* ``` 注:这几个文件是证书申请审批后自动生成的,每个Node不同,必须删除重新生成。 ##### 6.3. 修改主机名 ``` vi /opt/kubernetes/cfg/kubelet.conf --hostname-override=k8s-node1 vi /opt/kubernetes/cfg/kube-proxy-config.yml hostnameOverride: k8s-node1 ``` ##### 6.4. 启动并设置开机启动 systemctl daemon-reload systemctl start kubelet systemctl enable kubelet systemctl start kube-proxy systemctl enable kube-proxy ##### 6.5. 在Master上批准新Node kubelet证书申请 ``` root@k8s-master:~# kubectl get csr NAME AGE REQUESTOR CONDITION node-csr-hqhgEI8ez2hjy5Cm0nJ_OeP2s7pPow99b3c8PUDnmIE 32s kubelet-bootstrap Pending node-csr-uYm2cSUxv0HWPXQ4JNj5bYPaR_B2rLbkCM257un0iV4 73m kubelet-bootstrap Approved,Issued root@k8s-master:~# root@k8s-master:~# kubectl certificate approve node-csr-hqhgEI8ez2hjy5Cm0nJ_OeP2s7pPow99b3c8PUDnmIE certificatesigningrequest.certificates.k8s.io/node-csr-hqhgEI8ez2hjy5Cm0nJ_OeP2s7pPow99b3c8PUDnmIE approved ``` ##### 6.6. 查看Node状态 ``` root@k8s-master:~# kubectl get node NAME STATUS ROLES AGE VERSION k8s-master Ready 73m v1.17.3 k8s-node Ready 55s v1.17.3 ``` 正常创建pod测试 ``` root@k8s-master:~# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx 1/1 Running 0 114s 10.244.1.2 k8s-node ``` ### 7.可能遇到的坑 https://blog.csdn.net/zhuzhuxiazst/article/details/103887137 ================================================ FILE: k8s/install-k8s-from source code/2.window配置goland环境阅读kubernetes源码.md ================================================ Table of Contents ================= * [1. 代码下载](#1-代码下载) ### 1. 代码下载 (1)管理员运行git (2) 然后使用-c core.symlinks=true 来下载链接关系 ``` git clone -c core.symlinks=true https://github.com/kubernetes/kubernetes.git -b v1.17.4 ``` (3)goland 可以使用eval reset 插件,每次打开时激活30天的免费使用,从而达到白嫖 https://blog.csdn.net/qq_37699336/article/details/116528062 (4)goland配置如下 kubernetes 源码要放到 gopath/src 目录下 ![windows-read-sourcecode](../images/windows-read-sourcecode.png) 然后代码就不会变红了,到处乱跳了 参考链接:https://zhuanlan.zhihu.com/p/52056165 ================================================ FILE: k8s/kcm/0-kcm启动流程.md ================================================ Table of Contents ================= * [1. 定义-main](#1-定义-main) * [1.1 NewKubeControllerManagerOptions](#11-newkubecontrollermanageroptions) * [1.2 s.config 实例化一个kubecontrollerconfig.Config](#12-sconfig--实例化一个kubecontrollerconfigconfig) * [1.2.1 s.applyTo](#121-sapplyto) * [1.2.2 结构体定义](#122-结构体定义) * [1.3 Run](#13-run) * [1.4 run函数](#14-run函数) * [1.5 StartControllers](#15-startcontrollers) * [1.6 总结](#16-总结) * [1.6.1 整体流程](#161-整体流程) * [1.6.2 一些思考](#162-一些思考) * [2. 附录](#2-附录) * [2.1 cobra实践](#21-cobra实践) * [2.2 k8s中的选举机制](#22-k8s中的选举机制) ### 1. 定义-main cmd\kube-controller-manager\controller-manager.go ``` func main() { rand.Seed(time.Now().UTC().UnixNano()) command := app.NewControllerManagerCommand() // TODO: once we switch everything over to Cobra commands, we can go back to calling // utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the // normalize func and add the go flag set by hand. pflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc) pflag.CommandLine.AddGoFlagSet(goflag.CommandLine) // utilflag.InitFlags() logs.InitLogs() defer logs.FlushLogs() if err := command.Execute(); err != nil { fmt.Fprintf(os.Stderr, "%v\n", err) os.Exit(1) } } ```
```go // NewControllerManagerCommand creates a *cobra.Command object with default parameters func NewControllerManagerCommand() *cobra.Command { // 1.初始化config配置。包括每个controller的配置,例如hpacontroller的 HorizontalPodAutoscalerSyncPeriod // 详见 cmd\kube-controller-manager\app\options\options.go s, err := options.NewKubeControllerManagerOptions() if err != nil { glog.Fatalf("unable to initialize command options: %v", err) } cmd := &cobra.Command{ Use: "kube-controller-manager", Long: `The Kubernetes controller manager is a daemon that embeds the core control loops shipped with Kubernetes. In applications of robotics and automation, a control loop is a non-terminating loop that regulates the state of the system. In Kubernetes, a controller is a control loop that watches the shared state of the cluster through the apiserver and makes changes attempting to move the current state towards the desired state. Examples of controllers that ship with Kubernetes today are the replication controller, endpoints controller, namespace controller, and serviceaccounts controller.`, Run: func(cmd *cobra.Command, args []string) { // 打印一些信息 verflag.PrintAndExitIfRequested() utilflag.PrintFlags(cmd.Flags()) // 2. 实例化一个kubecontrollerconfig.Config c, err := s.Config(KnownControllers(), ControllersDisabledByDefault.List()) if err != nil { fmt.Fprintf(os.Stderr, "%v\n", err) os.Exit(1) } // 最关键的Run,这里是 neverStop if err := Run(c.Complete(), wait.NeverStop); err != nil { fmt.Fprintf(os.Stderr, "%v\n", err) os.Exit(1) } }, } fs := cmd.Flags() // 定义cobra的flags,这里就是定义参数的名称,默认值啥的。例如 --url --port等 namedFlagSets := s.Flags(KnownControllers(), ControllersDisabledByDefault.List()) for _, f := range namedFlagSets.FlagSets { fs.AddFlagSet(f) } //4.设置 help, usage函数 usageFmt := "Usage:\n %s\n" cols, _, _ := apiserverflag.TerminalSize(cmd.OutOrStdout()) cmd.SetUsageFunc(func(cmd *cobra.Command) error { fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine()) apiserverflag.PrintSections(cmd.OutOrStderr(), namedFlagSets, cols) return nil }) cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) { fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine()) apiserverflag.PrintSections(cmd.OutOrStdout(), namedFlagSets, cols) }) return cmd } ```
这个就是 s.flags namedFlagSets := s.Flags(KnownControllers(), ControllersDisabledByDefault.List()) ``` // Flags returns flags for a specific APIServer by section name // 依次调用其他controller-manager的flags。 func (s *KubeControllerManagerOptions) Flags(allControllers []string, disabledByDefaultControllers []string) apiserverflag.NamedFlagSets { fss := apiserverflag.NamedFlagSets{} s.Generic.AddFlags(&fss, allControllers, disabledByDefaultControllers) s.KubeCloudShared.AddFlags(fss.FlagSet("generic")) s.ServiceController.AddFlags(fss.FlagSet("service controller")) s.SecureServing.AddFlags(fss.FlagSet("secure serving")) s.InsecureServing.AddUnqualifiedFlags(fss.FlagSet("insecure serving")) s.Authentication.AddFlags(fss.FlagSet("authentication")) s.Authorization.AddFlags(fss.FlagSet("authorization")) s.AttachDetachController.AddFlags(fss.FlagSet("attachdetach controller")) s.CSRSigningController.AddFlags(fss.FlagSet("csrsigning controller")) s.DeploymentController.AddFlags(fss.FlagSet("deployment controller")) s.DaemonSetController.AddFlags(fss.FlagSet("daemonset controller")) s.DeprecatedFlags.AddFlags(fss.FlagSet("deprecated")) s.EndpointController.AddFlags(fss.FlagSet("endpoint controller")) s.GarbageCollectorController.AddFlags(fss.FlagSet("garbagecollector controller")) s.HPAController.AddFlags(fss.FlagSet("horizontalpodautoscaling controller")) s.JobController.AddFlags(fss.FlagSet("job controller")) s.NamespaceController.AddFlags(fss.FlagSet("namespace controller")) s.NodeIPAMController.AddFlags(fss.FlagSet("nodeipam controller")) s.NodeLifecycleController.AddFlags(fss.FlagSet("nodelifecycle controller")) s.PersistentVolumeBinderController.AddFlags(fss.FlagSet("persistentvolume-binder controller")) s.PodGCController.AddFlags(fss.FlagSet("podgc controller")) s.ReplicaSetController.AddFlags(fss.FlagSet("replicaset controller")) s.ReplicationController.AddFlags(fss.FlagSet("replicationcontroller")) s.ResourceQuotaController.AddFlags(fss.FlagSet("resourcequota controller")) s.SAController.AddFlags(fss.FlagSet("serviceaccount controller")) s.TTLAfterFinishedController.AddFlags(fss.FlagSet("ttl-after-finished controller")) fs := fss.FlagSet("misc") fs.StringVar(&s.Master, "master", s.Master, "The address of the Kubernetes API server (overrides any value in kubeconfig).") fs.StringVar(&s.Kubeconfig, "kubeconfig", s.Kubeconfig, "Path to kubeconfig file with authorization and master location information.") var dummy string fs.MarkDeprecated("insecure-experimental-approve-all-kubelet-csrs-for-group", "This flag does nothing.") fs.StringVar(&dummy, "insecure-experimental-approve-all-kubelet-csrs-for-group", "", "This flag does nothing.") utilfeature.DefaultFeatureGate.AddFlag(fss.FlagSet("generic")) return fss } ``` ``` // AddFlags adds flags related to DeploymentController for controller manager to the specified FlagSet. func (o *DeploymentControllerOptions) AddFlags(fs *pflag.FlagSet) { if o == nil { return } fs.Int32Var(&o.ConcurrentDeploymentSyncs, "concurrent-deployment-syncs", o.ConcurrentDeploymentSyncs, "The number of deployment objects that are allowed to sync concurrently. Larger number = more responsive deployments, but more CPU (and network) load") fs.DurationVar(&o.DeploymentControllerSyncPeriod.Duration, "deployment-controller-sync-period", o.DeploymentControllerSyncPeriod.Duration, "Period for syncing the deployments.") } ``` 比如,以DeploymentControllerOptions.AddFlags为例,这里就是定义了concurrent-deployment-syncs,deployment-controller-sync-period这两个参数,并且赋了默认值。 参考附录,可以加深理解。
#### 1.1 NewKubeControllerManagerOptions 看起来这里是通过获取默认的参数配置,然后赋值给 KubeControllerManagerOptions ``` // NewKubeControllerManagerOptions creates a new KubeControllerManagerOptions with a default config. func NewKubeControllerManagerOptions() (*KubeControllerManagerOptions, error) { componentConfig, err := NewDefaultComponentConfig(ports.InsecureKubeControllerManagerPort) if err != nil { return nil, err } s := KubeControllerManagerOptions{ Generic: cmoptions.NewGenericControllerManagerConfigurationOptions(componentConfig.Generic), KubeCloudShared: cmoptions.NewKubeCloudSharedOptions(componentConfig.KubeCloudShared), AttachDetachController: &AttachDetachControllerOptions{ ReconcilerSyncLoopPeriod: componentConfig.AttachDetachController.ReconcilerSyncLoopPeriod, }, CSRSigningController: &CSRSigningControllerOptions{ ClusterSigningCertFile: componentConfig.CSRSigningController.ClusterSigningCertFile, ClusterSigningKeyFile: componentConfig.CSRSigningController.ClusterSigningKeyFile, ClusterSigningDuration: componentConfig.CSRSigningController.ClusterSigningDuration, }, DaemonSetController: &DaemonSetControllerOptions{ ConcurrentDaemonSetSyncs: componentConfig.DaemonSetController.ConcurrentDaemonSetSyncs, }, DeploymentController: &DeploymentControllerOptions{ ConcurrentDeploymentSyncs: componentConfig.DeploymentController.ConcurrentDeploymentSyncs, DeploymentControllerSyncPeriod: componentConfig.DeploymentController.DeploymentControllerSyncPeriod, }, DeprecatedFlags: &DeprecatedControllerOptions{ RegisterRetryCount: componentConfig.DeprecatedController.RegisterRetryCount, }, EndpointController: &EndpointControllerOptions{ ConcurrentEndpointSyncs: componentConfig.EndpointController.ConcurrentEndpointSyncs, }, GarbageCollectorController: &GarbageCollectorControllerOptions{ ConcurrentGCSyncs: componentConfig.GarbageCollectorController.ConcurrentGCSyncs, EnableGarbageCollector: componentConfig.GarbageCollectorController.EnableGarbageCollector, }, HPAController: &HPAControllerOptions{ HorizontalPodAutoscalerSyncPeriod: componentConfig.HPAController.HorizontalPodAutoscalerSyncPeriod, HorizontalPodAutoscalerUpscaleForbiddenWindow: componentConfig.HPAController.HorizontalPodAutoscalerUpscaleForbiddenWindow, HorizontalPodAutoscalerDownscaleForbiddenWindow: componentConfig.HPAController.HorizontalPodAutoscalerDownscaleForbiddenWindow, HorizontalPodAutoscalerDownscaleStabilizationWindow: componentConfig.HPAController.HorizontalPodAutoscalerDownscaleStabilizationWindow, HorizontalPodAutoscalerCPUInitializationPeriod: componentConfig.HPAController.HorizontalPodAutoscalerCPUInitializationPeriod, HorizontalPodAutoscalerInitialReadinessDelay: componentConfig.HPAController.HorizontalPodAutoscalerInitialReadinessDelay, HorizontalPodAutoscalerTolerance: componentConfig.HPAController.HorizontalPodAutoscalerTolerance, HorizontalPodAutoscalerUseRESTClients: componentConfig.HPAController.HorizontalPodAutoscalerUseRESTClients, }, JobController: &JobControllerOptions{ ConcurrentJobSyncs: componentConfig.JobController.ConcurrentJobSyncs, }, NamespaceController: &NamespaceControllerOptions{ NamespaceSyncPeriod: componentConfig.NamespaceController.NamespaceSyncPeriod, ConcurrentNamespaceSyncs: componentConfig.NamespaceController.ConcurrentNamespaceSyncs, }, NodeIPAMController: &NodeIPAMControllerOptions{ NodeCIDRMaskSize: componentConfig.NodeIPAMController.NodeCIDRMaskSize, }, NodeLifecycleController: &NodeLifecycleControllerOptions{ EnableTaintManager: componentConfig.NodeLifecycleController.EnableTaintManager, NodeMonitorGracePeriod: componentConfig.NodeLifecycleController.NodeMonitorGracePeriod, NodeStartupGracePeriod: componentConfig.NodeLifecycleController.NodeStartupGracePeriod, PodEvictionTimeout: componentConfig.NodeLifecycleController.PodEvictionTimeout, }, PersistentVolumeBinderController: &PersistentVolumeBinderControllerOptions{ PVClaimBinderSyncPeriod: componentConfig.PersistentVolumeBinderController.PVClaimBinderSyncPeriod, VolumeConfiguration: componentConfig.PersistentVolumeBinderController.VolumeConfiguration, }, PodGCController: &PodGCControllerOptions{ TerminatedPodGCThreshold: componentConfig.PodGCController.TerminatedPodGCThreshold, }, ReplicaSetController: &ReplicaSetControllerOptions{ ConcurrentRSSyncs: componentConfig.ReplicaSetController.ConcurrentRSSyncs, }, ReplicationController: &ReplicationControllerOptions{ ConcurrentRCSyncs: componentConfig.ReplicationController.ConcurrentRCSyncs, }, ResourceQuotaController: &ResourceQuotaControllerOptions{ ResourceQuotaSyncPeriod: componentConfig.ResourceQuotaController.ResourceQuotaSyncPeriod, ConcurrentResourceQuotaSyncs: componentConfig.ResourceQuotaController.ConcurrentResourceQuotaSyncs, }, SAController: &SAControllerOptions{ ConcurrentSATokenSyncs: componentConfig.SAController.ConcurrentSATokenSyncs, }, ServiceController: &cmoptions.ServiceControllerOptions{ ConcurrentServiceSyncs: componentConfig.ServiceController.ConcurrentServiceSyncs, }, TTLAfterFinishedController: &TTLAfterFinishedControllerOptions{ ConcurrentTTLSyncs: componentConfig.TTLAfterFinishedController.ConcurrentTTLSyncs, }, SecureServing: apiserveroptions.NewSecureServingOptions().WithLoopback(), InsecureServing: (&apiserveroptions.DeprecatedInsecureServingOptions{ BindAddress: net.ParseIP(componentConfig.Generic.Address), BindPort: int(componentConfig.Generic.Port), BindNetwork: "tcp", }).WithLoopback(), Authentication: apiserveroptions.NewDelegatingAuthenticationOptions(), Authorization: apiserveroptions.NewDelegatingAuthorizationOptions(), } s.Authentication.RemoteKubeConfigFileOptional = true s.Authorization.RemoteKubeConfigFileOptional = true s.Authorization.AlwaysAllowPaths = []string{"/healthz"} s.SecureServing.ServerCert.CertDirectory = "/var/run/kubernetes" s.SecureServing.ServerCert.PairName = "kube-controller-manager" s.SecureServing.BindPort = ports.KubeControllerManagerPort gcIgnoredResources := make([]kubectrlmgrconfig.GroupResource, 0, len(garbagecollector.DefaultIgnoredResources())) for r := range garbagecollector.DefaultIgnoredResources() { gcIgnoredResources = append(gcIgnoredResources, kubectrlmgrconfig.GroupResource{Group: r.Group, Resource: r.Resource}) } s.GarbageCollectorController.GCIgnoredResources = gcIgnoredResources return &s, nil } ``` 可以看出来这里的关键就是: (1)config函数 (2)Run函数
#### 1.2 s.config 实例化一个kubecontrollerconfig.Config 这个函数的参数是:allControllers []string, disabledByDefaultControllers []string 核心就是:kubecontrollerconfig.Config ``` // Config return a controller manager config objective func (s KubeControllerManagerOptions) Config(allControllers []string, disabledByDefaultControllers []string) (*kubecontrollerconfig.Config, error) { if err := s.Validate(allControllers, disabledByDefaultControllers); err != nil { return nil, err } if err := s.SecureServing.MaybeDefaultWithSelfSignedCerts("localhost", nil, []net.IP{net.ParseIP("127.0.0.1")}); err != nil { return nil, fmt.Errorf("error creating self-signed certificates: %v", err) } kubeconfig, err := clientcmd.BuildConfigFromFlags(s.Master, s.Kubeconfig) if err != nil { return nil, err } kubeconfig.ContentConfig.ContentType = s.Generic.ClientConnection.ContentType kubeconfig.QPS = s.Generic.ClientConnection.QPS kubeconfig.Burst = int(s.Generic.ClientConnection.Burst) client, err := clientset.NewForConfig(restclient.AddUserAgent(kubeconfig, KubeControllerManagerUserAgent)) if err != nil { return nil, err } // shallow copy, do not modify the kubeconfig.Timeout. config := *kubeconfig config.Timeout = s.Generic.LeaderElection.RenewDeadline.Duration leaderElectionClient := clientset.NewForConfigOrDie(restclient.AddUserAgent(&config, "leader-election")) eventRecorder := createRecorder(client, KubeControllerManagerUserAgent) // 核心就是定义好这样一个结构体 c := &kubecontrollerconfig.Config{ Client: client, //用于api-server通信 Kubeconfig: kubeconfig, //kube-config EventRecorder: eventRecorder, //event上报 LeaderElectionClient: leaderElectionClient, //选举的客户端 } if err := s.ApplyTo(c); err != nil { return nil, err } return c, nil } ```
##### 1.2.1 s.applyTo ``` // ApplyTo fills up controller manager config with options. func (s *KubeControllerManagerOptions) ApplyTo(c *kubecontrollerconfig.Config) error { if err := s.Generic.ApplyTo(&c.ComponentConfig.Generic); err != nil { return err } if err := s.KubeCloudShared.ApplyTo(&c.ComponentConfig.KubeCloudShared); err != nil { return err } if err := s.AttachDetachController.ApplyTo(&c.ComponentConfig.AttachDetachController); err != nil { return err } if err := s.CSRSigningController.ApplyTo(&c.ComponentConfig.CSRSigningController); err != nil { return err } if err := s.DaemonSetController.ApplyTo(&c.ComponentConfig.DaemonSetController); err != nil { return err } if err := s.DeploymentController.ApplyTo(&c.ComponentConfig.DeploymentController); err != nil { return err } if err := s.DeprecatedFlags.ApplyTo(&c.ComponentConfig.DeprecatedController); err != nil { return err } if err := s.EndpointController.ApplyTo(&c.ComponentConfig.EndpointController); err != nil { return err } if err := s.GarbageCollectorController.ApplyTo(&c.ComponentConfig.GarbageCollectorController); err != nil { return err } if err := s.HPAController.ApplyTo(&c.ComponentConfig.HPAController); err != nil { return err } if err := s.JobController.ApplyTo(&c.ComponentConfig.JobController); err != nil { return err } if err := s.NamespaceController.ApplyTo(&c.ComponentConfig.NamespaceController); err != nil { return err } if err := s.NodeIPAMController.ApplyTo(&c.ComponentConfig.NodeIPAMController); err != nil { return err } if err := s.NodeLifecycleController.ApplyTo(&c.ComponentConfig.NodeLifecycleController); err != nil { return err } if err := s.PersistentVolumeBinderController.ApplyTo(&c.ComponentConfig.PersistentVolumeBinderController); err != nil { return err } if err := s.PodGCController.ApplyTo(&c.ComponentConfig.PodGCController); err != nil { return err } if err := s.ReplicaSetController.ApplyTo(&c.ComponentConfig.ReplicaSetController); err != nil { return err } if err := s.ReplicationController.ApplyTo(&c.ComponentConfig.ReplicationController); err != nil { return err } if err := s.ResourceQuotaController.ApplyTo(&c.ComponentConfig.ResourceQuotaController); err != nil { return err } if err := s.SAController.ApplyTo(&c.ComponentConfig.SAController); err != nil { return err } if err := s.ServiceController.ApplyTo(&c.ComponentConfig.ServiceController); err != nil { return err } if err := s.TTLAfterFinishedController.ApplyTo(&c.ComponentConfig.TTLAfterFinishedController); err != nil { return err } if err := s.InsecureServing.ApplyTo(&c.InsecureServing, &c.LoopbackClientConfig); err != nil { return err } if err := s.SecureServing.ApplyTo(&c.SecureServing, &c.LoopbackClientConfig); err != nil { return err } if s.SecureServing.BindPort != 0 || s.SecureServing.Listener != nil { if err := s.Authentication.ApplyTo(&c.Authentication, c.SecureServing, nil); err != nil { return err } if err := s.Authorization.ApplyTo(&c.Authorization); err != nil { return err } } // sync back to component config // TODO: find more elegant way than syncing back the values. c.ComponentConfig.Generic.Port = int32(s.InsecureServing.BindPort) c.ComponentConfig.Generic.Address = s.InsecureServing.BindAddress.String() return nil } ``` applyto 函数的逻辑就是根据KubeControllerManagerOptions,赋值给c *kubecontrollerconfig.Config。 这里随便找一个applyto具体实现看看就知道了 ``` // ApplyTo fills up AttachDetachController config with options. func (o *AttachDetachControllerOptions) ApplyTo(cfg *kubectrlmgrconfig.AttachDetachControllerConfiguration) error { if o == nil { return nil } cfg.DisableAttachDetachReconcilerSync = o.DisableAttachDetachReconcilerSync cfg.ReconcilerSyncLoopPeriod = o.ReconcilerSyncLoopPeriod return nil } ``` ##### 1.2.2 结构体定义 cmd\kube-controller-manager\app\config\config.go ApplyTO函数的最终目的就是实例化这样一个结构体。 ``` kubecontrollerconfig.Config // Config is the main context object for the controller manager. type Config struct { ComponentConfig kubectrlmgrconfig.KubeControllerManagerConfiguration //这个是各种manager的config,如下 SecureServing *apiserver.SecureServingInfo // LoopbackClientConfig is a config for a privileged loopback connection LoopbackClientConfig *restclient.Config // TODO: remove deprecated insecure serving InsecureServing *apiserver.DeprecatedInsecureServingInfo Authentication apiserver.AuthenticationInfo Authorization apiserver.AuthorizationInfo // the general kube client Client *clientset.Clientset // the client only used for leader election LeaderElectionClient *clientset.Clientset // the rest config for the master Kubeconfig *restclient.Config // the event sink EventRecorder record.EventRecorder } ``` pkg\controller\apis\config\types.go ``` // KubeControllerManagerConfiguration contains elements describing kube-controller manager. type KubeControllerManagerConfiguration struct { metav1.TypeMeta // Generic holds configuration for a generic controller-manager Generic GenericControllerManagerConfiguration // KubeCloudSharedConfiguration holds configuration for shared related features // both in cloud controller manager and kube-controller manager. KubeCloudShared KubeCloudSharedConfiguration // AttachDetachControllerConfiguration holds configuration for // AttachDetachController related features. AttachDetachController AttachDetachControllerConfiguration // CSRSigningControllerConfiguration holds configuration for // CSRSigningController related features. CSRSigningController CSRSigningControllerConfiguration // DaemonSetControllerConfiguration holds configuration for DaemonSetController // related features. DaemonSetController DaemonSetControllerConfiguration // DeploymentControllerConfiguration holds configuration for // DeploymentController related features. DeploymentController DeploymentControllerConfiguration // DeprecatedControllerConfiguration holds configuration for some deprecated // features. DeprecatedController DeprecatedControllerConfiguration // EndpointControllerConfiguration holds configuration for EndpointController // related features. EndpointController EndpointControllerConfiguration // GarbageCollectorControllerConfiguration holds configuration for // GarbageCollectorController related features. GarbageCollectorController GarbageCollectorControllerConfiguration // HPAControllerConfiguration holds configuration for HPAController related features. HPAController HPAControllerConfiguration // JobControllerConfiguration holds configuration for JobController related features. JobController JobControllerConfiguration // NamespaceControllerConfiguration holds configuration for NamespaceController // related features. NamespaceController NamespaceControllerConfiguration // NodeIPAMControllerConfiguration holds configuration for NodeIPAMController // related features. NodeIPAMController NodeIPAMControllerConfiguration // NodeLifecycleControllerConfiguration holds configuration for // NodeLifecycleController related features. NodeLifecycleController NodeLifecycleControllerConfiguration // PersistentVolumeBinderControllerConfiguration holds configuration for // PersistentVolumeBinderController related features. PersistentVolumeBinderController PersistentVolumeBinderControllerConfiguration // PodGCControllerConfiguration holds configuration for PodGCController // related features. PodGCController PodGCControllerConfiguration // ReplicaSetControllerConfiguration holds configuration for ReplicaSet related features. ReplicaSetController ReplicaSetControllerConfiguration // ReplicationControllerConfiguration holds configuration for // ReplicationController related features. ReplicationController ReplicationControllerConfiguration // ResourceQuotaControllerConfiguration holds configuration for // ResourceQuotaController related features. ResourceQuotaController ResourceQuotaControllerConfiguration // SAControllerConfiguration holds configuration for ServiceAccountController // related features. SAController SAControllerConfiguration // ServiceControllerConfiguration holds configuration for ServiceController // related features. ServiceController ServiceControllerConfiguration // TTLAfterFinishedControllerConfiguration holds configuration for // TTLAfterFinishedController related features. TTLAfterFinishedController TTLAfterFinishedControllerConfiguration } ```
#### 1.3 Run 所以,经过1.1 config函数 c就是补全了所有的config ``` // Run runs the KubeControllerManagerOptions. This should never exit. func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error { // To help debugging, immediately log version glog.Infof("Version: %+v", version.Get()) if cfgz, err := configz.New("componentconfig"); err == nil { cfgz.Set(c.ComponentConfig) } else { glog.Errorf("unable to register configz: %c", err) } // 1.开启http server。默认暴露的端口号:10252。用于controller-manager服务性能检测(如:/debug/profile)及暴露服务相关的metrics供promtheus用于监控。 // Start the controller manager HTTP server // unsecuredMux is the handler for these controller *after* authn/authz filters have been applied var unsecuredMux *mux.PathRecorderMux if c.SecureServing != nil { unsecuredMux = genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging) handler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, &c.Authorization, &c.Authentication) if err := c.SecureServing.Serve(handler, 0, stopCh); err != nil { return err } } if c.InsecureServing != nil { unsecuredMux = genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging) insecureSuperuserAuthn := server.AuthenticationInfo{Authenticator: &server.InsecureSuperuser{}} handler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, nil, &insecureSuperuserAuthn) if err := c.InsecureServing.Serve(handler, 0, stopCh); err != nil { return err } } // 2. 定义好run函数 run := func(ctx context.Context) { rootClientBuilder := controller.SimpleControllerClientBuilder{ ClientConfig: c.Kubeconfig, } var clientBuilder controller.ControllerClientBuilder if c.ComponentConfig.KubeCloudShared.UseServiceAccountCredentials { if len(c.ComponentConfig.SAController.ServiceAccountKeyFile) == 0 { // It'c possible another controller process is creating the tokens for us. // If one isn't, we'll timeout and exit when our client builder is unable to create the tokens. glog.Warningf("--use-service-account-credentials was specified without providing a --service-account-private-key-file") } clientBuilder = controller.SAControllerClientBuilder{ ClientConfig: restclient.AnonymousClientConfig(c.Kubeconfig), CoreClient: c.Client.CoreV1(), AuthenticationClient: c.Client.AuthenticationV1(), Namespace: "kube-system", } } else { clientBuilder = rootClientBuilder } controllerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done()) if err != nil { glog.Fatalf("error building controller context: %v", err) } saTokenControllerInitFunc := serviceAccountTokenControllerStarter{rootClientBuilder: rootClientBuilder}.startServiceAccountTokenController if err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil { glog.Fatalf("error starting controllers: %v", err) } controllerContext.InformerFactory.Start(controllerContext.Stop) close(controllerContext.InformersStarted) select {} } // 3. 如果没有多个就直接run if !c.ComponentConfig.Generic.LeaderElection.LeaderElect { run(context.TODO()) panic("unreachable") } id, err := os.Hostname() if err != nil { return err } // add a uniquifier so that two processes on the same host don't accidentally both become active id = id + "_" + string(uuid.NewUUID()) rl, err := resourcelock.New(c.ComponentConfig.Generic.LeaderElection.ResourceLock, "kube-system", "kube-controller-manager", c.LeaderElectionClient.CoreV1(), resourcelock.ResourceLockConfig{ Identity: id, EventRecorder: c.EventRecorder, }) if err != nil { glog.Fatalf("error creating lock: %v", err) } // 4.设置了选举 leaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{ Lock: rl, LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration, RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration, RetryPeriod: c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration, Callbacks: leaderelection.LeaderCallbacks{ OnStartedLeading: run, //leader 运行run函数,这个就是第二步定义的函数 OnStoppedLeading: func() { // 非leader就打印这个日志。 glog.Fatalf("leaderelection lost") }, }, }) panic("unreachable") } ```
#### 1.4 run函数 这里就是初始化clientBuilder,然后就StartControllers。 ``` run := func(ctx context.Context) { rootClientBuilder := controller.SimpleControllerClientBuilder{ ClientConfig: c.Kubeconfig, } var clientBuilder controller.ControllerClientBuilder if c.ComponentConfig.KubeCloudShared.UseServiceAccountCredentials { if len(c.ComponentConfig.SAController.ServiceAccountKeyFile) == 0 { // It'c possible another controller process is creating the tokens for us. // If one isn't, we'll timeout and exit when our client builder is unable to create the tokens. glog.Warningf("--use-service-account-credentials was specified without providing a --service-account-private-key-file") } clientBuilder = controller.SAControllerClientBuilder{ ClientConfig: restclient.AnonymousClientConfig(c.Kubeconfig), CoreClient: c.Client.CoreV1(), AuthenticationClient: c.Client.AuthenticationV1(), Namespace: "kube-system", } } else { clientBuilder = rootClientBuilder } controllerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done()) if err != nil { glog.Fatalf("error building controller context: %v", err) } saTokenControllerInitFunc := serviceAccountTokenControllerStarter{rootClientBuilder: rootClientBuilder}.startServiceAccountTokenController if err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil { glog.Fatalf("error starting controllers: %v", err) } controllerContext.InformerFactory.Start(controllerContext.Stop) close(controllerContext.InformersStarted) select {} } ``` startController这里有一个参数是函数NewControllerInitializers,从这里可以看到有这么多controller ``` // NewControllerInitializers is a public map of named controller groups (you can start more than one in an init func) // paired to their InitFunc. This allows for structured downstream composition and subdivision. func NewControllerInitializers(loopMode ControllerLoopMode) map[string]InitFunc { controllers := map[string]InitFunc{} controllers["endpoint"] = startEndpointController controllers["replicationcontroller"] = startReplicationController controllers["podgc"] = startPodGCController controllers["resourcequota"] = startResourceQuotaController controllers["namespace"] = startNamespaceController controllers["serviceaccount"] = startServiceAccountController controllers["garbagecollector"] = startGarbageCollectorController controllers["daemonset"] = startDaemonSetController controllers["job"] = startJobController controllers["deployment"] = startDeploymentController controllers["replicaset"] = startReplicaSetController controllers["horizontalpodautoscaling"] = startHPAController controllers["disruption"] = startDisruptionController controllers["statefulset"] = startStatefulSetController controllers["cronjob"] = startCronJobController controllers["csrsigning"] = startCSRSigningController controllers["csrapproving"] = startCSRApprovingController controllers["csrcleaner"] = startCSRCleanerController controllers["ttl"] = startTTLController controllers["bootstrapsigner"] = startBootstrapSignerController controllers["tokencleaner"] = startTokenCleanerController controllers["nodeipam"] = startNodeIpamController if loopMode == IncludeCloudLoops { controllers["service"] = startServiceController controllers["route"] = startRouteController // TODO: volume controller into the IncludeCloudLoops only set. // TODO: Separate cluster in cloud check from node lifecycle controller. } controllers["nodelifecycle"] = startNodeLifecycleController controllers["persistentvolume-binder"] = startPersistentVolumeBinderController controllers["attachdetach"] = startAttachDetachController controllers["persistentvolume-expander"] = startVolumeExpandController controllers["clusterrole-aggregation"] = startClusterRoleAggregrationController controllers["pvc-protection"] = startPVCProtectionController controllers["pv-protection"] = startPVProtectionController controllers["ttl-after-finished"] = startTTLAfterFinishedController return controllers } ```
#### 1.5 StartControllers ``` func StartControllers(ctx ControllerContext, startSATokenController InitFunc, controllers map[string]InitFunc, unsecuredMux *mux.PathRecorderMux) error { // Always start the SA token controller first using a full-power client, since it needs to mint tokens for the rest // If this fails, just return here and fail since other controllers won't be able to get credentials. if _, _, err := startSATokenController(ctx); err != nil { return err } // Initialize the cloud provider with a reference to the clientBuilder only after token controller // has started in case the cloud provider uses the client builder. if ctx.Cloud != nil { ctx.Cloud.Initialize(ctx.ClientBuilder) } // 依次启动controller,这里为啥不用协程呢? for controllerName, initFn := range controllers { if !ctx.IsControllerEnabled(controllerName) { glog.Warningf("%q is disabled", controllerName) continue } time.Sleep(wait.Jitter(ctx.ComponentConfig.Generic.ControllerStartInterval.Duration, ControllerStartJitter)) glog.V(1).Infof("Starting %q", controllerName) // 注意这里的 initFn就是NewControllerInitializers 中指定了。 debugHandler, started, err := initFn(ctx) if err != nil { glog.Errorf("Error starting %q", controllerName) return err } if !started { glog.Warningf("Skipping %q", controllerName) continue } if debugHandler != nil && unsecuredMux != nil { basePath := "/debug/controllers/" + controllerName unsecuredMux.UnlistedHandle(basePath, http.StripPrefix(basePath, debugHandler)) unsecuredMux.UnlistedHandlePrefix(basePath+"/", http.StripPrefix(basePath, debugHandler)) } glog.Infof("Started %q", controllerName) } return nil } ```
#### 1.6 总结 ##### 1.6.1 整体流程 (1) NewControllerManagerCommand中定义了 NewKubeControllerManagerOptions,名为s。同时调用这个,将命令行的参数,赋值给s ``` fs := cmd.Flags() // 定义cobra的flags,这里就是定义参数的名称,默认值啥的。例如 --url --port等 namedFlagSets := s.Flags(KnownControllers(), ControllersDisabledByDefault.List()) for _, f := range namedFlagSets.FlagSets { fs.AddFlagSet(f) } ``` (2)然后通过 s.config 实例化一个kubecontrollerconfig.Config, 名为c (3)通过ApplyTo,将s的值赋值给 C。 这样C对应的每个controller都有自己的config (4)然后就开始Run逻辑。 ``` Run逻辑: (1) 首先初始化clientBuilder (2) 然后定义好真正运行的run函数。run函数依次运行所有的controller的init函数。这样每个controller的起点就是这个 init函数。 (3) 调用选举函数,leader运行run。非leader打印,失去leader锁的日志。 ```
##### 1.6.2 一些思考 (1) 为啥参数赋值的时候,又要config,又要options,弄完弄清,直接想附录那样赋值不香吗? 这里的一个思考是:kube-controller-manage 想采用机制和策略分离原则,options参数主要是面向cmd.Flag的,用于用户启动kcm参数的接受。config面向kcm,具体来说是为了更方便kcm中各个控制器的启动。每个controller有自己的config。 这样的好像是:option和config参数分离。option 是通过AddFlags赋值。而config 则是ApplyTo赋值。 ### 2. 附录 #### 2.1 cobra实践 ``` package main import ( "fmt" "github.com/spf13/cobra" "flag" ) type Config struct { url string } func main() { var config = &Config{} var rootCmd = &cobra.Command{ Use: "test cobra", Run: func(cmd *cobra.Command, args []string) { fmt.Println(config.url) }, } rootCmd.PersistentFlags().AddGoFlagSet(flag.CommandLine) rootCmd.Flags().StringVarP(&config.url, "arg-url", "", "www.baidu.com", "the url is used for connect baidu") rootCmd.Execute() } E:\goWork\src\practice>cobra.exe --arg-url aaa aaa E:\goWork\src\practice>cobra.exe www.baidu.com ```
#### 2.2 k8s中的选举机制 k8s中的选举机制在Client-go包中实现。具体的做法是:多个客户端创建一起创建成功资源,哪一个goroutine 获得锁,哪一个就是主。 选择config, ep的原因在于他们被list-watcher比较少,后期由于svc, ingres等发展,现在主要是用configmap来做。 比如这个:当前kcm的锁就在k8s-master这个节点上。 ``` root@k8s-master:~# kubectl get ep -n kube-system kube-controller-manager -o yaml apiVersion: v1 kind: Endpoints metadata: annotations: control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"k8s-master_904e0225-871d-4ff2-becc-62b58a20e3c7","leaseDurationSeconds":15,"acquireTime":"2021-07-16T21:03:58Z","renewTime":"2021-07-17T10:35:50Z","leaderTransitions":24}' creationTimestamp: "2021-06-05T12:50:04Z" name: kube-controller-manager namespace: kube-system resourceVersion: "8831710" selfLink: /api/v1/namespaces/kube-system/endpoints/kube-controller-manager uid: 5d530096-9b10-45bb-a11e-43f1f8733fa5 ``` ================================================ FILE: k8s/kcm/1-rs controller-manager源码分析.md ================================================ Table of Contents ================= * [1. startReplicaSetController](#1-startreplicasetcontroller) * [1.1 rs中的expectations机制](#11-rs中的expectations机制) * [2. Pod,rs变化时对应的处理逻辑](#2-podrs变化时对应的处理逻辑) * [2.1 addPod](#21-addpod) * [2.2 updatePod](#22-updatepod) * [2.3 deletePod](#23-deletepod) * [2.4 addRS](#24-addrs) * [2.5 updateRS](#25-updaters) * [2.6 deleteRS](#26-deleters) * [3. rs的处理逻辑](#3-rs的处理逻辑) * [3.1 过滤pod](#31-过滤pod) * [3.2 manageReplicas](#32-managereplicas) * [3.2.1 创建pod](#321-创建pod) * [3.2.2 删除pod](#322-删除pod) * [3.3 calculateStatus](#33-calculatestatus) * [4 总结](#4-总结) ### 1. startReplicaSetController 和deployController一样,kcm中定义了startReplicaSetController,startReplicaSetController和所有的控制器一样,先New一个对象,然后调用run函数。 这里可以看出来,rs控制器监听rs, 和pod的变化。 ``` func startReplicaSetController(ctx ControllerContext) (http.Handler, bool, error) { if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "replicasets"}] { return nil, false, nil } go replicaset.NewReplicaSetController( ctx.InformerFactory.Apps().V1().ReplicaSets(), ctx.InformerFactory.Core().V1().Pods(), ctx.ClientBuilder.ClientOrDie("replicaset-controller"), replicaset.BurstReplicas, ).Run(int(ctx.ComponentConfig.ReplicaSetController.ConcurrentRSSyncs), ctx.Stop) return nil, true, nil } ```
先NewReplicaSetController,再run ``` // NewReplicaSetController configures a replica set controller with the specified event recorder func NewReplicaSetController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int) *ReplicaSetController { // event上传 eventBroadcaster := record.NewBroadcaster() eventBroadcaster.StartLogging(glog.Infof) eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")}) return NewBaseController(rsInformer, podInformer, kubeClient, burstReplicas, apps.SchemeGroupVersion.WithKind("ReplicaSet"), "replicaset_controller", "replicaset", controller.RealPodControl{ KubeClient: kubeClient, Recorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "replicaset-controller"}), }, ) } ``` ``` // NewBaseController is the implementation of NewReplicaSetController with additional injected // parameters so that it can also serve as the implementation of NewReplicationController. func NewBaseController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int, gvk schema.GroupVersionKind, metricOwnerName, queueName string, podControl controller.PodControlInterface) *ReplicaSetController { if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil { metrics.RegisterMetricAndTrackRateLimiterUsage(metricOwnerName, kubeClient.CoreV1().RESTClient().GetRateLimiter()) } rsc := &ReplicaSetController{ GroupVersionKind: gvk, kubeClient: kubeClient, podControl: podControl, burstReplicas: burstReplicas, expectations: controller.NewUIDTrackingControllerExpectations(controller.NewControllerExpectations()), queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), queueName), } rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: rsc.enqueueReplicaSet, UpdateFunc: rsc.updateRS, // This will enter the sync loop and no-op, because the replica set has been deleted from the store. // Note that deleting a replica set immediately after scaling it to 0 will not work. The recommended // way of achieving this is by performing a `stop` operation on the replica set. DeleteFunc: rsc.enqueueReplicaSet, }) rsc.rsLister = rsInformer.Lister() rsc.rsListerSynced = rsInformer.Informer().HasSynced podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: rsc.addPod, // This invokes the ReplicaSet for every pod change, eg: host assignment. Though this might seem like // overkill the most frequent pod update is status, and the associated ReplicaSet will only list from // local storage, so it should be ok. UpdateFunc: rsc.updatePod, DeleteFunc: rsc.deletePod, }) rsc.podLister = podInformer.Lister() rsc.podListerSynced = podInformer.Informer().HasSynced rsc.syncHandler = rsc.syncReplicaSet return rsc } ``` 这里注意一点,syncHandler函数是 syncReplicaSet。
#### 1.1 rs中的expectations机制 在介绍 rs controller如何处理rs, pod的变动之前,先介绍expectations机制。原因在于addpod, addrs, delrs等等处理函数一直用到了expectations。 expectations可以理解为一个map。举例来说,这个map可以认为有四个关键字段。 key: 有rs的ns和 rs的name组成 Add: 表示这个rs还需要增加多少个rs del: 表示这个rs还需要删除多少个pod Time: 表示 | Key | Add | Del | Time | | ----------- | ---- | ---- | ------------------- | | Default/zx1 | 0 | 0 | 2021.07.04 16:00:00 | | zx/zx1 | 1 | 0 | 2021.07.04 16:00:00 |
**GetExpectations**: 输入是key, 输出整个map; **SatisfiedExpectations**: 输入key, 输出bool;判断某个rs是否符合预期。符合预期: add<=0 && del<=0 或者 超过了同步周期; 其他情况都是不符合预期。 **DeleteExpectations**:输入key, 无输出;从map(缓存)中删除这个key **SetExpectations**:输入(key, add, del); 在map中新增加一行。 **这个会更新时间,将time复制为time.Now** **ExpectCreations**: 输入(key, add); 覆盖map中的内容,del=0, add等于函数的参数。 **这个会更新时间,将time复制为time.Now** **ExpectDeletions**: 输入(key, del); 覆盖map中的内容,add=0, del等于函数的参数。 **这个会更新时间,将time复制为time.Now** **CreationObserved**: 输入(key) ; map中对应的行中 add-1 **DeletionObserved**: 输入(key); map中对应的行中 del-1 **RaiseExpectations**: 输入(key, add, del); map中对应的行中 Add+add, Del+del **LowerExpectations**: 输入(key, add, del); map中对应的行中 Add-add, Del-del ``` // A TTLCache of pod creates/deletes each rc expects to see. expectations *controller.UIDTrackingControllerExpectations type UIDTrackingControllerExpectations struct { ControllerExpectationsInterface // 原生锁, 这里带了时间操作 sync/mutex.go uidStoreLock sync.Mutex // 缓存 uidStore cache.Store } type ControllerExpectationsInterface interface { GetExpectations(controllerKey string) (*ControlleeExpectations, bool, error) SatisfiedExpectations(controllerKey string) bool DeleteExpectations(controllerKey string) SetExpectations(controllerKey string, add, del int) error ExpectCreations(controllerKey string, adds int) error ExpectDeletions(controllerKey string, dels int) error CreationObserved(controllerKey string) DeletionObserved(controllerKey string) RaiseExpectations(controllerKey string, add, del int) LowerExpectations(controllerKey string, add, del int) } // ControlleeExpectations track controllee creates/deletes. type ControlleeExpectations struct { // Important: Since these two int64 fields are using sync/atomic, they have to be at the top of the struct due to a bug on 32-bit platforms // See: https://golang.org/pkg/sync/atomic/ for more information add int64 del int64 key string timestamp time.Time } // SatisfiedExpectations returns true if the required adds/dels for the given controller have been observed. // Add/del counts are established by the controller at sync time, and updated as controllees are observed by the controller // manager. func (r *ControllerExpectations) SatisfiedExpectations(controllerKey string) bool { if exp, exists, err := r.GetExpectations(controllerKey); exists { // Fulfilled就是 add<=0并且del<=0 if exp.Fulfilled() { klog.V(4).Infof("Controller expectations fulfilled %#v", exp) return true } else if exp.isExpired() { klog.V(4).Infof("Controller expectations expired %#v", exp) return true } else { klog.V(4).Infof("Controller still waiting on expectations %#v", exp) return false } } else if err != nil { klog.V(2).Infof("Error encountered while checking expectations %#v, forcing sync", err) } else { // When a new controller is created, it doesn't have expectations. // When it doesn't see expected watch events for > TTL, the expectations expire. // - In this case it wakes up, creates/deletes controllees, and sets expectations again. // When it has satisfied expectations and no controllees need to be created/destroyed > TTL, the expectations expire. // - In this case it continues without setting expectations till it needs to create/delete controllees. klog.V(4).Infof("Controller %v either never recorded expectations, or the ttl expired.", controllerKey) } // Trigger a sync if we either encountered and error (which shouldn't happen since we're // getting from local store) or this controller hasn't established expectations. return true } // Fulfilled就是 add<=0并且del<=0 // Fulfilled returns true if this expectation has been fulfilled. func (e *ControlleeExpectations) Fulfilled() bool { // TODO: think about why this line being atomic doesn't matter return atomic.LoadInt64(&e.add) <= 0 && atomic.LoadInt64(&e.del) <= 0 } // 判断是否超过同步周期,同步周期是5分钟 func (exp *ControlleeExpectations) isExpired() bool { return clock.RealClock{}.Since(exp.timestamp) > ExpectationsTimeout } // 这个会覆盖之前的行,并且del=0 func (r *ControllerExpectations) ExpectCreations(controllerKey string, adds int) error { return r.SetExpectations(controllerKey, adds, 0) } ```
**总结:** (1)expectations就是通过一个类似map结构的对象,来表示所有rs期望pod和当前现状的差距 (2) ### 2. Pod,rs变化时对应的处理逻辑 #### 2.1 addPod (1)如果pod有DeletionTimestamp,表明这个pod要被删除。将对应rs的Del+1,然后将rs加入队列。 (2) 如果pod有OwnerReference,判断OwnerReference是否是 rs。如果不是或者是rs,但是指定的rs不存在直接返回。否则rs对应的Add+1,并且将rs加入队列。因为pod数量更新了,rs也要更新。 (3)否则(pod没有OwnerReference)。所以他是一个孤儿,这个时候看有没有rs可以匹配它,如果有也可能要更新。匹配的逻辑: 判断pod的ns 和 rs的ns相等,并且 pod的labels能匹配上 rs。 找出来所有能匹配的rs,然后入队列 ``` // When a pod is created, enqueue the replica set that manages it and update its expectations. func (rsc *ReplicaSetController) addPod(obj interface{}) { pod := obj.(*v1.Pod) // 1.如果pod有DeletionTimestamp,表明这个pod要被删除。 // 2. deletePod就是将对应rs的Del+1,然后将rs加入队列。 if pod.DeletionTimestamp != nil { // on a restart of the controller manager, it's possible a new pod shows up in a state that // is already pending deletion. Prevent the pod from being a creation observation. rsc.deletePod(pod) return } // 2. 如果pod有OwnerReference,判断OwnerReference是否是 rs // 如果不是或者是rs,但是指定的rs不存在直接返回。否则rs对应的Add+1,并且将rs加入队列。因为pod数量更新了,rs也要更新。 // If it has a ControllerRef, that's all that matters. if controllerRef := metav1.GetControllerOf(pod); controllerRef != nil { rs := rsc.resolveControllerRef(pod.Namespace, controllerRef) if rs == nil { return } rsKey, err := controller.KeyFunc(rs) if err != nil { return } glog.V(4).Infof("Pod %s created: %#v.", pod.Name, pod) // 对应rs的add+1 rsc.expectations.CreationObserved(rsKey) rsc.enqueueReplicaSet(rs) return } // 3. 否则(pod没有OwnerReference)。所以他是一个孤儿,这个时候看有没有rs可以匹配它,如果可以也更新。 // 匹配的逻辑: 判断pod的ns 和 rs的ns相等,并且 pod的labels能匹配上 rs // 找出来所有能匹配的rs,然后入队列 // Otherwise, it's an orphan. Get a list of all matching ReplicaSets and sync // them to see if anyone wants to adopt it. // DO NOT observe creation because no controller should be waiting for an // orphan. rss := rsc.getPodReplicaSets(pod) if len(rss) == 0 { return } glog.V(4).Infof("Orphan Pod %s created: %#v.", pod.Name, pod) for _, rs := range rss { rsc.enqueueReplicaSet(rs) } } ``` ``` // When a pod is deleted, enqueue the replica set that manages the pod and update its expectations. // obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item. func (rsc *ReplicaSetController) deletePod(obj interface{}) { pod, ok := obj.(*v1.Pod) // When a delete is dropped, the relist will notice a pod in the store not // in the list, leading to the insertion of a tombstone object which contains // the deleted key/value. Note that this value might be stale. If the pod // changed labels the new ReplicaSet will not be woken up till the periodic resync. if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %+v", obj)) return } pod, ok = tombstone.Obj.(*v1.Pod) if !ok { utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a pod %#v", obj)) return } } controllerRef := metav1.GetControllerOf(pod) if controllerRef == nil { // No controller should care about orphans being deleted. return } rs := rsc.resolveControllerRef(pod.Namespace, controllerRef) if rs == nil { return } // 这里keyfunc就是 ns/rsName rsKey, err := controller.KeyFunc(rs) if err != nil { utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err)) return } klog.V(4).Infof("Pod %s/%s deleted through %v, timestamp %+v: %#v.", pod.Namespace, pod.Name, utilruntime.GetCaller(), pod.DeletionTimestamp, pod) // 调用 expectations.DeletionObserved,然后入队列 rsc.expectations.DeletionObserved(rsKey, controller.PodKey(pod)) rsc.queue.Add(rsKey) } ```
#### 2.2 updatePod (1)ResourceVersion判断pod是否真的更新了 (2)判断pod的DeletionTimestamp是否为空,如果不为空,表明这个pod是要删除的,对应rs的Del-1 (3)如果是pod的ownerRef改变了,首先将旧rs入队,这个是肯定要更新的 (4)判断pod新的ownerRef是否是rs,如果是加入队列,如果设置了MinReadySeconds,等延迟结束再将rs添加到队列,因为到时候pod ready可能会导致rs更新。 (5)和addPod一样,判断出来pod没有OwnerReference。所以他是一个孤儿,这个时候看有没有rs可以匹配它,如果有也可能要更新。匹配的逻辑: 判断pod的ns 和 rs的ns相等,并且 pod的labels能匹配上 rs。 找出来所有能匹配的rs,然后入队列 ``` // When a pod is updated, figure out what replica set/s manage it and wake them // up. If the labels of the pod have changed we need to awaken both the old // and new replica set. old and cur must be *v1.Pod types. func (rsc *ReplicaSetController) updatePod(old, cur interface{}) { curPod := cur.(*v1.Pod) oldPod := old.(*v1.Pod) // 1.判断是否是否一致,为啥 ResourceVersion就能判断呢。参考https://fankangbest.github.io/2018/01/16/Kubernetes-resourceVersion%E6%9C%BA%E5%88%B6%E5%88%86%E6%9E%90/ if curPod.ResourceVersion == oldPod.ResourceVersion { // Periodic resync will send update events for all known pods. // Two different versions of the same pod will always have different RVs. return } labelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels) // 2.判断pod是否删除,因为删除分为两步:(1)update DeletionTimestamp, (2)删除 if curPod.DeletionTimestamp != nil { // when a pod is deleted gracefully it's deletion timestamp is first modified to reflect a grace period, // and after such time has passed, the kubelet actually deletes it from the store. We receive an update // for modification of the deletion timestamp and expect an rs to create more replicas asap, not wait // until the kubelet actually deletes the pod. This is different from the Phase of a pod changing, because // an rs never initiates a phase change, and so is never asleep waiting for the same. // 从对应的rs的列表中删除pod rsc.deletePod(curPod) if labelChanged { // we don't need to check the oldPod.DeletionTimestamp because DeletionTimestamp cannot be unset. rsc.deletePod(oldPod) } return } curControllerRef := metav1.GetControllerOf(curPod) oldControllerRef := metav1.GetControllerOf(oldPod) controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef) // 3.如果是old rs->new rs。先将old rs进入更新队列。 if controllerRefChanged && oldControllerRef != nil { // The ControllerRef was changed. Sync the old controller, if any. if rs := rsc.resolveControllerRef(oldPod.Namespace, oldControllerRef); rs != nil { rsc.enqueueReplicaSet(rs) } } // 4. 如果pod有新的 ownerRef, // If it has a ControllerRef, that's all that matters. if curControllerRef != nil { // 4.1 新的ownerRef不是 rs。啥都不干。 rs := rsc.resolveControllerRef(curPod.Namespace, curControllerRef) if rs == nil { return } glog.V(4).Infof("Pod %s updated, objectMeta %+v -> %+v.", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta) rsc.enqueueReplicaSet(rs) // TODO: MinReadySeconds in the Pod will generate an Available condition to be added in // the Pod status which in turn will trigger a requeue of the owning replica set thus // having its status updated with the newly available replica. For now, we can fake the // update by resyncing the controller MinReadySeconds after the it is requeued because // a Pod transitioned to Ready. // Note that this still suffers from #29229, we are just moving the problem one level // "closer" to kubelet (from the deployment to the replica set controller). // 4.2 如果oldpod ready, newpod not ready,然后设置了 MinReadySeconds延迟再添加rs到队列。 if !podutil.IsPodReady(oldPod) && podutil.IsPodReady(curPod) && rs.Spec.MinReadySeconds > 0 { glog.V(2).Infof("%v %q will be enqueued after %ds for availability check", rsc.Kind, rs.Name, rs.Spec.MinReadySeconds) // Add a second to avoid milliseconds skew in AddAfter. // See https://github.com/kubernetes/kubernetes/issues/39785#issuecomment-279959133 for more info. rsc.enqueueReplicaSetAfter(rs, (time.Duration(rs.Spec.MinReadySeconds)*time.Second)+time.Second) } return } // 5. 和addpod一样,判断孤儿pod。 // Otherwise, it's an orphan. If anything changed, sync matching controllers // to see if anyone wants to adopt it now. if labelChanged || controllerRefChanged { rss := rsc.getPodReplicaSets(curPod) if len(rss) == 0 { return } glog.V(4).Infof("Orphan Pod %s updated, objectMeta %+v -> %+v.", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta) for _, rs := range rss { rsc.enqueueReplicaSet(rs) } } } ``` spec.minReadySeconds: 新创建的Pod状态为Ready持续的时间至少为`spec.minReadySeconds`才认为Pod Available(Ready)。
#### 2.3 deletePod deletepod就很简单: (1)判断墓碑状态的pod是否ok。 (2)找出pod对应的rsA,从rsA中删除该pod,然后将rs加入队列。 ``` // When a pod is deleted, enqueue the replica set that manages the pod and update its expectations. // obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item. func (rsc *ReplicaSetController) deletePod(obj interface{}) { pod, ok := obj.(*v1.Pod) // When a delete is dropped, the relist will notice a pod in the store not // in the list, leading to the insertion of a tombstone object which contains // the deleted key/value. Note that this value might be stale. If the pod // changed labels the new ReplicaSet will not be woken up till the periodic resync. // 墓碑状态,这个是存储在etcd中,资源被删除时候的一个状态。 可以参考:https://draveness.me/etcd-introduction/ if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %+v", obj)) return } pod, ok = tombstone.Obj.(*v1.Pod) if !ok { utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a pod %#v", obj)) return } } controllerRef := metav1.GetControllerOf(pod) if controllerRef == nil { // No controller should care about orphans being deleted. return } rs := rsc.resolveControllerRef(pod.Namespace, controllerRef) if rs == nil { return } rsKey, err := controller.KeyFunc(rs) if err != nil { return } glog.V(4).Infof("Pod %s/%s deleted through %v, timestamp %+v: %#v.", pod.Namespace, pod.Name, utilruntime.GetCaller(), pod.DeletionTimestamp, pod) rsc.expectations.DeletionObserved(rsKey, controller.PodKey(pod)) rsc.enqueueReplicaSet(rs) } // 这里会先判断是否存在 // DeletionObserved records the given deleteKey as a deletion, for the given rc. func (u *UIDTrackingControllerExpectations) DeletionObserved(rcKey, deleteKey string) { u.uidStoreLock.Lock() defer u.uidStoreLock.Unlock() uids := u.GetUIDs(rcKey) if uids != nil && uids.Has(deleteKey) { klog.V(4).Infof("Controller %v received delete for pod %v", rcKey, deleteKey) u.ControllerExpectationsInterface.DeletionObserved(rcKey) uids.Delete(deleteKey) } } ```
从上面可以看出来,Pod的add,update, delete都会将 rs重新加入队列。 #### 2.4 addRS 直接入队列 ``` func (rsc *ReplicaSetController) addRS(obj interface{}) { rs := obj.(*apps.ReplicaSet) klog.V(4).Infof("Adding %s %s/%s", rsc.Kind, rs.Namespace, rs.Name) rsc.enqueueRS(rs) } ``` #### 2.5 updateRS 判断了是否真的更新,如果是,就入队列。 ``` // callback when RS is updated func (rsc *ReplicaSetController) updateRS(old, cur interface{}) { oldRS := old.(*apps.ReplicaSet) curRS := cur.(*apps.ReplicaSet) // You might imagine that we only really need to enqueue the // replica set when Spec changes, but it is safer to sync any // time this function is triggered. That way a full informer // resync can requeue any replica set that don't yet have pods // but whose last attempts at creating a pod have failed (since // we don't block on creation of pods) instead of those // replica sets stalling indefinitely. Enqueueing every time // does result in some spurious syncs (like when Status.Replica // is updated and the watch notification from it retriggers // this function), but in general extra resyncs shouldn't be // that bad as ReplicaSets that haven't met expectations yet won't // sync, and all the listing is done using local stores. if *(oldRS.Spec.Replicas) != *(curRS.Spec.Replicas) { glog.V(4).Infof("%v %v updated. Desired pod count change: %d->%d", rsc.Kind, curRS.Name, *(oldRS.Spec.Replicas), *(curRS.Spec.Replicas)) } rsc.enqueueReplicaSet(cur) } ```
#### 2.6 deleteRS 先判断tombstone,再进行map中对应行的删除。然后入队列。 个人认为,这里每次都判断tombstone的原因在于: k8s删除对象分为两步:(1)设置deletionTimestamp,这是个更新时间。(2)删除对象,这是个删除事件。 所以到了删除的时候,update已经做了一下处理,所以这里要通过tombstone再额外判断一次。 ``` func (rsc *ReplicaSetController) deleteRS(obj interface{}) { rs, ok := obj.(*apps.ReplicaSet) if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %#v", obj)) return } rs, ok = tombstone.Obj.(*apps.ReplicaSet) if !ok { utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a ReplicaSet %#v", obj)) return } } key, err := controller.KeyFunc(rs) if err != nil { utilruntime.HandleError(fmt.Errorf("couldn't get key for object %#v: %v", rs, err)) return } klog.V(4).Infof("Deleting %s %q", rsc.Kind, key) // Delete expectations for the ReplicaSet so if we create a new one with the same name it starts clean rsc.expectations.DeleteExpectations(key) rsc.queue.Add(key) } ``` ### 3. rs的处理逻辑 接下来看看rsController是如何处理队列中的对象。 ``` // Run begins watching and syncing. func (rsc *ReplicaSetController) Run(workers int, stopCh <-chan struct{}) { defer utilruntime.HandleCrash() defer rsc.queue.ShutDown() controllerName := strings.ToLower(rsc.Kind) glog.Infof("Starting %v controller", controllerName) defer glog.Infof("Shutting down %v controller", controllerName) if !controller.WaitForCacheSync(rsc.Kind, stopCh, rsc.podListerSynced, rsc.rsListerSynced) { return } for i := 0; i < workers; i++ { go wait.Until(rsc.worker, time.Second, stopCh) } <-stopCh } ``` 一样的套路,最后是 syncHandler。初始化NewBaseController的时候 `rsc.syncHandler = rsc.syncReplicaSet` syncReplicaSet就是处理队列中一个一个的元素了。 ``` // worker runs a worker thread that just dequeues items, processes them, and marks them done. // It enforces that the syncHandler is never invoked concurrently with the same key. func (rsc *ReplicaSetController) worker() { for rsc.processNextWorkItem() { } } func (rsc *ReplicaSetController) processNextWorkItem() bool { key, quit := rsc.queue.Get() if quit { return false } defer rsc.queue.Done(key) err := rsc.syncHandler(key.(string)) if err == nil { rsc.queue.Forget(key) return true } utilruntime.HandleError(fmt.Errorf("Sync %q failed with %v", key, err)) rsc.queue.AddRateLimited(key) return true } ```
**syncReplicaSet** (1)判断是否需要 rsNeedsSync, 如果 add<=0 && del<=0 或者 超过了同步周期,则需要同步 (2)获得所有该rs下的pod (3)如果要同步,并且rs没有删除,调用manageReplicas对pod进行创建/删除 (4)计算当前rs的状态 (5)更新rs的状态 (6)判断是否需要将 rs 加入到延迟队列中 ``` // syncReplicaSet will sync the ReplicaSet with the given key if it has had its expectations fulfilled, // meaning it did not expect to see any more of its pods created or deleted. This function is not meant to be // invoked concurrently with the same key. func (rsc *ReplicaSetController) syncReplicaSet(key string) error { startTime := time.Now() defer func() { glog.V(4).Infof("Finished syncing %v %q (%v)", rsc.Kind, key, time.Since(startTime)) }() namespace, name, err := cache.SplitMetaNamespaceKey(key) if err != nil { return err } rs, err := rsc.rsLister.ReplicaSets(namespace).Get(name) if errors.IsNotFound(err) { glog.V(4).Infof("%v %v has been deleted", rsc.Kind, key) rsc.expectations.DeleteExpectations(key) return nil } if err != nil { return err } // 1.判断是否需要 rsNeedsSync,这里调用了SatisfiedExpectations rsNeedsSync := rsc.expectations.SatisfiedExpectations(key) selector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector) if err != nil { utilruntime.HandleError(fmt.Errorf("Error converting pod selector to selector: %v", err)) return nil } // 2. 获得namespaces下的所有pod // list all pods to include the pods that don't match the rs`s selector // anymore but has the stale controller ref. // TODO: Do the List and Filter in a single pass, or use an index. allPods, err := rsc.podLister.Pods(rs.Namespace).List(labels.Everything()) if err != nil { return err } // 2.1 过滤inactive的pods // Ignore inactive pods. var filteredPods []*v1.Pod for _, pod := range allPods { if controller.IsPodActive(pod) { filteredPods = append(filteredPods, pod) } } // NOTE: filteredPods are pointing to objects from cache - if you need to // modify them, you need to copy it first. // 2.2 重新洗牌,获得真正属于该rs的podlist filteredPods, err = rsc.claimPods(rs, selector, filteredPods) if err != nil { return err } // 3. 如果要同步,并且rs没有删除,调用manageReplicas对pod进行创建/删除 var manageReplicasErr error if rsNeedsSync && rs.DeletionTimestamp == nil { manageReplicasErr = rsc.manageReplicas(filteredPods, rs) } // 4. 计算 rs 当前的 status rs = rs.DeepCopy() newStatus := calculateStatus(rs, filteredPods, manageReplicasErr) // Always updates status as pods come up or die. // 5. 更新 status updatedRS, err := updateReplicaSetStatus(rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace), rs, newStatus) if err != nil { // Multiple things could lead to this update failing. Requeuing the replica set ensures // Returning an error causes a requeue without forcing a hotloop return err } // 6. 判断是否需要将 rs 加入到延迟队列中。这里判断的标准也是很简单: ReadyReplicas满足了,但是AvailableReplicas还没满足,那肯定还有pod在启动中 // Resync the ReplicaSet after MinReadySeconds as a last line of defense to guard against clock-skew. if manageReplicasErr == nil && updatedRS.Spec.MinReadySeconds > 0 && updatedRS.Status.ReadyReplicas == *(updatedRS.Spec.Replicas) && updatedRS.Status.AvailableReplicas != *(updatedRS.Spec.Replicas) { rsc.enqueueReplicaSetAfter(updatedRS, time.Duration(updatedRS.Spec.MinReadySeconds)*time.Second) } return manageReplicasErr } ```
#### 3.1 过滤pod (1)过滤inactivedPod: 就是pod状态不是PodSucceeded, PodFailed或者DeletionTimestamp!=nil的pod (2)重新洗牌,获得真正属于该rs的podlist
adopt 就是根据lable(原来不匹配,现在匹配了),绑定 rs与pod release就是根据lable(原来匹配了,现在不匹配),释放原来的绑定关系 ``` func IsPodActive(p *v1.Pod) bool { return v1.PodSucceeded != p.Status.Phase && v1.PodFailed != p.Status.Phase && p.DeletionTimestamp == nil } func (rsc *ReplicaSetController) claimPods(rs *apps.ReplicaSet, selector labels.Selector, filteredPods []*v1.Pod) ([]*v1.Pod, error) { // If any adoptions are attempted, we should first recheck for deletion with // an uncached quorum read sometime after listing Pods (see #42639). canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) { fresh, err := rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace).Get(rs.Name, metav1.GetOptions{}) if err != nil { return nil, err } if fresh.UID != rs.UID { return nil, fmt.Errorf("original %v %v/%v is gone: got uid %v, wanted %v", rsc.Kind, rs.Namespace, rs.Name, fresh.UID, rs.UID) } return fresh, nil }) cm := controller.NewPodControllerRefManager(rsc.podControl, rs, selector, rsc.GroupVersionKind, canAdoptFunc) return cm.ClaimPods(filteredPods) } ``` ``` // ClaimPods tries to take ownership of a list of Pods. // // It will reconcile the following: // * Adopt orphans if the selector matches. // * Release owned objects if the selector no longer matches. // // Optional: If one or more filters are specified, a Pod will only be claimed if // all filters return true. // // A non-nil error is returned if some form of reconciliation was attempted and // failed. Usually, controllers should try again later in case reconciliation // is still needed. // // If the error is nil, either the reconciliation succeeded, or no // reconciliation was necessary. The list of Pods that you now own is returned. func (m *PodControllerRefManager) ClaimPods(pods []*v1.Pod, filters ...func(*v1.Pod) bool) ([]*v1.Pod, error) { var claimed []*v1.Pod var errlist []error match := func(obj metav1.Object) bool { pod := obj.(*v1.Pod) // Check selector first so filters only run on potentially matching Pods. if !m.Selector.Matches(labels.Set(pod.Labels)) { return false } for _, filter := range filters { if !filter(pod) { return false } } return true } adopt := func(obj metav1.Object) error { return m.AdoptPod(obj.(*v1.Pod)) } release := func(obj metav1.Object) error { return m.ReleasePod(obj.(*v1.Pod)) } for _, pod := range pods { ok, err := m.ClaimObject(pod, match, adopt, release) if err != nil { errlist = append(errlist, err) continue } if ok { claimed = append(claimed, pod) } } return claimed, utilerrors.NewAggregate(errlist) } ```
#### 3.2 manageReplicas (1)计算当前pod和期望pod的数量差距 (2)进行pod的创建和删除 ``` func (rsc *ReplicaSetController) manageReplicas(......) error { // 1.计算当前pod数量的差距 diff := len(filteredPods) - int(*(rs.Spec.Replicas)) rsKey, err := controller.KeyFunc(rs) if err != nil { ...... } // 2.diff<0,表示需要创建 pod if diff < 0 { diff *= -1 // 2.1 pod创建一轮的上限是500 if diff > rsc.burstReplicas { diff = rsc.burstReplicas } // 2.2 更新map的数据,表示当前只需要创建diff个pod rsc.expectations.ExpectCreations(rsKey, diff) // 2.3 调用slowStartBatch创建pod successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func() error { err := rsc.podControl.CreatePodsWithControllerRef(rs.Namespace, &rs.Spec.Template, rs, metav1.NewControllerRef(rs, rsc.GroupVersionKind)) if err != nil && errors.IsTimeout(err) { return nil } return err }) // 2.3 根据创建的结果,更新map的数据 if skippedPods := diff - successfulCreations; skippedPods > 0 { for i := 0; i < skippedPods; i++ { rsc.expectations.CreationObserved(rsKey) } } return err } else if diff > 0 { // 3. 如果是删除pod,同样一轮最多只能删除500个 if diff > rsc.burstReplicas { diff = rsc.burstReplicas } // 3.1 选择需要删除的pod列表,这个是有优先级的 podsToDelete := getPodsToDelete(filteredPods, diff) // 3.2 覆盖map表中的数据 rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete)) // 3.3 进行并发删除 errCh := make(chan error, diff) var wg sync.WaitGroup wg.Add(diff) for _, pod := range podsToDelete { go func(targetPod *v1.Pod) { defer wg.Done() if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil { podKey := controller.PodKey(targetPod) rsc.expectations.DeletionObserved(rsKey, podKey) errCh <- err } }(pod) } wg.Wait() select { case err := <-errCh: if err != nil { return err } default: } } return nil } ```
##### 3.2.1 创建pod `slowStartBatch` 创建的 pod 数依次为 1,2,4,8 。以2的指数级增长,如果失败了,直接返回(当前成功创建了多少)。 ``` func slowStartBatch(count int, initialBatchSize int, fn func() error) (int, error) { remaining := count successes := 0 for batchSize := integer.IntMin(remaining, initialBatchSize); batchSize > 0; batchSize = integer.IntMin(2*batchSize, remaining) { errCh := make(chan error, batchSize) var wg sync.WaitGroup wg.Add(batchSize) for i := 0; i < batchSize; i++ { go func() { defer wg.Done() if err := fn(); err != nil { errCh <- err } }() } wg.Wait() curSuccesses := batchSize - len(errCh) successes += curSuccesses if len(errCh) > 0 { return successes, <-errCh } remaining -= batchSize } return successes, nil } ```
##### 3.2.2 删除pod 给pod定义优先级,从优先级最高的依次往下删,优先级越高,表示这个pod越应该删除,根据以下的条件判断优先级: (1)没有绑定node的pod优先级比 绑定了的高 (2)pod状态是PodPending的高于PodUnknown,PodUnknown高于PodRunning (3)pod unready的高于 ready (4)根据运行时间排序,越短优先级越高 (5)pod中容器重启次数越多的,优先级越高 (6)pod创建时间越短,优先级越高 ``` func getPodsToDelete(filteredPods []*v1.Pod, diff int) []*v1.Pod { if diff < len(filteredPods) { sort.Sort(controller.ActivePods(filteredPods)) } return filteredPods[:diff] } ``` ``` type ActivePods []*v1.Pod func (s ActivePods) Len() int { return len(s) } func (s ActivePods) Swap(i, j int) { s[i], s[j] = s[j], s[i] } func (s ActivePods) Less(i, j int) bool { // 1.没有绑定node的pod优先级比绑定了的高 if s[i].Spec.NodeName != s[j].Spec.NodeName && (len(s[i].Spec.NodeName) == 0 || len(s[j].Spec.NodeName) == 0) { return len(s[i].Spec.NodeName) == 0 } // 2. pod状态是PodPending的高于PodUnknown,PodUnknown高于PodRunning m := map[v1.PodPhase]int{v1.PodPending: 0, v1.PodUnknown: 1, v1.PodRunning: 2} if m[s[i].Status.Phase] != m[s[j].Status.Phase] { return m[s[i].Status.Phase] < m[s[j].Status.Phase] } // 3. pod unready的高于 ready if podutil.IsPodReady(s[i]) != podutil.IsPodReady(s[j]) { return !podutil.IsPodReady(s[i]) } // 4. 根据运行时间排序,越短优先级越高 if podutil.IsPodReady(s[i]) && podutil.IsPodReady(s[j]) && !podReadyTime(s[i]).Equal(podReadyTime(s[j])) { return afterOrZero(podReadyTime(s[i]), podReadyTime(s[j])) } // 5. pod中容器重启次数越多的优先级越高 if maxContainerRestarts(s[i]) != maxContainerRestarts(s[j]) { return maxContainerRestarts(s[i]) > maxContainerRestarts(s[j]) } // 6. pod创建时间越短,优先级越高 if !s[i].CreationTimestamp.Equal(&s[j].CreationTimestamp) { return afterOrZero(&s[i].CreationTimestamp, &s[j].CreationTimestamp) } return false } ```
#### 3.3 calculateStatus calculateStatus 会通过当前 pod 的状态计算出 rs 中 status 字段值,status 字段如下所示: replicas 实际的 pod 副本数 availableReplicas 现在可用的 Pod 的副本数量,有的副本可能还处在未准备好,或者初始化状态 readyReplicas 是处于 ready 状态的 Pod 的副本数量 fullyLabeledReplicas 意思是这个 ReplicaSet 的标签 selector 对应的副本数量,不同纬度的一种统计 ``` 随便一个rs都有 status: availableReplicas: 1 fullyLabeledReplicas: 1 observedGeneration: 1 readyReplicas: 1 replicas: 1 ``` ``` func calculateStatus(rs *apps.ReplicaSet, filteredPods []*v1.Pod, manageReplicasErr error) apps.ReplicaSetStatus { newStatus := rs.Status // Count the number of pods that have labels matching the labels of the pod // template of the replica set, the matching pods may have more // labels than are in the template. Because the label of podTemplateSpec is // a superset of the selector of the replica set, so the possible // matching pods must be part of the filteredPods. fullyLabeledReplicasCount := 0 readyReplicasCount := 0 availableReplicasCount := 0 templateLabel := labels.Set(rs.Spec.Template.Labels).AsSelectorPreValidated() for _, pod := range filteredPods { if templateLabel.Matches(labels.Set(pod.Labels)) { fullyLabeledReplicasCount++ } if podutil.IsPodReady(pod) { readyReplicasCount++ if podutil.IsPodAvailable(pod, rs.Spec.MinReadySeconds, metav1.Now()) { availableReplicasCount++ } } } failureCond := GetCondition(rs.Status, apps.ReplicaSetReplicaFailure) if manageReplicasErr != nil && failureCond == nil { var reason string if diff := len(filteredPods) - int(*(rs.Spec.Replicas)); diff < 0 { reason = "FailedCreate" } else if diff > 0 { reason = "FailedDelete" } cond := NewReplicaSetCondition(apps.ReplicaSetReplicaFailure, v1.ConditionTrue, reason, manageReplicasErr.Error()) SetCondition(&newStatus, cond) } else if manageReplicasErr == nil && failureCond != nil { RemoveCondition(&newStatus, apps.ReplicaSetReplicaFailure) } newStatus.Replicas = int32(len(filteredPods)) newStatus.FullyLabeledReplicas = int32(fullyLabeledReplicasCount) newStatus.ReadyReplicas = int32(readyReplicasCount) newStatus.AvailableReplicas = int32(availableReplicasCount) return newStatus } ```
### 4 总结 (1)expectations确实是一个很巧妙的方法,这种思想可以借鉴 (2)rs根本不感知deploy的存在 ================================================ FILE: k8s/kcm/10-kcm-NodeLifecycleController源码分析.md ================================================ * [1\. startNodeLifecycleController](#1-startnodelifecyclecontroller) * [2\. NewNodeLifecycleController](#2-newnodelifecyclecontroller) * [2\.1 NodeLifecycleController结构体介绍](#21-nodelifecyclecontroller结构体介绍) * [2\.2 NewNodeLifecycleController](#22-newnodelifecyclecontroller) * [3\. NodeLifecycleController\.run](#3-nodelifecyclecontrollerrun) * [3\.1 nc\.taintManager\.Run](#31-nctaintmanagerrun) * [3\.1\.1 worker处理](#311-worker处理) * [3\.1\.2 handleNodeUpdate](#312-handlenodeupdate) * [3\.1\.2\.1 processPodOnNode](#3121-processpodonnode) * [3\.1\.3 handlePodUpdate](#313-handlepodupdate) * [3\.1\.3 nc\.taintManager\.Run总结](#313-nctaintmanagerrun总结) * [3\.2 doNodeProcessingPassWorker](#32-donodeprocessingpassworker) * [3\.2\.1 doNoScheduleTaintingPass](#321-donoscheduletaintingpass) * [3\.2\.2 reconcileNodeLabels](#322-reconcilenodelabels) * [3\.3 doPodProcessingWorker](#33-dopodprocessingworker) * [3\.3\.1 processNoTaintBaseEviction](#331-processnotaintbaseeviction) * [3\.4 doEvictionPass(if useTaintBasedEvictions==false)](#34-doevictionpassif-usetaintbasedevictionsfalse) * [3\.5 doNoExecuteTaintingPass(if useTaintBasedEvictions==true)](#35-donoexecutetaintingpassif-usetaintbasedevictionstrue) * [3\.6 monitorNodeHealth](#36-monitornodehealth) * [3\.6\.1 node分类并初始化](#361--node分类并初始化) * [3\.6\.2 处理node status](#362-处理node-status) * [3\.6\.3 集群健康状态处理](#363-集群健康状态处理) * [4 总结](#4-总结) 代码版本:1.17.4 ### 1. startNodeLifecycleController 可以看到startNodeLifecycleController就是分为2个步骤: * NodeLifecycleController * NodeLifecycleController.run ``` func startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) { lifecycleController, err := lifecyclecontroller.NewNodeLifecycleController( ctx.InformerFactory.Coordination().V1().Leases(), ctx.InformerFactory.Core().V1().Pods(), ctx.InformerFactory.Core().V1().Nodes(), ctx.InformerFactory.Apps().V1().DaemonSets(), // node lifecycle controller uses existing cluster role from node-controller ctx.ClientBuilder.ClientOrDie("node-controller"), // 就是node-monitor-period参数 ctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration, // 就是node-startup-grace-period参数 ctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration, // 就是node-monitor-grace-period参数 ctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration, // 就是pod-eviction-timeout参数 ctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration, // 就是node-eviction-rate参数 ctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate, // 就是secondary-node-eviction-rate参数 ctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate, // 就是large-cluster-size-threshold参数 ctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold, // 就是unhealthy-zone-threshold参数 ctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold, // 就是enable-taint-manager参数 (默认打开的) ctx.ComponentConfig.NodeLifecycleController.EnableTaintManager, // 就是这个是否打开--feature-gates=TaintBasedEvictions=true (默认打开的) utilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions), ) if err != nil { return nil, true, err } go lifecycleController.Run(ctx.Stop) return nil, true, nil } ``` 具体参数介绍 * enable-taint-manager 默认为true, 表示允许NoExecute污点,并且将会驱逐pod * large-cluster-size-threshold 默认50,基于这个阈值来判断所在集群是否为大规模集群。当集群规模小于等于这个值的时候,会将--secondary-node-eviction-rate参数强制赋值为0 * secondary-node-eviction-rate 默认0.01。 当zone unhealthy时候,一秒内多少个node进行驱逐node上pod。二级驱赶速率,当集群中宕机节点过多时,相应的驱赶速率也降低,默认为0.01。 * node-eviction-rate float32 默认为0.1。驱赶速率,即驱赶Node的速率,由令牌桶流控算法实现,默认为0.1,即每秒驱赶0.1个节点,注意这里不是驱赶Pod的速率,而是驱赶节点的速率。相当于每隔10s,清空一个节点。 * node-monitor-grace-period duration 默认40s, 多久node没有响应认为node为unhealthy * node-startup-grace-period duration 默认1分钟。多久允许刚启动的node未响应,认为unhealthy * pod-eviction-timeout duration 默认5min。当node unhealthy时候多久删除上面的pod(只在taint manager未启用时候生效) * unhealthy-zone-threshold float32 默认55%,多少比例的unhealthy node认为zone unhealthy
### 2. NewNodeLifecycleController #### 2.1 NodeLifecycleController结构体介绍 ``` // Controller is the controller that manages node's life cycle. type Controller struct { // taintManager监听节点的Taint/Toleration变化,用于驱逐pod taintManager *scheduler.NoExecuteTaintManager // 监听pod podLister corelisters.PodLister podInformerSynced cache.InformerSynced kubeClient clientset.Interface // This timestamp is to be used instead of LastProbeTime stored in Condition. We do this // to avoid the problem with time skew across the cluster. now func() metav1.Time // 返回secondary-node-eviction-rate参数值。就是根据集群是否为大集群,如果是大集群,返回secondary-node-eviction-rate,否则返回0 enterPartialDisruptionFunc func(nodeNum int) float32 // 返回evictionLimiterQPS参数 enterFullDisruptionFunc func(nodeNum int) float32 // 返回集群有多少nodeNotReady, 并且返回bool值ZoneState用于判断zone是否健康。利用了unhealthyZoneThreshold参数 computeZoneStateFunc func(nodeConditions []*v1.NodeCondition) (int, ZoneState) // node map knownNodeSet map[string]*v1.Node // node健康信息map表 // per Node map storing last observed health together with a local time when it was observed. nodeHealthMap *nodeHealthMap // evictorLock protects zonePodEvictor and zoneNoExecuteTainter. // TODO(#83954): API calls shouldn't be executed under the lock. evictorLock sync.Mutex // 存放node上pod是否已经执行驱逐的状态, 从这读取node eviction的状态是evicted、tobeeviced nodeEvictionMap *nodeEvictionMap // workers that evicts pods from unresponsive nodes. // zone的需要pod evictor的node列表 zonePodEvictor map[string]*scheduler.RateLimitedTimedQueue // 存放需要更新taint的unready node列表--令牌桶队列 // workers that are responsible for tainting nodes. zoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue // 重试列表 nodesToRetry sync.Map // 存放每个zone的健康状态,有stateFullDisruption、statePartialDisruption、stateNormal、stateInitial zoneStates map[string]ZoneState // 监听ds相关 daemonSetStore appsv1listers.DaemonSetLister daemonSetInformerSynced cache.InformerSynced // 监听node相关 leaseLister coordlisters.LeaseLister leaseInformerSynced cache.InformerSynced nodeLister corelisters.NodeLister nodeInformerSynced cache.InformerSynced getPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error) recorder record.EventRecorder // 之前推到的一对参数 // Value controlling Controller monitoring period, i.e. how often does Controller // check node health signal posted from kubelet. This value should be lower than // nodeMonitorGracePeriod. // TODO: Change node health monitor to watch based. nodeMonitorPeriod time.Duration // When node is just created, e.g. cluster bootstrap or node creation, we give // a longer grace period. nodeStartupGracePeriod time.Duration // Controller will not proactively sync node health, but will monitor node // health signal updated from kubelet. There are 2 kinds of node healthiness // signals: NodeStatus and NodeLease. NodeLease signal is generated only when // NodeLease feature is enabled. If it doesn't receive update for this amount // of time, it will start posting "NodeReady==ConditionUnknown". The amount of // time before which Controller start evicting pods is controlled via flag // 'pod-eviction-timeout'. // Note: be cautious when changing the constant, it must work with // nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease // controller. The node health signal update frequency is the minimal of the // two. // There are several constraints: // 1. nodeMonitorGracePeriod must be N times more than the node health signal // update frequency, where N means number of retries allowed for kubelet to // post node status/lease. It is pointless to make nodeMonitorGracePeriod // be less than the node health signal update frequency, since there will // only be fresh values from Kubelet at an interval of node health signal // update frequency. The constant must be less than podEvictionTimeout. // 2. nodeMonitorGracePeriod can't be too large for user experience - larger // value takes longer for user to see up-to-date node health. nodeMonitorGracePeriod time.Duration podEvictionTimeout time.Duration evictionLimiterQPS float32 secondaryEvictionLimiterQPS float32 largeClusterThreshold int32 unhealthyZoneThreshold float32 // if set to true Controller will start TaintManager that will evict Pods from // tainted nodes, if they're not tolerated. runTaintManager bool // if set to true Controller will taint Nodes with 'TaintNodeNotReady' and 'TaintNodeUnreachable' // taints instead of evicting Pods itself. useTaintBasedEvictions bool // pod, node队列 nodeUpdateQueue workqueue.Interface podUpdateQueue workqueue.RateLimitingInterface } ```
#### 2.2 NewNodeLifecycleController 核心逻辑如下: (1)根据参数初始化Controller (2)定义了pod的监听处理逻辑。都是先nc.podUpdated,如果enable-taint-manager=true,还会经过nc.taintManager.PodUpdated函数处理 (3)实现找出所有node上pod的函数 (4)如果enable-taint-manager=true,node有变化都需要经过 nc.taintManager.NodeUpdated函数 (5)实现node的监听处理,这里不管开没开taint-manager,都是要监听 (6)实现node, ds, lease的list,用于获取对象 ``` // NewNodeLifecycleController returns a new taint controller. func NewNodeLifecycleController( leaseInformer coordinformers.LeaseInformer, podInformer coreinformers.PodInformer, nodeInformer coreinformers.NodeInformer, daemonSetInformer appsv1informers.DaemonSetInformer, kubeClient clientset.Interface, nodeMonitorPeriod time.Duration, nodeStartupGracePeriod time.Duration, nodeMonitorGracePeriod time.Duration, podEvictionTimeout time.Duration, evictionLimiterQPS float32, secondaryEvictionLimiterQPS float32, largeClusterThreshold int32, unhealthyZoneThreshold float32, runTaintManager bool, useTaintBasedEvictions bool, ) (*Controller, error) { // 1.根据参数初始化Controller nc := &Controller{ 省略代码 .... } if useTaintBasedEvictions { klog.Infof("Controller is using taint based evictions.") } nc.enterPartialDisruptionFunc = nc.ReducedQPSFunc nc.enterFullDisruptionFunc = nc.HealthyQPSFunc nc.computeZoneStateFunc = nc.ComputeZoneState // 2.定义了pod的监听处理逻辑。都是先nc.podUpdated,如果enable-taint-manager=true,还会经过nc.taintManager.PodUpdated podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ 。。。 省略代码 }) // 3.实现找出所有node上pod的函数 nc.podInformerSynced = podInformer.Informer().HasSynced podInformer.Informer().AddIndexers(cache.Indexers{ nodeNameKeyIndex: func(obj interface{}) ([]string, error) { pod, ok := obj.(*v1.Pod) if !ok { return []string{}, nil } if len(pod.Spec.NodeName) == 0 { return []string{}, nil } return []string{pod.Spec.NodeName}, nil }, }) podIndexer := podInformer.Informer().GetIndexer() nc.getPodsAssignedToNode = func(nodeName string) ([]*v1.Pod, error) { objs, err := podIndexer.ByIndex(nodeNameKeyIndex, nodeName) if err != nil { return nil, err } pods := make([]*v1.Pod, 0, len(objs)) for _, obj := range objs { pod, ok := obj.(*v1.Pod) if !ok { continue } pods = append(pods, pod) } return pods, nil } nc.podLister = podInformer.Lister() // 4.如果enable-taint-manager=true,node有变化都需要经过 nc.taintManager.NodeUpdated函数 if nc.runTaintManager { podGetter := func(name, namespace string) (*v1.Pod, error) { return nc.podLister.Pods(namespace).Get(name) } nodeLister := nodeInformer.Lister() nodeGetter := func(name string) (*v1.Node, error) { return nodeLister.Get(name) } nc.taintManager = scheduler.NewNoExecuteTaintManager(kubeClient, podGetter, nodeGetter, nc.getPodsAssignedToNode) nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error { nc.taintManager.NodeUpdated(nil, node) return nil }), UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(oldNode, newNode *v1.Node) error { nc.taintManager.NodeUpdated(oldNode, newNode) return nil }), DeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error { nc.taintManager.NodeUpdated(node, nil) return nil }), }) } // 5. 实现node的监听处理,这里不管开没开taint-manager,都是要监听 klog.Infof("Controller will reconcile labels.") nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error { nc.nodeUpdateQueue.Add(node.Name) nc.nodeEvictionMap.registerNode(node.Name) return nil }), UpdateFunc: nodeutil.CreateUpdateNodeHandler(func(_, newNode *v1.Node) error { nc.nodeUpdateQueue.Add(newNode.Name) return nil }), DeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error { nc.nodesToRetry.Delete(node.Name) nc.nodeEvictionMap.unregisterNode(node.Name) return nil }), }) // 6. 实现node, ds, lease的list,用于获取对象 nc.leaseLister = leaseInformer.Lister() nc.leaseInformerSynced = leaseInformer.Informer().HasSynced nc.nodeLister = nodeInformer.Lister() nc.nodeInformerSynced = nodeInformer.Informer().HasSynced nc.daemonSetStore = daemonSetInformer.Lister() nc.daemonSetInformerSynced = daemonSetInformer.Informer().HasSynced return nc, nil } ``` ### 3. NodeLifecycleController.run 逻辑如下: (1)等待leaseInformer、nodeInformer、podInformerSynced、daemonSetInformerSynced同步完成。 (2)如果enable-taint-manager=true,开启nc.taintManager.Run (3)执行doNodeProcessingPassWorker,这个是处理nodeUpdateQueue队列的node (4)doPodProcessingWorker,这个是处理podUpdateQueue队列的pod (5)如果开启了feature-gates=TaintBasedEvictions=true,执行doNoExecuteTaintingPass函数。否则执行doEvictionPass函数 (6)一直监听node状态是否健康 ``` // Run starts an asynchronous loop that monitors the status of cluster nodes. func (nc *Controller) Run(stopCh <-chan struct{}) { defer utilruntime.HandleCrash() klog.Infof("Starting node controller") defer klog.Infof("Shutting down node controller") // 1.等待leaseInformer、nodeInformer、podInformerSynced、daemonSetInformerSynced同步完成。 if !cache.WaitForNamedCacheSync("taint", stopCh, nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) { return } // 2.如果enable-taint-manager=true,开启nc.taintManager.Run if nc.runTaintManager { go nc.taintManager.Run(stopCh) } // Close node update queue to cleanup go routine. defer nc.nodeUpdateQueue.ShutDown() defer nc.podUpdateQueue.ShutDown() // 3.执行doNodeProcessingPassWorker,这个是处理nodeUpdateQueue队列的node // Start workers to reconcile labels and/or update NoSchedule taint for nodes. for i := 0; i < scheduler.UpdateWorkerSize; i++ { // Thanks to "workqueue", each worker just need to get item from queue, because // the item is flagged when got from queue: if new event come, the new item will // be re-queued until "Done", so no more than one worker handle the same item and // no event missed. go wait.Until(nc.doNodeProcessingPassWorker, time.Second, stopCh) } // 4.doPodProcessingWorker,这个是处理podUpdateQueue队列的pod for i := 0; i < podUpdateWorkerSize; i++ { go wait.Until(nc.doPodProcessingWorker, time.Second, stopCh) } // 5. 如果开启了feature-gates=TaintBasedEvictions=true,执行doNoExecuteTaintingPass函数。否则执行doEvictionPass函数 if nc.useTaintBasedEvictions { // Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated // taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints. go wait.Until(nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod, stopCh) } else { // Managing eviction of nodes: // When we delete pods off a node, if the node was not empty at the time we then // queue an eviction watcher. If we hit an error, retry deletion. go wait.Until(nc.doEvictionPass, scheduler.NodeEvictionPeriod, stopCh) } // 6.一直监听node状态是否健康 // Incorporate the results of node health signal pushed from kubelet to master. go wait.Until(func() { if err := nc.monitorNodeHealth(); err != nil { klog.Errorf("Error monitoring node health: %v", err) } }, nc.nodeMonitorPeriod, stopCh) <-stopCh } ``` #### 3.1 nc.taintManager.Run 在newNodeLifecycleContainer的时候就初始化了NewNoExecuteTaintManager。 taint manager是由pod和node事件触发执行,根据node或pod绑定的node是否有的noExcute taint,如果有则对node上所有的pod或这个pod执行删除。 具体逻辑为:如果启用了taint manager就会调用NewNoExecuteTaintManager对taint manager进行初始化。可以看出来这里就是初始化了nodeUpdateQueue,podUpdateQueue队列以及事件上报。 核心数据机构: * nodeUpdateQueue 在nodelifecycleController的时候定义了,node变化会扔进这个队列 * podUpdateQueue 在nodelifecycleController的时候定义了,pod变化会扔进这个队列 * taintedNodes是存放node上所有的noExecute taint,handlePodUpdate会从taintedNodes查询node的noExecute taint。 * taintEvictionQueuetaintEvictionQueue是一个TimedWorkerQueue–定时自动执行队列。因为有的pod设置了污点容忍时间,所以需要一个时间队列来定时删除。 ``` // NewNoExecuteTaintManager creates a new NoExecuteTaintManager that will use passed clientset to // communicate with the API server. func NewNoExecuteTaintManager(c clientset.Interface, getPod GetPodFunc, getNode GetNodeFunc, getPodsAssignedToNode GetPodsByNodeNameFunc) *NoExecuteTaintManager { eventBroadcaster := record.NewBroadcaster() recorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "taint-controller"}) eventBroadcaster.StartLogging(klog.Infof) if c != nil { klog.V(0).Infof("Sending events to api server.") eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: c.CoreV1().Events("")}) } else { klog.Fatalf("kubeClient is nil when starting NodeController") } tm := &NoExecuteTaintManager{ client: c, recorder: recorder, getPod: getPod, getNode: getNode, getPodsAssignedToNode: getPodsAssignedToNode, taintedNodes: make(map[string][]v1.Taint), nodeUpdateQueue: workqueue.NewNamed("noexec_taint_node"), podUpdateQueue: workqueue.NewNamed("noexec_taint_pod"), } tm.taintEvictionQueue = CreateWorkerQueue(deletePodHandler(c, tm.emitPodDeletionEvent)) return tm } ``` run函数逻辑如下: 这里的核心其实就是从nodeUpdateQueue, UpdateWorkerSize 取出一个元素,然后执行worker处理。和一般的controller思想是一样的。 **注意**: 这里用了负载均衡的思想。因为worker数量是UpdateWorkerSize个,所以这里就定义UpdateWorkerSize个channel。然后开启UpdateWorkerSize个协程,处理对应的channel。这样通过哈希取模的方式,就相当于尽可能使得每个channel的元素尽可能相等。 ``` // Run starts NoExecuteTaintManager which will run in loop until `stopCh` is closed. func (tc *NoExecuteTaintManager) Run(stopCh <-chan struct{}) { klog.V(0).Infof("Starting NoExecuteTaintManager") for i := 0; i < UpdateWorkerSize; i++ { tc.nodeUpdateChannels = append(tc.nodeUpdateChannels, make(chan nodeUpdateItem, NodeUpdateChannelSize)) tc.podUpdateChannels = append(tc.podUpdateChannels, make(chan podUpdateItem, podUpdateChannelSize)) } // Functions that are responsible for taking work items out of the workqueues and putting them // into channels. go func(stopCh <-chan struct{}) { for { item, shutdown := tc.nodeUpdateQueue.Get() if shutdown { break } nodeUpdate := item.(nodeUpdateItem) hash := hash(nodeUpdate.nodeName, UpdateWorkerSize) select { case <-stopCh: tc.nodeUpdateQueue.Done(item) return case tc.nodeUpdateChannels[hash] <- nodeUpdate: // tc.nodeUpdateQueue.Done is called by the nodeUpdateChannels worker } } }(stopCh) go func(stopCh <-chan struct{}) { for { item, shutdown := tc.podUpdateQueue.Get() if shutdown { break } // The fact that pods are processed by the same worker as nodes is used to avoid races // between node worker setting tc.taintedNodes and pod worker reading this to decide // whether to delete pod. // It's possible that even without this assumption this code is still correct. podUpdate := item.(podUpdateItem) hash := hash(podUpdate.nodeName, UpdateWorkerSize) select { case <-stopCh: tc.podUpdateQueue.Done(item) return case tc.podUpdateChannels[hash] <- podUpdate: // tc.podUpdateQueue.Done is called by the podUpdateChannels worker } } }(stopCh) wg := sync.WaitGroup{} wg.Add(UpdateWorkerSize) for i := 0; i < UpdateWorkerSize; i++ { go tc.worker(i, wg.Done, stopCh) } wg.Wait() } ```
##### 3.1.1 worker处理 worker的处理逻辑其实很简单。就是每个worker协程从对应的chanel取出一个nodeUpdate/podUpdate 事件进行处理。 分别对应:handleNodeUpdate函数和handlePodUpdate函数 **但是**:这里又得注意的是:worker会优先处理nodeUpdate事件。(很好理解,因为处理node事件是驱逐整个节点的Pod, 这个可能包括了Pod) ``` func (tc *NoExecuteTaintManager) worker(worker int, done func(), stopCh <-chan struct{}) { defer done() // When processing events we want to prioritize Node updates over Pod updates, // as NodeUpdates that interest NoExecuteTaintManager should be handled as soon as possible - // we don't want user (or system) to wait until PodUpdate queue is drained before it can // start evicting Pods from tainted Nodes. for { select { case <-stopCh: return case nodeUpdate := <-tc.nodeUpdateChannels[worker]: tc.handleNodeUpdate(nodeUpdate) tc.nodeUpdateQueue.Done(nodeUpdate) case podUpdate := <-tc.podUpdateChannels[worker]: // If we found a Pod update we need to empty Node queue first. priority: for { select { case nodeUpdate := <-tc.nodeUpdateChannels[worker]: tc.handleNodeUpdate(nodeUpdate) tc.nodeUpdateQueue.Done(nodeUpdate) default: break priority } } // After Node queue is emptied we process podUpdate. tc.handlePodUpdate(podUpdate) tc.podUpdateQueue.Done(podUpdate) } } } ``` ##### 3.1.2 handleNodeUpdate 核心逻辑: (1)先得到该node上所有的taint (2)得到这个node上所有的pod (3)for循环执行processPodOnNode来一个个的处理pod ``` func (tc *NoExecuteTaintManager) handleNodeUpdate(nodeUpdate nodeUpdateItem) { node, err := tc.getNode(nodeUpdate.nodeName) if err != nil { if apierrors.IsNotFound(err) { // Delete klog.V(4).Infof("Noticed node deletion: %#v", nodeUpdate.nodeName) tc.taintedNodesLock.Lock() defer tc.taintedNodesLock.Unlock() delete(tc.taintedNodes, nodeUpdate.nodeName) return } utilruntime.HandleError(fmt.Errorf("cannot get node %s: %v", nodeUpdate.nodeName, err)) return } // 1.先得到该node上所有的taint // Create or Update klog.V(4).Infof("Noticed node update: %#v", nodeUpdate) taints := getNoExecuteTaints(node.Spec.Taints) func() { tc.taintedNodesLock.Lock() defer tc.taintedNodesLock.Unlock() klog.V(4).Infof("Updating known taints on node %v: %v", node.Name, taints) if len(taints) == 0 { delete(tc.taintedNodes, node.Name) } else { tc.taintedNodes[node.Name] = taints } }() // 2. 得到这个node上所有的pod // This is critical that we update tc.taintedNodes before we call getPodsAssignedToNode: // getPodsAssignedToNode can be delayed as long as all future updates to pods will call // tc.PodUpdated which will use tc.taintedNodes to potentially delete delayed pods. pods, err := tc.getPodsAssignedToNode(node.Name) if err != nil { klog.Errorf(err.Error()) return } if len(pods) == 0 { return } // Short circuit, to make this controller a bit faster. if len(taints) == 0 { klog.V(4).Infof("All taints were removed from the Node %v. Cancelling all evictions...", node.Name) for i := range pods { tc.cancelWorkWithEvent(types.NamespacedName{Namespace: pods[i].Namespace, Name: pods[i].Name}) } return } // 3. for循环执行processPodOnNode来一个个的处理pod now := time.Now() for _, pod := range pods { podNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name} tc.processPodOnNode(podNamespacedName, node.Name, pod.Spec.Tolerations, taints, now) } } ``` ###### 3.1.2.1 processPodOnNode 核心逻辑如下: (1) 如果node没有taint了,那就取消该pod的处理(可能在定时队列中挂着) (2)通过pod的Tolerations和node的taints进行对比,看该pod有没有完全容忍。 (3)如果没有完全容忍,那就先取消对该pod的处理(防止如果pod已经在队列中,不能添加到队列中去),然后再通过AddWork重新挂进去。注意这里设置的时间都是time.now,意思就是马上删除 (4)如果完全容忍,找出来最短能够容忍的时间。看这个函数就知道。如果没有身容忍时间或者容忍时间为负数,都赋值为0,表示马上删除。如果设置了最大值math.MaxInt64。表示一直容忍,永远不删除。否则就找设置的最小的容忍时间 (5)接下里就是根据最小时间来设置等多久触发删除pod了,但是设置之前还要和之前已有的触发再判断一下 * 如果之前就有在等着到时间删除的,并且这次的触发删除时间在那之前。不删除。举例,podA应该是11点删除,这次更新发现pod应该是10.50删除,那么这次就忽略,还是以上次为准 * 否则删除后,再次设置这次的删除时间 ``` func (tc *NoExecuteTaintManager) processPodOnNode( podNamespacedName types.NamespacedName, nodeName string, tolerations []v1.Toleration, taints []v1.Taint, now time.Time, ) { // 1. 如果node没有taint了,那就取消该pod的处理(可能在定时队列中挂着) if len(taints) == 0 { tc.cancelWorkWithEvent(podNamespacedName) } // 2.通过pod的Tolerations和node的taints进行对比,看该pod有没有完全容忍。 allTolerated, usedTolerations := v1helper.GetMatchingTolerations(taints, tolerations) // 3.如果没有完全容忍,那就先取消对该pod的处理(防止如果pod已经在队列中,不能添加到队列中去),然后再通过AddWork重新挂进去。注意这里设置的时间都是time.now,意思就是马上删除 if !allTolerated { klog.V(2).Infof("Not all taints are tolerated after update for Pod %v on %v", podNamespacedName.String(), nodeName) // We're canceling scheduled work (if any), as we're going to delete the Pod right away. tc.cancelWorkWithEvent(podNamespacedName) tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), time.Now(), time.Now()) return } // 4.如果完全容忍,找出来最短能够容忍的时间。看这个函数就知道。如果没有身容忍时间或者容忍时间为负数,都赋值为0,表示马上删除。如果设置了最大值math.MaxInt64。表示一直容忍,永远不删除。否则就找设置的最小的容忍时间 minTolerationTime := getMinTolerationTime(usedTolerations) // getMinTolerationTime returns negative value to denote infinite toleration. if minTolerationTime < 0 { klog.V(4).Infof("New tolerations for %v tolerate forever. Scheduled deletion won't be cancelled if already scheduled.", podNamespacedName.String()) return } // 5. 接下里就是根据最小时间来设置等多久触发删除pod了 startTime := now triggerTime := startTime.Add(minTolerationTime) scheduledEviction := tc.taintEvictionQueue.GetWorkerUnsafe(podNamespacedName.String()) if scheduledEviction != nil { startTime = scheduledEviction.CreatedAt // 5.1 如果之前就有在等着到时间删除的,并且这次的触发删除时间在那之前。不删除。举例,podA应该是11点删除,这次更新发现pod应该是10.50删除,那么这次就忽略,还是以上次为准 if startTime.Add(minTolerationTime).Before(triggerTime) { return } // 5.2 否则删除后,再次设置这次的删除时间 tc.cancelWorkWithEvent(podNamespacedName) } tc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), startTime, triggerTime) } ``` ##### 3.1.3 handlePodUpdate handlePodUpdate是handleNodeUpdate的子集。核心逻辑就是processPodOnNode。这个上面分析了,不在分析了 ``` func (tc *NoExecuteTaintManager) handlePodUpdate(podUpdate podUpdateItem) { pod, err := tc.getPod(podUpdate.podName, podUpdate.podNamespace) if err != nil { if apierrors.IsNotFound(err) { // Delete podNamespacedName := types.NamespacedName{Namespace: podUpdate.podNamespace, Name: podUpdate.podName} klog.V(4).Infof("Noticed pod deletion: %#v", podNamespacedName) tc.cancelWorkWithEvent(podNamespacedName) return } utilruntime.HandleError(fmt.Errorf("could not get pod %s/%s: %v", podUpdate.podName, podUpdate.podNamespace, err)) return } // We key the workqueue and shard workers by nodeName. If we don't match the current state we should not be the one processing the current object. if pod.Spec.NodeName != podUpdate.nodeName { return } // Create or Update podNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name} klog.V(4).Infof("Noticed pod update: %#v", podNamespacedName) nodeName := pod.Spec.NodeName if nodeName == "" { return } taints, ok := func() ([]v1.Taint, bool) { tc.taintedNodesLock.Lock() defer tc.taintedNodesLock.Unlock() taints, ok := tc.taintedNodes[nodeName] return taints, ok }() // It's possible that Node was deleted, or Taints were removed before, which triggered // eviction cancelling if it was needed. if !ok { return } tc.processPodOnNode(podNamespacedName, nodeName, pod.Spec.Tolerations, taints, time.Now()) } ``` ##### 3.1.3 nc.taintManager.Run总结 **可以看出来nc.taintManager针对NoExecute污点立即生效的,只要节点有污点,我就要开始驱逐,pod你自身通过设置容忍时间来避免马上驱逐** (1)监听pod, node的add/update事件 (2)通过多个channel的方式,hash打断pod/node事件到不容的chanenl,这样让n个worker负载均衡处理 (3)优先处理node事件,但实际node处理和pod处理是一样的。处理node是将上面的pod一个一个的判断,是否需要驱逐。判断驱逐逻辑核心就是: * 如果node没有taint了,那就取消该pod的处理(可能在定时队列中挂着) * 通过pod的Tolerations和node的taints进行对比,看该pod有没有完全容忍。 * 如果没有完全容忍,那就先取消对该pod的处理(防止如果pod已经在队列中,不能添加到队列中去),然后再通过AddWork重新挂进去。注意这里设置的时间都是time.now,意思就是马上删除 * 如果完全容忍,找出来最短能够容忍的时间。看这个函数就知道。如果没有身容忍时间或者容忍时间为负数,都赋值为0,表示马上删除。如果设置了最大值math.MaxInt64。表示一直容忍,永远不删除。否则就找设置的最小的容忍时间 * 接下里就是根据最小时间来设置等多久触发删除pod了,但是设置之前还要和之前已有的触发再判断一下 * 如果之前就有在等着到时间删除的,并且这次的触发删除时间在那之前。不删除。举例,podA应该是11点删除,这次更新发现pod应该是10.50删除,那么这次就忽略,还是以上次为准 * 否则删除后,再次设置这次的删除时间 ![image-20220811113528003](../images/taintManager.png) #### 3.2 doNodeProcessingPassWorker 可以看出来doNodeProcessingPassWorker核心就是2件事: (1)给node添加NoScheduleTaint (2)给node添加lables ``` func (nc *Controller) doNodeProcessingPassWorker() { for { obj, shutdown := nc.nodeUpdateQueue.Get() // "nodeUpdateQueue" will be shutdown when "stopCh" closed; // we do not need to re-check "stopCh" again. if shutdown { return } nodeName := obj.(string) if err := nc.doNoScheduleTaintingPass(nodeName); err != nil { klog.Errorf("Failed to taint NoSchedule on node <%s>, requeue it: %v", nodeName, err) // TODO(k82cn): Add nodeName back to the queue } // TODO: re-evaluate whether there are any labels that need to be // reconcile in 1.19. Remove this function if it's no longer necessary. if err := nc.reconcileNodeLabels(nodeName); err != nil { klog.Errorf("Failed to reconcile labels for node <%s>, requeue it: %v", nodeName, err) // TODO(yujuhong): Add nodeName back to the queue } nc.nodeUpdateQueue.Done(nodeName) } } ``` ##### 3.2.1 doNoScheduleTaintingPass 核心逻辑就是检查该 node 是否需要添加对应的NoSchedule 逻辑为: - 1、从 nodeLister 中获取该 node 对象; - 2、判断该 node 是否存在以下几种 Condition:(1) False 或 Unknown 状态的 NodeReady Condition;(2) MemoryPressureCondition;(3) DiskPressureCondition;(4) NetworkUnavailableCondition;(5) PIDPressureCondition;若任一一种存在会添加对应的 `NoSchedule` taint; - 3、判断 node 是否处于 `Unschedulable` 状态,若为 `Unschedulable` 也添加对应的 `NoSchedule` taint; - 4、对比 node 已有的 taints 以及需要添加的 taints,以需要添加的 taints 为准,调用 `nodeutil.SwapNodeControllerTaint` 为 node 添加不存在的 taints 并删除不需要的 taints; ``` func (nc *Controller) doNoScheduleTaintingPass(nodeName string) error { node, err := nc.nodeLister.Get(nodeName) if err != nil { // If node not found, just ignore it. if apierrors.IsNotFound(err) { return nil } return err } // Map node's condition to Taints. var taints []v1.Taint for _, condition := range node.Status.Conditions { if taintMap, found := nodeConditionToTaintKeyStatusMap[condition.Type]; found { if taintKey, found := taintMap[condition.Status]; found { taints = append(taints, v1.Taint{ Key: taintKey, Effect: v1.TaintEffectNoSchedule, }) } } } if node.Spec.Unschedulable { // If unschedulable, append related taint. taints = append(taints, v1.Taint{ Key: v1.TaintNodeUnschedulable, Effect: v1.TaintEffectNoSchedule, }) } // Get exist taints of node. nodeTaints := taintutils.TaintSetFilter(node.Spec.Taints, func(t *v1.Taint) bool { // only NoSchedule taints are candidates to be compared with "taints" later if t.Effect != v1.TaintEffectNoSchedule { return false } // Find unschedulable taint of node. if t.Key == v1.TaintNodeUnschedulable { return true } // Find node condition taints of node. _, found := taintKeyToNodeConditionMap[t.Key] return found }) taintsToAdd, taintsToDel := taintutils.TaintSetDiff(taints, nodeTaints) // If nothing to add not delete, return true directly. if len(taintsToAdd) == 0 && len(taintsToDel) == 0 { return nil } if !nodeutil.SwapNodeControllerTaint(nc.kubeClient, taintsToAdd, taintsToDel, node) { return fmt.Errorf("failed to swap taints of node %+v", node) } return nil } nodeConditionToTaintKeyStatusMap = map[v1.NodeConditionType]map[v1.ConditionStatus]string{ v1.NodeReady: { v1.ConditionFalse: v1.TaintNodeNotReady, v1.ConditionUnknown: v1.TaintNodeUnreachable, }, v1.NodeMemoryPressure: { v1.ConditionTrue: v1.TaintNodeMemoryPressure, }, v1.NodeDiskPressure: { v1.ConditionTrue: v1.TaintNodeDiskPressure, }, v1.NodeNetworkUnavailable: { v1.ConditionTrue: v1.TaintNodeNetworkUnavailable, }, v1.NodePIDPressure: { v1.ConditionTrue: v1.TaintNodePIDPressure, }, } ``` ##### 3.2.2 reconcileNodeLabels reconcileNodeLabels就是及时给node更新: ``` beta.kubernetes.io/arch: amd64 beta.kubernetes.io/os: linux kubernetes.io/arch: amd64 kubernetes.io/os: linux ```
``` // reconcileNodeLabels reconciles node labels. func (nc *Controller) reconcileNodeLabels(nodeName string) error { node, err := nc.nodeLister.Get(nodeName) if err != nil { // If node not found, just ignore it. if apierrors.IsNotFound(err) { return nil } return err } if node.Labels == nil { // Nothing to reconcile. return nil } labelsToUpdate := map[string]string{} for _, r := range labelReconcileInfo { primaryValue, primaryExists := node.Labels[r.primaryKey] secondaryValue, secondaryExists := node.Labels[r.secondaryKey] if !primaryExists { // The primary label key does not exist. This should not happen // within our supported version skew range, when no external // components/factors modifying the node object. Ignore this case. continue } if secondaryExists && primaryValue != secondaryValue { // Secondary label exists, but not consistent with the primary // label. Need to reconcile. labelsToUpdate[r.secondaryKey] = primaryValue } else if !secondaryExists && r.ensureSecondaryExists { // Apply secondary label based on primary label. labelsToUpdate[r.secondaryKey] = primaryValue } } if len(labelsToUpdate) == 0 { return nil } if !nodeutil.AddOrUpdateLabelsOnNode(nc.kubeClient, labelsToUpdate, node) { return fmt.Errorf("failed update labels for node %+v", node) } return nil } ``` #### 3.3 doPodProcessingWorker doPodProcessingWorker从podUpdateQueue读取一个pod,执行processPod。(注意这里的podUpdateQueue和tainManger的podUpdateQueue不是一个队列,是同名而已) processPod和新逻辑如下: (1) 判断NodeCondition是否notReady (2)如果feature-gates=TaintBasedEvictions=false,则执行processNoTaintBaseEviction (3)最终都会判断node ReadyCondition是否不为true,如果不为true, 执行MarkPodsNotReady–如果pod的ready condition不为false, 将pod的ready condition设置为false,并更新LastTransitionTimestamp;否则不更新pod ``` func (nc *Controller) doPodProcessingWorker() { for { obj, shutdown := nc.podUpdateQueue.Get() // "podUpdateQueue" will be shutdown when "stopCh" closed; // we do not need to re-check "stopCh" again. if shutdown { return } podItem := obj.(podUpdateItem) nc.processPod(podItem) } } // processPod is processing events of assigning pods to nodes. In particular: // 1. for NodeReady=true node, taint eviction for this pod will be cancelled // 2. for NodeReady=false or unknown node, taint eviction of pod will happen and pod will be marked as not ready // 3. if node doesn't exist in cache, it will be skipped and handled later by doEvictionPass func (nc *Controller) processPod(podItem podUpdateItem) { defer nc.podUpdateQueue.Done(podItem) pod, err := nc.podLister.Pods(podItem.namespace).Get(podItem.name) if err != nil { if apierrors.IsNotFound(err) { // If the pod was deleted, there is no need to requeue. return } klog.Warningf("Failed to read pod %v/%v: %v.", podItem.namespace, podItem.name, err) nc.podUpdateQueue.AddRateLimited(podItem) return } nodeName := pod.Spec.NodeName nodeHealth := nc.nodeHealthMap.getDeepCopy(nodeName) if nodeHealth == nil { // Node data is not gathered yet or node has beed removed in the meantime. // Pod will be handled by doEvictionPass method. return } node, err := nc.nodeLister.Get(nodeName) if err != nil { klog.Warningf("Failed to read node %v: %v.", nodeName, err) nc.podUpdateQueue.AddRateLimited(podItem) return } // 1. 判断NodeCondition是否notReady _, currentReadyCondition := nodeutil.GetNodeCondition(nodeHealth.status, v1.NodeReady) if currentReadyCondition == nil { // Lack of NodeReady condition may only happen after node addition (or if it will be maliciously deleted). // In both cases, the pod will be handled correctly (evicted if needed) during processing // of the next node update event. return } // 2.如果feature-gates=TaintBasedEvictions=false,则执行processNoTaintBaseEviction pods := []*v1.Pod{pod} // In taint-based eviction mode, only node updates are processed by NodeLifecycleController. // Pods are processed by TaintManager. if !nc.useTaintBasedEvictions { if err := nc.processNoTaintBaseEviction(node, currentReadyCondition, nc.nodeMonitorGracePeriod, pods); err != nil { klog.Warningf("Unable to process pod %+v eviction from node %v: %v.", podItem, nodeName, err) nc.podUpdateQueue.AddRateLimited(podItem) return } } // 3.最终都会判断node ReadyCondition是否不为true,如果不为true, 执行MarkPodsNotReady–如果pod的ready condition不为false, 将pod的ready condition设置为false,并更新LastTransitionTimestamp;否则不更新pod if currentReadyCondition.Status != v1.ConditionTrue { if err := nodeutil.MarkPodsNotReady(nc.kubeClient, pods, nodeName); err != nil { klog.Warningf("Unable to mark pod %+v NotReady on node %v: %v.", podItem, nodeName, err) nc.podUpdateQueue.AddRateLimited(podItem) } } } ``` ##### 3.3.1 processNoTaintBaseEviction 核心逻辑如下: (1)node最后发现ReadyCondition为false,如果nodeHealthMap里的readyTransitionTimestamp加上podEvictionTimeout的时间是过去的时间–ReadyCondition为false状态已经持续了至少podEvictionTimeout,执行evictPods。 (2)node最后发现ReadyCondition为unknown,如果nodeHealthMap里的probeTimestamp加上podEvictionTimeout的时间是过去的时间–ReadyCondition为false状态已经持续了至少podEvictionTimeout,执行evictPods。 (3)node最后发现ReadyCondition为true,则执行cancelPodEviction–在nodeEvictionMap设置status为unmarked,然后node从zonePodEvictor队列中移除。 **evictPods并不会马上驱逐pod,他还是看node是否已经是驱逐状态。** evictPods先从nodeEvictionMap获取node驱逐的状态,如果是evicted说明node已经发生驱逐,则把node上的这个pod删除。否则设置状态为toBeEvicted,然后node加入**zonePodEvictor**队列等待执行驱逐pod ``` func (nc *Controller) processNoTaintBaseEviction(node *v1.Node, observedReadyCondition *v1.NodeCondition, gracePeriod time.Duration, pods []*v1.Pod) error { decisionTimestamp := nc.now() nodeHealthData := nc.nodeHealthMap.getDeepCopy(node.Name) if nodeHealthData == nil { return fmt.Errorf("health data doesn't exist for node %q", node.Name) } // Check eviction timeout against decisionTimestamp switch observedReadyCondition.Status { case v1.ConditionFalse: if decisionTimestamp.After(nodeHealthData.readyTransitionTimestamp.Add(nc.podEvictionTimeout)) { enqueued, err := nc.evictPods(node, pods) if err != nil { return err } if enqueued { klog.V(2).Infof("Node is NotReady. Adding Pods on Node %s to eviction queue: %v is later than %v + %v", node.Name, decisionTimestamp, nodeHealthData.readyTransitionTimestamp, nc.podEvictionTimeout, ) } } case v1.ConditionUnknown: if decisionTimestamp.After(nodeHealthData.probeTimestamp.Add(nc.podEvictionTimeout)) { enqueued, err := nc.evictPods(node, pods) if err != nil { return err } if enqueued { klog.V(2).Infof("Node is unresponsive. Adding Pods on Node %s to eviction queues: %v is later than %v + %v", node.Name, decisionTimestamp, nodeHealthData.readyTransitionTimestamp, nc.podEvictionTimeout-gracePeriod, ) } } case v1.ConditionTrue: if nc.cancelPodEviction(node) { klog.V(2).Infof("Node %s is ready again, cancelled pod eviction", node.Name) } } return nil } // evictPods: // - adds node to evictor queue if the node is not marked as evicted. // Returns false if the node name was already enqueued. // - deletes pods immediately if node is already marked as evicted. // Returns false, because the node wasn't added to the queue. func (nc *Controller) evictPods(node *v1.Node, pods []*v1.Pod) (bool, error) { nc.evictorLock.Lock() defer nc.evictorLock.Unlock() status, ok := nc.nodeEvictionMap.getStatus(node.Name) if ok && status == evicted { // Node eviction already happened for this node. // Handling immediate pod deletion. _, err := nodeutil.DeletePods(nc.kubeClient, pods, nc.recorder, node.Name, string(node.UID), nc.daemonSetStore) if err != nil { return false, fmt.Errorf("unable to delete pods from node %q: %v", node.Name, err) } return false, nil } if !nc.nodeEvictionMap.setStatus(node.Name, toBeEvicted) { klog.V(2).Infof("node %v was unregistered in the meantime - skipping setting status", node.Name) } return nc.zonePodEvictor[utilnode.GetZoneKey(node)].Add(node.Name, string(node.UID)), nil } ``` #### 3.4 doEvictionPass(if useTaintBasedEvictions==false) **doEvictionPass是一个令牌桶限速队列(受参数evictionLimiterQPS影响,默认0.1也就是10s驱逐一个node)**,+加入这个队列的node都是 unready状态持续时间大于podEvictionTimeout。(这个就是processNoTaintBaseEviction将node加入了队列) - 遍历zonePodEvictor,获取一个zone里的node队列,从队列中获取一个node,执行下面步骤 - 获取node的uid,从缓存中获取node上的所有pod - 执行DeletePods–删除daemonset之外的所有pod,保留daemonset的pod 1. 遍历所由的pod,检查pod绑定的node是否跟提供的一样,不一样则跳过这个pod 2. 执行SetPodTerminationReason–设置pod Status.Reason为`NodeLost`,Status.Message为`"Node %v which was running pod %v is unresponsive"`,并更新pod。 3. 如果pod 设置了DeletionGracePeriodSeconds,说明pod已经被删除,则跳过这个pod 4. 判断pod是否为daemonset的pod,如果是则跳过这个pod 5. 删除这个pod - 在nodeEvictionMap设置node的状态为evicted #### 3.5 doNoExecuteTaintingPass(if useTaintBasedEvictions==true) 启用taint manager 执行doNoExecuteTaintingPass–添加NoExecute的taint。这里不执行驱逐,驱逐单独在taint manager里处理。 doNoExecuteTaintingPass是一个令牌桶限速队列(也是受受参数evictionLimiterQPS影响,默认0.1也就是10s驱逐一个node) - 遍历zoneNoExecuteTainter,获得一个zone的node队列,从队列中获取一个node,执行下面步骤 - 从缓存中获取node - 如果node ready condition为false,移除“node.kubernetes.io/unreachable”的taint,添加“node.kubernetes.io/not-ready” 的taint,Effect为NoExecute。 - 如果node ready condition为unknown,移除“node.kubernetes.io/not-ready” 的taint,添加“node.kubernetes.io/unreachable” 的taint,Effect为NoExecute。 #### 3.6 monitorNodeHealth (3.6该章节摘自https://midbai.com/post/node-lifecycle-controller-manager/) 无论是否启用了 `TaintBasedEvictions` 特性,需要打 taint 或者驱逐 pod 的 node 都会被放在 zoneNoExecuteTainter 或者 zonePodEvictor 队列中,而 `nc.monitorNodeHealth` 就是这两个队列中数据的生产者。`nc.monitorNodeHealth` 的主要功能是持续监控 node 的状态,当 node 处于异常状态时更新 node 的 taint 以及 node 上 pod 的状态或者直接驱逐 node 上的 pod,此外还会为集群下的所有 node 划分 zoneStates 并为每个 zoneStates 设置对应的驱逐速率。 每隔nodeMonitorPeriod周期,执行一次monitorNodeHealth,维护node状态和zone的状态,更新未响应的node–设置node status为unknown和根据集群不同状态设置zone的速率。 ##### 3.6.1 node分类并初始化 从缓存中获取所有node列表,借助两个字段knownNodeSet(用来存放已经发现的node集合)和zoneStates(用来存储已经发现zone的状态–状态有Initial、Normal、FullDisruption、PartialDisruption)来进行对node进行分类,分为新加的–add、删除的deleted、新的zone node–newZoneRepresentatives。 对新发现的zone进行初始化–启用taint manager,设置执行node设置taint 队列zoneNoExecuteTainter(存放node为unready,需要添加taint)的速率为evictionLimiterQPS。未启用taint manager,设置安排node执行驱逐队列zonePodEvictor(存放zone里的需要执行pod evictor的node列表)的速率evictionLimiterQPS。同时在zoneStates里设置zone状态为stateInitial。 对新发现的node,添加到knownNodeSet,同时在zoneStates里设置zone状态为stateInitial,如果node的所属的zone未初始化,则进行初始化。启用taint manager,标记node为健康的–移除node上unreachable和notready taint(如果存在),从zoneNoExecuteTainter(存放node为unready,需要添加taint)队列中移除(如果存在)。未启用taint manager,初始化nodeEvictionMap(存放node驱逐执行pod的进度)–设置node的状态为unmarked,从zonePodEvictor(存放zone的需要pod evictor的node列表)队列中移除。 对删除的node,发送一个RemovingNode事件并从knownNodeSet里移除。 ##### 3.6.2 处理node status **超时时间** 如果当前node的ready condition为空,说明node刚注册,所以它的超时时间为nodeStartupGracePeriod,否则它的超时时间为nodeMonitorGracePeriod。 **心跳时间** 最后的心跳时间(probeTimestamp和readyTransitionTimestamp),由下面规则从上往下执行。 如果node刚注册,则nodeHealthMap保存的probeTimestamp和readyTransitionTimestamp都为node的创建时间。 如果nodeHealthMap里没有该node数据,则probeTimestamp和readyTransitionTimestamp都为现在。 如果nodeHealthMap里的 ready condition没有,而现在有ready condition,则probeTimestamp和readyTransitionTimestamp都为现在,status为现在的status。 如果nodeHealthMap里的有ready condition,而现在的ready condition没有,说明发生了未知的异常情况(一般不会发生,只是预防性的代码),则probeTimestamp和readyTransitionTimestamp都为现在,status为现在的status。 如果nodeHealthMap里有ready condition,而现在的ready condition也有,且保存的LastHeartbeatTime与现在不一样。probeTimestamp为现在、status为现在的status。 如果保存的LastTransitionTime与现在的不一样,说明node状态发生了变化,则设置nodeHealthMap的readyTransitionTimestamp为现在。 如果现在的lease存在,且lease的RenewTime在nodeHealthMap保存的RenewTime之后,或者nodeHealthMap里不存在。则probeTimestamp为现在,保存现在lease到nodeHealthMap里。 **尝试更新node状态** 如果probeTimestamp加上超时时间,在现在之前–即status状态更新已经超时,则会更新update node。 更新ready、memorypressure、diskpressure、pidpressure的condition为: 相应condition不存在 ``` v1.NodeCondition{ Type: nodeConditionType,//上面的四种类型 Status: v1.ConditionUnknown,// unknown Reason: "NodeStatusNeverUpdated", Message: "Kubelet never posted node status.", LastHeartbeatTime: node.CreationTimestamp,//node创建时间 LastTransitionTime: nowTimestamp, //现在时间 } ``` 相应的condition存在 ```` currentCondition.Status = v1.ConditionUnknown currentCondition.Reason = "NodeStatusUnknown" currentCondition.Message = "Kubelet stopped posting node status." currentCondition.LastTransitionTime = nowTimestamp ```` 如果现在node与之前的node不一样的–发生了更新,则对node执行update。 update成功,同时更新nodeHealthMap上的状态–readyTransitionTimestamp改为现在,status改为现在的node.status。 **对unready node进行处理–驱逐pod** node当前的ReadyCondition–执行尝试更新node状态之后的node的ReadyCondition node最后发现ReadyCondition–执行尝试更新node状态之前node的ReadyCondition 如果当前的ReadyCondition不为空,执行下面操作 1. 从缓存中获取node上pod列表 2. 如果启用taint manager,执行processTaintBaseEviction–根据node最后发现ReadyCondition 对node的taint进行操作 1. node最后发现ReadyCondition为false,如果已经有“node.kubernetes.io/unreachable”的taint,将该taint删除,添加“node.kubernetes.io/not-ready” 的taint。否则将node添加到zoneNoExecuteTainter队列中,等待添加taint。 2. node最后发现ReadyCondition为unknown,如果已经有“node.kubernetes.io/not-ready” 的taint,将该taint删除,添加“node.kubernetes.io/unreachable”的taint。否则将node添加到zoneNoExecuteTainter队列中,等待添加taint。 3. node最后发现ReadyCondition为true,移除“node.kubernetes.io/not-ready” 和“node.kubernetes.io/unreachable”的taint,如果存在的话,同时从zoneNoExecuteTainter队列中移除。 3. 未启用taint manager,则执行processNoTaintBaseEviction - node最后发现ReadyCondition为false,nodeHealthMap里的readyTransitionTimestamp加上podEvictionTimeout的时间是过去的时间–ReadyCondition为false状态已经持续了至少podEvictionTimeout,执行evictPods。 - node最后发现ReadyCondition为unknown,nodeHealthMap里的readyTransitionTimestamp加上podEvictionTimeout的时间是过去的时间–ReadyCondition为false状态已经持续了至少podEvictionTimeout,执行evictPods。 - node最后发现ReadyCondition为true,则执行cancelPodEviction–在nodeEvictionMap设置status为unmarked,然后node从zonePodEvictor队列中移除。 - evictPods–先从nodeEvictionMap获取node驱逐的状态,如果是evicted说明node已经发生驱逐,则把node上所有的pod删除。否则设置状态为toBeEvicted,然后node加入zonePodEvictor队列等待执行驱逐pod。 **这里有个疑问**: 为什么要用observedReadyCondition 而不用currentReadyCondition,observedReadyCondition和currentReadyCondition不一定一样? 比如node挂了currentReadyCondition变为unknown,而observedReadyCondition为ready 这样明显有问题,这一周期不会做驱逐或taint,下一周期observedReadyCondition和currentReadyCondition都为unknown 一定会驱逐pod或添加taint。 可能考虑nodeMonitorPeriod都很短,不立马执行驱逐或taint没有什么大问题。 ###### 3.6.3 集群健康状态处理 每个zone有四种状态,stateInitial(刚加入的zone)、stateFullDisruption(全挂)、statePartialDisruption(挂的node比例超出了unhealthyZoneThreshold)、stateNormal(剩下的所有情况) allAreFullyDisrupted代表现在所有zone状态stateFullDisruption全挂 allWasFullyDisrupted为true代表过去所有zone状态stateFullDisruption全挂 集群状态有四种: - allAreFullyDisrupted为true allWasFullyDisrupted为true - allAreFullyDisrupted为true allWasFullyDisrupted为false - allAreFullyDisrupted为false allWasFullyDisrupted为true - allAreFullyDisrupted为false allWasFullyDisrupted为false **计算现在集群的状态** 遍历现在所有的zone,每个zone遍历所有node的ready condition,计算出zone的状态。 根据zone的状态设置allAreFullyDisrupted的值 如果zone不在zoneStates,添加进zoneStates并设置状态为stateInitial **计算过去集群的状态** 从zoneStates读取保存的zone列表,如果不在现在的zone列表里,则从zoneStates移除 根据zoneStates里保存的zone状态设置allWasFullyDisrupted值 **设置zone 每秒安排多少个node来执行taint或驱逐** 当allAreFullyDisrupted为false allWasFullyDisrupted为true–之前zone未全挂,现在所有zone全挂。 1. 遍历所有node,设置node为正常状态。 - 启用taint manager,执行markNodeAsReachable–移除“node.kubernetes.io/not-ready”和“node.kubernetes.io/unreachable”的taint,如果存的话,同时从zoneNoExecuteTainter队列中移除 - 未启用taint manager,执行cancelPodEviction–在nodeEvictionMap设置status为unmarked,然后node从zonePodEvictor队列中移除 2. 从zoneStates读取保存的zone列表,设置zone 每秒安排多少个node来执行taint或驱逐 - 启用taint manager,设置zoneNoExecuteTainter的速率为0 - 未启用taint manager, 设置zonePodEvictor的速率为0 3. 设置所有zoneStates里的zone为stateFullDisruption 当 allAreFullyDisrupted为true allWasFullyDisrupted为false–过去所有zone全挂,现在所有zone未全挂 1. 遍历所有node更新nodeHealthMap里的probeTImestamp、readyTransitiontimestamp为现在的时间戳 2. 遍历zoneStates,重新评估zone的每秒安排多少个node来执行taint或驱逐 - 当zone的状态为stateNormal,如果启用taint manager,则zoneNoExecuteTainter速率设置为evictionLimiterQPS,否则,设置zonePodEvictor的速率为evictionLimiterQPS的速率 - 当zone状态为statePartialDisruption,如果启用taint manager,根据zone里的node数量,当node数量大于largeClusterThreshold,设置zoneNoExecuteTainter速率为SecondEvictionLimiterQPS;小于等于largeClusterThreshold,设置zoneNoExecuteTainter速率为0。未启用taint manager,根据zone里的node数量,当node数量大于largeClusterThreshold,设置zonePodEvictor速率为SecondEvictionLimiterQPS;小于等于largeClusterThreshold,设置zonePodEvictorTainter速率为0。 - 当zone状态为stateFullDisruption,如果启用taint manager,则zoneNoExecuteTainter速率设置为evictionLimiterQPS,否则,设置zonePodEvictor的速率为evictionLimiterQPS的速率 - 这里不处理stateInitial状态的zone,因为下一周期,zone会变成非stateInitial,下面就是处理这个情况的 除了上面两种情况,还有一个情况要进行处理,allAreFullyDisrupted为false allWasFullyDisrupted为false,就是没有发生集群所有zone全挂。这个时候zone有可能发生状态转换,所以需要重新评估zone的速率 1. 遍历zoneStates,当保存的状态和新的状态不一致的时候–zone状态发生了变化,重新评估zone的速率 - 当zone的状态为stateNormal,如果启用taint manager,则zoneNoExecuteTainter速率设置为evictionLimiterQPS,否则,设置zonePodEvictor的速率为evictionLimiterQPS的速率 - 当zone状态为statePartialDisruption,如果启用taint manager,根据zone里的node数量,当node数量大于largeClusterThreshold,设置zoneNoExecuteTainter速率为SecondEvictionLimiterQPS;小于等于largeClusterThreshold,设置zoneNoExecuteTainter速率为0。未启用taint manager,根据zone里的node数量,当node数量大于largeClusterThreshold,设置zonePodEvictor速率为SecondEvictionLimiterQPS;小于等于largeClusterThreshold,设置zonePodEvictorTainter速率为0。 - 当zone状态为stateFullDisruption,如果启用taint manager,则zoneNoExecuteTainter速率设置为evictionLimiterQPS,否则,设置zonePodEvictor的速率为evictionLimiterQPS的速率 2. zoneStates里的状态更新为新的状态 而allAreFullyDisrupted为true allWasFullyDisrupted为true,集群一直都是挂着,不需要处理,zone状态没有发生改变。
### 4 总结 nodeLifecycleController核心逻辑如下: 启动了以下协程: (1)monitorNodeHealth 更新node的状态,并且更加baseTaint是否开启,将需要处理的node加入NoExecuteTainter或者ZonePodEviction队列。实现按照速率驱逐 (2)doNodeProcessingPassWorker 监听Node,根据node状态设置NoScheduler污点(这个影响调度和驱逐无关) (3)如果开启了BaseTaint, 那么就会执行doNoExecutingPass从NoExecuteTainter取出node设置污点(这里设置可以控制设置污点的速率) 同时如果开启了BaseTaint,taintManger就会run, 会进行pod的驱逐 (4)如果不开启BaseTaint, 那么就会启动doevctionPass从ZonePodEviction取出node,进行pod驱逐 (5)doPodProcessingPassWorker会监听pod,设置pod状态,如果没有开启BaseTaint,还会进行pod的驱逐 ![image-20220811161358504](../images/taintManager-2.png)
一般而言,kcm都是有2中设置: pod-eviction-timeout:默认5分钟 enable-taint-manager,TaintBasedEvictions默认true (1)开启驱逐,或者使用默认值 ``` --pod-eviction-timeout=5m --enable-taint-manager=true --feature-gates=TaintBasedEvictions=true ``` 这个时候**pod-eviction-timeout是不起作用的**,只要node有污点,Pod会马上驱逐。(变更Kubelet的时候要小心这个坑) (2)不开启污点驱逐 ``` --pod-eviction-timeout=5m --enable-taint-manager=false --feature-gates=TaintBasedEvictions=false ``` 这个时候**pod-eviction-timeout起作用的**,node notReady 5分钟后,pod会被驱逐。 ================================================ FILE: k8s/kcm/11.k8s node状态更新机制 .md ================================================ **注意** 为了防止参考链接失效,本文摘抄自:https://www.qikqiak.com/post/kubelet-sync-node-status/ 当 Kubernetes 中 Node 节点出现状态异常的情况下,节点上的 Pod 会被重新调度到其他节点上去,但是有的时候我们会发现节点 Down 掉以后,Pod 并不会立即触发重新调度,这实际上就是和 Kubelet 的状态更新机制密切相关的,Kubernetes 提供了一些参数配置来触发重新调度到嗯时间,下面我们来分析下 Kubelet 状态更新的基本流程。 1. kubelet 自身会定期更新状态到 apiserver,通过参数`--node-status-update-frequency`指定上报频率,默认是 10s 上报一次。 2. kube-controller-manager 会每隔`--node-monitor-period`时间去检查 kubelet 的状态,默认是 5s。 3. 当 node 失联一段时间后,kubernetes 判定 node 为 `notready` 状态,这段时长通过`--node-monitor-grace-period`参数配置,默认 40s。 4. 当 node 失联一段时间后,kubernetes 判定 node 为 `unhealthy` 状态,这段时长通过`--node-startup-grace-period`参数配置,默认 1m0s。 5. 当 node 失联一段时间后,kubernetes 开始删除原 node 上的 pod,这段时长是通过`--pod-eviction-timeout`参数配置,默认 5m0s。 > kube-controller-manager 和 kubelet 是异步工作的,这意味着延迟可能包括任何的网络延迟、apiserver 的延迟、etcd 延迟,一个节点上的负载引起的延迟等等。因此,如果`--node-status-update-frequency`设置为5s,那么实际上 etcd 中的数据变化会需要 6-7s,甚至更长时间。 Kubelet在更新状态失败时,会进行`nodeStatusUpdateRetry`次重试,默认为 5 次。 Kubelet 会在函数`tryUpdateNodeStatus`中尝试进行状态更新。Kubelet 使用了 Golang 中的`http.Client()`方法,但是没有指定超时时间,因此,如果 API Server 过载时,当建立 TCP 连接时可能会出现一些故障。 因此,在`nodeStatusUpdateRetry` * `--node-status-update-frequency`时间后才会更新一次节点状态。 同时,Kubernetes 的 controller manager 将尝试每`--node-monitor-period`时间周期内检查`nodeStatusUpdateRetry`次。在`--node-monitor-grace-period`之后,会认为节点 unhealthy,然后会在`--pod-eviction-timeout`后删除 Pod。 kube proxy 有一个 watcher API,一旦 Pod 被驱逐了,kube proxy 将会通知更新节点的 iptables 规则,将 Pod 从 Service 的 Endpoints 中移除,这样就不会访问到来自故障节点的 Pod 了。 ## 配置 对于这些参数的配置,需要根据不通的集群规模场景来进行配置。 ### 社区默认的配置 | 参数 | 值 | | :---------------------------- | :--- | | –node-status-update-frequency | 10s | | –node-monitor-period | 5s | | –node-monitor-grace-period | 40s | | –pod-eviction-timeout | 5m | ### 快速更新和快速响应 | 参数 | 值 | | :---------------------------- | :--- | | –node-status-update-frequency | 4s | | –node-monitor-period | 2s | | –node-monitor-grace-period | 20s | | –pod-eviction-timeout | 30s | 在这种情况下,Pod 将在 50s 被驱逐,因为该节点在 20s 后被视为Down掉了,`--pod-eviction-timeout`在 30s 之后发生,但是,这种情况会给 etcd 产生很大的开销,因为每个节点都会尝试每 2s 更新一次状态。 如果环境有1000个节点,那么每分钟将有15000次节点更新操作,这可能需要大型 etcd 容器甚至是 etcd 的专用节点。 > 如果我们计算尝试次数,则除法将给出5,但实际上每次尝试的 nodeStatusUpdateRetry 尝试将从3到5。 由于所有组件的延迟,尝试总次数将在15到25之间变化。 ### 中等更新和平均响应 | 参数 | 值 | | :---------------------------- | :--- | | –node-status-update-frequency | 20s | | –node-monitor-period | 5s | | –node-monitor-grace-period | 2m | | –pod-eviction-timeout | 1m | 这种场景下会 20s 更新一次 node 状态,controller manager 认为 node 状态不正常之前,会有 2m*60⁄20*5=30 次的 node 状态更新,Node 状态为 down 之后 1m,就会触发驱逐操作。 如果有 1000 个节点,1分钟之内就会有 60s/20s*1000=3000 次的节点状态更新操作。 ### 低更新和慢响应 | 参数 | 值 | | :---------------------------- | :--- | | –node-status-update-frequency | 1m | | –node-monitor-period | 5s | | –node-monitor-grace-period | 5m | | –pod-eviction-timeout | 1m | Kubelet 将会 1m 更新一次节点的状态,在认为不健康之后会有 5m/1m*5=25 次重试更新的机会。Node为不健康的时候,1m 之后 pod开始被驱逐。 可以有不同的组合,例如快速更新和慢反应以满足特定情况。 原文链接: https://github.com/kubernetes-sigs/kubespray/blob/master/docs/kubernetes-reliability.md ================================================ FILE: k8s/kcm/2-deployment controller-manager源码分析.md ================================================ Table of Contents ================= * [1. deploy基础概念](#1-deploy基础概念) * [1.1. metadata.generation & status.observedGeneration](#11-metadatageneration--statusobservedgeneration) * [1.2. metadata.resourceVersion](#12-metadataresourceversion) * [1.3 status](#13-status) * [2. startDeploymentController](#2-startdeploymentcontroller) * [3. NewDeploymentController](#3-newdeploymentcontroller) * [4. 对deploy, rs, pod的处理](#4-对deploy-rs-pod的处理) * [4.1 add,update, del deploy](#41-addupdate-del-deploy) * [4.2 add,update,del ReplicaSet](#42-addupdatedel-replicaset) * [4.3 del pod](#43-del-pod) * [4.4 getDeploymentForPod](#44-getdeploymentforpod) * [4.5 总结](#45-总结) * [5. syncDeployment](#5-syncdeployment) * [5.1 删除deploy](#51-删除deploy) * [5.1.1 getAllReplicaSetsAndSyncRevision](#511-getallreplicasetsandsyncrevision) * [5.1.2 syncDeploymentStatus](#512-syncdeploymentstatus) * [5.1.3 总结](#513-总结) * [5.2 pause操作](#52-pause操作) * [5.3 Rollback操作](#53-rollback操作) * [5.4 scale操作](#54-scale操作) * [5.4.1 获得最新的一个activeRs](#541-获得最新的一个activers) * [5.4.2 如果newRS已经是期望状态,将所有的oldRS缩到0](#542-如果newrs已经是期望状态将所有的oldrs缩到0) * [5.5 recreate更新](#55-recreate更新) * [5.6 rolloutRolling更新](#56-rolloutrolling更新) * [5.6.1 如果是scaledUp(针对news),返回 syncRolloutStatus](#561-如果是scaledup针对news返回-syncrolloutstatus) * [5.7 scaleReplicaSetAndRecordEvent](#57-scalereplicasetandrecordevent) ### 1. deploy基础概念 ``` root@k8s-master# kubectl get deploy nginx-deployment -oyaml apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "2" // 这个是版本号,说明这是第二个版本。 generation: 4 // 这里有个 generation labels: app: nginx name: nginx-deployment resourceVersion: "59522723" selfLink: /apis/apps/v1/namespaces/default/deployments/nginx-deployment uid: a6830e24-a479-452d-bbb2-3cb3cad82ebf spec: progressDeadlineSeconds: 600 replicas: 2 revisionHistoryLimit: 2 // 这个表明只保留2个版本。 selector: matchLabels: app: nginx strategy: rollingUpdate: maxSurge: 25% // 滚动更新的时候,不是一次就更新完了,而是一批一批的更新 maxUnavailable: 25% // 升级过程中最多有多少个 pod 处于无法提供服务的状态 type: RollingUpdate template: metadata: labels: app: nginx spec: containers: - image: nginx imagePullPolicy: Always name: nginx ports: - containerPort: 8080 name: test1 protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 status: availableReplicas: 2 conditions: - lastTransitionTime: "2020-11-28T08:35:07Z" lastUpdateTime: "2020-12-01T02:36:27Z" message: ReplicaSet "nginx-deployment-59bc6679cd" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing - lastTransitionTime: "2020-12-01T02:44:17Z" lastUpdateTime: "2020-12-01T02:44:17Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available observedGeneration: 4 //这里也有一个 readyReplicas: 2 replicas: 2 updatedReplicas: 2 ``` #### 1.1. metadata.generation & status.observedGeneration 这两个是对应的,metadata.generation 就是这个 ReplicationSet 的元配置数据被修改了多少次。这里就有个版本迭代的概念。每次我们使用 kuberctl edit 来修改 ReplicationSet 的配置文件,或者更新镜像,这个generation都会增长1,表示增加了一个版本。 这个版本迭代是配置文件只要有改动就进行版本迭代。observedGeneration就是最近观察到的可用的版本迭代。这两个只有在镜像升级的时候有可能不同,当我们使用 `kubectl rollout status` 来探测一个deployment的状态的时候,就是检查observedGeneration是否大于等于generation。 ``` root@k8s-master:~# kubectl rollout status deployment kube-hpa -n kube-system deployment "kube-hpa" successfully rolled out ```
#### 1.2. metadata.resourceVersion 每个资源在底层数据库都有版本的概念,我们可以使用 watch 来看某个资源,某个版本之后的操作。这些操作是存储在 etcd 中的。当让,并不是所有的操作都会永久存储,只会保留有限的时间的操作。这个 resourceVersion 就是这个资源对象当前的版本号。 #### 1.3 status replicas 实际的 pod 副本数 availableReplicas 现在可用的 Pod 的副本数量,有的副本可能还处在未准备好,或者初始化状态 readyReplicas 是处于 ready 状态的 Pod 的副本数量 fullyLabeledReplicas 意思是这个 ReplicaSet 的标签 selector 对应的副本数量,不同纬度的一种统计
### 2. startDeploymentController kcm启动时,NewControllerInitializers里面定义了所有要启动的manager,如下: ``` // NewControllerInitializers is a public map of named controller groups (you can start more than one in an init func) // paired to their InitFunc. This allows for structured downstream composition and subdivision. func NewControllerInitializers(loopMode ControllerLoopMode) map[string]InitFunc { controllers := map[string]InitFunc{} controllers["endpoint"] = startEndpointController controllers["endpointslice"] = startEndpointSliceController controllers["replicationcontroller"] = startReplicationController controllers["podgc"] = startPodGCController controllers["resourcequota"] = startResourceQuotaController controllers["namespace"] = startNamespaceController controllers["serviceaccount"] = startServiceAccountController controllers["garbagecollector"] = startGarbageCollectorController controllers["daemonset"] = startDaemonSetController controllers["job"] = startJobController controllers["deployment"] = startDeploymentController //启动 deploymentController controllers["replicaset"] = startReplicaSetController controllers["horizontalpodautoscaling"] = startHPAController controllers["disruption"] = startDisruptionController controllers["statefulset"] = startStatefulSetController controllers["cronjob"] = startCronJobController controllers["csrsigning"] = startCSRSigningController controllers["csrapproving"] = startCSRApprovingController controllers["csrcleaner"] = startCSRCleanerController controllers["ttl"] = startTTLController controllers["bootstrapsigner"] = startBootstrapSignerController controllers["tokencleaner"] = startTokenCleanerController controllers["nodeipam"] = startNodeIpamController controllers["nodelifecycle"] = startNodeLifecycleController if loopMode == IncludeCloudLoops { controllers["service"] = startServiceController controllers["route"] = startRouteController controllers["cloud-node-lifecycle"] = startCloudNodeLifecycleController // TODO: volume controller into the IncludeCloudLoops only set. } controllers["persistentvolume-binder"] = startPersistentVolumeBinderController controllers["attachdetach"] = startAttachDetachController controllers["persistentvolume-expander"] = startVolumeExpandController controllers["clusterrole-aggregation"] = startClusterRoleAggregrationController controllers["pvc-protection"] = startPVCProtectionController controllers["pv-protection"] = startPVProtectionController controllers["ttl-after-finished"] = startTTLAfterFinishedController controllers["root-ca-cert-publisher"] = startRootCACertPublisher return controllers } ```
deployment 的本质是控制 replicaSet,replicaSet 会控制 pod,然后由 controller 驱动各个对象达到期望状态。所以deployController需要监听pod, rs, deploy三种资源的变化。 ```go cmd/kube-controller-manager/app/apps.go func startDeploymentController(ctx ControllerContext) (http.Handler, bool, error) { // 判断当前是否支持deployment这种资源 if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "deployments"}] { return nil, false, nil } dc, err := deployment.NewDeploymentController( ctx.InformerFactory.Apps().V1().Deployments(), ctx.InformerFactory.Apps().V1().ReplicaSets(), ctx.InformerFactory.Core().V1().Pods(), ctx.ClientBuilder.ClientOrDie("deployment-controller"), ) if err != nil { return nil, true, fmt.Errorf("error creating Deployment controller: %v", err) } go dc.Run(int(ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs), ctx.Stop) return nil, true, nil } ``` 和其他控制器一样。new deploycontroller之后,就是run。run调用如下: run->work->processNextItem->syncHandler 定义时(new deploy),dc.syncHandler = dc.syncDeployment
### 3. NewDeploymentController ```go // NewDeploymentController creates a new DeploymentController. func NewDeploymentController(dInformer appsinformers.DeploymentInformer, rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) { // 记录event eventBroadcaster := record.NewBroadcaster() eventBroadcaster.StartLogging(glog.Infof) eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: client.CoreV1().Events("")}) if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil { if err := metrics.RegisterMetricAndTrackRateLimiterUsage("deployment_controller", client.CoreV1().RESTClient().GetRateLimiter()); err != nil { return nil, err } } dc := &DeploymentController{ client: client, eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "deployment-controller"}), queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"), } dc.rsControl = controller.RealRSControl{ KubeClient: client, Recorder: dc.eventRecorder, } dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: dc.addDeployment, UpdateFunc: dc.updateDeployment, // This will enter the sync loop and no-op, because the deployment has been deleted from the store. DeleteFunc: dc.deleteDeployment, }) rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: dc.addReplicaSet, UpdateFunc: dc.updateReplicaSet, DeleteFunc: dc.deleteReplicaSet, }) // pod只关注删除? podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ DeleteFunc: dc.deletePod, }) dc.syncHandler = dc.syncDeployment dc.enqueueDeployment = dc.enqueue dc.dLister = dInformer.Lister() dc.rsLister = rsInformer.Lister() dc.podLister = podInformer.Lister() dc.dListerSynced = dInformer.Informer().HasSynced dc.rsListerSynced = rsInformer.Informer().HasSynced dc.podListerSynced = podInformer.Informer().HasSynced return dc, nil } ``` 从这里看出来,这里关注: deploy的增删改,rs的增删改,pod的删除。 接下来就是 run->works->processNextWorkItem()->syncDeployment()
### 4. 对deploy, rs, pod的处理 在之前的分析中,有addDeployment,deleteDeployment,addReplicaSet等函数。这里看一下这些函数做了什么事情。 #### 4.1 add,update, del deploy deploy相关的变化都是入队列 ```go func (dc *DeploymentController) addDeployment(obj interface{}) { d := obj.(*apps.Deployment) glog.V(4).Infof("Adding deployment %s", d.Name) dc.enqueueDeployment(d) } func (dc *DeploymentController) updateDeployment(old, cur interface{}) { oldD := old.(*apps.Deployment) curD := cur.(*apps.Deployment) glog.V(4).Infof("Updating deployment %s", oldD.Name) dc.enqueueDeployment(curD) } func (dc *DeploymentController) deleteDeployment(obj interface{}) { d, ok := obj.(*apps.Deployment) if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("Couldn't get object from tombstone %#v", obj)) return } d, ok = tombstone.Obj.(*apps.Deployment) if !ok { utilruntime.HandleError(fmt.Errorf("Tombstone contained object that is not a Deployment %#v", obj)) return } } glog.V(4).Infof("Deleting deployment %s", d.Name) dc.enqueueDeployment(d) } ```
#### 4.2 add,update,del ReplicaSet ```go // addReplicaSet enqueues the deployment that manages a ReplicaSet when the ReplicaSet is created. func (dc *DeploymentController) addReplicaSet(obj interface{}) { rs := obj.(*apps.ReplicaSet) // 1.如果是删除,删除后返回 if rs.DeletionTimestamp != nil { // On a restart of the controller manager, it's possible for an object to // show up in a state that is already pending deletion. dc.deleteReplicaSet(rs) return } // 2.判断owneref是否是deploy,是的话,讲对应的rs加入队列。 // If it has a ControllerRef, that's all that matters. if controllerRef := metav1.GetControllerOf(rs); controllerRef != nil { d := dc.resolveControllerRef(rs.Namespace, controllerRef) if d == nil { return } klog.V(4).Infof("ReplicaSet %s added.", rs.Name) dc.enqueueDeployment(d) return } // 3. 否则,就是孤儿rs,通过label判断 rs是否属于某个deploy // Otherwise, it's an orphan. Get a list of all matching Deployments and sync // them to see if anyone wants to adopt it. ds := dc.getDeploymentsForReplicaSet(rs) if len(ds) == 0 { return } klog.V(4).Infof("Orphan ReplicaSet %s added.", rs.Name) for _, d := range ds { dc.enqueueDeployment(d) } } ``` ```go // updateReplicaSet figures out what deployment(s) manage a ReplicaSet when the ReplicaSet // is updated and wake them up. If the anything of the ReplicaSets have changed, we need to // awaken both the old and new deployments. old and cur must be *apps.ReplicaSet // types. func (dc *DeploymentController) updateReplicaSet(old, cur interface{}) { curRS := cur.(*apps.ReplicaSet) oldRS := old.(*apps.ReplicaSet) // 1. 同样的,ResourceVersion可以判断资源有没有发生改变 if curRS.ResourceVersion == oldRS.ResourceVersion { // Periodic resync will send update events for all known replica sets. // Two different versions of the same replica set will always have different RVs. return } curControllerRef := metav1.GetControllerOf(curRS) oldControllerRef := metav1.GetControllerOf(oldRS) controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef) // 2.先将旧对象删除。旧对象是deploy if controllerRefChanged && oldControllerRef != nil { // The ControllerRef was changed. Sync the old controller, if any. if d := dc.resolveControllerRef(oldRS.Namespace, oldControllerRef); d != nil { dc.enqueueDeployment(d) } } // 3. 处理新对象,如果新对象还是受deploy管,加入队列 // If it has a ControllerRef, that's all that matters. if curControllerRef != nil { d := dc.resolveControllerRef(curRS.Namespace, curControllerRef) if d == nil { return } klog.V(4).Infof("ReplicaSet %s updated.", curRS.Name) dc.enqueueDeployment(d) return } // 4. 孤儿rs。因为是更新,所以如果label都没有改,肯定就不用动。 // Otherwise, it's an orphan. If anything changed, sync matching controllers // to see if anyone wants to adopt it now. labelChanged := !reflect.DeepEqual(curRS.Labels, oldRS.Labels) if labelChanged || controllerRefChanged { ds := dc.getDeploymentsForReplicaSet(curRS) if len(ds) == 0 { return } klog.V(4).Infof("Orphan ReplicaSet %s updated.", curRS.Name) for _, d := range ds { dc.enqueueDeployment(d) } } } ``` ```go // deleteReplicaSet enqueues the deployment that manages a ReplicaSet when // the ReplicaSet is deleted. obj could be an *apps.ReplicaSet, or // a DeletionFinalStateUnknown marker item. func (dc *DeploymentController) deleteReplicaSet(obj interface{}) { rs, ok := obj.(*apps.ReplicaSet) // When a delete is dropped, the relist will notice a pod in the store not // in the list, leading to the insertion of a tombstone object which contains // the deleted key/value. Note that this value might be stale. If the ReplicaSet // changed labels the new deployment will not be woken up till the periodic resync. if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("Couldn't get object from tombstone %#v", obj)) return } rs, ok = tombstone.Obj.(*apps.ReplicaSet) if !ok { utilruntime.HandleError(fmt.Errorf("Tombstone contained object that is not a ReplicaSet %#v", obj)) return } } controllerRef := metav1.GetControllerOf(rs) if controllerRef == nil { // No controller should care about orphans being deleted. return } d := dc.resolveControllerRef(rs.Namespace, controllerRef) if d == nil { return } klog.V(4).Infof("ReplicaSet %s deleted.", rs.Name) // 加入队列 dc.enqueueDeployment(d) } ``` #### 4.3 del pod ``` // deletePod will enqueue a Recreate Deployment once all of its pods have stopped running. func (dc *DeploymentController) deletePod(obj interface{}) { pod, ok := obj.(*v1.Pod) // When a delete is dropped, the relist will notice a pod in the store not // in the list, leading to the insertion of a tombstone object which contains // the deleted key/value. Note that this value might be stale. If the Pod // changed labels the new deployment will not be woken up till the periodic resync. if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("Couldn't get object from tombstone %#v", obj)) return } pod, ok = tombstone.Obj.(*v1.Pod) if !ok { utilruntime.HandleError(fmt.Errorf("Tombstone contained object that is not a pod %#v", obj)) return } } glog.V(4).Infof("Pod %s deleted.", pod.Name) // 只有当pod全删除,才更新 deploy。这个判断说明是 recreate 了 if d := dc.getDeploymentForPod(pod); d != nil && d.Spec.Strategy.Type == apps.RecreateDeploymentStrategyType { // Sync if this Deployment now has no more Pods. rsList, err := util.ListReplicaSets(d, util.RsListFromClient(dc.client.AppsV1())) if err != nil { return } podMap, err := dc.getPodMapForDeployment(d, rsList) if err != nil { return } numPods := 0 for _, podList := range podMap { numPods += len(podList.Items) } if numPods == 0 { dc.enqueueDeployment(d) } } } ``` deployment升级方案: Recreate:删除所有已存在的pod,重新创建新的; RollingUpdate:滚动升级,逐步替换的策略,同时滚动升级时,支持更多的附加参数,例如设置最大不可用pod数量,最小升级间隔时间等等。
#### 4.4 getDeploymentForPod 更加pod或得 rs,然后更加rs 或得deployment ``` // getDeploymentForPod returns the deployment managing the given Pod. func (dc *DeploymentController) getDeploymentForPod(pod *v1.Pod) *apps.Deployment { // Find the owning replica set var rs *apps.ReplicaSet var err error controllerRef := metav1.GetControllerOf(pod) if controllerRef == nil { // No controller owns this Pod. return nil } if controllerRef.Kind != apps.SchemeGroupVersion.WithKind("ReplicaSet").Kind { // Not a pod owned by a replica set. return nil } rs, err = dc.rsLister.ReplicaSets(pod.Namespace).Get(controllerRef.Name) if err != nil || rs.UID != controllerRef.UID { klog.V(4).Infof("Cannot get replicaset %q for pod %q: %v", controllerRef.Name, pod.Name, err) return nil } // Now find the Deployment that owns that ReplicaSet. controllerRef = metav1.GetControllerOf(rs) if controllerRef == nil { return nil } return dc.resolveControllerRef(rs.Namespace, controllerRef) } ``` #### 4.5 总结 从这里也可以看出来,deployment, rs的add, del, update都可能会导致deployment入队列,然后进入syncDeployment。 pod这里只关注删除,原因在于如果是recreate更新的时候,deploy等旧pod删除完才能创建新的pod。
### 5. syncDeployment
``` // syncDeployment will sync the deployment with the given key. // This function is not meant to be invoked concurrently with the same key. func (dc *DeploymentController) syncDeployment(key string) error { startTime := time.Now() klog.V(4).Infof("Started syncing deployment %q (%v)", key, startTime) defer func() { klog.V(4).Infof("Finished syncing deployment %q (%v)", key, time.Since(startTime)) }() namespace, name, err := cache.SplitMetaNamespaceKey(key) if err != nil { return err } deployment, err := dc.dLister.Deployments(namespace).Get(name) if errors.IsNotFound(err) { klog.V(2).Infof("Deployment %v has been deleted", key) return nil } if err != nil { return err } // Deep-copy otherwise we are mutating our cache. // TODO: Deep-copy only when needed. d := deployment.DeepCopy() // 1. 如果一个deploy的label是everything,会直接返回。(虽然会有判断是否要更新状态的说法) everything := metav1.LabelSelector{} if reflect.DeepEqual(d.Spec.Selector, &everything) { dc.eventRecorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all pods. A non-empty selector is required.") if d.Status.ObservedGeneration < d.Generation { d.Status.ObservedGeneration = d.Generation dc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(d) } return nil } // 2. 根据deploy获得rslist。以及根据rslist获得所有的pod(pod是一个map) // List ReplicaSets owned by this Deployment, while reconciling ControllerRef // through adoption/orphaning. rsList, err := dc.getReplicaSetsForDeployment(d) if err != nil { return err } // List all Pods owned by this Deployment, grouped by their ReplicaSet. // Current uses of the podMap are: // // * check if a Pod is labeled correctly with the pod-template-hash label. // * check that no old Pods are running in the middle of Recreate Deployments. podMap, err := dc.getPodMapForDeployment(d, rsList) if err != nil { return err } // 3.如果是删除,则直接调用syncStatusOnly if d.DeletionTimestamp != nil { return dc.syncStatusOnly(d, rsList) } // 4.检查是否处于 pause 状态 // Update deployment conditions with an Unknown condition when pausing/resuming // a deployment. In this way, we can be sure that we won't timeout when a user // resumes a Deployment with a set progressDeadlineSeconds. if err = dc.checkPausedConditions(d); err != nil { return err } // 如果是 pause 状态,同步状态 if d.Spec.Paused { return dc.sync(d, rsList) } // rollback is not re-entrant in case the underlying replica sets are updated with a new // revision so we should ensure that we won't proceed to update replica sets until we // make sure that the deployment has cleaned up its rollback spec in subsequent enqueues. // 5.如果annotations中有 deprecated.deployment.rollback.to 这个字段,则进行回滚 if getRollbackTo(d) != nil { return dc.rollback(d, rsList) } // 6.检查 deployment 是否处于 scale 状态 scalingEvent, err := dc.isScalingEvent(d, rsList) if err != nil { return err } if scalingEvent { return dc.sync(d, rsList) } // 7.更新deployment状态 switch d.Spec.Strategy.Type { case apps.RecreateDeploymentStrategyType: return dc.rolloutRecreate(d, rsList, podMap) case apps.RollingUpdateDeploymentStrategyType: return dc.rolloutRolling(d, rsList) } return fmt.Errorf("unexpected deployment strategy type: %s", d.Spec.Strategy.Type) } ``` syncDeployment的大流程如下: (1)如果一个deploy的label是everything,会直接返回。 (2)根据deploy获得rslist。以及根据deploy的label获得所有的pod,然后以rs为key,返回一个podMap(pod是一个map) (3)如果是删除,则直接调用syncStatusOnly,并返回 (4)检查是否处于 pause 状态,如果是pause,同步状态,并返回 (5)如果需要rollback,进行rollback,然后返回 (5)检查 deployment 是否处于 scale 状态,如果是scale, 同步状态,并返回 (6)如果是滚动更新或者是recreate更新,更新deployment状态,并返回
接下来从第三步开始,具体做了什么。 #### 5.1 删除deploy 删除deploy调用了syncStatusOnly函数。 syncStatusOn函数中主要调用了 getAllReplicaSetsAndSyncRevision 和 syncDeploymentStatus 函数。 ##### 5.1.1 getAllReplicaSetsAndSyncRevision getAllReplicaSetsAndSyncRevision 就是找出来 newRS, oldRSs。 newRs 就是:**最近的**,满足 rs.spec.template = deploy.spec.temp 的rs。 **使用最近的原因在于rs.spec.template = deploy.spec.temp 的rs可能有多个 ** oldRss 就是所有的rs中去掉 newRs。 ``` // syncStatusOnly only updates Deployments Status and doesn't take any mutating actions. func (dc *DeploymentController) syncStatusOnly(d *apps.Deployment, rsList []*apps.ReplicaSet) error { newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false) if err != nil { return err } // 这里有点没看懂。 oldRSs + newRS = rsList。 为啥又要来一个allRSs allRSs := append(oldRSs, newRS) return dc.syncDeploymentStatus(allRSs, newRS, d) } // rsList should come from getReplicaSetsForDeployment(d). // // 1. Get all old RSes this deployment targets, and calculate the max revision number among them (maxOldV). // 2. Get new RS this deployment targets (whose pod template matches deployment's), and update new RS's revision number to (maxOldV + 1), // only if its revision number is smaller than (maxOldV + 1). If this step failed, we'll update it in the next deployment sync loop. // 3. Copy new RS's revision number to deployment (update deployment's revision). If this step failed, we'll update it in the next deployment sync loop. // // Note that currently the deployment controller is using caches to avoid querying the server for reads. // This may lead to stale reads of replica sets, thus incorrect deployment status. func (dc *DeploymentController) getAllReplicaSetsAndSyncRevision(d *apps.Deployment, rsList []*apps.ReplicaSet, createIfNotExisted bool) (*apps.ReplicaSet, []*apps.ReplicaSet, error) { _, allOldRSs := deploymentutil.FindOldReplicaSets(d, rsList) // Get new replica set with the updated revision number newRS, err := dc.getNewReplicaSet(d, rsList, allOldRSs, createIfNotExisted) if err != nil { return nil, nil, err } return newRS, allOldRSs, nil } // FindOldReplicaSets returns the old replica sets targeted by the given Deployment, with the given slice of RSes. // Note that the first set of old replica sets doesn't include the ones with no pods, and the second set of old replica sets include all old replica sets. func FindOldReplicaSets(deployment *apps.Deployment, rsList []*apps.ReplicaSet) ([]*apps.ReplicaSet, []*apps.ReplicaSet) { var requiredRSs []*apps.ReplicaSet var allRSs []*apps.ReplicaSet newRS := FindNewReplicaSet(deployment, rsList) for _, rs := range rsList { // Filter out new replica set if newRS != nil && rs.UID == newRS.UID { continue } allRSs = append(allRSs, rs) if *(rs.Spec.Replicas) != 0 { requiredRSs = append(requiredRSs, rs) } } return requiredRSs, allRSs } // FindNewReplicaSet returns the new RS this given deployment targets (the one with the same pod template). func FindNewReplicaSet(deployment *apps.Deployment, rsList []*apps.ReplicaSet) *apps.ReplicaSet { sort.Sort(controller.ReplicaSetsByCreationTimestamp(rsList)) for i := range rsList { if EqualIgnoreHash(&rsList[i].Spec.Template, &deployment.Spec.Template) { // In rare cases, such as after cluster upgrades, Deployment may end up with // having more than one new ReplicaSets that have the same template as its template, // see https://github.com/kubernetes/kubernetes/issues/40415 // We deterministically choose the oldest new ReplicaSet. return rsList[i] } } // new ReplicaSet does not exist. return nil } ``` ##### 5.1.2 syncDeploymentStatus calculateStatus 就是根据allRSs,newRS得到deploy最新的状态。然后再更新。 ``` // syncDeploymentStatus checks if the status is up-to-date and sync it if necessary func (dc *DeploymentController) syncDeploymentStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, d *apps.Deployment) error { newStatus := calculateStatus(allRSs, newRS, d) if reflect.DeepEqual(d.Status, newStatus) { return nil } newDeployment := d newDeployment.Status = newStatus _, err := dc.client.AppsV1().Deployments(newDeployment.Namespace).UpdateStatus(newDeployment) return err } // calculateStatus calculates the latest status for the provided deployment by looking into the provided replica sets. func calculateStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) apps.DeploymentStatus { availableReplicas := deploymentutil.GetAvailableReplicaCountForReplicaSets(allRSs) totalReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs) unavailableReplicas := totalReplicas - availableReplicas // If unavailableReplicas is negative, then that means the Deployment has more available replicas running than // desired, e.g. whenever it scales down. In such a case we should simply default unavailableReplicas to zero. if unavailableReplicas < 0 { unavailableReplicas = 0 } status := apps.DeploymentStatus{ // TODO: Ensure that if we start retrying status updates, we won't pick up a new Generation value. ObservedGeneration: deployment.Generation, Replicas: deploymentutil.GetActualReplicaCountForReplicaSets(allRSs), UpdatedReplicas: deploymentutil.GetActualReplicaCountForReplicaSets([]*apps.ReplicaSet{newRS}), ReadyReplicas: deploymentutil.GetReadyReplicaCountForReplicaSets(allRSs), AvailableReplicas: availableReplicas, UnavailableReplicas: unavailableReplicas, CollisionCount: deployment.Status.CollisionCount, } // Copy conditions one by one so we won't mutate the original object. conditions := deployment.Status.Conditions for i := range conditions { status.Conditions = append(status.Conditions, conditions[i]) } if availableReplicas >= *(deployment.Spec.Replicas)-deploymentutil.MaxUnavailable(*deployment) { minAvailability := deploymentutil.NewDeploymentCondition(apps.DeploymentAvailable, v1.ConditionTrue, deploymentutil.MinimumReplicasAvailable, "Deployment has minimum availability.") deploymentutil.SetDeploymentCondition(&status, *minAvailability) } else { noMinAvailability := deploymentutil.NewDeploymentCondition(apps.DeploymentAvailable, v1.ConditionFalse, deploymentutil.MinimumReplicasUnavailable, "Deployment does not have minimum availability.") deploymentutil.SetDeploymentCondition(&status, *noMinAvailability) } return status } ```
##### 5.1.3 总结 deploy 根据DeletionTimestamp判断该pod是否需要删除,deploy controller并没有进行deploy的删除,而是仅仅更新了状态。deploy的删除时gc来做的。后文在详细分析。 这里注意一个问题就是deploy的DeletionTimestamp到底是谁加上去的。 答案是:APIsever。 当kubectl delete 的时候。最终kubelet调用了会调用到 store里面的DELETE函数,这里的的操作就是给DeletionTimestamp赋值。
#### 5.2 pause操作 目前比较少用到。暂时忽略。 #### 5.3 Rollback操作 (1)判断deploy的annotations中是否有"deprecated.deployment.rollback.to" 字段的key,如果有需要rollback (2)获取deprecated.deployment.rollback.to对应的value, 这个就是表示是需要rollback到哪个rs (3)将rs.sepc.template 赋值给 deployment.spec.template (4)更新deploy, 删除annotations中 deprecated.deployment.rollback.to字段 特殊情况:如果value=0,则更新到最近的版本。如果value不存在,则忽略。 ``` if getRollbackTo(d) != nil { return dc.rollback(d, rsList) } // getRollbackTo 就是判断deploy的annotations中是否有"deprecated.deployment.rollback.to" 字段的key // TODO: Remove this when extensions/v1beta1 and apps/v1beta1 Deployment are dropped. func getRollbackTo(d *apps.Deployment) *extensions.RollbackConfig { // Extract the annotation used for round-tripping the deprecated RollbackTo field. revision := d.Annotations[apps.DeprecatedRollbackTo] if revision == "" { return nil } revision64, err := strconv.ParseInt(revision, 10, 64) if err != nil { // If it's invalid, ignore it. return nil } return &extensions.RollbackConfig{ Revision: revision64, } } // 这里的核心思想就是找到对应版本的rs。然后将rs.sepc.template 赋值给 deployment.spec.template // 然后更新deploy, 删除annotations中 deprecated.deployment.rollback.to // rollback the deployment to the specified revision. In any case cleanup the rollback spec. func (dc *DeploymentController) rollback(d *apps.Deployment, rsList []*apps.ReplicaSet) error { newRS, allOldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true) if err != nil { return err } allRSs := append(allOldRSs, newRS) rollbackTo := getRollbackTo(d) // If rollback revision is 0, rollback to the last revision if rollbackTo.Revision == 0 { if rollbackTo.Revision = deploymentutil.LastRevision(allRSs); rollbackTo.Revision == 0 { // If we still can't find the last revision, gives up rollback dc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, "Unable to find last revision.") // Gives up rollback return dc.updateDeploymentAndClearRollbackTo(d) } } for _, rs := range allRSs { v, err := deploymentutil.Revision(rs) if err != nil { klog.V(4).Infof("Unable to extract revision from deployment's replica set %q: %v", rs.Name, err) continue } if v == rollbackTo.Revision { klog.V(4).Infof("Found replica set %q with desired revision %d", rs.Name, v) // rollback by copying podTemplate.Spec from the replica set // revision number will be incremented during the next getAllReplicaSetsAndSyncRevision call // no-op if the spec matches current deployment's podTemplate.Spec performedRollback, err := dc.rollbackToTemplate(d, rs) if performedRollback && err == nil { dc.emitRollbackNormalEvent(d, fmt.Sprintf("Rolled back deployment %q to revision %d", d.Name, rollbackTo.Revision)) } return err } } dc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, "Unable to find the revision to rollback to.") // Gives up rollback return dc.updateDeploymentAndClearRollbackTo(d) } ```
#### 5.4 scale操作 (1)判断是否需要scale, 这里通过deploy.spec.Replicas是否等于 rs中的annotations中的desired来判断,不相等就要scale (2)调用scale进行扩缩 ``` scalingEvent, err := dc.isScalingEvent(d, rsList) if err != nil { return err } if scalingEvent { return dc.sync(d, rsList) } // 这里就是判断 deploy.spec.Replicas是否等于 rs中的annotations中的desired,不相等就要scale // isScalingEvent checks whether the provided deployment has been updated with a scaling event // by looking at the desired-replicas annotation in the active replica sets of the deployment. // // rsList should come from getReplicaSetsForDeployment(d). // podMap should come from getPodMapForDeployment(d, rsList). func (dc *DeploymentController) isScalingEvent(d *apps.Deployment, rsList []*apps.ReplicaSet) (bool, error) { newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false) if err != nil { return false, err } allRSs := append(oldRSs, newRS) for _, rs := range controller.FilterActiveReplicaSets(allRSs) { desired, ok := deploymentutil.GetDesiredReplicasAnnotation(rs) if !ok { continue } if desired != *(d.Spec.Replicas) { return true, nil } } return false, nil } //以一个rs为例,annotations中确实有desired-replicas apiVersion: apps/v1 kind: ReplicaSet metadata: annotations: deployment.kubernetes.io/desired-replicas: "1" deployment.kubernetes.io/max-replicas: "2" deployment.kubernetes.io/revision: "1" creationTimestamp: "2021-06-12T14:47:22Z" ```
调用sync->scale 进行扩缩,主要逻辑如下: (1) 获得最新的一个activeRs,进行扩缩容 (2)如果newRS已经是期望状态,将所有的oldRS缩到0 (3)如果是滚动更新,根据MaxSurge等字段,一步一步的更新,oldRs和newRs。最终的状态是newrs是期望状态,oldrs都是0。 这里如果是recreate更新,则什么都不会做,等到旧pod删除完了之后,自然会进入(1),就直接扩缩容就行了。 ``` // sync is responsible for reconciling deployments on scaling events or when they // are paused. func (dc *DeploymentController) sync(d *apps.Deployment, rsList []*apps.ReplicaSet) error { newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false) if err != nil { return err } if err := dc.scale(d, newRS, oldRSs); err != nil { // If we get an error while trying to scale, the deployment will be requeued // so we can abort this resync return err } // Clean up the deployment when it's paused and no rollback is in flight. if d.Spec.Paused && getRollbackTo(d) == nil { if err := dc.cleanupDeployment(oldRSs, d); err != nil { return err } } allRSs := append(oldRSs, newRS) return dc.syncDeploymentStatus(allRSs, newRS, d) } // scale scales proportionally in order to mitigate risk. Otherwise, scaling up can increase the size // of the new replica set and scaling down can decrease the sizes of the old ones, both of which would // have the effect of hastening the rollout progress, which could produce a higher proportion of unavailable // replicas in the event of a problem with the rolled out template. Should run only on scaling events or // when a deployment is paused and not during the normal rollout process. func (dc *DeploymentController) scale(deployment *apps.Deployment, newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet) error { // If there is only one active replica set then we should scale that up to the full count of the // deployment. If there is no active replica set, then we should scale up the newest replica set. // 1. 获得最新的一个activeRs,进行扩缩容 if activeOrLatest := deploymentutil.FindActiveOrLatest(newRS, oldRSs); activeOrLatest != nil { if *(activeOrLatest.Spec.Replicas) == *(deployment.Spec.Replicas) { return nil } _, _, err := dc.scaleReplicaSetAndRecordEvent(activeOrLatest, *(deployment.Spec.Replicas), deployment) return err } // 2. 如果newRS已经是期望状态,将所有的oldRS缩到0 // If the new replica set is saturated, old replica sets should be fully scaled down. // This case handles replica set adoption during a saturated new replica set. if deploymentutil.IsSaturated(deployment, newRS) { for _, old := range controller.FilterActiveReplicaSets(oldRSs) { if _, _, err := dc.scaleReplicaSetAndRecordEvent(old, 0, deployment); err != nil { return err } } return nil } // 3. 如果是滚动更新,根据MaxSurge等字段,一步一步的更新,oldRs和newRs。最终的状态是newrs是期望状态,oldrs都是0。 // There are old replica sets with pods and the new replica set is not saturated. // We need to proportionally scale all replica sets (new and old) in case of a // rolling deployment. if deploymentutil.IsRollingUpdate(deployment) { allRSs := controller.FilterActiveReplicaSets(append(oldRSs, newRS)) allRSsReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs) allowedSize := int32(0) if *(deployment.Spec.Replicas) > 0 { allowedSize = *(deployment.Spec.Replicas) + deploymentutil.MaxSurge(*deployment) } // Number of additional replicas that can be either added or removed from the total // replicas count. These replicas should be distributed proportionally to the active // replica sets. deploymentReplicasToAdd := allowedSize - allRSsReplicas // The additional replicas should be distributed proportionally amongst the active // replica sets from the larger to the smaller in size replica set. Scaling direction // drives what happens in case we are trying to scale replica sets of the same size. // In such a case when scaling up, we should scale up newer replica sets first, and // when scaling down, we should scale down older replica sets first. var scalingOperation string switch { case deploymentReplicasToAdd > 0: sort.Sort(controller.ReplicaSetsBySizeNewer(allRSs)) scalingOperation = "up" case deploymentReplicasToAdd < 0: sort.Sort(controller.ReplicaSetsBySizeOlder(allRSs)) scalingOperation = "down" } // Iterate over all active replica sets and estimate proportions for each of them. // The absolute value of deploymentReplicasAdded should never exceed the absolute // value of deploymentReplicasToAdd. deploymentReplicasAdded := int32(0) nameToSize := make(map[string]int32) for i := range allRSs { rs := allRSs[i] // Estimate proportions if we have replicas to add, otherwise simply populate // nameToSize with the current sizes for each replica set. if deploymentReplicasToAdd != 0 { proportion := deploymentutil.GetProportion(rs, *deployment, deploymentReplicasToAdd, deploymentReplicasAdded) nameToSize[rs.Name] = *(rs.Spec.Replicas) + proportion deploymentReplicasAdded += proportion } else { nameToSize[rs.Name] = *(rs.Spec.Replicas) } } // Update all replica sets for i := range allRSs { rs := allRSs[i] // Add/remove any leftovers to the largest replica set. if i == 0 && deploymentReplicasToAdd != 0 { leftover := deploymentReplicasToAdd - deploymentReplicasAdded nameToSize[rs.Name] = nameToSize[rs.Name] + leftover if nameToSize[rs.Name] < 0 { nameToSize[rs.Name] = 0 } } // TODO: Use transactions when we have them. if _, _, err := dc.scaleReplicaSet(rs, nameToSize[rs.Name], deployment, scalingOperation); err != nil { // Return as soon as we fail, the deployment is requeued return err } } } return nil } ```
##### 5.4.1 获得最新的一个activeRs 从这里可以看出来:activeRs 就是 rs.Spec.Replica>0 的rs 这里的逻辑就是: * 如果没有一个rs是active的,那就当newRS是当前要扩缩容的。newRs 就是:**最近的**,满足 rs.spec.template = deploy.spec.temp 的rs。 * 如果有一个active的rs。那么当其作为要扩缩容的。 * 如果找到多个active的rs, 那么表示这个可能是滚动更新等复杂情况,走后面的逻辑。 * 扩缩容直接调用了scaleReplicaSetAndRecordEvent函数,这个最后分析。 ``` if activeOrLatest := deploymentutil.FindActiveOrLatest(newRS, oldRSs); activeOrLatest != nil { if *(activeOrLatest.Spec.Replicas) == *(deployment.Spec.Replicas) { return nil } _, _, err := dc.scaleReplicaSetAndRecordEvent(activeOrLatest, *(deployment.Spec.Replicas), deployment) return err } // FindActiveOrLatest returns the only active or the latest replica set in case there is at most one active // replica set. If there are more active replica sets, then we should proportionally scale them. func FindActiveOrLatest(newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet) *apps.ReplicaSet { if newRS == nil && len(oldRSs) == 0 { return nil } sort.Sort(sort.Reverse(controller.ReplicaSetsByCreationTimestamp(oldRSs))) allRSs := controller.FilterActiveReplicaSets(append(oldRSs, newRS)) switch len(allRSs) { case 0: // If there is no active replica set then we should return the newest. if newRS != nil { return newRS } return oldRSs[0] case 1: return allRSs[0] default: return nil } } // FilterActiveReplicaSets returns replica sets that have (or at least ought to have) pods. func FilterActiveReplicaSets(replicaSets []*apps.ReplicaSet) []*apps.ReplicaSet { activeFilter := func(rs *apps.ReplicaSet) bool { return rs != nil && *(rs.Spec.Replicas) > 0 } return FilterReplicaSets(replicaSets, activeFilter) } type filterRS func(rs *apps.ReplicaSet) bool // FilterReplicaSets returns replica sets that are filtered by filterFn (all returned ones should match filterFn). func FilterReplicaSets(RSes []*apps.ReplicaSet, filterFn filterRS) []*apps.ReplicaSet { var filtered []*apps.ReplicaSet for i := range RSes { if filterFn(RSes[i]) { filtered = append(filtered, RSes[i]) } } return filtered } ```
##### 5.4.2 如果newRS已经是期望状态,将所有的oldRS缩到0 从这里很直观就可以看出来 ``` // If the new replica set is saturated, old replica sets should be fully scaled down. // This case handles replica set adoption during a saturated new replica set. if deploymentutil.IsSaturated(deployment, newRS) { for _, old := range controller.FilterActiveReplicaSets(oldRSs) { if _, _, err := dc.scaleReplicaSetAndRecordEvent(old, 0, deployment); err != nil { return err } } return nil } // IsSaturated checks if the new replica set is saturated by comparing its size with its deployment size. // Both the deployment and the replica set have to believe this replica set can own all of the desired // replicas in the deployment and the annotation helps in achieving that. All pods of the ReplicaSet // need to be available. func IsSaturated(deployment *apps.Deployment, rs *apps.ReplicaSet) bool { if rs == nil { return false } desiredString := rs.Annotations[DesiredReplicasAnnotation] desired, err := strconv.Atoi(desiredString) if err != nil { return false } return *(rs.Spec.Replicas) == *(deployment.Spec.Replicas) && int32(desired) == *(deployment.Spec.Replicas) && rs.Status.AvailableReplicas == *(deployment.Spec.Replicas) } ```
#### 5.5 recreate更新 这种策略就非常简单。先将所有旧rs scaledown到0。然后再将newRs扩到期望值。这里需要注意的是,如果旧rs还有pod running,这这个时候是再次同步,也就是说新的rs是等所有旧pod全部删除完了之后,才会开始创建。 ``` // rolloutRecreate implements the logic for recreating a replica set. func (dc *DeploymentController) rolloutRecreate(d *apps.Deployment, rsList []*apps.ReplicaSet, podMap map[types.UID][]*v1.Pod) error { // Don't create a new RS if not already existed, so that we avoid scaling up before scaling down. newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false) if err != nil { return err } allRSs := append(oldRSs, newRS) activeOldRSs := controller.FilterActiveReplicaSets(oldRSs) // scale down old replica sets. scaledDown, err := dc.scaleDownOldReplicaSetsForRecreate(activeOldRSs, d) if err != nil { return err } if scaledDown { // Update DeploymentStatus. return dc.syncRolloutStatus(allRSs, newRS, d) } // 如果旧rs还有pod running,这个时候是再次同步。 // Do not process a deployment when it has old pods running. if oldPodsRunning(newRS, oldRSs, podMap) { return dc.syncRolloutStatus(allRSs, newRS, d) } // If we need to create a new RS, create it now. if newRS == nil { newRS, oldRSs, err = dc.getAllReplicaSetsAndSyncRevision(d, rsList, true) if err != nil { return err } allRSs = append(oldRSs, newRS) } // scale up new replica set. if _, err := dc.scaleUpNewReplicaSetForRecreate(newRS, d); err != nil { return err } if util.DeploymentComplete(d, &d.Status) { if err := dc.cleanupDeployment(oldRSs, d); err != nil { return err } } // Sync deployment status. return dc.syncRolloutStatus(allRSs, newRS, d) } // scaleDownOldReplicaSetsForRecreate scales down old replica sets when deployment strategy is "Recreate". func (dc *DeploymentController) scaleDownOldReplicaSetsForRecreate(oldRSs []*apps.ReplicaSet, deployment *apps.Deployment) (bool, error) { scaled := false for i := range oldRSs { rs := oldRSs[i] // Scaling not required. if *(rs.Spec.Replicas) == 0 { continue } scaledRS, updatedRS, err := dc.scaleReplicaSetAndRecordEvent(rs, 0, deployment) if err != nil { return false, err } if scaledRS { oldRSs[i] = updatedRS scaled = true } } return scaled, nil } ```
#### 5.6 rolloutRolling更新 (1)获得newRS, oldRSs (2)如果是scaledUp,返回 syncRolloutStatus (3)如果是scaledDown,返回syncRolloutStatus (4)如果到了这里,说明不是scaledUp也不是scaledDown,那说明可能是达到了期望值,通过DeploymentComplete判断一下 (5)同步状态 ``` // rolloutRolling implements the logic for rolling a new replica set. func (dc *DeploymentController) rolloutRolling(d *apps.Deployment, rsList []*apps.ReplicaSet) error { newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true) if err != nil { return err } allRSs := append(oldRSs, newRS) // Scale up, if we can. scaledUp, err := dc.reconcileNewReplicaSet(allRSs, newRS, d) if err != nil { return err } if scaledUp { // Update DeploymentStatus return dc.syncRolloutStatus(allRSs, newRS, d) } // Scale down, if we can. scaledDown, err := dc.reconcileOldReplicaSets(allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d) if err != nil { return err } if scaledDown { // Update DeploymentStatus return dc.syncRolloutStatus(allRSs, newRS, d) } if deploymentutil.DeploymentComplete(d, &d.Status) { if err := dc.cleanupDeployment(oldRSs, d); err != nil { return err } } // Sync deployment status return dc.syncRolloutStatus(allRSs, newRS, d) } ```
##### 5.6.1 如果是scaledUp(针对news),返回 syncRolloutStatus 这里就是判断是否是scaleup,如果是,还计算了一下需要扩容的副本数。这里更加了更新策略,以及MaxSurge等因素来计算。 然后通过scaleReplicaSetAndRecordEvent来修改rs并发送事件。 ``` func (dc *DeploymentController) reconcileNewReplicaSet(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) { if *(newRS.Spec.Replicas) == *(deployment.Spec.Replicas) { // Scaling not required. return false, nil } if *(newRS.Spec.Replicas) > *(deployment.Spec.Replicas) { // Scale down. scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, *(deployment.Spec.Replicas), deployment) return scaled, err } newReplicasCount, err := deploymentutil.NewRSNewReplicas(deployment, allRSs, newRS) if err != nil { return false, err } scaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, newReplicasCount, deployment) return scaled, err } // NewRSNewReplicas calculates the number of replicas a deployment's new RS should have. // When one of the followings is true, we're rolling out the deployment; otherwise, we're scaling it. // 1) The new RS is saturated: newRS's replicas == deployment's replicas // 2) Max number of pods allowed is reached: deployment's replicas + maxSurge == all RSs' replicas func NewRSNewReplicas(deployment *apps.Deployment, allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet) (int32, error) { switch deployment.Spec.Strategy.Type { case apps.RollingUpdateDeploymentStrategyType: // Check if we can scale up. maxSurge, err := intstrutil.GetValueFromIntOrPercent(deployment.Spec.Strategy.RollingUpdate.MaxSurge, int(*(deployment.Spec.Replicas)), true) if err != nil { return 0, err } // Find the total number of pods currentPodCount := GetReplicaCountForReplicaSets(allRSs) maxTotalPods := *(deployment.Spec.Replicas) + int32(maxSurge) if currentPodCount >= maxTotalPods { // Cannot scale up. return *(newRS.Spec.Replicas), nil } // Scale up. scaleUpCount := maxTotalPods - currentPodCount // Do not exceed the number of desired replicas. scaleUpCount = int32(integer.IntMin(int(scaleUpCount), int(*(deployment.Spec.Replicas)-*(newRS.Spec.Replicas)))) return *(newRS.Spec.Replicas) + scaleUpCount, nil case apps.RecreateDeploymentStrategyType: return *(deployment.Spec.Replicas), nil default: return 0, fmt.Errorf("deployment type %v isn't supported", deployment.Spec.Strategy.Type) } } ```
scaledown同样也是差不多的逻辑。计算的是当前旧rs应该减少的部分。
#### 5.7 scaleReplicaSetAndRecordEvent 这个函数作用和名字一样。通过restful 对rs进行 scale。 然后用事件记录。 ``` func (dc *DeploymentController) scaleReplicaSetAndRecordEvent(rs *apps.ReplicaSet, newScale int32, deployment *apps.Deployment) (bool, *apps.ReplicaSet, error) { // No need to scale if *(rs.Spec.Replicas) == newScale { return false, rs, nil } var scalingOperation string if *(rs.Spec.Replicas) < newScale { scalingOperation = "up" } else { scalingOperation = "down" } scaled, newRS, err := dc.scaleReplicaSet(rs, newScale, deployment, scalingOperation) return scaled, newRS, err } func (dc *DeploymentController) scaleReplicaSet(rs *apps.ReplicaSet, newScale int32, deployment *apps.Deployment, scalingOperation string) (bool, *apps.ReplicaSet, error) { sizeNeedsUpdate := *(rs.Spec.Replicas) != newScale annotationsNeedUpdate := deploymentutil.ReplicasAnnotationsNeedUpdate(rs, *(deployment.Spec.Replicas), *(deployment.Spec.Replicas)+deploymentutil.MaxSurge(*deployment)) scaled := false var err error if sizeNeedsUpdate || annotationsNeedUpdate { rsCopy := rs.DeepCopy() *(rsCopy.Spec.Replicas) = newScale deploymentutil.SetReplicasAnnotations(rsCopy, *(deployment.Spec.Replicas), *(deployment.Spec.Replicas)+deploymentutil.MaxSurge(*deployment)) rs, err = dc.client.AppsV1().ReplicaSets(rsCopy.Namespace).Update(rsCopy) if err == nil && sizeNeedsUpdate { scaled = true dc.eventRecorder.Eventf(deployment, v1.EventTypeNormal, "ScalingReplicaSet", "Scaled %s replica set %s to %d", scalingOperation, rs.Name, newScale) } } return scaled, rs, err } ```
================================================ FILE: k8s/kcm/3-k8s gc源码分析.md ================================================ Table of Contents ================= * [1. K8s 的垃圾回收策略](#1-k8s-的垃圾回收策略) * [2 gc 源码分析](#2-gc-源码分析) * [2.1 初始化 garbageCollector 对象](#21-初始化-garbagecollector-对象) * [2.1.1 garbageCollector包含的结构体对象](#211-garbagecollector包含的结构体对象) * [2.1.2 NewGarbageCollector](#212-newgarbagecollector) * [2.2 启动garbageCollector](#22-启动garbagecollector) * [2.2.1 启动dependencyGraphBuilder](#221-启动dependencygraphbuilder) * [2.2.2 runAttemptToDeleteWorker](#222-runattempttodeleteworker) * [2.2.3 runAttemptToOrphanWorker](#223-runattempttoorphanworker) * [2.2.4 总结](#224-总结) * [2.3 runProcessGraphChanges](#23--runprocessgraphchanges) * [2.4 processTransitions函数的处理逻辑](#24-processtransitions函数的处理逻辑) * [2.5 runAttemptToOrphanWorker](#25-runattempttoorphanworker) * [2.6 attemptToDeleteWorker](#26-attempttodeleteworker) * [2.7 uidToNode到底是什么](#27-uidtonode到底是什么) * [3.总结](#3总结) ### 1. K8s 的垃圾回收策略 k8s目前支持三种回收策略: **(1)前台级联删除(Foreground Cascading Deletion)**:在这种删除策略中,所有者对象的删除将会持续到其所有从属对象都被删除为止。当所有者被删除时,会进入“正在删除”(deletion in progress)状态,此时: * 对象仍然可以通过 REST API 查询到(可通过 kubectl 或 kuboard 查询到) * 对象的 deletionTimestamp 字段被设置 * 对象的 metadata.finalizers 包含值 foregroundDeletion **(2)后台级联删除(Background Cascading Deletion)**:这种删除策略会简单很多,它会立即删除所有者的对象,并由垃圾回收器在后台删除其从属对象。这种方式比前台级联删除快的多,因为不用等待时间来删除从属对象。 **(3)孤儿(Orphan)**:这种情况下,对所有者的进行删除只会将其从集群中删除,并使所有对象处于“孤儿”状态。 举例:已有一个deployA, 对应的rs假设为 rsA, pod为PodA。 (1)前台删除:先删除podA, 再删除rsA, 再删除deployA。 podA的删除如果卡在,rsA也会被卡住。 (2)后台删除:先删除deployA, 再删除rsA, 再删除podA。 podA和rsA是否会删除成功,deploy不会受影响。 (3)孤儿删除:只删除deployA。rsA, podA不受影响。 rsA的owner不再是deployA。
### 2 gc 源码分析 和deployController, rsController一样,GarbageCollectorController也是kube-controller-manager(kcm)中的一个控制器。 GarbageCollectorController 的启动方法为 `startGarbageCollectorController`,主要逻辑如下: **从第三步开始每一步都深入展开。第三步对应2.1。** (1)初始化客户端,用于发现集群中的资源。这个先不关注 (2)获得deletableResources,以及ignoredResources。 deletableResources: 所有支持"delete", "list", "watch" 操作的资源 ignoredResources:kcm启动时GarbageCollectorController的config指定 (3)初始化 garbageCollector 对象。 (4)启动garbageCollector (5)garbageCollector同步 (6)开启debug模式 ``` func startGarbageCollectorController(ctx ControllerContext) (http.Handler, bool, error) { // 1.初始化客户端 if !ctx.ComponentConfig.GarbageCollectorController.EnableGarbageCollector { return nil, false, nil } gcClientset := ctx.ClientBuilder.ClientOrDie("generic-garbage-collector") discoveryClient := cacheddiscovery.NewMemCacheClient(gcClientset.Discovery()) config := ctx.ClientBuilder.ConfigOrDie("generic-garbage-collector") metadataClient, err := metadata.NewForConfig(config) if err != nil { return nil, true, err } // 2. 获得deletableResources,以及ignoredResources // Get an initial set of deletable resources to prime the garbage collector. deletableResources := garbagecollector.GetDeletableResources(discoveryClient) ignoredResources := make(map[schema.GroupResource]struct{}) for _, r := range ctx.ComponentConfig.GarbageCollectorController.GCIgnoredResources { ignoredResources[schema.GroupResource{Group: r.Group, Resource: r.Resource}] = struct{}{} } // 3. NewGarbageCollector garbageCollector, err := garbagecollector.NewGarbageCollector( metadataClient, ctx.RESTMapper, deletableResources, ignoredResources, ctx.ObjectOrMetadataInformerFactory, ctx.InformersStarted, ) if err != nil { return nil, true, fmt.Errorf("failed to start the generic garbage collector: %v", err) } // 4. 启动garbageCollector // Start the garbage collector. workers := int(ctx.ComponentConfig.GarbageCollectorController.ConcurrentGCSyncs) go garbageCollector.Run(workers, ctx.Stop) // Periodically refresh the RESTMapper with new discovery information and sync // the garbage collector. // 5. garbageCollector同步 go garbageCollector.Sync(gcClientset.Discovery(), 30*time.Second, ctx.Stop) // 6. 开启debug模式 return garbagecollector.NewDebugHandler(garbageCollector), true, nil } ```
#### 2.1 初始化 garbageCollector 对象 ##### 2.1.1 garbageCollector包含的结构体对象 garbageCollector需要额外的结构: attemptToDelete,attemptToOrphan:限速队列 uidToNode:一个缓存依赖关系的图。一个map结构,key=uid, value是一个node结构。 ``` type GarbageCollector struct { restMapper resettableRESTMapper metadataClient metadata.Interface attemptToDelete workqueue.RateLimitingInterface attemptToOrphan workqueue.RateLimitingInterface dependencyGraphBuilder *GraphBuilder absentOwnerCache *UIDCache workerLock sync.RWMutex } // GraphBuilder: based on the events supplied by the informers, GraphBuilder updates // uidToNode, a graph that caches the dependencies as we know, and enqueues // items to the attemptToDelete and attemptToOrphan. type GraphBuilder struct { restMapper meta.RESTMapper // 每一个monitor对应一种资源 monitors monitors monitorLock sync.RWMutex informersStarted <-chan struct{} stopCh <-chan struct{} running bool metadataClient metadata.Interface graphChanges workqueue.RateLimitingInterface uidToNode *concurrentUIDToNode attemptToDelete workqueue.RateLimitingInterface attemptToOrphan workqueue.RateLimitingInterface absentOwnerCache *UIDCache sharedInformers controller.InformerFactory ignoredResources map[schema.GroupResource]struct{} } type concurrentUIDToNode struct { uidToNodeLock sync.RWMutex uidToNode map[types.UID]*node } type node struct { identity objectReference dependentsLock sync.RWMutex dependents map[*node]struct{} //该节点的所有依赖 deletingDependents bool deletingDependentsLock sync.RWMutex beingDeleted bool beingDeletedLock sync.RWMutex virtual bool virtualLock sync.RWMutex owners []metav1.OwnerReference //该节点的所有owner } ``` 举例来说: 假设集群中有:deployA, rsA, podA三个对象。 monitors 负责监听这三种资源的变化。然后根据情况扔进 attemptToDelete,attemptToOrphan队列。 GraphBuilder负责构建一个图。在这种情况下,图的内容为: Node1( key=deployA.uid ): 它的owner为空,dependents=rsA。 Node2( key=rsA.uid ): 它的owner=deployA,dependents=podA。 Node3( key=pod.uid ): 它的owner=rsA,dependents为空。
同时,每个节点还有beingDeleted,deletingDependents等关键字段。这样gc根据这个图就可以很方便地进行各种策略的删除。 ##### 2.1.2 NewGarbageCollector NewGarbageCollector就做了俩件事 (1)初始化GarbageCollector结构体 (2)调用controllerFor定义对象变化的处理事件。无论是监听到add, update, del都是将其打包成一个event事件,然后加入graphChanges队列。 ``` func NewGarbageCollector( metadataClient metadata.Interface, mapper resettableRESTMapper, deletableResources map[schema.GroupVersionResource]struct{}, ignoredResources map[schema.GroupResource]struct{}, sharedInformers controller.InformerFactory, informersStarted <-chan struct{}, ) (*GarbageCollector, error) { attemptToDelete := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "garbage_collector_attempt_to_delete") attemptToOrphan := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "garbage_collector_attempt_to_orphan") absentOwnerCache := NewUIDCache(500) gc := &GarbageCollector{ metadataClient: metadataClient, restMapper: mapper, attemptToDelete: attemptToDelete, attemptToOrphan: attemptToOrphan, absentOwnerCache: absentOwnerCache, } gb := &GraphBuilder{ metadataClient: metadataClient, informersStarted: informersStarted, restMapper: mapper, graphChanges: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "garbage_collector_graph_changes"), uidToNode: &concurrentUIDToNode{ uidToNode: make(map[types.UID]*node), }, attemptToDelete: attemptToDelete, attemptToOrphan: attemptToOrphan, absentOwnerCache: absentOwnerCache, sharedInformers: sharedInformers, ignoredResources: ignoredResources, } // if err := gb.syncMonitors(deletableResources); err != nil { utilruntime.HandleError(fmt.Errorf("failed to sync all monitors: %v", err)) } gc.dependencyGraphBuilder = gb return gc, nil } ```
syncMonitors就是同步更新哪些资源需要监听,然后调用controllerFor注册事件处理。 ``` func (gb *GraphBuilder) syncMonitors(resources map[schema.GroupVersionResource]struct{}) error { gb.monitorLock.Lock() defer gb.monitorLock.Unlock() toRemove := gb.monitors if toRemove == nil { toRemove = monitors{} } current := monitors{} errs := []error{} kept := 0 added := 0 for resource := range resources { if _, ok := gb.ignoredResources[resource.GroupResource()]; ok { continue } if m, ok := toRemove[resource]; ok { current[resource] = m delete(toRemove, resource) kept++ continue } kind, err := gb.restMapper.KindFor(resource) if err != nil { errs = append(errs, fmt.Errorf("couldn't look up resource %q: %v", resource, err)) continue } c, s, err := gb.controllerFor(resource, kind) if err != nil { errs = append(errs, fmt.Errorf("couldn't start monitor for resource %q: %v", resource, err)) continue } current[resource] = &monitor{store: s, controller: c} added++ } gb.monitors = current for _, monitor := range toRemove { if monitor.stopCh != nil { close(monitor.stopCh) } } klog.V(4).Infof("synced monitors; added %d, kept %d, removed %d", added, kept, len(toRemove)) // NewAggregate returns nil if errs is 0-length return utilerrors.NewAggregate(errs) } ``` controllerFor无论是监听到add, update, del都是将其打包成一个event事件,然后加入graphChanges队列。 ``` func (gb *GraphBuilder) controllerFor(resource schema.GroupVersionResource, kind schema.GroupVersionKind) (cache.Controller, cache.Store, error) { handlers := cache.ResourceEventHandlerFuncs{ // add the event to the dependencyGraphBuilder's graphChanges. AddFunc: func(obj interface{}) { event := &event{ eventType: addEvent, obj: obj, gvk: kind, } gb.graphChanges.Add(event) }, UpdateFunc: func(oldObj, newObj interface{}) { // TODO: check if there are differences in the ownerRefs, // finalizers, and DeletionTimestamp; if not, ignore the update. event := &event{ eventType: updateEvent, obj: newObj, oldObj: oldObj, gvk: kind, } gb.graphChanges.Add(event) }, DeleteFunc: func(obj interface{}) { // delta fifo may wrap the object in a cache.DeletedFinalStateUnknown, unwrap it if deletedFinalStateUnknown, ok := obj.(cache.DeletedFinalStateUnknown); ok { obj = deletedFinalStateUnknown.Obj } event := &event{ eventType: deleteEvent, obj: obj, gvk: kind, } gb.graphChanges.Add(event) }, } shared, err := gb.sharedInformers.ForResource(resource) if err != nil { klog.V(4).Infof("unable to use a shared informer for resource %q, kind %q: %v", resource.String(), kind.String(), err) return nil, nil, err } klog.V(4).Infof("using a shared informer for resource %q, kind %q", resource.String(), kind.String()) // need to clone because it's from a shared cache shared.Informer().AddEventHandlerWithResyncPeriod(handlers, ResourceResyncTime) return shared.Informer().GetController(), shared.Informer().GetStore(), nil } ```
#### 2.2 启动garbageCollector ``` func (gc *GarbageCollector) Run(workers int, stopCh <-chan struct{}) { defer utilruntime.HandleCrash() defer gc.attemptToDelete.ShutDown() defer gc.attemptToOrphan.ShutDown() defer gc.dependencyGraphBuilder.graphChanges.ShutDown() klog.Infof("Starting garbage collector controller") defer klog.Infof("Shutting down garbage collector controller") // 1.启动dependencyGraphBuilder go gc.dependencyGraphBuilder.Run(stopCh) if !cache.WaitForNamedCacheSync("garbage collector", stopCh, gc.dependencyGraphBuilder.IsSynced) { return } klog.Infof("Garbage collector: all resource monitors have synced. Proceeding to collect garbage") // 启动runAttemptToDeleteWorker,runAttemptToOrphanWorker // gc workers for i := 0; i < workers; i++ { go wait.Until(gc.runAttemptToDeleteWorker, 1*time.Second, stopCh) go wait.Until(gc.runAttemptToOrphanWorker, 1*time.Second, stopCh) } <-stopCh } ```
##### 2.2.1 启动dependencyGraphBuilder ``` // Run sets the stop channel and starts monitor execution until stopCh is // closed. Any running monitors will be stopped before Run returns. func (gb *GraphBuilder) Run(stopCh <-chan struct{}) { klog.Infof("GraphBuilder running") defer klog.Infof("GraphBuilder stopping") // Set up the stop channel. gb.monitorLock.Lock() gb.stopCh = stopCh gb.running = true gb.monitorLock.Unlock() // Start monitors and begin change processing until the stop channel is // closed. // 1. 启动各个资源的监听 gb.startMonitors() // 2. runProcessGraphChanges开始处理各种事件 wait.Until(gb.runProcessGraphChanges, 1*time.Second, stopCh) // 这里就是有monitor关闭后的处理 // Stop any running monitors. gb.monitorLock.Lock() defer gb.monitorLock.Unlock() monitors := gb.monitors stopped := 0 for _, monitor := range monitors { if monitor.stopCh != nil { stopped++ close(monitor.stopCh) } } // reset monitors so that the graph builder can be safely re-run/synced. gb.monitors = nil klog.Infof("stopped %d of %d monitors", stopped, len(monitors)) } // 启动各个资源的监听 func (gb *GraphBuilder) startMonitors() { gb.monitorLock.Lock() defer gb.monitorLock.Unlock() if !gb.running { return } // we're waiting until after the informer start that happens once all the controllers are initialized. This ensures // that they don't get unexpected events on their work queues. <-gb.informersStarted monitors := gb.monitors started := 0 for _, monitor := range monitors { if monitor.stopCh == nil { monitor.stopCh = make(chan struct{}) gb.sharedInformers.Start(gb.stopCh) go monitor.Run() started++ } } klog.V(4).Infof("started %d new monitors, %d currently running", started, len(monitors)) } ```
##### 2.2.2 runAttemptToDeleteWorker runAttemptToDeleteWorker就是从attemptToDelete队列中取出来一个对象处理。 ``` func (gc *GarbageCollector) runAttemptToDeleteWorker() { for gc.attemptToDeleteWorker() { } } func (gc *GarbageCollector) attemptToDeleteWorker() bool { item, quit := gc.attemptToDelete.Get() ... err := gc.attemptToDeleteItem(n) ... return true } ``` ##### 2.2.3 runAttemptToOrphanWorker runAttemptToOrphanWorker就是从attemptToOrphan队列中取出来一个对象处理。 ``` func (gc *GarbageCollector) runAttemptToOrphanWorker() { for gc.attemptToOrphanWorker() { } } func (gc *GarbageCollector) attemptToOrphanWorker() bool { item, quit := gc.attemptToOrphan.Get() defer gc.attemptToOrphan.Done(item) owner, ok := item.(*node) if !ok { utilruntime.HandleError(fmt.Errorf("expect *node, got %#v", item)) return true } // we don't need to lock each element, because they never get updated owner.dependentsLock.RLock() dependents := make([]*node, 0, len(owner.dependents)) for dependent := range owner.dependents { dependents = append(dependents, dependent) } owner.dependentsLock.RUnlock() err := gc.orphanDependents(owner.identity, dependents) if err != nil { utilruntime.HandleError(fmt.Errorf("orphanDependents for %s failed with %v", owner.identity, err)) gc.attemptToOrphan.AddRateLimited(item) return true } // update the owner, remove "orphaningFinalizer" from its finalizers list err = gc.removeFinalizer(owner, metav1.FinalizerOrphanDependents) if err != nil { utilruntime.HandleError(fmt.Errorf("removeOrphanFinalizer for %s failed with %v", owner.identity, err)) gc.attemptToOrphan.AddRateLimited(item) } return true } ```
##### 2.2.4 总结 (1)NewGarbageCollector初始化了graphbuild, attempToDelete, attempToOrphan队列,然后定义了资源变化时的处理对象 (2)GarbageCollector.run 做了三个工作。`第一是`, 让监控的所有资源,都用一个处理逻辑。就是:add, update, del都是将其打包成一个event事件,然后加入graphChanges队列。`第二是` ,启动runProcessGraphChanges处理graphChanges队列的对象。`第三是`, 启动AttemptToOrphanWorker,AttemptToDeleteWorker进行gc处理。 (3)到这里,总的来说逻辑就是: * NewGarbageCollector监听了所有支持 list, watch, delete操作的事件 * 然后定义这些对象所有的add, update, del变化都扔进 graphChanges队列 * 然后启动runProcessGraphChanges,处理graphChanges的对象。runProcessGraphChanges主要做俩件事,一是维护图,二是将可能需要删除的对象,扔进 AttemptToOrphan,或者AttemptToDelete进行处理 * AttemptToOrphanWorker,AttemptToDeleteWorker进行具体的gc处理。
到这里为止,gc的初始化,以及大概的流程都清楚了。接下来具体分析runProcessGraphChanges函数,以及AttemptToOrphanWorker,AttemptToDeleteWorker的处理逻辑。
#### 2.3 runProcessGraphChanges runProcessGraphChanges作用就是俩件事: (1)时刻uidToNode维护图的正确和完整 (2)将可能需要删除的对象扔进AttemptToOrphan,AttemptToDelete队列 **具体逻辑如下:** (1)从 graphChanges 取出一个 对象(event),然后判断图里面有没有这个对象。如果存在,将该节点标记为 observed。这个是表示,这个节点不是virtual节点。 (2)分三种情况进行处理。具体是: ``` func (gb *GraphBuilder) runProcessGraphChanges() { for gb.processGraphChanges() { } } // Dequeueing an event from graphChanges, updating graph, populating dirty_queue. func (gb *GraphBuilder) processGraphChanges() bool { item, quit := gb.graphChanges.Get() if quit { return false } defer gb.graphChanges.Done(item) event, ok := item.(*event) if !ok { utilruntime.HandleError(fmt.Errorf("expect a *event, got %v", item)) return true } obj := event.obj accessor, err := meta.Accessor(obj) if err != nil { utilruntime.HandleError(fmt.Errorf("cannot access obj: %v", err)) return true } klog.V(5).Infof("GraphBuilder process object: %s/%s, namespace %s, name %s, uid %s, event type %v", event.gvk.GroupVersion().String(), event.gvk.Kind, accessor.GetNamespace(), accessor.GetName(), string(accessor.GetUID()), event.eventType) // Check if the node already exists // 1.判断图里面有没有这个对象 existingNode, found := gb.uidToNode.Read(accessor.GetUID()) // 1.1 如果存在,将其标记为 observed。这个是表示,这个节点不是virtual节点。 if found { // this marks the node as having been observed via an informer event // 1. this depends on graphChanges only containing add/update events from the actual informer // 2. this allows things tracking virtual nodes' existence to stop polling and rely on informer events existingNode.markObserved() } // 2. 分三种情况进行处理。 switch { case (event.eventType == addEvent || event.eventType == updateEvent) && !found: newNode := &node{ identity: objectReference{ OwnerReference: metav1.OwnerReference{ APIVersion: event.gvk.GroupVersion().String(), Kind: event.gvk.Kind, UID: accessor.GetUID(), Name: accessor.GetName(), }, Namespace: accessor.GetNamespace(), }, dependents: make(map[*node]struct{}), owners: accessor.GetOwnerReferences(), deletingDependents: beingDeleted(accessor) && hasDeleteDependentsFinalizer(accessor), beingDeleted: beingDeleted(accessor), } gb.insertNode(newNode) // the underlying delta_fifo may combine a creation and a deletion into // one event, so we need to further process the event. gb.processTransitions(event.oldObj, accessor, newNode) case (event.eventType == addEvent || event.eventType == updateEvent) && found: // handle changes in ownerReferences added, removed, changed := referencesDiffs(existingNode.owners, accessor.GetOwnerReferences()) if len(added) != 0 || len(removed) != 0 || len(changed) != 0 { // check if the changed dependency graph unblock owners that are // waiting for the deletion of their dependents. gb.addUnblockedOwnersToDeleteQueue(removed, changed) // update the node itself existingNode.owners = accessor.GetOwnerReferences() // Add the node to its new owners' dependent lists. gb.addDependentToOwners(existingNode, added) // remove the node from the dependent list of node that are no longer in // the node's owners list. gb.removeDependentFromOwners(existingNode, removed) } if beingDeleted(accessor) { existingNode.markBeingDeleted() } gb.processTransitions(event.oldObj, accessor, existingNode) case event.eventType == deleteEvent: if !found { klog.V(5).Infof("%v doesn't exist in the graph, this shouldn't happen", accessor.GetUID()) return true } // removeNode updates the graph gb.removeNode(existingNode) existingNode.dependentsLock.RLock() defer existingNode.dependentsLock.RUnlock() if len(existingNode.dependents) > 0 { gb.absentOwnerCache.Add(accessor.GetUID()) } for dep := range existingNode.dependents { gb.attemptToDelete.Add(dep) } for _, owner := range existingNode.owners { ownerNode, found := gb.uidToNode.Read(owner.UID) if !found || !ownerNode.isDeletingDependents() { continue } // this is to let attempToDeleteItem check if all the owner's // dependents are deleted, if so, the owner will be deleted. gb.attemptToDelete.Add(ownerNode) } } return true } ```
**第一种:** 如果图中不存在这个节点,并且事件为 add或者update,处理方法为: (1) 初始化一个node节点。然后插入到map中。 ``` case (event.eventType == addEvent || event.eventType == updateEvent) && !found: newNode := &node{ // 该对象的标记,由APIVersion,Kind,UID,Name identity: objectReference{ OwnerReference: metav1.OwnerReference{ APIVersion: event.gvk.GroupVersion().String(), Kind: event.gvk.Kind, UID: accessor.GetUID(), Name: accessor.GetName(), }, Namespace: accessor.GetNamespace(), }, dependents: make(map[*node]struct{}), // 这里现在是空的 owners: accessor.GetOwnerReferences(), // 判断是否是删dependent deletingDependents: beingDeleted(accessor) && hasDeleteDependentsFinalizer(accessor), // 判断是否在正在删除 beingDeleted: beingDeleted(accessor), } gb.insertNode(newNode) // the underlying delta_fifo may combine a creation and a deletion into // one event, so we need to further process the event. gb.processTransitions(event.oldObj, accessor, newNode) ``` (2)insertNode,将这个节点加入map中,并且将这个node加入所有的owner node的dependent中。 假设当前是当前节点是rsA, 这一步会将rsA加入map中,并且增加deployA的一个dependent为rsA. (3)调用processTransitions进行进一步的处理。processTransitions是一个通用函数,它的作用就是将这个对象放入放到AttemptToOrphan或者AttemptToDelete队列,这个等下具体介绍
**第二种**, 如果图中存在这个节点,并且事件为 add或者update,处理方法为: (1)处理references Diff * 首先根据节点的信息 和 对象最新的信息,判断OwnerReference的变化。这里分为三种变化: added 表示该对象的OwnerReference中新增了哪些 owner; removed表示该对象删除了哪些owner;changed表示哪些改变了 * 针对这三种变化做出的处理如下: a. 调用addUnblockedOwnersToDeleteQueue将可能阻塞的owner重新加入队列。具体可以看代码注释中的分析 b. existingNode.owners = accessor.GetOwnerReferences(), 让节点使用最新的owner c. 新增了owner,需要在新增owner中的Dependents增加一个Dependent, 就是该节点 d. 删除了owner,需要在原来的owner中的Dependents删除这个Dependent, 就是该节点 (2) 如果当前对象有deletionStamp,标记这个节点正在删除 (3)调用processTransitions进行进一步的处理。processTransitions是一个通用函数,它的作用就是将这个对象放入放到AttemptToOrphan或者AttemptToDelete队列,这个等下具体介绍 ``` case (event.eventType == addEvent || event.eventType == updateEvent) && found: // handle changes in ownerReferences added, removed, changed := referencesDiffs(existingNode.owners, accessor.GetOwnerReferences()) if len(added) != 0 || len(removed) != 0 || len(changed) != 0 { // check if the changed dependency graph unblock owners that are // waiting for the deletion of their dependents. // a.调用addUnblockedOwnersToDeleteQueue将可能阻塞的owner重新加入队列。具体可以看代码注释中的分析 gb.addUnblockedOwnersToDeleteQueue(removed, changed) // update the node itself // b.让节点使用最新的owner existingNode.owners = accessor.GetOwnerReferences() // Add the node to its new owners' dependent lists. // c. 新增了owner,需要在新增owner中的Dependents增加一个Dependent, 就是该节点 gb.addDependentToOwners(existingNode, added) // remove the node from the dependent list of node that are no longer in // the node's owners list. // d. 删除了owner,需要在原来的owner中的Dependents删除这个Dependent, 就是该节点 gb.removeDependentFromOwners(existingNode, removed) } if beingDeleted(accessor) { existingNode.markBeingDeleted() } gb.processTransitions(event.oldObj, accessor, existingNode) // TODO: profile this function to see if a naive N^2 algorithm performs better // when the number of references is small. func referencesDiffs(old []metav1.OwnerReference, new []metav1.OwnerReference) (added []metav1.OwnerReference, removed []metav1.OwnerReference, changed []ownerRefPair) { oldUIDToRef := make(map[string]metav1.OwnerReference) for _, value := range old { oldUIDToRef[string(value.UID)] = value } oldUIDSet := sets.StringKeySet(oldUIDToRef) for _, value := range new { newUID := string(value.UID) if oldUIDSet.Has(newUID) { if !reflect.DeepEqual(oldUIDToRef[newUID], value) { changed = append(changed, ownerRefPair{oldRef: oldUIDToRef[newUID], newRef: value}) } oldUIDSet.Delete(newUID) } else { added = append(added, value) } } for oldUID := range oldUIDSet { removed = append(removed, oldUIDToRef[oldUID]) } return added, removed, changed } // 以foreground方式删除deployA的时候,deployA会被Block,原因在于它在等 rsA的删除。 // 这个时候如果改变rsA的OwnerReference,比如删除owner, deployA。这个时候需要通知deployA,你不用等了,可以直接删除了。 // addUnblockedOwnersToDeleteQueue就是做这样的事情,检测到rsA的OwnerReference变化,将等待的deployA加入删除队列。 // if an blocking ownerReference points to an object gets removed, or gets set to // "BlockOwnerDeletion=false", add the object to the attemptToDelete queue. func (gb *GraphBuilder) addUnblockedOwnersToDeleteQueue(removed []metav1.OwnerReference, changed []ownerRefPair) { for _, ref := range removed { if ref.BlockOwnerDeletion != nil && *ref.BlockOwnerDeletion { node, found := gb.uidToNode.Read(ref.UID) if !found { klog.V(5).Infof("cannot find %s in uidToNode", ref.UID) continue } gb.attemptToDelete.Add(node) } } for _, c := range changed { wasBlocked := c.oldRef.BlockOwnerDeletion != nil && *c.oldRef.BlockOwnerDeletion isUnblocked := c.newRef.BlockOwnerDeletion == nil || (c.newRef.BlockOwnerDeletion != nil && !*c.newRef.BlockOwnerDeletion) if wasBlocked && isUnblocked { node, found := gb.uidToNode.Read(c.newRef.UID) if !found { klog.V(5).Infof("cannot find %s in uidToNode", c.newRef.UID) continue } gb.attemptToDelete.Add(node) } } } ```
**第三种**,这个对象已经删除, 处理方法为: (1)从图中删除这个节点,如果这个节点有dependents,将这个节点加入absentOwnerCache。这个是非常有用的。假如deployA删除了,rsA通过absentOwnerCache能判断,deployA确实存在,并且被删除了。 (2)将所有的依赖加入attemptToDelete队列 (3)如果这个节点有owners,并且处于删除Dependents中,那么很有可能它的owners正在等自己。现在自己删除了,所以将owners再加入删除队列 ``` case event.eventType == deleteEvent: if !found { klog.V(5).Infof("%v doesn't exist in the graph, this shouldn't happen", accessor.GetUID()) return true } // removeNode updates the graph gb.removeNode(existingNode) existingNode.dependentsLock.RLock() defer existingNode.dependentsLock.RUnlock() if len(existingNode.dependents) > 0 { gb.absentOwnerCache.Add(accessor.GetUID()) } for dep := range existingNode.dependents { gb.attemptToDelete.Add(dep) } for _, owner := range existingNode.owners { ownerNode, found := gb.uidToNode.Read(owner.UID) if !found || !ownerNode.isDeletingDependents() { continue } // this is to let attempToDeleteItem check if all the owner's // dependents are deleted, if so, the owner will be deleted. gb.attemptToDelete.Add(ownerNode) } } ```
#### 2.4 processTransitions函数的处理逻辑 从上面的分析,可以看出来,runProcessGraphChanges就做了两件事情: (1)时刻维护图的正确和完整 (2)将可能需要删除的对象扔进AttemptToOrphan,AttemptToDelete队列 processTransitions就是做第二件事情,将可能需要删除的对象扔进AttemptToOrphan,AttemptToDelete队列。 判断的逻辑很简单: (1)如果这个对象正在删除,并且有orphan这个Finalizer,就将它扔进attemptToOrphan队列 (1)如果这个对象正在删除,并且有foregroundDeletion这个Finalizer,就将它和它的dependents扔进attemptToDelete ``` func (gb *GraphBuilder) processTransitions(oldObj interface{}, newAccessor metav1.Object, n *node) { if startsWaitingForDependentsOrphaned(oldObj, newAccessor) { klog.V(5).Infof("add %s to the attemptToOrphan", n.identity) gb.attemptToOrphan.Add(n) return } if startsWaitingForDependentsDeleted(oldObj, newAccessor) { klog.V(2).Infof("add %s to the attemptToDelete, because it's waiting for its dependents to be deleted", n.identity) // if the n is added as a "virtual" node, its deletingDependents field is not properly set, so always set it here. n.markDeletingDependents() for dep := range n.dependents { gb.attemptToDelete.Add(dep) } gb.attemptToDelete.Add(n) } } ```
#### 2.5 runAttemptToOrphanWorker runAttemptToOrphanWorker逻辑如下: (1)获得这个节点的所有orphanDependents (2)调用orphanDependents,删除它的orphanDependents的OwnerReferences。 (3)删除orphan这个Finalizer,让该对象可以被删除 ``` func (gc *GarbageCollector) runAttemptToOrphanWorker() { for gc.attemptToOrphanWorker() { } } // attemptToOrphanWorker dequeues a node from the attemptToOrphan, then finds its // dependents based on the graph maintained by the GC, then removes it from the // OwnerReferences of its dependents, and finally updates the owner to remove // the "Orphan" finalizer. The node is added back into the attemptToOrphan if any of // these steps fail. func (gc *GarbageCollector) attemptToOrphanWorker() bool { item, quit := gc.attemptToOrphan.Get() gc.workerLock.RLock() defer gc.workerLock.RUnlock() if quit { return false } defer gc.attemptToOrphan.Done(item) owner, ok := item.(*node) if !ok { utilruntime.HandleError(fmt.Errorf("expect *node, got %#v", item)) return true } // we don't need to lock each element, because they never get updated owner.dependentsLock.RLock() dependents := make([]*node, 0, len(owner.dependents)) // 1.获得这个节点的所有orphanDependents for dependent := range owner.dependents { dependents = append(dependents, dependent) } owner.dependentsLock.RUnlock() // 2.调用orphanDependents,删除它的orphanDependents的OwnerReferences。 // 举例来说,删除deployA时,删除rsA的OwnerReference,这样rsA就不受deployA控制了。 err := gc.orphanDependents(owner.identity, dependents) if err != nil { utilruntime.HandleError(fmt.Errorf("orphanDependents for %s failed with %v", owner.identity, err)) gc.attemptToOrphan.AddRateLimited(item) return true } // update the owner, remove "orphaningFinalizer" from its finalizers list // 3. 删除orphan这个Finalizer,让deployA可以被删除 err = gc.removeFinalizer(owner, metav1.FinalizerOrphanDependents) if err != nil { utilruntime.HandleError(fmt.Errorf("removeOrphanFinalizer for %s failed with %v", owner.identity, err)) gc.attemptToOrphan.AddRateLimited(item) } return true } ```
#### 2.6 attemptToDeleteWorker 主要调用attemptToDeleteItem函数。attemptToDeleteItem的逻辑如下: (1)如果该对象isBeingDeleted,并且没有在删除Dependents,直接返回 (2)如果该对象正在删除dependents, 将dependents加入attemptToDelete队列 (3)调用classifyReferences,计算solid,dangling,waitingForDependentsDeletion的情况,solid,dangling,waitingForDependentsDeletion是OwnerReferences数组 solid:当前节点的owner存在,并且owner的状态不是删除Dependents中 dangling:owner不存在 waitingForDependentsDeletion:owner存在,并且owner的状态是删除Dependents中 (4)根据solid,dangling,waitingForDependentsDeletion的情况进行不同的处理,具体如下: * 情况1: 如果有至少有一个owner存在,并且不处于删除依赖中。这个时候判断dangling,waitingForDependentsDeletion的数量是否为0。如果为0,说明当前不需要处理;否则,将该节点对应dangling,waitingForDependentsDeletion的节点删除dependents。 * 情况2: 到这里说明 len(solid)=0,这个时候如果有节点在等待这个节点删除,并且这个节点还有依赖,那么将这个节点的blockOwnerDeletion设置为true。然后后台删除这个节点。 这里举一个例子说明:当前台模式删除deployA时,rsA是当前要处理的节点。这个时候rsA发现deployA再等自己删除,但是自己又有依赖podA,所以这里马上将自己设置为前台删除。这样在deployA看来就实现了先删除podA, 再删除rsA,再删除deployA。 * 情况3: 除了上面的两种情况,根据设置的删除策略删除这个节点。 ​ 这里举一个例子说明:当后台模式删除deployA时,rsA是当前要处理的节点。这个时候deployA已经删除了,同时没有finalizer,因为只有Orphan, foreGround有finalizer,所以这个时候直接默认以background删除这个节点。 ``` func (gc *GarbageCollector) attemptToDeleteWorker() bool { item, quit := gc.attemptToDelete.Get() err := gc.attemptToDeleteItem(n) return true } func (gc *GarbageCollector) attemptToDeleteItem(item *node) error { klog.V(2).Infof("processing item %s", item.identity) // "being deleted" is an one-way trip to the final deletion. We'll just wait for the final deletion, and then process the object's dependents. // 1.如果该对象isBeingDeleted,并且没有在删除Dependents,直接返回 if item.isBeingDeleted() && !item.isDeletingDependents() { klog.V(5).Infof("processing item %s returned at once, because its DeletionTimestamp is non-nil", item.identity) return nil } // TODO: It's only necessary to talk to the API server if this is a // "virtual" node. The local graph could lag behind the real status, but in // practice, the difference is small. latest, err := gc.getObject(item.identity) switch { case errors.IsNotFound(err): // the GraphBuilder can add "virtual" node for an owner that doesn't // exist yet, so we need to enqueue a virtual Delete event to remove // the virtual node from GraphBuilder.uidToNode. klog.V(5).Infof("item %v not found, generating a virtual delete event", item.identity) gc.dependencyGraphBuilder.enqueueVirtualDeleteEvent(item.identity) // since we're manually inserting a delete event to remove this node, // we don't need to keep tracking it as a virtual node and requeueing in attemptToDelete item.markObserved() return nil case err != nil: return err } if latest.GetUID() != item.identity.UID { klog.V(5).Infof("UID doesn't match, item %v not found, generating a virtual delete event", item.identity) gc.dependencyGraphBuilder.enqueueVirtualDeleteEvent(item.identity) // since we're manually inserting a delete event to remove this node, // we don't need to keep tracking it as a virtual node and requeueing in attemptToDelete item.markObserved() return nil } // TODO: attemptToOrphanWorker() routine is similar. Consider merging // attemptToOrphanWorker() into attemptToDeleteItem() as well. // 2. 如果该对象正在删除dependents, 将dependents加入attemptToDelete队列 if item.isDeletingDependents() { return gc.processDeletingDependentsItem(item) } // compute if we should delete the item ownerReferences := latest.GetOwnerReferences() if len(ownerReferences) == 0 { klog.V(2).Infof("object %s's doesn't have an owner, continue on next item", item.identity) return nil } // 3.计算solid,dangling,waitingForDependentsDeletion的情况。 solid, dangling, waitingForDependentsDeletion, err := gc.classifyReferences(item, ownerReferences) if err != nil { return err } klog.V(5).Infof("classify references of %s.\nsolid: %#v\ndangling: %#v\nwaitingForDependentsDeletion: %#v\n", item.identity, solid, dangling, waitingForDependentsDeletion) // 4.根据solid,dangling,waitingForDependentsDeletion的情况进行不同的处理 switch { // 情况1: 如果有至少有一个owner存在,并且不处于删除依赖中。这个时候判断dangling,waitingForDependentsDeletion的数量是否为0。如果为0,说明当前不需要处理;否则,将该节点对应dangling,waitingForDependentsDeletion的节点删除dependents。 case len(solid) != 0: klog.V(2).Infof("object %#v has at least one existing owner: %#v, will not garbage collect", item.identity, solid) if len(dangling) == 0 && len(waitingForDependentsDeletion) == 0 { return nil } klog.V(2).Infof("remove dangling references %#v and waiting references %#v for object %s", dangling, waitingForDependentsDeletion, item.identity) // waitingForDependentsDeletion needs to be deleted from the // ownerReferences, otherwise the referenced objects will be stuck with // the FinalizerDeletingDependents and never get deleted. ownerUIDs := append(ownerRefsToUIDs(dangling), ownerRefsToUIDs(waitingForDependentsDeletion)...) patch := deleteOwnerRefStrategicMergePatch(item.identity.UID, ownerUIDs...) _, err = gc.patch(item, patch, func(n *node) ([]byte, error) { return gc.deleteOwnerRefJSONMergePatch(n, ownerUIDs...) }) return err // 情况2: 到这里说明 len(solid)=0,这个时候如果有节点在等待这个节点删除,并且这个节点还有依赖,那么将这个节点的blockOwnerDeletion设置为true。然后后台删除这个节点。 case len(waitingForDependentsDeletion) != 0 && item.dependentsLength() != 0: deps := item.getDependents() for _, dep := range deps { if dep.isDeletingDependents() { // this circle detection has false positives, we need to // apply a more rigorous detection if this turns out to be a // problem. // there are multiple workers run attemptToDeleteItem in // parallel, the circle detection can fail in a race condition. klog.V(2).Infof("processing object %s, some of its owners and its dependent [%s] have FinalizerDeletingDependents, to prevent potential cycle, its ownerReferences are going to be modified to be non-blocking, then the object is going to be deleted with Foreground", item.identity, dep.identity) patch, err := item.unblockOwnerReferencesStrategicMergePatch() if err != nil { return err } if _, err := gc.patch(item, patch, gc.unblockOwnerReferencesJSONMergePatch); err != nil { return err } break } } klog.V(2).Infof("at least one owner of object %s has FinalizerDeletingDependents, and the object itself has dependents, so it is going to be deleted in Foreground", item.identity) // the deletion event will be observed by the graphBuilder, so the item // will be processed again in processDeletingDependentsItem. If it // doesn't have dependents, the function will remove the // FinalizerDeletingDependents from the item, resulting in the final // deletion of the item. policy := metav1.DeletePropagationForeground return gc.deleteObject(item.identity, &policy) // 情况3: 除了上面的两种情况,根据设置的删除策略删除这个节点 default: // item doesn't have any solid owner, so it needs to be garbage // collected. Also, none of item's owners is waiting for the deletion of // the dependents, so set propagationPolicy based on existing finalizers. var policy metav1.DeletionPropagation switch { case hasOrphanFinalizer(latest): // if an existing orphan finalizer is already on the object, honor it. policy = metav1.DeletePropagationOrphan case hasDeleteDependentsFinalizer(latest): // if an existing foreground finalizer is already on the object, honor it. policy = metav1.DeletePropagationForeground default: // otherwise, default to background. policy = metav1.DeletePropagationBackground } klog.V(2).Infof("delete object %s with propagation policy %s", item.identity, policy) return gc.deleteObject(item.identity, &policy) } } ```
#### 2.7 uidToNode到底是什么 在startGarbageCollectorController的时候 开启debug模式 ``` return garbagecollector.NewDebugHandler(garbageCollector), true, nil ``` 利用这个,我们可以看到uidToNode里的数据。数据太多,我这里就看 kube-system命名空间,kube-hpa这个deploy 在uidToNode的数据。 kcm对应的10252端口 ``` 。看这个 // 639d5269-d73d-4964-a7de-d6f386c9c7e4是kube-hpa这个deploy的uid。 # curl http://127.0.0.1:10252/debug/controllers/garbagecollector/graph?uid=639d5269-d73d-4964-a7de-d6f386c9c7e4 strict digraph full { // Node definitions. 0 [ label="\"uid=e66e45c0-5695-4c93-82f1-067b20aa035f\nnamespace=kube-system\nReplicaSet.v1.apps/kube-hpa-84c884f994\n\"" group="apps" version="v1" kind="ReplicaSet" namespace="kube-system" name="kube-hpa-84c884f994" uid="e66e45c0-5695-4c93-82f1-067b20aa035f" missing="false" beingDeleted="false" deletingDependents="false" virtual="false" ]; 1 [ label="\"uid=9833c399-b139-4432-98f7-cec13158f804\nnamespace=kube-system\nPod.v1/kube-hpa-84c884f994-7gwpz\n\"" group="" version="v1" kind="Pod" namespace="kube-system" name="kube-hpa-84c884f994-7gwpz" uid="9833c399-b139-4432-98f7-cec13158f804" missing="false" beingDeleted="false" deletingDependents="false" virtual="false" ]; 2 [ label="\"uid=639d5269-d73d-4964-a7de-d6f386c9c7e4\nnamespace=kube-system\nDeployment.v1.apps/kube-hpa\n\"" group="apps" version="v1" kind="Deployment" namespace="kube-system" name="kube-hpa" uid="639d5269-d73d-4964-a7de-d6f386c9c7e4" missing="false" beingDeleted="false" deletingDependents="false" virtual="false" ]; // Edge definitions. 0 -> 2; 1 -> 0; } ``` 可以看出来,这个图就是表示了节点的依赖,同时beingDeleted, deletingDependents表示了当前节点的状态。 这个还可以将图画出来。 ``` curl http://127.0.0.1:10252/debug/controllers/garbagecollector/graph?uid=639d5269-d73d-4964-a7de-d6f386c9c7e4 > tmp.dot dot -Tsvg -o graph.svg tmp.dot ``` graph.svg如下: ![graph](../images/graph.svg) ### 3.总结 gc这块的逻辑非常绕,也非常难懂。但是多看几遍就会发现这个其他的妙处。这里再次总结一下整个流程。 (1) kcm启动时,gc controller随之启动。gc 启动时,做了以下的初始化工作见下图: * 定期获取所有能删除的资源,保存到RestMapper。然后启动这些资源的监听事件 * 对这些些资源设置add, update, delete事件的处理逻辑:只要有变化就将其封装成一个event,然后扔进graphChanges队列 (2)runProcessGraphChanges负责处理graphChanges队列中的对象。主要做了俩件事情: * 第一,根据不同的变化,维护uidToNode这个图。一个对象对应了uidToNode中的一个节点,同时该节点有o wner, depends字段。 * 第二,根据节点的beingDeleted, deletingDependents等字段,判断该节点是否可能要删除。如果要删除,将其扔进attemtToDelete, attemtToOrghan队列 (3)attemtToDeleteWorker, attemtToOrghanWorker负责出来attemtToDelete, attemtToOrghan队列,根据不同的情况进行删除 ![gc-1](../images/gc-1.png)
================================================ FILE: k8s/kcm/3-k8s中以不同的策略删除资源时发生了什么.md ================================================ Table of Contents ================= * [1. 孤儿模式](#1-孤儿模式) * [2. 后台模式](#2-后台模式) * [3. 前台模式](#3-前台模式) * [4. 总结](#4-总结) * [5. 方法论](#5-方法论) * [5.1 看deployA的yaml发生了什么变化](#51-看deploya的yaml发生了什么变化) * [5.2 增大kcm的日志等级,查看gc的日志](#52-增大kcm的日志等级查看gc的日志) * [5.3 增大apiserver的日志等级,查看apiserver的处理](#53-增大apiserver的日志等级查看apiserver的处理) 接上篇gc源码分析,这篇主要总结以在不同的删除策略(孤儿,前台,后台)模式下,删除k8s资源发生了什么。 以下都是以 deployA , rsA, podA作为介绍。(这个可以类比为任何有这种依赖关系的资源) ### 1. 孤儿模式 孤儿模式删除deployA: deployA会被删除,rsA不会删除,但是rsA的OwnerReference里deployA会被删除。 具体的流程如下: (1) 客户端发起kubectl delete deploy deployA --cascade=false (2)apiserver接收到请求,发现删除模式是organ。这个时候apiserver会做俩件事情: * 设置deployA的deletionStamp * 增加一个finalizer,organ **这个时候apiserver会直接返回,不会一直阻塞在这里等** (3)这个时候由于apiserver对deployA更新了。所以gc收到了deployA的**更新**事件,然后开始处理工作: * 一,维护uidToNode图,就是删除了deployA这个node节点,并且将rsA节点的onwer删除。 * 二,将rsA这个对象的OwnerReference中的deployA删除; * 三,将deployA这个对象的organ finalizer删除 (4)将deployA这个对象的organ finalizer删除实际上是一个更新事件。这个时候apiserver收到这个更新事件,发现deployA的所以finalizer被删除了,这个时候调用restful接口真正的删除 deployA。
### 2. 后台模式 后台模式删除deployA: deployA会被马上删除,然后删除rsA,最后删除pod 具体的流程如下: (1) 客户端发起kubectl delete deployA propagationPolicy":"Background" (2)apiserver接收到请求,发现删除模式是Background。这个时候apiserver会直接将deployA删除。 (3)这个时候由于apiserver删除了deployA。所以gc收到了deployA的**删除**事件,然后开始处理工作: * 一,维护uidToNode图,就是删除了deployA这个node节点,并且将rsA扔进attemptToDelete队列 * 二,处理rsA时,发现它的owner已经不存在了,所以马上以backgroud的方式,再删除rsA。 * 三,然后就是同样的操作,先删除了rsA,然后删除了pod。
### 3. 前台模式 前台模式删除deployA: podA会先删除,然后是rsA,最后是deployA。 具体的流程如下: (1) 客户端发起kubectl delete deployA propagationPolicy:Foreground (2)apiserver接收到请求,发现删除模式是Foreground。这个时候apiserver会做俩件事情: * 设置deployA的deletionStamp * 增加一个finalizer,Foreground **这个时候apiserver会直接返回,不会一直阻塞在这里等** (3)这个时候由于apiserver对deployA更新了。所以gc收到了deployA的更新事件,然后开始处理工作。 具体为: 一,维护uidToNode图。 首先是deployA这个node节点,会标记为 删除depents中。然后将 deployA的依赖(rsA)加入 attempToDelete队列。 处理rsA时,发现rsA的owner在等待删除depents。并且rsA还有自己的 depends。所以这个时候就调用**前台删除**接口删除 来删除rsA。 同样,前台删除rsA时,先标记rsA这个node节点,为 删除depents中,然后将 rsA的依赖(podA)加入 attempToDelete队列。 处理podA时,发现PodA的owner在等待删除depents。但是podA没有自己的 depends。所以这个时候就调用**后台删除**接口删除 来删除podA。 后台删除podA后,apiserver会直接将podA这个对象删除。所以gc收到了 删除事件。这个时候会将 podA这个节点删除,然后再将rsA加入删除队列。 接下来rsA发现自己的depents删除了,所以rsA的finalizer就会删除。然后apiserver就会将rsA删除。 然后gc收到了rsA的删除事件,同样的操作再将deployA删除。
### 4. 总结 gc的机制非常巧妙,而且和apiserver进行了联动。在实际过程中运用这种gc机制也非常有用。比如有俩个不相关的对象,通过设置OwnerReference, 就可以实现,俩个对象的级联删除。
### 5. 方法论 以上的流程,通过代码和实践进行验证。 代码分析见上一篇。实践就是通过实验,主要做了以下观察: (1)看deployA的yaml发生了什么变化 (2)增大kcm的日志等级,查看gc的日志 (3)增大apiserver的日志等级,查看apiserver的处理 #### 5.1 看deployA的yaml发生了什么变化 ``` // -w 一直监控删除前后的变化 root@k8s-master:~/testyaml/hpa# kubectl get deploy zx-hpa -oyaml -w apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2021-07-09T07:21:48Z" generation: 1 labels: app: zx-hpa-test name: zx-hpa namespace: default resourceVersion: "6975175" selfLink: /apis/apps/v1/namespaces/default/deployments/zx-hpa uid: 6ccbe990-e4d3-4ba1-b67f-56a9bfbd69a0 spec: progressDeadlineSeconds: 600 replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: zx-hpa-test strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app: zx-hpa-test name: zx-hpa-test spec: containers: - command: - sleep - "3600" image: busybox:latest imagePullPolicy: IfNotPresent name: busybox resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 5 status: availableReplicas: 2 conditions: - lastTransitionTime: "2021-07-09T07:21:50Z" lastUpdateTime: "2021-07-09T07:21:50Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2021-07-09T07:21:49Z" lastUpdateTime: "2021-07-09T07:21:50Z" message: ReplicaSet "zx-hpa-7b56cddd95" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing observedGeneration: 1 readyReplicas: 2 replicas: 2 updatedReplicas: 2 --- apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" creationTimestamp: "2021-07-09T07:21:48Z" generation: 1 labels: app: zx-hpa-test name: zx-hpa namespace: default resourceVersion: "6975316" selfLink: /apis/apps/v1/namespaces/default/deployments/zx-hpa uid: 6ccbe990-e4d3-4ba1-b67f-56a9bfbd69a0 spec: progressDeadlineSeconds: 600 replicas: 2 revisionHistoryLimit: 10 selector: matchLabels: app: zx-hpa-test strategy: rollingUpdate: maxSurge: 1 maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app: zx-hpa-test name: zx-hpa-test spec: containers: - command: - sleep - "3600" image: busybox:latest imagePullPolicy: IfNotPresent name: busybox resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 5 status: availableReplicas: 2 conditions: - lastTransitionTime: "2021-07-09T07:21:50Z" lastUpdateTime: "2021-07-09T07:21:50Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2021-07-09T07:21:49Z" lastUpdateTime: "2021-07-09T07:21:50Z" message: ReplicaSet "zx-hpa-7b56cddd95" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing observedGeneration: 1 readyReplicas: 2 replicas: 2 updatedReplicas: 2 ``` #### 5.2 增大kcm的日志等级,查看gc的日志 ``` I0709 15:17:45.089271 3183 resource_quota_monitor.go:354] QuotaMonitor process object: apps/v1, Resource=deployments, namespace kube-system, name kube-hpa, uid 639d5269-d73d-4964-a7de-d6f386c9c7e4, event type delete I0709 15:17:45.089320 3183 graph_builder.go:543] GraphBuilder process object: apps/v1/Deployment, namespace kube-system, name kube-hpa, uid 639d5269-d73d-4964-a7de-d6f386c9c7e4, event type delete I0709 15:17:45.089346 3183 garbagecollector.go:404] processing item [apps/v1/ReplicaSet, namespace: kube-system, name: kube-hpa-84c884f994, uid: e66e45c0-5695-4c93-82f1-067b20aa035f] I0709 15:17:45.089576 3183 deployment_controller.go:193] Deleting deployment kube-hpa I0709 15:17:45.089591 3183 deployment_controller.go:564] Started syncing deployment "kube-system/kube-hpa" (2021-07-09 15:17:45.089588305 +0800 CST m=+38.708727198) I0709 15:17:45.089611 3183 deployment_controller.go:575] Deployment kube-system/kube-hpa has been deleted I0709 15:17:45.089615 3183 deployment_controller.go:566] Finished syncing deployment "kube-system/kube-hpa" (24.606µs) I0709 15:17:45.093463 3183 garbagecollector.go:329] according to the absentOwnerCache, object e66e45c0-5695-4c93-82f1-067b20aa035f's owner apps/v1/Deployment, kube-hpa does not exist I0709 15:17:45.093480 3183 garbagecollector.go:455] classify references of [apps/v1/ReplicaSet, namespace: kube-system, name: kube-hpa-84c884f994, uid: e66e45c0-5695-4c93-82f1-067b20aa035f]. solid: []v1.OwnerReference(nil) dangling: []v1.OwnerReference{v1.OwnerReference{APIVersion:"apps/v1", Kind:"Deployment", Name:"kube-hpa", UID:"639d5269-d73d-4964-a7de-d6f386c9c7e4", Controller:(*bool)(0xc000ab3817), BlockOwnerDeletion:(*bool)(0xc000ab3818)}} waitingForDependentsDeletion: []v1.OwnerReference(nil) I0709 15:17:45.093517 3183 garbagecollector.go:517] delete object [apps/v1/ReplicaSet, namespace: kube-system, name: kube-hpa-84c884f994, uid: e66e45c0-5695-4c93-82f1-067b20aa035f] with propagation policy Background I0709 15:17:45.107563 3183 resource_quota_monitor.go:354] QuotaMonitor process object: apps/v1, Resource=replicasets, namespace kube-system, name kube-hpa-84c884f994, uid e66e45c0-5695-4c93-82f1-067b20aa035f, event type delete I0709 15:17:45.107635 3183 replica_set.go:349] Deleting ReplicaSet "kube-system/kube-hpa-84c884f994" I0709 15:17:45.107687 3183 replica_set.go:658] ReplicaSet kube-system/kube-hpa-84c884f994 has been deleted I0709 15:17:45.107692 3183 replica_set.go:649] Finished syncing ReplicaSet "kube-system/kube-hpa-84c884f994" (16.069µs) I0709 15:17:45.107720 3183 graph_builder.go:543] GraphBuilder process object: apps/v1/ReplicaSet, namespace kube-system, name kube-hpa-84c884f994, uid e66e45c0-5695-4c93-82f1-067b20aa035f, event type delete I0709 15:17:45.107753 3183 garbagecollector.go:404] processing item [v1/Pod, namespace: kube-system, name: kube-hpa-84c884f994-7gwpz, uid: 9833c399-b139-4432-98f7-cec13158f804] I0709 15:17:45.111155 3183 garbagecollector.go:329] according to the absentOwnerCache, object 9833c399-b139-4432-98f7-cec13158f804's owner apps/v1/ReplicaSet, kube-hpa-84c884f994 does not exist I0709 15:17:45.111174 3183 garbagecollector.go:455] classify references of [v1/Pod, namespace: kube-system, name: kube-hpa-84c884f994-7gwpz, uid: 9833c399-b139-4432-98f7-cec13158f804]. solid: []v1.OwnerReference(nil) dangling: []v1.OwnerReference{v1.OwnerReference{APIVersion:"apps/v1", Kind:"ReplicaSet", Name:"kube-hpa-84c884f994", UID:"e66e45c0-5695-4c93-82f1-067b20aa035f", Controller:(*bool)(0xc000bde7bf), BlockOwnerDeletion:(*bool)(0xc000bde800)}} waitingForDependentsDeletion: []v1.OwnerReference(nil) I0709 15:17:45.111213 3183 garbagecollector.go:517] delete object [v1/Pod, namespace: kube-system, name: kube-hpa-84c884f994-7gwpz, uid: 9833c399-b139-4432-98f7-cec13158f804] with propagation policy Background I0709 15:17:45.124112 3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type update I0709 15:17:45.124236 3183 endpoints_controller.go:385] About to update endpoints for service "kube-system/kube-hpa" I0709 15:17:45.124275 3183 endpoints_controller.go:420] Pod is being deleted kube-system/kube-hpa-84c884f994-7gwpz I0709 15:17:45.124293 3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0 I0709 15:17:45.124481 3183 disruption.go:394] updatePod called on pod "kube-hpa-84c884f994-7gwpz" I0709 15:17:45.124523 3183 disruption.go:457] No PodDisruptionBudgets found for pod kube-hpa-84c884f994-7gwpz, PodDisruptionBudget controller will avoid syncing. I0709 15:17:45.124527 3183 disruption.go:397] No matching pdb for pod "kube-hpa-84c884f994-7gwpz" I0709 15:17:45.131011 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-hpa, uid 17a8623b-2bd6-4253-b7cd-88a7af615220, event type update I0709 15:17:45.132261 3183 endpoints_controller.go:353] Finished syncing service "kube-system/kube-hpa" endpoints. (8.020508ms) I0709 15:17:45.132951 3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace kube-system, name kube-hpa-84c884f994-7gwpz.16900e30134087ab, uid 7c55e936-801b-4eb9-a828-085d92983134, event type add I0709 15:17:45.310041 3183 graph_builder.go:543] GraphBuilder process object: apiregistration.k8s.io/v1/APIService, namespace , name v1beta1.custom.metrics.k8s.io, uid 71617a10-8136-4a2a-af65-d64bcd6c78c3, event type update I0709 15:17:45.660593 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update I0709 15:17:45.668379 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update I0709 15:17:46.143691 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update I0709 15:17:46.143962 3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type update I0709 15:17:46.144055 3183 endpoints_controller.go:385] About to update endpoints for service "kube-system/kube-hpa" I0709 15:17:46.144095 3183 endpoints_controller.go:420] Pod is being deleted kube-system/kube-hpa-84c884f994-7gwpz I0709 15:17:46.144126 3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0 I0709 15:17:46.144329 3183 disruption.go:394] updatePod called on pod "kube-hpa-84c884f994-7gwpz" I0709 15:17:46.144347 3183 disruption.go:457] No PodDisruptionBudgets found for pod kube-hpa-84c884f994-7gwpz, PodDisruptionBudget controller will avoid syncing. I0709 15:17:46.144350 3183 disruption.go:397] No matching pdb for pod "kube-hpa-84c884f994-7gwpz" I0709 15:17:46.144361 3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod kube-system/kube-hpa-84c884f994-7gwpz (UID=9833c399-b139-4432-98f7-cec13158f804) I0709 15:17:46.150410 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-hpa, uid 17a8623b-2bd6-4253-b7cd-88a7af615220, event type update I0709 15:17:46.150749 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update I0709 15:17:46.151231 3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager I0709 15:17:46.151321 3183 endpoints_controller.go:353] Finished syncing service "kube-system/kube-hpa" endpoints. (7.269404ms) I0709 15:17:46.978486 3183 cronjob_controller.go:129] Found 4 jobs I0709 15:17:46.978503 3183 cronjob_controller.go:135] Found 1 groups I0709 15:17:46.982118 3183 event.go:281] Event(v1.ObjectReference{Kind:"CronJob", Namespace:"default", Name:"hello", UID:"b9648456-0b0a-44a4-b4c7-4c1db9be4085", APIVersion:"batch/v1beta1", ResourceVersion:"6974347", FieldPath:""}): type: 'Normal' reason: 'SawCompletedJob' Saw completed job: hello-1625815020, status: Complete I0709 15:17:46.986941 3183 graph_builder.go:543] GraphBuilder process object: batch/v1beta1/CronJob, namespace default, name hello, uid b9648456-0b0a-44a4-b4c7-4c1db9be4085, event type update I0709 15:17:46.987073 3183 cronjob_controller.go:278] No unmet start times for default/hello I0709 15:17:46.987091 3183 cronjob_controller.go:203] Cleaning up 1/4 jobs from default/hello I0709 15:17:46.987096 3183 cronjob_controller.go:207] Removing job hello-1625814840 from default/hello I0709 15:17:46.987694 3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace default, name hello.16900e3081ed9288, uid 21dc6f32-9c3b-479a-8a69-c71946be3b7a, event type add I0709 15:17:46.998396 3183 job_controller.go:452] Job has been deleted: default/hello-1625814840 I0709 15:17:46.998407 3183 job_controller.go:439] Finished syncing job "default/hello-1625814840" (42.057µs) I0709 15:17:46.998436 3183 graph_builder.go:543] GraphBuilder process object: batch/v1/Job, namespace default, name hello-1625814840, uid ce65b016-b3c4-4a65-b01d-f81381fca20a, event type delete I0709 15:17:46.998463 3183 garbagecollector.go:404] processing item [v1/Pod, namespace: default, name: hello-1625814840-9tmbk, uid: 7aabf04b-31c5-4602-af5e-87a7e0079d1a] I0709 15:17:46.998715 3183 resource_quota_monitor.go:354] QuotaMonitor process object: batch/v1, Resource=jobs, namespace default, name hello-1625814840, uid ce65b016-b3c4-4a65-b01d-f81381fca20a, event type delete I0709 15:17:46.999144 3183 event.go:281] Event(v1.ObjectReference{Kind:"CronJob", Namespace:"default", Name:"hello", UID:"b9648456-0b0a-44a4-b4c7-4c1db9be4085", APIVersion:"batch/v1beta1", ResourceVersion:"6974464", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted job hello-1625814840 I0709 15:17:47.002267 3183 garbagecollector.go:329] according to the absentOwnerCache, object 7aabf04b-31c5-4602-af5e-87a7e0079d1a's owner batch/v1/Job, hello-1625814840 does not exist I0709 15:17:47.002298 3183 garbagecollector.go:455] classify references of [v1/Pod, namespace: default, name: hello-1625814840-9tmbk, uid: 7aabf04b-31c5-4602-af5e-87a7e0079d1a]. solid: []v1.OwnerReference(nil) dangling: []v1.OwnerReference{v1.OwnerReference{APIVersion:"batch/v1", Kind:"Job", Name:"hello-1625814840", UID:"ce65b016-b3c4-4a65-b01d-f81381fca20a", Controller:(*bool)(0xc000bdf480), BlockOwnerDeletion:(*bool)(0xc000bdf481)}} waitingForDependentsDeletion: []v1.OwnerReference(nil) I0709 15:17:47.002325 3183 garbagecollector.go:517] delete object [v1/Pod, namespace: default, name: hello-1625814840-9tmbk, uid: 7aabf04b-31c5-4602-af5e-87a7e0079d1a] with propagation policy Background I0709 15:17:47.005713 3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace default, name hello.16900e3082f15365, uid 903283d1-63da-4ba7-b200-69d6a30a1d5c, event type add I0709 15:17:47.011868 3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace default, name hello-1625814840-9tmbk, uid 7aabf04b-31c5-4602-af5e-87a7e0079d1a, event type update I0709 15:17:47.011938 3183 disruption.go:394] updatePod called on pod "hello-1625814840-9tmbk" I0709 15:17:47.011960 3183 disruption.go:457] No PodDisruptionBudgets found for pod hello-1625814840-9tmbk, PodDisruptionBudget controller will avoid syncing. I0709 15:17:47.011964 3183 disruption.go:397] No matching pdb for pod "hello-1625814840-9tmbk" I0709 15:17:47.011977 3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod default/hello-1625814840-9tmbk (UID=7aabf04b-31c5-4602-af5e-87a7e0079d1a) I0709 15:17:47.026287 3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace default, name hello-1625814840-9tmbk, uid 7aabf04b-31c5-4602-af5e-87a7e0079d1a, event type delete I0709 15:17:47.026312 3183 deployment_controller.go:356] Pod hello-1625814840-9tmbk deleted. I0709 15:17:47.026350 3183 taint_manager.go:383] Noticed pod deletion: types.NamespacedName{Namespace:"default", Name:"hello-1625814840-9tmbk"} I0709 15:17:47.026389 3183 disruption.go:423] deletePod called on pod "hello-1625814840-9tmbk" I0709 15:17:47.026409 3183 disruption.go:457] No PodDisruptionBudgets found for pod hello-1625814840-9tmbk, PodDisruptionBudget controller will avoid syncing. I0709 15:17:47.026413 3183 disruption.go:426] No matching pdb for pod "hello-1625814840-9tmbk" I0709 15:17:47.026425 3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod default/hello-1625814840-9tmbk (UID=7aabf04b-31c5-4602-af5e-87a7e0079d1a) I0709 15:17:47.026449 3183 resource_quota_monitor.go:354] QuotaMonitor process object: /v1, Resource=pods, namespace default, name hello-1625814840-9tmbk, uid 7aabf04b-31c5-4602-af5e-87a7e0079d1a, event type delete I0709 15:17:47.164797 3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type update I0709 15:17:47.164886 3183 endpoints_controller.go:385] About to update endpoints for service "kube-system/kube-hpa" I0709 15:17:47.164929 3183 endpoints_controller.go:420] Pod is being deleted kube-system/kube-hpa-84c884f994-7gwpz I0709 15:17:47.164945 3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0 I0709 15:17:47.165093 3183 disruption.go:394] updatePod called on pod "kube-hpa-84c884f994-7gwpz" I0709 15:17:47.165108 3183 disruption.go:457] No PodDisruptionBudgets found for pod kube-hpa-84c884f994-7gwpz, PodDisruptionBudget controller will avoid syncing. I0709 15:17:47.165111 3183 disruption.go:397] No matching pdb for pod "kube-hpa-84c884f994-7gwpz" I0709 15:17:47.165122 3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod kube-system/kube-hpa-84c884f994-7gwpz (UID=9833c399-b139-4432-98f7-cec13158f804) I0709 15:17:47.165142 3183 resource_quota_monitor.go:354] QuotaMonitor process object: /v1, Resource=pods, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type update I0709 15:17:47.169973 3183 endpoints_controller.go:353] Finished syncing service "kube-system/kube-hpa" endpoints. (5.082912ms) I0709 15:17:47.172446 3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type delete I0709 15:17:47.172467 3183 deployment_controller.go:356] Pod kube-hpa-84c884f994-7gwpz deleted. I0709 15:17:47.172474 3183 deployment_controller.go:424] Cannot get replicaset "kube-hpa-84c884f994" for pod "kube-hpa-84c884f994-7gwpz": replicaset.apps "kube-hpa-84c884f994" not found I0709 15:17:47.172507 3183 taint_manager.go:383] Noticed pod deletion: types.NamespacedName{Namespace:"kube-system", Name:"kube-hpa-84c884f994-7gwpz"} I0709 15:17:47.172564 3183 endpoints_controller.go:385] About to update endpoints for service "kube-system/kube-hpa" I0709 15:17:47.172614 3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0 I0709 15:17:47.172779 3183 disruption.go:423] deletePod called on pod "kube-hpa-84c884f994-7gwpz" I0709 15:17:47.172796 3183 disruption.go:457] No PodDisruptionBudgets found for pod kube-hpa-84c884f994-7gwpz, PodDisruptionBudget controller will avoid syncing. I0709 15:17:47.172799 3183 disruption.go:426] No matching pdb for pod "kube-hpa-84c884f994-7gwpz" I0709 15:17:47.172808 3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod kube-system/kube-hpa-84c884f994-7gwpz (UID=9833c399-b139-4432-98f7-cec13158f804) I0709 15:17:47.172843 3183 resource_quota_monitor.go:354] QuotaMonitor process object: /v1, Resource=pods, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type delete I0709 15:17:47.173978 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-hpa, uid 17a8623b-2bd6-4253-b7cd-88a7af615220, event type update I0709 15:17:47.178093 3183 endpoints_controller.go:353] Finished syncing service "kube-system/kube-hpa" endpoints. (5.525822ms) I0709 15:17:47.178107 3183 endpoints_controller.go:340] Error syncing endpoints for service "kube-system/kube-hpa", retrying. Error: Operation cannot be fulfilled on endpoints "kube-hpa": the object has been modified; please apply your changes to the latest version and try again I0709 15:17:47.178372 3183 event.go:281] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kube-system", Name:"kube-hpa", UID:"17a8623b-2bd6-4253-b7cd-88a7af615220", APIVersion:"v1", ResourceVersion:"6974462", FieldPath:""}): type: 'Warning' reason: 'FailedToUpdateEndpoint' Failed to update endpoint kube-system/kube-hpa: Operation cannot be fulfilled on endpoints "kube-hpa": the object has been modified; please apply your changes to the latest version and try again I0709 15:17:47.182381 3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace kube-system, name kube-hpa.16900e308da0917a, uid d136415c-0a51-40e2-b1ba-f63587af89a6, event type add I0709 15:17:47.183280 3183 endpoints_controller.go:385] About to update endpoints for service "kube-system/kube-hpa" I0709 15:17:47.183318 3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0 I0709 15:17:47.186538 3183 endpoints_controller.go:353] Finished syncing service "kube-system/kube-hpa" endpoints. (3.266428ms) I0709 15:17:47.679672 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update I0709 15:17:47.686259 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update I0709 15:17:48.166708 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update I0709 15:17:48.175956 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update I0709 15:17:48.176356 3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager I0709 15:17:49.277193 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-node-lease, name 192.168.0.5, uid 71ce7519-2999-4dbf-9118-227e5cb6d9ef, event type update I0709 15:17:49.701416 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update I0709 15:17:49.721102 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update I0709 15:17:50.189139 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update I0709 15:17:50.199890 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update I0709 15:17:50.200028 3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager I0709 15:17:51.046632 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-node-lease, name 192.168.0.4, uid a6c1c902-8d7f-442e-89d2-407f1677247e, event type update I0709 15:17:51.734474 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update I0709 15:17:51.742571 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update I0709 15:17:51.949675 3183 reflector.go:268] k8s.io/client-go/informers/factory.go:135: forcing resync E0709 15:17:51.960736 3183 horizontal.go:214] failed to query scale subresource for Deployment/default/zx-hpa: deployments/scale.apps "zx-hpa" not found I0709 15:17:51.961135 3183 event.go:281] Event(v1.ObjectReference{Kind:"HorizontalPodAutoscaler", Namespace:"default", Name:"nginx-hpa-zx-1", UID:"d49c5146-c5ef-4ac8-8039-c9b15f094360", APIVersion:"autoscaling/v2beta2", ResourceVersion:"4763928", FieldPath:""}): type: 'Warning' reason: 'FailedGetScale' deployments/scale.apps "zx-hpa" not found I0709 15:17:51.965206 3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace default, name nginx-hpa-zx-1.16900e31aab074d5, uid 3c9d8d3b-d63f-463c-8f8f-b8d2ba3f4fb3, event type add I0709 15:17:52.215733 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update I0709 15:17:52.224070 3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager I0709 15:17:52.224234 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update I0709 15:17:52.461003 3183 pv_controller_base.go:514] resyncing PV controller I0709 15:17:53.755870 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update I0709 15:17:53.766095 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update I0709 15:17:53.886970 3183 discovery.go:214] Invalidating discovery information I0709 15:17:54.236384 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update I0709 15:17:54.244313 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update I0709 15:17:54.244924 3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager I0709 15:17:55.778133 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update I0709 15:17:55.785242 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update I0709 15:17:56.264037 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update I0709 15:17:56.271400 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update I0709 15:17:56.271774 3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager I0709 15:17:57.011460 3183 cronjob_controller.go:129] Found 3 jobs I0709 15:17:57.011484 3183 cronjob_controller.go:135] Found 1 groups I0709 15:17:57.018598 3183 cronjob_controller.go:278] No unmet start times for default/hello I0709 15:17:57.436623 3183 gc_controller.go:163] GC'ing orphaned I0709 15:17:57.436642 3183 gc_controller.go:226] GC'ing unscheduled pods which are terminating. I0709 15:17:57.799012 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update I0709 15:17:57.807268 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update I0709 15:17:58.282260 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update I0709 15:17:58.288233 3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager I0709 15:17:58.288746 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update I0709 15:17:59.286621 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-node-lease, name 192.168.0.5, uid 71ce7519-2999-4dbf-9118-227e5cb6d9ef, event type update I0709 15:17:59.819587 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update I0709 15:17:59.827855 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update I0709 15:18:00.301289 3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update I0709 15:18:00.310096 3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager I0709 15:18:00.310445 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update I0709 15:18:01.054003 3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-node-lease, name 192.168.0.4, uid a6c1c902-8d7f-442e-89d2-407f1677247e, event type update ^Z ```
#### 5.3 增大apiserver的日志等级,查看apiserver的处理 至少开到5 ``` I0709 16:43:48.411395 28901 handler.go:143] kube-apiserver: PUT "/apis/apps/v1/namespaces/default/deployments/zx-hpa/status" satisfied by gorestful with webservice /apis/apps/v1 I0709 16:43:48.413431 28901 httplog.go:90] GET /apis/apps/v1/namespaces/default/deployments/zx-hpa: (2.677854ms) 200 [kube-controller-manager/v1.17.4 (linux/amd64) kubernetes/8d8aa39/generic-garbage-collector 192.168.0.4:48978] I0709 16:43:48.414076 28901 handler.go:153] kube-aggregator: GET "/apis/apps/v1/namespaces/default/deployments/zx-hpa" satisfied by nonGoRestful I0709 16:43:48.414089 28901 pathrecorder.go:247] kube-aggregator: "/apis/apps/v1/namespaces/default/deployments/zx-hpa" satisfied by prefix /apis/apps/v1/ I0709 16:43:48.414119 28901 handler.go:143] kube-apiserver: GET "/apis/apps/v1/namespaces/default/deployments/zx-hpa" satisfied by gorestful with webservice /apis/apps/v1 I0709 16:43:48.418663 28901 httplog.go:90] PUT /apis/apps/v1/namespaces/default/deployments/zx-hpa/status: (7.370204ms) 200 [kube-controller-manager/v1.17.4 (linux/amd64) kubernetes/8d8aa39/deployment-controller 192.168.0.4:49000] I0709 16:43:48.420303 28901 httplog.go:90] GET /apis/apps/v1/namespaces/default/deployments/zx-hpa: (6.309997ms) 200 [kube-controller-manager/v1.17.4 (linux/amd64) kubernetes/8d8aa39/generic-garbage-collector 192.168.0.4:48978] I0709 16:43:48.420817 28901 handler.go:153] kube-aggregator: PATCH "/apis/apps/v1/namespaces/default/deployments/zx-hpa" satisfied by nonGoRestful I0709 16:43:48.420828 28901 pathrecorder.go:247] kube-aggregator: "/apis/apps/v1/namespaces/default/deployments/zx-hpa" satisfied by prefix /apis/apps/v1/ I0709 16:43:48.420855 28901 handler.go:143] kube-apiserver: PATCH "/apis/apps/v1/namespaces/default/deployments/zx-hpa" satisfied by gorestful with webservice /apis/apps/v1 I0709 16:43:48.425221 28901 store.go:428] going to delete zx-hpa from registry, triggered by update ``` ================================================ FILE: k8s/kcm/4-hpa-自定义metric server.md ================================================ Table of Contents ================= * [1. custom-metrics-apiserver简介](#1-custom-metrics-apiserver简介) * [2. 定制自己的metric server](#2-定制自己的metric-server) * [2.1 代码部署和编译](#21-代码部署和编译) * [2.2 创建 Sv and APIService](#22-创建-sv-and-apiservice) * [2.3 system:anonymous授权](#23-systemanonymous授权) * [3. 创建hpa验证是否成功](#3-创建hpa验证是否成功) * [4. 追踪整个过程](#4-追踪整个过程) * [5. 总结](#5-总结) **本章重点:** 如何基于 custom-metrics-apiserver 项目,打造自己的 metric server ### 1. custom-metrics-apiserver简介 项目地址: https://github.com/kubernetes-sigs/custom-metrics-apiserver/tree/master **自定义metric server,具体来说需要做以下几个事情:** (1)实现 custom-metrics-apiserver 的 三个接口,如下: ``` type CustomMetricsProvider interface { // 定义metric。 例如 pod_cpu_used_1m ListAllMetrics() []CustomMetricInfo // 如何根据 metric的信息,得到具体的值 GetMetricByName(name types.NamespacedName, info CustomMetricInfo) (*custom_metrics.MetricValue, error) // 如何根据 metric selector的信息,得到具体的值 GetMetricBySelector(namespace string, selector labels.Selector, info CustomMetricInfo) (*custom_metrics.MetricValueList, error) } ``` GetMetricBySelectorm, GetMetricByName 在reststorage.go被使用。 https://github.com/kubernetes-sigs/custom-metrics-apiserver/blob/master/pkg/registry/custom_metrics/reststorage.go restful接口在installer.go中被定义。 https://github.com/kubernetes-sigs/custom-metrics-apiserver/blob/master/pkg/apiserver/installer/installer.go **总的来说,可以认为** (1)基于custom-metrics-apiserver这个项目,你只要实现上述三个接口就行。其他的事情这个包在你new provider的时候都自动实现了。 (2)ListAllMetrics 注册了所有的Metric,让api-server 知道有哪些自定义metric。 (3)GetMetricByName, GetMetricBySelector 都是返回具体的Metric数据。 (4)一般api server都是 调用GetMetricBySelector,因为hpa的对象基本都是deploy, GetMetricBySelector会循环调用GetMetricByName取得deploy所有pod的metric信息。
### 2. 定制自己的metric server #### 2.1 代码部署和编译 这里我做了如下的修改。对于metric server而言,无论访问什么metric,都返回10。 ``` func (p *monitorProvider) GetMetricByName( name types.NamespacedName, info provider.CustomMetricInfo, metricSelector labels.Selector, ) (*custom_metrics.MetricValue, error) { ref, err := helpers.ReferenceFor(p.mapper, name, info) if err != nil { return nil, err } return &custom_metrics.MetricValue{ DescribedObject: ref, // MetricName: info.Metric, Metric: custom_metrics.MetricIdentifier{ Name: info.Metric, }, Timestamp: metav1.Time{time.Unix(int64(10), 0)}, Value: *resource.NewMilliQuantity(int64(10*1000.0), resource.DecimalSI), }, nil } ``` 更详细的可以参考我的github项目。
编译生成自己的镜像:zoux/hpa:v1。然后生成一下的deployment。 ``` apiVersion: apps/v1 kind: Deployment metadata: labels: app: kube-hpa name: kube-hpa namespace: kube-system spec: replicas: 1 selector: matchLabels: app: kube-hpa template: metadata: labels: app: kube-hpa name: kube-hpa spec: hostNetwork: true containers: - name: kube-hpa image: zoux/hpa:v1 imagePullPolicy: IfNotPresent command: - /metric-server args: - --master-url=XXX - --kube-config=/pkc/config - --tls-private-key-file=/pkc/server-key.pem - --secure-port=9997 - --v=10 ports: - containerPort: 9997 resources: limits: cpu: 2 memory: 2048Mi requests: cpu: 0.5 memory: 500Mi volumeMounts: - name: pkc mountPath: /pkc readOnly: true volumes: - name: pkc hostPath: path: /opt/kubernetes/ssl ```
验证部署成功 ``` root@k8s-master:~/testyaml/hpa# kubectl get pod -n kube-system -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-hpa-84c884f994-gd5fl 1/1 Running 0 3d13h 192.168.0.5 192.168.0.5 ``` #### 2.2 创建 Sv and APIService 上面虽然部署成功了,但是apiserver还是访问不到。 ``` k8s-master:~/testyaml/hpa# kubectl get --raw "/apis/custom.metrics.k8s.io/v1 Error from server (NotFound): the server could not find the requested resource ``` 原因在于,apiserver不知道如何找到kube-hpa-84c884f994-gd5fl这个pod进行访问。所以需要创建下面的svc和apiserver。 ``` root@k8s-master:~/testyaml/hpa# cat tls.yaml apiVersion: v1 kind: Service metadata: name: kube-hpa namespace: kube-system spec: clusterIP: None ports: - name: https-hpa-dont-edit-it port: 9997 targetPort: 9997 selector: app: kube-hpa --- apiVersion: apiregistration.k8s.io/v1beta1 kind: APIService metadata: name: v1beta1.custom.metrics.k8s.io spec: service: name: kube-hpa namespace: kube-system port: 9997 group: custom.metrics.k8s.io version: v1beta1 insecureSkipTLSVerify: true groupPriorityMinimum: 100 versionPriority: 100 ``` **创建完成,验证是否成功:** ``` root@k8s-master:~/testyaml/hpa# kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" {"kind":"APIResourceList","apiVersion":"v1","groupVersion":"custom.metrics.k8s.io/v1beta1","resources":[{"name":"pods/pod_cpu_used_1m","singularName":"","namespaced":true,"kind":"MetricValueList","verbs":["get"]},{"name":"pods/pod_cpu_used_5m","singularName":"","namespaced":true,"kind":"MetricValueList","verbs":["get"]},{"name":"pods/container_cpu_used_1m","singularName":"","namespaced":true,"kind":"MetricValueList","verbs":...} ``` 如果报错。查看该apiserver哪里报错了 ``` root@k8s-master:~/testyaml/hpa# kubectl get APIService v1beta1.custom.metrics.k8s.io -oyaml apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: creationTimestamp: "2021-06-13T13:22:01Z" name: v1beta1.custom.metrics.k8s.io resourceVersion: "1590641" selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.custom.metrics.k8s.io uid: d488d6a8-7e79-4311-a1e9-0b12e4591375 spec: group: custom.metrics.k8s.io groupPriorityMinimum: 100 insecureSkipTLSVerify: true service: name: kube-hpa namespace: kube-system port: 9997 version: v1beta1 versionPriority: 100 status: conditions: - lastTransitionTime: "2021-06-13T13:42:17Z" message: all checks passed reason: Passed status: "True" type: Available ``` 或者直接curl访问: ``` curl -k https://nodeip:9997/apis/custom.metrics.k8s.io/v1beta1 ```
#### 2.3 system:anonymous授权 如果没有出现类似问题,这一步直接跳过 又是会出现如下的错误 或者 上述的APIService没有运行成功。都是因为system:anonymous权限不够 ``` annotations: autoscaling.alpha.kubernetes.io/conditions: '[{"type":"AbleToScale","status":"True","lastTransitionTime":"2021-06-13T13:33:12Z","reason":"SucceededGetScale","message":"the HPA controller was able to get the target''s current scale"},{"type":"ScalingActive","status":"False","lastTransitionTime":"2021-06-13T13:33:12Z","reason":"FailedGetPodsMetric","message":"the HPA was unable to compute the replica count: unable to get metric pod_cpu_usage_for_limit_1m: unable to fetch metrics from custom metrics API: pods.custom.metrics.k8s.io \"*\" is forbidden: User \"system:anonymous\" cannot get resource \"pods/pod_cpu_usage_for_limit_1m\" in API group \"custom.metrics.k8s.io\" in the namespace \"default\""}]' autoscaling.alpha.kubernetes.io/metrics: '[{"type":"Pods","pods":{"metricName":"pod_cpu_usage_for_limit_1m","targetAverageValue":"60"}}]' metric-containerName: zx-hpa creationTimestamp: "2021-06-13T13:32:56Z" name: nginx-hpa-zx-1 namespace: default resourceVersion: "1589301" selfLink: /apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers/ ``` 这是可以直接绑定clusterrole https://github.com/kubernetes-sigs/metrics-server/issues/81 我这里是直接给了 cluster-admin 权限,实际情况可以按照需求赋权。 ``` kubectl create clusterrolebinding anonymous-role-binding --clusterrole=cluster-admin --user=system:anonymous ```
### 3. 创建hpa验证是否成功 可以看出来都是10 ``` root@k8s-master:~/testyaml/hpa# kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE nginx-hpa-zx-1 Deployment/zx-hpa 10/60 1 3 3 9m55s root@k8s-master:~/testyaml/hpa# kubectl get hpa NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE nginx-hpa-zx-1 Deployment/zx-hpa 10/60 1 3 3 9m57s ```
### 4. 追踪整个过程 **第一步** Kcm(hpa controller)发送的请求。 ``` I0613 23:12:36.498740 9879 httplog.go:90] GET /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test: (35.302304ms) 200 [kube-controller-manager/v1.17.4 (linux/amd64) kubernetes/8d8aa39/horizontal-pod-autoscaler 192.168.0.4:42750] ``` 要在url里使用不安全字符,就需要使用转义。 %2A = * %3D = =(等号)
**第二步** apiserver进行了 url转换。 kcm访问的是: /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test 但是由于第二步创建了 Sv and APIService。所以访问这个url会被转换为: https://192.168.0.5:9997/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test 192.168.0.5是pod kube-hpa-84c884f994-gd5fl 所在的节点ip也是podia(hostNetwork模式)。 9997是定义的端口。
**第三步:** 访问 https://192.168.0.5:9997/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test 直接在master节点上(masterip=192.168.0.4)通过curl模拟 ``` root@k8s-master:~/testyaml/hpa# curl -k https://192.168.0.5:9997/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test { "kind": "MetricValueList", "apiVersion": "custom.metrics.k8s.io/v1beta1", "metadata": { "selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m" }, "items": [ { "describedObject": { "kind": "Pod", "namespace": "default", "name": "zx-hpa-7b56cddd95-5j6r4|", "apiVersion": "/v1" }, "metricName": "pod_aa_100m", "timestamp": "1970-01-01T00:00:10Z", "value": "10", "selector": null }, { "describedObject": { "kind": "Pod", "namespace": "default", "name": "zx-hpa-7b56cddd95-lthbz|", "apiVersion": "/v1" }, "metricName": "pod_aa_100m", "timestamp": "1970-01-01T00:00:10Z", "value": "10", "selector": null }, { "describedObject": { "kind": "Pod", "namespace": "default", "name": "zx-hpa-7b56cddd95-n9ft9|", "apiVersion": "/v1" }, "metricName": "pod_aa_100m", "timestamp": "1970-01-01T00:00:10Z", "value": "10", "selector": null } ] } ```
### 5. 总结 (1)如何定制自己的metric-server,包括代码编写和环境搭建 (2)Kubernetes 里的 Custom Metrics 机制,也是借助 Aggregator APIServer 扩展机制来实现的。这里的具体原理是,当你把 Custom Metrics APIServer 启动之后,Kubernetes 里就会出现一个叫作custom.metrics.k8s.io的 API。而当你访问这个 URL 时,Aggregator 就会把你的请求转发给 Custom Metrics APIServer 。 这里一定要注意: kube-apiserver启动参数一定要包含: -enable-swagger-ui=true (3)ListAllMetrics()并没有将metric注册到apiserver。 apiserver并没有对metric进行验证。上文中,我metric server的ListAllMetrics()并没有注册 pod_aa_100m这个metric,但是可以正常使用。 原因:apiserver并没有进行验证,apiserver只进行url转发,如果有返回数据,apiserver就认为这个metric是正确的。所以这一点可以用来自定义metric。 ================================================ FILE: k8s/kcm/4-hpa源码分析.md ================================================ Table of Contents ================= * [1. hpa介绍](#1-hpa介绍) * [1.1 hpa是什么](#11-hpa是什么) * [1.2 hpa如何用起来](#12-hpa如何用起来) * [2. hpa 源码分析](#2-hpa-源码分析) * [2.1 启动参数介绍](#21-启动参数介绍) * [2.2 启动流程](#22-启动流程) * [2.3 核心计算逻辑](#23-核心计算逻辑) * [2.4 计算期望副本数量](#24--计算期望副本数量) * [2.4.1 GetRawMetric-具体的metric值](#241-getrawmetric-具体的metric值) * [2.4.2 calcPlainMetricReplicas-计算期望副本值](#242-calcplainmetricreplicas-计算期望副本值) * [3. 举例说明计算过程](#3-举例说明计算过程) * [3.1 hpa扩容计算逻辑](#31-hpa扩容计算逻辑) * [3.2 场景1](#32-场景1) * [3.3 场景2](#33-场景2) * [4. 总结](#4-总结) **本章重点:** 从源码角度分析hpa的计算逻辑 ### 1. hpa介绍 #### 1.1 hpa是什么 hpa指的是 Pod 水平自动扩缩,全名是Horizontal Pod Autoscaler简称HPA。它可以基于 CPU 利用率或其他指标自动扩缩 ReplicationController、Deployment 和 ReplicaSet 中的 Pod 数量。 **用处:** 用户可以通过设置hpa,实现deploy pod数量的自动扩缩容。比如流量大的时候,pod数量多一些。流量小的时候,Pod数量降下来,避免资源浪费。 ![image-20210521145344510](../images/hap-1.png)
#### 1.2 hpa如何用起来 (1)需要一个deploy/svc等,可以参考社区 (2)需要对应的hpa 举例: (1) 创建1个deploy。这里只有1个副本 ``` apiVersion: apps/v1 kind: Deployment metadata: labels: app: zx-hpa-test name: zx-hpa spec: strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 replicas: 2 selector: matchLabels: app: zx-hpa-test template: metadata: labels: app: zx-hpa-test name: zx-hpa-test spec: terminationGracePeriodSeconds: 5 containers: - name: busybox image: busybox:latest imagePullPolicy: IfNotPresent command: - sleep - "3600" ``` (2)创建对应的hpa。 ``` apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler metadata: name: nginx-hpa-zx-1 annotations: metric-containerName: zx-hpa spec: scaleTargetRef: apiVersion: apps/v1 // 这里必须指定需要监控那个对象 kind: Deployment name: zx-hpa minReplicas: 1 // deploy最小的Pod数量 maxReplicas: 3 // deploy最大的Pod数量 metrics: - type: Pods pods: metricName: pod_cpu_1m targetAverageValue: 60 ``` hpa是从同命名空间下,找对应的deploy。所以yaml中指定deploy的时候不要指定namespaces。这也就要求,hpa 和deploy必须在同一命名空间。
这里我使用的 pod_cpu_1m这个指标。这是一个自定义指标。接下来就是分析 创建好之后,观察hpa,当deploy的cpu利用率变化时,deploy的副本会随之改变。
### 2. hpa 源码分析 #### 2.1 启动参数介绍 hpa controller随controller manager的初始化而启动,hpa controller将以下flag添加到controller manager的flag中,通过controller manager的CLI端暴露给用户: ``` // AddFlags adds flags related to HPAController for controller manager to the specified FlagSet. func (o *HPAControllerOptions) AddFlags(fs *pflag.FlagSet) { if o == nil { return } fs.DurationVar(&o.HorizontalPodAutoscalerSyncPeriod.Duration, "horizontal-pod-autoscaler-sync-period", o.HorizontalPodAutoscalerSyncPeriod.Duration, "The period for syncing the number of pods in horizontal pod autoscaler.") fs.DurationVar(&o.HorizontalPodAutoscalerUpscaleForbiddenWindow.Duration, "horizontal-pod-autoscaler-upscale-delay", o.HorizontalPodAutoscalerUpscaleForbiddenWindow.Duration, "The period since last upscale, before another upscale can be performed in horizontal pod autoscaler.") fs.MarkDeprecated("horizontal-pod-autoscaler-upscale-delay", "This flag is currently no-op and will be deleted.") fs.DurationVar(&o.HorizontalPodAutoscalerDownscaleStabilizationWindow.Duration, "horizontal-pod-autoscaler-downscale-stabilization", o.HorizontalPodAutoscalerDownscaleStabilizationWindow.Duration, "The period for which autoscaler will look backwards and not scale down below any recommendation it made during that period.") fs.DurationVar(&o.HorizontalPodAutoscalerDownscaleForbiddenWindow.Duration, "horizontal-pod-autoscaler-downscale-delay", o.HorizontalPodAutoscalerDownscaleForbiddenWindow.Duration, "The period since last downscale, before another downscale can be performed in horizontal pod autoscaler.") fs.MarkDeprecated("horizontal-pod-autoscaler-downscale-delay", "This flag is currently no-op and will be deleted.") fs.Float64Var(&o.HorizontalPodAutoscalerTolerance, "horizontal-pod-autoscaler-tolerance", o.HorizontalPodAutoscalerTolerance, "The minimum change (from 1.0) in the desired-to-actual metrics ratio for the horizontal pod autoscaler to consider scaling.") fs.BoolVar(&o.HorizontalPodAutoscalerUseRESTClients, "horizontal-pod-autoscaler-use-rest-clients", o.HorizontalPodAutoscalerUseRESTClients, "If set to true, causes the horizontal pod autoscaler controller to use REST clients through the kube-aggregator, instead of using the legacy metrics client through the API server proxy. This is required for custom metrics support in the horizontal pod autoscaler.") fs.DurationVar(&o.HorizontalPodAutoscalerCPUInitializationPeriod.Duration, "horizontal-pod-autoscaler-cpu-initialization-period", o.HorizontalPodAutoscalerCPUInitializationPeriod.Duration, "The period after pod start when CPU samples might be skipped.") fs.MarkDeprecated("horizontal-pod-autoscaler-use-rest-clients", "Heapster is no longer supported as a source for Horizontal Pod Autoscaler metrics.") fs.DurationVar(&o.HorizontalPodAutoscalerInitialReadinessDelay.Duration, "horizontal-pod-autoscaler-initial-readiness-delay", o.HorizontalPodAutoscalerInitialReadinessDelay.Duration, "The period after pod start during which readiness changes will be treated as initial readiness.") } ``` | 参数 | 默认 | 说明 | | :-------------------------------------------------- | :--- | :----------------------------------------------------------- | | horizontal-pod-autoscaler-sync-period | 15s | controller同步HPA信息的同步周期 | | horizontal-pod-autoscaler-downscale-stabilization | 5m | 缩容稳定窗口,缩容间隔时间(v1.12支持) | | horizontal-pod-autoscaler-tolerance | 0.1 | 最小缩放容忍度:计算出的期望值和实际值的比率<最小容忍比率,则不进行扩缩容 | | horizontal-pod-autoscaler-cpu-initialization-period | 5m | pod刚启动时,一定时间内的CPU使用率数据不参与计算。 | | horizontal-pod-autoscaler-initial-readiness-delay | 30s | 扩容等待pod ready的时间(无法得知pod何时就绪) | kcm中需要设置这个,才能启动自定义的rest-clients。 --horizontal-pod-autoscaler-use-rest-clients=true
#### 2.2 启动流程 **代码流程: ** startHPAControllerWithMetricsClient -> startHPAControllerWithMetricsClient -> Run -> worker -> processNextWorkItem -> reconcileKey->reconcileAutoscaler ``` func (a *HorizontalController) reconcileKey(key string) (deleted bool, err error) { namespace, name, err := cache.SplitMetaNamespaceKey(key) if err != nil { return true, err } hpa, err := a.hpaLister.HorizontalPodAutoscalers(namespace).Get(name) if errors.IsNotFound(err) { klog.Infof("Horizontal Pod Autoscaler %s has been deleted in %s", name, namespace) delete(a.recommendations, key) return true, nil } return false, a.reconcileAutoscaler(hpa, key) } ```
#### 2.3 核心计算逻辑 **metric的定义类型分为3种,resource、pods和external,这里只分析pods类型的metric。** reconcileAutoscaler函数就是hpa的核心函数。该函数主要逻辑如下: * 1.做一些类型转换,用于接下来的Hpa计算 * 2.计算hpa 的期望副本数量。 * 3.根据计算的结果判断是否需要改变副本数,需要改变的话,调用接口修改,然后做错误处理。 ```go func (a *HorizontalController) reconcileAutoscaler(hpav1Shared *autoscalingv1.HorizontalPodAutoscaler, key string) error { // 1. 调用client向apiserver发送请求,scale是返回的hpa实体,然后做各种数据类型转换,然后通过一个client向apiserver获取scale,以及当然还有一些backup、把错误写入hpa event的操作 。。。。代码省略 // 2. 判断是否需要计算副本数,如果需要,就调用computeReplicasForMetrics函数计算当前hpa的副本数。 desiredReplicas := int32(0) rescaleReason := "" var minReplicas int32 if hpa.Spec.MinReplicas != nil { minReplicas = *hpa.Spec.MinReplicas } else { // Default value minReplicas = 1 } rescale := true if scale.Spec.Replicas == 0 && minReplicas != 0 { // Autoscaling is disabled for this resource desiredReplicas = 0 rescale = false setCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionFalse, "ScalingDisabled", "scaling is disabled since the replica count of the target is zero") } else if currentReplicas > hpa.Spec.MaxReplicas { rescaleReason = "Current number of replicas above Spec.MaxReplicas" desiredReplicas = hpa.Spec.MaxReplicas } else if currentReplicas < minReplicas { rescaleReason = "Current number of replicas below Spec.MinReplicas" desiredReplicas = minReplicas } else { var metricTimestamp time.Time metricDesiredReplicas, metricName, metricStatuses, metricTimestamp, err = a.computeReplicasForMetrics(hpa, scale, hpa.Spec.Metrics) if err != nil { a.setCurrentReplicasInStatus(hpa, currentReplicas) if err := a.updateStatusIfNeeded(hpaStatusOriginal, hpa); err != nil { utilruntime.HandleError(err) } a.eventRecorder.Event(hpa, v1.EventTypeWarning, "FailedComputeMetricsReplicas", err.Error()) return fmt.Errorf("failed to compute desired number of replicas based on listed metrics for %s: %v", reference, err) } klog.V(4).Infof("proposing %v desired replicas (based on %s from %s) for %s", metricDesiredReplicas, metricName, metricTimestamp, reference) rescaleMetric := "" if metricDesiredReplicas > desiredReplicas { desiredReplicas = metricDesiredReplicas rescaleMetric = metricName } if desiredReplicas > currentReplicas { rescaleReason = fmt.Sprintf("%s above target", rescaleMetric) } if desiredReplicas < currentReplicas { rescaleReason = "All metrics below target" } desiredReplicas = a.normalizeDesiredReplicas(hpa, key, currentReplicas, desiredReplicas, minReplicas) rescale = desiredReplicas != currentReplicas } // 3.进行扩缩容,并进行错误处理。 if rescale { scale.Spec.Replicas = desiredReplicas _, err = a.scaleNamespacer.Scales(hpa.Namespace).Update(targetGR, scale) if err != nil { a.eventRecorder.Eventf(hpa, v1.EventTypeWarning, "FailedRescale", "New size: %d; reason: %s; error: %v", desiredReplicas, rescaleReason, err.Error()) setCondition(hpa, autoscalingv2.AbleToScale, v1.ConditionFalse, "FailedUpdateScale", "the HPA controller was unable to update the target scale: %v", err) a.setCurrentReplicasInStatus(hpa, currentReplicas) if err := a.updateStatusIfNeeded(hpaStatusOriginal, hpa); err != nil { utilruntime.HandleError(err) } return fmt.Errorf("failed to rescale %s: %v", reference, err) } setCondition(hpa, autoscalingv2.AbleToScale, v1.ConditionTrue, "SucceededRescale", "the HPA controller was able to update the target scale to %d", desiredReplicas) a.eventRecorder.Eventf(hpa, v1.EventTypeNormal, "SuccessfulRescale", "New size: %d; reason: %s", desiredReplicas, rescaleReason) klog.Infof("Successful rescale of %s, old size: %d, new size: %d, reason: %s", hpa.Name, currentReplicas, desiredReplicas, rescaleReason) } else { klog.V(4).Infof("decided not to scale %s to %v (last scale time was %s)", reference, desiredReplicas, hpa.Status.LastScaleTime) desiredReplicas = currentReplicas } a.setStatus(hpa, currentReplicas, desiredReplicas, metricStatuses, rescale) return a.updateStatusIfNeeded(hpaStatusOriginal, hpa) } ```
**这里主要关心第二个步骤:hpa如何计算期望副本数量** #### 2.4 计算期望副本数量 概念: 最小值:minReplicas。 这个是用户在hpa里面的yaml设置的。这个是可选的,如果不设置,默认是1。 最大值:MaxReplicas。 这个是用户在hpa里面的yaml设置的。这个必填的,如果不设置,会报错, 如下。 当前值:currentReplicas。这个是hpa获得的当前deploy的副本数量。 期望值:desiredReplicas。 这个是hpa希望deploy的副本数量。 ``` error: error validating "nginx-deployment-hpa-test.yaml": error validating data: ValidationError(HorizontalPodAutoscaler.spec): missing required field "maxReplicas" in io.k8s.api.autoscaling.v2beta1.HorizontalPodAutoscalerSpec; if you choose to ignore these errors, turn validation off with --validate=false ``` 计算逻辑分为两部分,第一种情况是不需要算,就可以直接得出期望值。 第二种情况需要调用函数计算。 **情况1:不需要计算** (1)当前值等于0。 期望值=0. 不扩容, (2)当前值 > 最大值。 没必要计算期望值。 期望值=最大值,需要扩缩容。 (3)当前值 < 最小值。 没必要计算期望值。 期望值=最小值,需要扩缩容。
**情况2:** 最小值 <= 当前值 <= 最大值。 需要调用函数计算 期望值。 这里的调用链为 computeReplicasForMetrics -> computeReplicasForMetric -> GetMetricReplicas 这里computeReplicasForMetrics有一个需要注意的点就是。这里可以处理了多个metric的情况。例如:这里一个hpa有多个指标。 ```text - type: Resource resource: name: cpu # Utilization类型的目标值,Resource类型的指标只支持Utilization和AverageValue类型的目标值 target: type: Utilization averageUtilization: 50 # Pods类型的指标 - type: Pods pods: metric: name: packets-per-second # AverageValue类型的目标值,Pods指标类型下只支持AverageValue类型的目标值 target: type: AverageValue averageValue: 1k ``` 这里hpa的逻辑是,谁最大取谁。例如, 通过cpu.Utilization hpa算出来应该需要 4个pod。 但是packets-per-second算出来需要5个。这个时候就已5个为准。见下面代码: ``` // computeReplicasForMetrics computes the desired number of replicas for the metric specifications listed in the HPA, // returning the maximum of the computed replica counts, a description of the associated metric, and the statuses of // all metrics computed. func (a *HorizontalController) computeReplicasForMetrics(hpa *autoscalingv2.HorizontalPodAutoscaler, scale *autoscalingv1.Scale, metricSpecs []autoscalingv2.MetricSpec) (replicas int32, metric string, statuses []autoscalingv2.MetricStatus, timestamp time.Time, err error) { for i, metricSpec := range metricSpecs { replicaCountProposal, metricNameProposal, timestampProposal, condition, err := a.computeReplicasForMetric(hpa, metricSpec, specReplicas, statusReplicas, selector, &statuses[i]) if err != nil { if invalidMetricsCount <= 0 { invalidMetricCondition = condition invalidMetricError = err } invalidMetricsCount++ } if err == nil && (replicas == 0 || replicaCountProposal > replicas) { timestamp = timestampProposal replicas = replicaCountProposal metric = metricNameProposal } } // If all metrics are invalid return error and set condition on hpa based on first invalid metric. if invalidMetricsCount >= len(metricSpecs) { setCondition(hpa, invalidMetricCondition.Type, invalidMetricCondition.Status, invalidMetricCondition.Reason, invalidMetricCondition.Message) return 0, "", statuses, time.Time{}, fmt.Errorf("invalid metrics (%v invalid out of %v), first error is: %v", invalidMetricsCount, len(metricSpecs), invalidMetricError) } setCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionTrue, "ValidMetricFound", "the HPA was able to successfully calculate a replica count from %s", metric) return replicas, metric, statuses, timestamp, nil } ```
针对具体某个metric指标。计算分为俩步: (1)GetRawMetric函数: 得到 具体的metric值 (2)calcPlainMetricReplicas :计算期望副本值 这里需要注意一点就是targetUtilization进行了数据转换。乘以了10^3。 ``` // GetMetricReplicas calculates the desired replica count based on a target metric utilization // (as a milli-value) for pods matching the given selector in the given namespace, and the // current replica count func (c *ReplicaCalculator) GetMetricReplicas(currentReplicas int32, targetUtilization int64, metricName string, namespace string, selector labels.Selector, metricSelector labels.Selector) (replicaCount int32, utilization int64, timestamp time.Time, err error) { metrics, timestamp, err := c.metricsClient.GetRawMetric(metricName, namespace, selector, metricSelector) if err != nil { return 0, 0, time.Time{}, fmt.Errorf("unable to get metric %s: %v", metricName, err) } replicaCount, utilization, err = c.calcPlainMetricReplicas(metrics, currentReplicas, targetUtilization, namespace, selector, v1.ResourceName("")) return replicaCount, utilization, timestamp, err } ```
##### 2.4.1 GetRawMetric-具体的metric值 ``` // GetRawMetric gets the given metric (and an associated oldest timestamp) // for all pods matching the specified selector in the given namespace func (c *customMetricsClient) GetRawMetric(metricName string, namespace string, selector labels.Selector, metricSelector labels.Selector) (PodMetricsInfo, time.Time, error) { // 1.这里直接调用 GetForObjects,发送restful请求获取数据 metrics, err := c.client.NamespacedMetrics(namespace).GetForObjects(schema.GroupKind{Kind: "Pod"}, selector, metricName, metricSelector) if err != nil { return nil, time.Time{}, fmt.Errorf("unable to fetch metrics from custom metrics API: %v", err) } if len(metrics.Items) == 0 { return nil, time.Time{}, fmt.Errorf("no metrics returned from custom metrics API") } // 2. 对获取的数据进行处理。这里看起来是乘以了 10^3 res := make(PodMetricsInfo, len(metrics.Items)) for _, m := range metrics.Items { window := metricServerDefaultMetricWindow if m.WindowSeconds != nil { window = time.Duration(*m.WindowSeconds) * time.Second } res[m.DescribedObject.Name] = PodMetric{ Timestamp: m.Timestamp.Time, Window: window, Value: int64(m.Value.MilliValue()), } m.Value.MilliValue() } timestamp := metrics.Items[0].Timestamp.Time return res, timestamp, nil } ```
##### 2.4.2 calcPlainMetricReplicas-计算期望副本值 这里代码省略,直接贴逻辑。 3.1 先从apiserver端拿到所有相关的pod,将这些pod分为三类: ``` a.missingPods用于记录处于running状态,但不提供该metric的pod b.ignoredPods 用于处理resource类型cpu相关metric的延迟(就是pod未就绪),这里不深入讨论 c.readyPodCount记录状态为running,且能提供该metric的pod ``` 3.2 调用GetMetricUtilizationRatio计算实际值与期望值的对比情况。计算时,对于所有可获取到metric的pod,取它们metric value的平均值得到:usageRatio=实际值/期望值;utilization=实际值(平均) 3.3 计算期望pod数量DesiredReplicas。对于missingPods为0,即所有target pod都处于running可获取metric value的情况: a.如果实际值与期望值的对比usageRatio处于可容忍范围内,不执行scale操作。默认情况下c.tolerance=0.1,即usageRatio处于 [0.9,1.1]时pod数量不变化 ``` if math.Abs(1.0-usageRatio) <= c.tolerance { // return the current replicas if the change would be too small return currentReplicas, utilization, nil } ``` b.实际值与期望值的对比usageRatio不在可容忍范围内,向上取整得到desiredReplicas `return int32(math.Ceil(usageRatio * float64(readyPodCount))), utilization, nil` 对于missingPods>0,即有target pod的metric value没有获取到的情况。 缩容时,对于找不到metric的pod,`视为`正好用了desired value ``` if usageRatio < 1.0 { // on a scale-down, treat missing pods as using 100% of the resource request for podName := range missingPods { metrics[podName] = metricsclient.PodMetric{Value: targetUtilization} } } ``` 扩容时,对于找不到metric的pod,`视为`该pod对指定metric的使用量为0 ``` for podName := range missingPods { metrics[podName] = metricsclient.PodMetric{Value: 0} } ``` 经过上面的处理后,重新计算实际值与期望值的对比newUsageRatio。 在下面两种情况下,不执行scale操作:新的实际值与期望值的对比newUsageRatio在容忍范围内; 赋值处理前后,一个需要scale up,另一个需要scale down。 其它情况下,同样地执行向上取整操作 ``` if math.Abs(1.0-newUsageRatio) <= c.tolerance || (usageRatio < 1.0 && newUsageRatio > 1.0) || (usageRatio > 1.0 && newUsageRatio < 1.0) { // return the current replicas if the change would be too small, // or if the new usage ratio would cause a change in scale direction return currentReplicas, utilization, nil } return int32(math.Ceil(newUsageRatio * float64(len(metrics)))), utilization, nil ```
最后,Hpa将desiredReplicas写到scale.Spec.Replicas,调用a.scaleNamespacer.Scales(hpa.Namespace).Update(targetGR, scale)向apiserver发送更新hpa的请求,对某个hpa的一轮更新操作就完成了。
### 3. 举例说明计算过程 #### 3.1 hpa扩容计算逻辑 **关键概念**:tolerance(hpa扩容容忍度), 默认为0.1。 Custom server: 自定义metric服务。这里是一个抽象,用于给hpa提供具体的metric值。Custom server具体可以是prometheus,或者其他的监控系统。下一篇文章会讲如何将Custom server和hpa联系起来。
#### 3.2 场景1 当前有deployA, 运行着俩个pod, A1和A2。 deploy设置了hpa,指标是内存使用量,并且规定,当平均使用量大于60就要扩容。 ![image-20210616210849063](../images/hpa-2.png) hpa扩容计算步骤: **第一步:** 往monitor-adaptor发送请求, 要求获得deployA下所有pod的metric值。 这里收到了 A1=50; A2=100 **第二步:** 补全metric值,给获取不到metric值的pod赋值。 这里hpa会查看集群状态,发现deployA 下有俩个pod,A1,A2。并且这两个pod的metric值都获取到了。 这个时候就不用补全。(下面例子就介绍需要补全metric的情况) **第三步:** 开始计算 (1)计算 平均pod metric值和 target的比例。也可以叫扩容比例系数 ratio = (A1+A2)/(2*target) = (50+100)/120 = 1.25 按理说不用再除target值,直接(50+100)/2=75,然后拿75和60比就行。 75比60大就应该扩容。 这里使用系数表示主要有俩个原因: * 有容忍度的概念,使用比例方便和计算是否超出了容忍度 * 用于扩缩容计算 (2)判断是否超过容忍度 这里 1.25-1 > 0.1(默认容忍度)。 因此这种情况是需要扩容的。 这里就体现了容忍度的作用。有了容忍度, 平均metric需要大于 66才会扩容(60*1.1) (3)计算真正的副本数量 向上取整: 扩容比例系数*当前的副本数 这里就是: 1.25*2 = 2.5 , 取整后就是3。
#### 3.3 场景2 和场景1不同在于:由于某件原因,导致 monitor-adaptor往hpa发送的时候,只有 A1=20。 A2的数据丢失。 ![image-20210616211337218](../images/hpa-3.png)hpa扩容计算步骤: **第一步:** 往monitor-adaptor发送请求, 要求获得deployA下所有pod的metric值。 这里收到了 A1=2; **第二步:** 补全metric值,给获取不到metric值的pod赋值。 这里hpa会查看集群状态,发现deployA 下有俩个pod,A1,A2。但是这里发现只有A1的值,这个时候hpa就认为A2 有数据,但是获取失败。所以就会给A2自己赋值, 0/target。 赋值逻辑如下: 当 A1 > target的时候,A2=0; 当A1<= target的时候,赋值为 target。 这里由于 A1=2, 比target(60)小,所以最终hpa计算时: A1=2; A2=60; target=60; **第三步:** 开始计算 (1)计算 平均pod metric值和 target的比例。也可以叫扩容比例系数 ratio = (A1+A2)/(2*target) = (2+60)/120 = 0.517 (2)判断是否超过容忍度 这里 1-0.517 > 0.1(默认容忍度)。 因此这种情况是需要缩容的。 (3)计算真正的副本数量 向上取整: 扩容比例系数*当前的副本数(这里就是metric数量,A1,A2) 对应就是: 0.517*2 = 1.034 , 取整后就是2。
### 4. 总结 (1)hpa可以设置多个metric。当有多个metric时,谁算出来的副本值最大,取谁的值 (2)针对具体的metric而言(这里是以pods这种为例),首先获得用户定义的hpa指标。比如最大值,最小值,阈值等。 这里有一个点在于。阈值乘以了1000用于计算。 (3)获取metric的值,这里是使用了自定义rest服务。hpa只要发送rest请求,就有数据。这种情况非常适用于公司使用自己的监控数据做扩缩容。 注意:这里每个值也乘以了1000。这样和阈值就是相互抵消了。 (4)利用公式计算期望值。 期望值*X <= 当前pod所有的metric值。X取小的正整数。具体逻辑可以看上文的计算过程。 ================================================ FILE: k8s/kcm/5-job controller-manager源码分析.md ================================================ Table of Contents ================= * [1. job简介](#1--job简介) * [2. job controller源码分析-初始化](#2-job-controller源码分析-初始化) * [2.1 startJobController](#21-startjobcontroller) * [2.2 NewJobController](#22-newjobcontroller) * [2.3 对Pod的监听事件](#23-对pod的监听事件) * [2.3.1 job的expectations机制](#231-job的expectations机制) * [2.3.2 addPod](#232-addpod) * [2.3.3 updatePod](#233-updatepod) * [2.3.4 deletePod](#234-deletepod) * [2.3.5 总结](#235-总结) * [3. 如何处理队列中的job](#3-如何处理队列中的job) * [3.1 sycnjob](#31-sycnjob) * [3.2 判断job是否完成的标准: completed, failed,c.Status == v1.ConditionTrue](#32-判断job是否完成的标准--completed--failedcstatus--v1conditiontrue) * [3.3 如何获得该job对应的pods](#33-如何获得该job对应的pods) * [3.4 jm.manageJob](#34--jmmanagejob) * [4.总结](#4总结) ### 1. job简介 job 在 kubernetes 中主要用来处理离线任务,job 直接管理 pod,可以创建一个或多个 pod 并会确保指定数量的 pod 运行完成。 job 的一个示例如下所示: ``` apiVersion: batch/v1 kind: Job metadata: labels: job-name: hello-1626526800 name: hello-1626526800 namespace: default spec: backoffLimit: 6 //标记为 failed 前的重试次数(运行多少个pod failed),默认为 6 completions: 4 //当成功的 Pod 个数达到 .spec.completions 时,Job 被视为完成 parallelism: 1 // 并行度。这里就是每次1个1个pod的运行,4个pod运行完后,job完成 selector: matchLabels: controller-uid: 52f8d25f-6bbf-4439-ab6d-02876c52baea template: metadata: creationTimestamp: null labels: job-name: hello-1626526800 spec: containers: - args: - /bin/sh - -c - date; echo "Hello, World!" image: busybox imagePullPolicy: Always name: hello ``` 更多关于job的描述,可以参考社区介绍:https://kubernetes.io/zh/docs/concepts/workloads/controllers/job/
### 2. job controller源码分析-初始化 #### 2.1 startJobController 这个就是 startControllers里面kcm启动时各个controller对应的init函数。 cmd\kube-controller-manager\app\batch.go ``` func startJobController(ctx ControllerContext) (http.Handler, bool, error) { if !ctx.AvailableResources[schema.GroupVersionResource{Group: "batch", Version: "v1", Resource: "jobs"}] { return nil, false, nil } go job.NewJobController( ctx.InformerFactory.Core().V1().Pods(), ctx.InformerFactory.Batch().V1().Jobs(), ctx.ClientBuilder.ClientOrDie("job-controller"), ).Run(int(ctx.ComponentConfig.JobController.ConcurrentJobSyncs), ctx.Stop) return nil, true, nil } ```
#### 2.2 NewJobController pkg\controller\job\job_controller.go 这里就是定义好 informer和处理函数。 可以看出来,job的add,delete, update最终都是入队列了。 ``` func NewJobController(podInformer coreinformers.PodInformer, jobInformer batchinformers.JobInformer, kubeClient clientset.Interface) *JobController { // 1.定义event上传 eventBroadcaster := record.NewBroadcaster() eventBroadcaster.StartLogging(glog.Infof) eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")}) if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil { metrics.RegisterMetricAndTrackRateLimiterUsage("job_controller", kubeClient.CoreV1().RESTClient().GetRateLimiter()) } jm := &JobController{ kubeClient: kubeClient, podControl: controller.RealPodControl{ KubeClient: kubeClient, Recorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "job-controller"}), }, expectations: controller.NewControllerExpectations(), queue: workqueue.NewNamedRateLimitingQueue(workqueue.NewItemExponentialFailureRateLimiter(DefaultJobBackOff, MaxJobBackOff), "job"), recorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "job-controller"}), } jobInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) { jm.enqueueController(obj, true) }, UpdateFunc: jm.updateJob, // 这个其实也是放入队列的,见下面的函数 DeleteFunc: func(obj interface{}) { jm.enqueueController(obj, true) }, }) jm.jobLister = jobInformer.Lister() jm.jobStoreSynced = jobInformer.Informer().HasSynced podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: jm.addPod, UpdateFunc: jm.updatePod, DeleteFunc: jm.deletePod, }) jm.podStore = podInformer.Lister() jm.podStoreSynced = podInformer.Informer().HasSynced jm.updateHandler = jm.updateJobStatus jm.syncHandler = jm.syncJob return jm } ``` updateJob进行了一些判断,最后还是入队列了。 ``` func (jm *JobController) updateJob(old, cur interface{}) { oldJob := old.(*batch.Job) curJob := cur.(*batch.Job) // never return error key, err := controller.KeyFunc(curJob) if err != nil { return } jm.enqueueController(curJob, true) // check if need to add a new rsync for ActiveDeadlineSeconds if curJob.Status.StartTime != nil { curADS := curJob.Spec.ActiveDeadlineSeconds if curADS == nil { return } oldADS := oldJob.Spec.ActiveDeadlineSeconds if oldADS == nil || *oldADS != *curADS { now := metav1.Now() start := curJob.Status.StartTime.Time passed := now.Time.Sub(start) total := time.Duration(*curADS) * time.Second // AddAfter will handle total < passed jm.queue.AddAfter(key, total-passed) glog.V(4).Infof("job ActiveDeadlineSeconds updated, will rsync after %d seconds", total-passed) } } } ```
#### 2.3 对Pod的监听事件 ##### 2.3.1 job的expectations机制 和rs的机制其实是一样的。更详细的可以参考rs那篇博客的介绍。 expectations可以理解为一个map。举例来说,这个map可以认为有四个关键字段。 key: 有rs的ns和 rs的name组成 Add: 表示这个rs还需要增加多少个rs del: 表示这个rs还需要删除多少个pod Time: 表示 | Key | Add | Del | Time | | ----------- | ---- | ---- | ------------------- | | Default/zx1 | 0 | 0 | 2021.07.04 16:00:00 | | zx/zx1 | 1 | 0 | 2021.07.04 16:00:00 |
**GetExpectations**: 输入是key, 输出整个map; **SatisfiedExpectations**: 输入key, 输出bool;判断某个job是否符合预期。符合预期: add<=0 && del<=0 或者 超过了同步周期; 其他情况都是不符合预期。 **DeleteExpectations**:输入key, 无输出;从map(缓存)中删除这个key **SetExpectations**:输入(key, add, del); 在map中新增加一行。 **这个会更新时间,将time复制为time.Now** **ExpectCreations**: 输入(key, add); 覆盖map中的内容,del=0, add等于函数的参数。 **这个会更新时间,将time复制为time.Now** **ExpectDeletions**: 输入(key, del); 覆盖map中的内容,add=0, del等于函数的参数。 **这个会更新时间,将time复制为time.Now** **CreationObserved**: 输入(key) ; map中对应的行中 add-1 **DeletionObserved**: 输入(key); map中对应的行中 del-1 **RaiseExpectations**: 输入(key, add, del); map中对应的行中 Add+add, Del+del **LowerExpectations**: 输入(key, add, del); map中对应的行中 Add-add, Del-del
##### 2.3.2 addPod (1)如果pod要删除,deletePod最终会调用DeletionObserved函数,使得这个map中对应job的del-1 (2)有ower并且是job,就将这个job在对应map,add - 1 (3)如果这个pod是孤儿,这将这个pod之前有关联的job入队列,然后通过syncJob更新 ``` // When a pod is created, enqueue the controller that manages it and update it's expectations. func (jm *JobController) addPod(obj interface{}) { pod := obj.(*v1.Pod) if pod.DeletionTimestamp != nil { // on a restart of the controller controller, it's possible a new pod shows up in a state that // is already pending deletion. Prevent the pod from being a creation observation. // 1. deletePod最终会调用DeletionObserved函数。 jm.deletePod(pod) return } // If it has a ControllerRef, that's all that matters. if controllerRef := metav1.GetControllerOf(pod); controllerRef != nil { job := jm.resolveControllerRef(pod.Namespace, controllerRef) if job == nil { return } jobKey, err := controller.KeyFunc(job) if err != nil { return } // 2.有ower并且是job,就将这个job在对应map,add - 1 jm.expectations.CreationObserved(jobKey) jm.enqueueController(job, true) return } // Otherwise, it's an orphan. Get a list of all matching controllers and sync // them to see if anyone wants to adopt it. // DO NOT observe creation because no controller should be waiting for an // orphan. // 3.如果是孤儿,这将这个pod之前有关联的job入队列,然后通过syncJob更新 for _, job := range jm.getPodJobs(pod) { jm.enqueueController(job, true) } } ```
##### 2.3.3 updatePod (1)如果pod要删除,deletePod最终会调用DeletionObserved函数,使得这个map中对应job的del-1 (2) 如果pod更新了owner,先是旧job加入队列,因为进入队列的job都会同步 (3)如果新owner还是job,这个job入队列 (4) 如果这个pod是孤儿,这将这个pod之前有关联的job入队列,然后通过syncJob更新 ``` // When a pod is updated, figure out what job/s manage it and wake them up. // If the labels of the pod have changed we need to awaken both the old // and new job. old and cur must be *v1.Pod types. func (jm *JobController) updatePod(old, cur interface{}) { curPod := cur.(*v1.Pod) oldPod := old.(*v1.Pod) if curPod.ResourceVersion == oldPod.ResourceVersion { // Periodic resync will send update events for all known pods. // Two different versions of the same pod will always have different RVs. return } // 1.deletePod最终会调用DeletionObserved函数,使得这个map中对应job的del-1 if curPod.DeletionTimestamp != nil { // when a pod is deleted gracefully it's deletion timestamp is first modified to reflect a grace period, // and after such time has passed, the kubelet actually deletes it from the store. We receive an update // for modification of the deletion timestamp and expect an job to create more pods asap, not wait // until the kubelet actually deletes the pod. jm.deletePod(curPod) return } // the only time we want the backoff to kick-in, is when the pod failed immediate := curPod.Status.Phase != v1.PodFailed curControllerRef := metav1.GetControllerOf(curPod) oldControllerRef := metav1.GetControllerOf(oldPod) // 2. 如果pod更新了owner,先是旧job加入队列,因为进入队列的job都会同步 controllerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef) if controllerRefChanged && oldControllerRef != nil { // The ControllerRef was changed. Sync the old controller, if any. if job := jm.resolveControllerRef(oldPod.Namespace, oldControllerRef); job != nil { jm.enqueueController(job, immediate) } } // 3.如果新owner还是job,这个job入队列 // If it has a ControllerRef, that's all that matters. if curControllerRef != nil { job := jm.resolveControllerRef(curPod.Namespace, curControllerRef) if job == nil { return } jm.enqueueController(job, immediate) return } // 4. 如果这个pod是孤儿,这将这个pod之前有关联的job入队列,然后通过syncJob更新 // Otherwise, it's an orphan. If anything changed, sync matching controllers // to see if anyone wants to adopt it now. labelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels) if labelChanged || controllerRefChanged { for _, job := range jm.getPodJobs(curPod) { jm.enqueueController(job, immediate) } } } ```
##### 2.3.4 deletePod 这里又出现了tombstone。 deletepod的逻辑就更简单,将map中对应job的del-1 ``` // When a pod is deleted, enqueue the job that manages the pod and update its expectations. // obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item. func (jm *JobController) deletePod(obj interface{}) { pod, ok := obj.(*v1.Pod) // When a delete is dropped, the relist will notice a pod in the store not // in the list, leading to the insertion of a tombstone object which contains // the deleted key/value. Note that this value might be stale. If the pod // changed labels the new job will not be woken up till the periodic resync. if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { utilruntime.HandleError(fmt.Errorf("couldn't get object from tombstone %+v", obj)) return } pod, ok = tombstone.Obj.(*v1.Pod) if !ok { utilruntime.HandleError(fmt.Errorf("tombstone contained object that is not a pod %+v", obj)) return } } controllerRef := metav1.GetControllerOf(pod) if controllerRef == nil { // No controller should care about orphans being deleted. return } job := jm.resolveControllerRef(pod.Namespace, controllerRef) if job == nil { return } jobKey, err := controller.KeyFunc(job) if err != nil { return } jm.expectations.DeletionObserved(jobKey) jm.enqueueController(job, true) } ```
##### 2.3.5 总结 * 对于pod的add, updated, del事件,核心就是维护 map中job的数据,然后就是将对应的job入队列 * 对于job的add, updated, del事件,最后都是扔进了队列 ### 3. 如何处理队列中的job ``` // Run the main goroutine responsible for watching and syncing jobs. func (jm *JobController) Run(workers int, stopCh <-chan struct{}) { defer utilruntime.HandleCrash() defer jm.queue.ShutDown() glog.Infof("Starting job controller") defer glog.Infof("Shutting down job controller") if !controller.WaitForCacheSync("job", stopCh, jm.podStoreSynced, jm.jobStoreSynced) { return } for i := 0; i < workers; i++ { go wait.Until(jm.worker, time.Second, stopCh) } <-stopCh } ``` ``` // worker runs a worker thread that just dequeues items, processes them, and marks them done. // It enforces that the syncHandler is never invoked concurrently with the same key. func (jm *JobController) worker() { for jm.processNextWorkItem() { } } func (jm *JobController) processNextWorkItem() bool { key, quit := jm.queue.Get() if quit { return false } defer jm.queue.Done(key) forget, err := jm.syncHandler(key.(string)) if err == nil { if forget { jm.queue.Forget(key) } return true } utilruntime.HandleError(fmt.Errorf("Error syncing job: %v", err)) jm.queue.AddRateLimited(key) return true } ``` 和所有控制器一样,流程为:Run->worker->processNextWorkItem->syncHandler。 而NewJobController的时候就指定了 jm.syncHandler = jm.syncJob
#### 3.1 sycnjob (1)判断 job 是否已经执行完成,如果完成了,直接返回。 * 当 job 的 `.status.conditions` 中有 `Complete` 或 `Failed` 的 type 且对应的 status 为 true 时表示该 job 已经执行完成 (2)获得job的重试次数,以及通过expectations判断是否需要同步,以下三种情况需要同步 - 该 job 在 map 中的 adds 和 dels 都 <= 0 - 该 job 在 map 中已经超过 5min 没有更新了; - 该 job 在 job 中不存在,即该对象是新创建的; (3)获取 job 关联的所有 pods,然后分为三类:active、succeeded、failed (4)如果这个job第一次启动,设置启动时间为Now,如果还设置了ActiveDeadlineSeconds值,则等ActiveDeadlineSeconds这个时间后再入队列 (5)判断job是否失败了。有俩种情况: * 一是 job的重试此时达到了Spec.BackoffLimit`(默认是6次),` * 二是 job 的运行时间达到了 `job.Spec.ActiveDeadlineSeconds` 中设定的值 (6)如果job Failed, 删除所有的pod,然后发事件说这个job已经failed, 原因是XX (7)如果job需要同步,并且没有deletionStamp,通过manageJob调整activejob=parallelism (8)检查 `job.Spec.Completions` 判断 job 是否已经运行完成,如果 `job.Spec.Completions` 没有设置,那只要有一个pod运行成功,就表示该 job 完成。 (9)最后通过job的状态有无变化,如果有变化,更新到 apiserver; ```go // syncJob will sync the job with the given key if it has had its expectations fulfilled, meaning // it did not expect to see any more of its pods created or deleted. This function is not meant to be invoked // concurrently with the same key. func (jm *JobController) syncJob(key string) (bool, error) { // 1、用于统计本次 sync 的运行时间 startTime := time.Now() defer func() { glog.V(4).Infof("Finished syncing job %q (%v)", key, time.Since(startTime)) }() // 2、从 lister 中获取 job 对象 ns, name, err := cache.SplitMetaNamespaceKey(key) if err != nil { return false, err } if len(ns) == 0 || len(name) == 0 { return false, fmt.Errorf("invalid job key %q: either namespace or name is missing", key) } sharedJob, err := jm.jobLister.Jobs(ns).Get(name) if err != nil { if errors.IsNotFound(err) { glog.V(4).Infof("Job has been deleted: %v", key) jm.expectations.DeleteExpectations(key) return true, nil } return false, err } job := *sharedJob // if job was finished previously, we don't want to redo the termination // 3、判断 job 是否已经执行完成,如果完成了,直接返回 if IsJobFinished(&job) { return true, nil } // retrieve the previous number of retry // 4、获取 job 重试的次数,这个是队列自带的函数 workqueue自己就实现了 previousRetry := jm.queue.NumRequeues(key) // Check the expectations of the job before counting active pods, otherwise a new pod can sneak in // and update the expectations after we've retrieved active pods from the store. If a new pod enters // the store after we've checked the expectation, the job sync is just deferred till the next relist. // 5、通过Expectations,判断 job 是否能进行 sync 操作 jobNeedsSync := jm.expectations.SatisfiedExpectations(key) // 6、获取 job 关联的所有 pod pods, err := jm.getPodsForJob(&job) if err != nil { return false, err } // 7、 获取active、succeeded、failed状态的 pod 数 activePods := controller.FilterActivePods(pods) active := int32(len(activePods)) succeeded, failed := getStatus(pods) conditions := len(job.Status.Conditions) // job first start // 8、判断 job 是否为首次启动 if job.Status.StartTime == nil { now := metav1.Now() job.Status.StartTime = &now // enqueue a sync to check if job past ActiveDeadlineSeconds // 9、如果设定了 ActiveDeadlineSeconds值,这个这个事件过去了再加入队列。 if job.Spec.ActiveDeadlineSeconds != nil { glog.V(4).Infof("Job %s have ActiveDeadlineSeconds will sync after %d seconds", key, *job.Spec.ActiveDeadlineSeconds) jm.queue.AddAfter(key, time.Duration(*job.Spec.ActiveDeadlineSeconds)*time.Second) } } var manageJobErr error jobFailed := false var failureReason string var failureMessage string // 10、通过已经失败的pod数量,判断是否超过了job运行的上限,BackoffLimit。例子中设置的为6. jobHaveNewFailure := failed > job.Status.Failed // new failures happen when status does not reflect the failures and active // is different than parallelism, otherwise the previous controller loop // failed updating status so even if we pick up failure it is not a new one // 因为有的pod可能在job完成前就删除了,所以需要previousRetry+1(这次错误)进行判断。 exceedsBackoffLimit := jobHaveNewFailure && (active != *job.Spec.Parallelism) && (int32(previousRetry)+1 > *job.Spec.BackoffLimit) if exceedsBackoffLimit || pastBackoffLimitOnFailure(&job, pods) { // check if the number of pod restart exceeds backoff (for restart OnFailure only) // OR if the number of failed jobs increased since the last syncJob jobFailed = true failureReason = "BackoffLimitExceeded" failureMessage = "Job has reached the specified backoff limit" } else if pastActiveDeadline(&job) { jobFailed = true failureReason = "DeadlineExceeded" failureMessage = "Job was active longer than specified deadline" } // 11、如果处于 failed 状态,则调用 jm.deleteJobPods 并发删除所有 active pods if jobFailed { errCh := make(chan error, active) jm.deleteJobPods(&job, activePods, errCh) select { case manageJobErr = <-errCh: if manageJobErr != nil { break } default: } // update status values accordingly failed += active active = 0 job.Status.Conditions = append(job.Status.Conditions, newCondition(batch.JobFailed, failureReason, failureMessage)) jm.recorder.Event(&job, v1.EventTypeWarning, failureReason, failureMessage) } else { // 12、若非 failed 状态,根据 jobNeedsSync 判断是否要进行同步 if jobNeedsSync && job.DeletionTimestamp == nil { active, manageJobErr = jm.manageJob(activePods, succeeded, &job) } // 13、检查 job.Spec.Completions 判断 job 是否已经运行完成 completions := succeeded complete := false if job.Spec.Completions == nil { // This type of job is complete when any pod exits with success. // Each pod is capable of // determining whether or not the entire Job is done. Subsequent pods are // not expected to fail, but if they do, the failure is ignored. Once any // pod succeeds, the controller waits for remaining pods to finish, and // then the job is complete. if succeeded > 0 && active == 0 { complete = true } } else { // Job specifies a number of completions. This type of job signals // success by having that number of successes. Since we do not // start more pods than there are remaining completions, there should // not be any remaining active pods once this count is reached. if completions >= *job.Spec.Completions { complete = true if active > 0 { jm.recorder.Event(&job, v1.EventTypeWarning, "TooManyActivePods", "Too many active pods running after completion count reached") } if completions > *job.Spec.Completions { jm.recorder.Event(&job, v1.EventTypeWarning, "TooManySucceededPods", "Too many succeeded pods running after completion count reached") } } } // 14、若 job 运行完成了,则更新 job.Status.Conditions 和 job.Status.CompletionTime 字段 if complete { job.Status.Conditions = append(job.Status.Conditions, newCondition(batch.JobComplete, "", "")) now := metav1.Now() job.Status.CompletionTime = &now } } forget := false // Check if the number of jobs succeeded increased since the last check. If yes "forget" should be true // This logic is linked to the issue: https://github.com/kubernetes/kubernetes/issues/56853 that aims to // improve the Job backoff policy when parallelism > 1 and few Jobs failed but others succeed. // In this case, we should clear the backoff delay. if job.Status.Succeeded < succeeded { forget = true } // 15、如果 job 的 status 有变化,将 job 的 status 更新到 apiserver // no need to update the job if the status hasn't changed since last time if job.Status.Active != active || job.Status.Succeeded != succeeded || job.Status.Failed != failed || len(job.Status.Conditions) != conditions { job.Status.Active = active job.Status.Succeeded = succeeded job.Status.Failed = failed if err := jm.updateHandler(&job); err != nil { return forget, err } if jobHaveNewFailure && !IsJobFinished(&job) { // returning an error will re-enqueue Job after the backoff period return forget, fmt.Errorf("failed pod(s) detected for job key %q", key) } forget = true } return forget, manageJobErr } ```
#### 3.2 判断job是否完成的标准: completed, failed,c.Status == v1.ConditionTrue ``` func IsJobFinished(j *batch.Job) bool { for _, c := range j.Status.Conditions { if (c.Type == batch.JobComplete || c.Type == batch.JobFailed) && c.Status == v1.ConditionTrue { return true } } return false } ``` 找两个job对比,可以看出来 completed确实有一个 status=true的字段;就是第三个判断 ``` Complete的job status: completionTime: "2021-01-20T07:27:11Z" conditions: - lastProbeTime: "2021-01-20T07:27:11Z" lastTransitionTime: "2021-01-20T07:27:11Z" status: "True" type: Complete startTime: "2021-01-20T07:27:04Z" succeeded: 1 ``` ``` running的job status: active: 1 startTime: "2021-01-20T07:32:06Z" ```
#### 3.3 如何获得该job对应的pods ``` // getPodsForJob returns the set of pods that this Job should manage. // It also reconciles ControllerRef by adopting/orphaning. // Note that the returned Pods are pointers into the cache. func (jm *JobController) getPodsForJob(j *batch.Job) ([]*v1.Pod, error) { selector, err := metav1.LabelSelectorAsSelector(j.Spec.Selector) if err != nil { return nil, fmt.Errorf("couldn't convert Job selector: %v", err) } // List all pods to include those that don't match the selector anymore // but have a ControllerRef pointing to this controller. pods, err := jm.podStore.Pods(j.Namespace).List(labels.Everything()) if err != nil { return nil, err } // If any adoptions are attempted, we should first recheck for deletion // with an uncached quorum read sometime after listing Pods (see #42639). canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) { fresh, err := jm.kubeClient.BatchV1().Jobs(j.Namespace).Get(j.Name, metav1.GetOptions{}) if err != nil { return nil, err } if fresh.UID != j.UID { return nil, fmt.Errorf("original Job %v/%v is gone: got uid %v, wanted %v", j.Namespace, j.Name, fresh.UID, j.UID) } return fresh, nil }) cm := controller.NewPodControllerRefManager(jm.podControl, j, selector, controllerKind, canAdoptFunc) return cm.ClaimPods(pods) } ```
``` // NewPodControllerRefManager returns a PodControllerRefManager that exposes // methods to manage the controllerRef of pods. // // The CanAdopt() function can be used to perform a potentially expensive check // (such as a live GET from the API server) prior to the first adoption. // It will only be called (at most once) if an adoption is actually attempted. // If CanAdopt() returns a non-nil error, all adoptions will fail. // // NOTE: Once CanAdopt() is called, it will not be called again by the same // PodControllerRefManager instance. Create a new instance if it makes // sense to check CanAdopt() again (e.g. in a different sync pass). func NewPodControllerRefManager( podControl PodControlInterface, controller metav1.Object, selector labels.Selector, controllerKind schema.GroupVersionKind, canAdopt func() error, ) *PodControllerRefManager { return &PodControllerRefManager{ BaseControllerRefManager: BaseControllerRefManager{ Controller: controller, Selector: selector, CanAdoptFunc: canAdopt, }, controllerKind: controllerKind, podControl: podControl, } } ``` ``` 最终还是通过 labels 匹配 // If the error is nil, either the reconciliation succeeded, or no // reconciliation was necessary. The list of Pods that you now own is returned. func (m *PodControllerRefManager) ClaimPods(pods []*v1.Pod, filters ...func(*v1.Pod) bool) ([]*v1.Pod, error) { var claimed []*v1.Pod var errlist []error match := func(obj metav1.Object) bool { pod := obj.(*v1.Pod) // Check selector first so filters only run on potentially matching Pods. if !m.Selector.Matches(labels.Set(pod.Labels)) { return false } for _, filter := range filters { if !filter(pod) { return false } } return true } adopt := func(obj metav1.Object) error { return m.AdoptPod(obj.(*v1.Pod)) } release := func(obj metav1.Object) error { return m.ReleasePod(obj.(*v1.Pod)) } for _, pod := range pods { ok, err := m.ClaimObject(pod, match, adopt, release) if err != nil { errlist = append(errlist, err) continue } if ok { claimed = append(claimed, pod) } } return claimed, utilerrors.NewAggregate(errlist) } ```
这是 job zx-testip1-1611142680产生的一个pod. ``` kind: Pod metadata: labels: controller-uid: ecff8cf1-7523-4d90-9559-22c9e994f726 //这个是job的 uuid job-name: zx-testip1-1611142680 name: zx-testip1-1611142680-4s9z8 namespace: zx ```
#### 3.4 jm.manageJob 注意:这里进行了map的设置,初始化 `jm.manageJob` 核心工作就是根据 job的并发数来确认当前处于 active 的 pods 数量是否ok,如果不ok的话则进行调整。 具体为: - 如果active > parallelism,说明active的pod数量太多,需要删除一些。 删除pod的逻辑,rs那篇文章有,其实就是根据pod的运行时间,状态等信息判断pod优先级。 - 如果active < parallelism,说明active的pod数量太少,需要创建一些。 ``` // manageJob is the core method responsible for managing the number of running // pods according to what is specified in the job.Spec. // Does NOT modify . func (jm *JobController) manageJob(activePods []*v1.Pod, succeeded int32, job *batch.Job) (int32, error) { var activeLock sync.Mutex active := int32(len(activePods)) parallelism := *job.Spec.Parallelism jobKey, err := controller.KeyFunc(job) if err != nil { utilruntime.HandleError(fmt.Errorf("Couldn't get key for job %#v: %v", job, err)) return 0, nil } var errCh chan error if active > parallelism { diff := active - parallelism errCh = make(chan error, diff) // 注意这里进行了map的设置 jm.expectations.ExpectDeletions(jobKey, int(diff)) glog.V(4).Infof("Too many pods running job %q, need %d, deleting %d", jobKey, parallelism, diff) // Sort the pods in the order such that not-ready < ready, unscheduled // < scheduled, and pending < running. This ensures that we delete pods // in the earlier stages whenever possible. sort.Sort(controller.ActivePods(activePods)) active -= diff wait := sync.WaitGroup{} wait.Add(int(diff)) for i := int32(0); i < diff; i++ { go func(ix int32) { defer wait.Done() if err := jm.podControl.DeletePod(job.Namespace, activePods[ix].Name, job); err != nil { defer utilruntime.HandleError(err) // Decrement the expected number of deletes because the informer won't observe this deletion glog.V(2).Infof("Failed to delete %v, decrementing expectations for job %q/%q", activePods[ix].Name, job.Namespace, job.Name) jm.expectations.DeletionObserved(jobKey) activeLock.Lock() active++ activeLock.Unlock() errCh <- err } }(i) } wait.Wait() } else if active < parallelism { wantActive := int32(0) if job.Spec.Completions == nil { // Job does not specify a number of completions. Therefore, number active // should be equal to parallelism, unless the job has seen at least // once success, in which leave whatever is running, running. if succeeded > 0 { wantActive = active } else { wantActive = parallelism } } else { // Job specifies a specific number of completions. Therefore, number // active should not ever exceed number of remaining completions. wantActive = *job.Spec.Completions - succeeded if wantActive > parallelism { wantActive = parallelism } } diff := wantActive - active if diff < 0 { utilruntime.HandleError(fmt.Errorf("More active than wanted: job %q, want %d, have %d", jobKey, wantActive, active)) diff = 0 } jm.expectations.ExpectCreations(jobKey, int(diff)) errCh = make(chan error, diff) glog.V(4).Infof("Too few pods running job %q, need %d, creating %d", jobKey, wantActive, diff) active += diff wait := sync.WaitGroup{} // Batch the pod creates. Batch sizes start at SlowStartInitialBatchSize // and double with each successful iteration in a kind of "slow start". // This handles attempts to start large numbers of pods that would // likely all fail with the same error. For example a project with a // low quota that attempts to create a large number of pods will be // prevented from spamming the API service with the pod create requests // after one of its pods fails. Conveniently, this also prevents the // event spam that those failures would generate. for batchSize := int32(integer.IntMin(int(diff), controller.SlowStartInitialBatchSize)); diff > 0; batchSize = integer.Int32Min(2*batchSize, diff) { errorCount := len(errCh) wait.Add(int(batchSize)) for i := int32(0); i < batchSize; i++ { go func() { defer wait.Done() err := jm.podControl.CreatePodsWithControllerRef(job.Namespace, &job.Spec.Template, job, metav1.NewControllerRef(job, controllerKind)) if err != nil && errors.IsTimeout(err) { // Pod is created but its initialization has timed out. // If the initialization is successful eventually, the // controller will observe the creation via the informer. // If the initialization fails, or if the pod keeps // uninitialized for a long time, the informer will not // receive any update, and the controller will create a new // pod when the expectation expires. return } if err != nil { defer utilruntime.HandleError(err) // Decrement the expected number of creates because the informer won't observe this pod glog.V(2).Infof("Failed creation, decrementing expectations for job %q/%q", job.Namespace, job.Name) jm.expectations.CreationObserved(jobKey) activeLock.Lock() active-- activeLock.Unlock() errCh <- err } }() } wait.Wait() // any skipped pods that we never attempted to start shouldn't be expected. skippedPods := diff - batchSize if errorCount < len(errCh) && skippedPods > 0 { glog.V(2).Infof("Slow-start failure. Skipping creation of %d pods, decrementing expectations for job %q/%q", skippedPods, job.Namespace, job.Name) active -= skippedPods for i := int32(0); i < skippedPods; i++ { // Decrement the expected number of creates because the informer won't observe this pod jm.expectations.CreationObserved(jobKey) } // The skipped pods will be retried later. The next controller resync will // retry the slow start process. break } diff -= batchSize } } select { case err := <-errCh: // all errors have been reported before, we only need to inform the controller that there was an error and it should re-try this job once more next time. if err != nil { return active, err } default: } return active, nil } ```
### 4.总结 (1)jobController也利用expectations机制,在每次同步计算当前active pod的数量时进行了设置。 (2)然后pod的add, update, del 对map进行了修改 举例来说,如果一个job completions=4, parallelism=2。那么当这个job创建的时候: (1)发现map(expectations)中没有这个job,那么需要同步。 (2)通过manageJob,设置map中 add=2, del=0 (3)然后在创建了2个pod (4)每pod创建,add-1, 创建完2个pod后,add=0, del=0,表示job需要同步了。 这个就是expectations的精髓,这2个pod没创建完之前,这个job根本不需要同步。 (5)然后同步发现,当前确实只能运行2个pod,所以等着2个pod运行完后,触发下一轮更新,再创建2个pod (6)最后job运行完成 ================================================ FILE: k8s/kcm/6-namespaces controller-manager源码分析.md ================================================ Table of Contents ================= * [1. startNamespaceController](#1-startnamespacecontroller) * [2. NewNamespaceController](#2-newnamespacecontroller) * [3. Run](#3-run) * [3.1 syncNamespaceFromKey](#31-syncnamespacefromkey) * [3.2. deleteAllContent](#32-deleteallcontent) * [4 总结](#4-总结) 和其他kcm控制器组件一样,nsController还是在NewControllerInitializers定义然后启动。 ### 1. startNamespaceController cmd\kube-controller-manager\app\core.go 这里有很多的startController函数。这个就是 startControllers里面各个controller对应的init函数。 ns也是用了令牌桶进行限速。 ``` func startNamespaceController(ctx ControllerContext) (http.Handler, bool, error) { // the namespace cleanup controller is very chatty. It makes lots of discovery calls and then it makes lots of delete calls // the ratelimiter negatively affects its speed. Deleting 100 total items in a namespace (that's only a few of each resource // including events), takes ~10 seconds by default. nsKubeconfig := ctx.ClientBuilder.ConfigOrDie("namespace-controller") nsKubeconfig.QPS *= 20 nsKubeconfig.Burst *= 100 namespaceKubeClient := clientset.NewForConfigOrDie(nsKubeconfig) return startModifiedNamespaceController(ctx, namespaceKubeClient, nsKubeconfig) } func startModifiedNamespaceController(ctx ControllerContext, namespaceKubeClient clientset.Interface, nsKubeconfig *restclient.Config) (http.Handler, bool, error) { metadataClient, err := metadata.NewForConfig(nsKubeconfig) if err != nil { return nil, true, err } discoverResourcesFn := namespaceKubeClient.Discovery().ServerPreferredNamespacedResources namespaceController := namespacecontroller.NewNamespaceController( namespaceKubeClient, metadataClient, discoverResourcesFn, ctx.InformerFactory.Core().V1().Namespaces(), ctx.ComponentConfig.NamespaceController.NamespaceSyncPeriod.Duration, v1.FinalizerKubernetes, ) go namespaceController.Run(int(ctx.ComponentConfig.NamespaceController.ConcurrentNamespaceSyncs), ctx.Stop) return nil, true, nil } ``` (1)NewNamespaceController (2)nsController.Run
### 2. NewNamespaceController ``` // NewNamespaceController creates a new NamespaceController func NewNamespaceController( kubeClient clientset.Interface, metadataClient metadata.Interface, discoverResourcesFn func() ([]*metav1.APIResourceList, error), namespaceInformer coreinformers.NamespaceInformer, resyncPeriod time.Duration, finalizerToken v1.FinalizerName) *NamespaceController { // create the controller so we can inject the enqueue function namespaceController := &NamespaceController{ queue: workqueue.NewNamedRateLimitingQueue(nsControllerRateLimiter(), "namespace"), namespacedResourcesDeleter: deletion.NewNamespacedResourcesDeleter(kubeClient.CoreV1().Namespaces(), metadataClient, kubeClient.CoreV1(), discoverResourcesFn, finalizerToken), } if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil { ratelimiter.RegisterMetricAndTrackRateLimiterUsage("namespace_controller", kubeClient.CoreV1().RESTClient().GetRateLimiter()) } // configure the namespace informer event handlers namespaceInformer.Informer().AddEventHandlerWithResyncPeriod( cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) { namespace := obj.(*v1.Namespace) namespaceController.enqueueNamespace(namespace) }, UpdateFunc: func(oldObj, newObj interface{}) { namespace := newObj.(*v1.Namespace) namespaceController.enqueueNamespace(namespace) }, }, resyncPeriod, ) namespaceController.lister = namespaceInformer.Lister() namespaceController.listerSynced = namespaceInformer.Informer().HasSynced return namespaceController } ``` NewNamespaceController就是定义了kubeconfig,然后定义好需要监听对象的处理函数。这里这监听addFunc, updateFunc。 Q:delete为啥不监听? A:因为ns都delete掉了,监听没什么意义。ns 有deletionStamp,是update事件。
### 3. Run 定义好控制器之后就开始运行了。 ```go func (nm *NamespaceController) Run(workers int, stopCh <-chan struct{}) { defer utilruntime.HandleCrash() defer nm.queue.ShutDown() klog.Infof("Starting namespace controller") defer klog.Infof("Shutting down namespace controller") if !cache.WaitForNamedCacheSync("namespace", stopCh, nm.listerSynced) { return } klog.V(5).Info("Starting workers of namespace controller") for i := 0; i < workers; i++ { go wait.Until(nm.worker, time.Second, stopCh) } <-stopCh } ``` 还是一样,run完之后调用 worker,并发处理队列中的namespace。这里并没有processNextItem() 可以看出来,work函数会一直循环处理一个ns。知道这个ns从队列中被移除。 ``` // worker processes the queue of namespace objects. // Each namespace can be in the queue at most once. // The system ensures that no two workers can process // the same namespace at the same time. func (nm *NamespaceController) worker() { workFunc := func() bool { key, quit := nm.queue.Get() if quit { return true } defer nm.queue.Done(key) err := nm.syncNamespaceFromKey(key.(string)) if err == nil { // no error, forget this entry and return nm.queue.Forget(key) return false } if estimate, ok := err.(*deletion.ResourcesRemainingError); ok { t := estimate.Estimate/2 + 1 klog.V(4).Infof("Content remaining in namespace %s, waiting %d seconds", key, t) nm.queue.AddAfter(key, time.Duration(t)*time.Second) } else { // rather than wait for a full resync, re-add the namespace to the queue to be processed nm.queue.AddRateLimited(key) utilruntime.HandleError(fmt.Errorf("deletion of namespace %v failed: %v", key, err)) } return false } for { quit := workFunc() if quit { return } } } ```
#### 3.1 syncNamespaceFromKey syncNamespaceFromKey主要调用了nm.namespacedResourcesDeleter.Delete,它们的逻辑如下: (1)如果namespace不存在,返回nil (2)如果namespace没有DeletionTimestamp字段,返回nil (3)可以删除的话,先删除namespaces下所以的资源,如果某一个资源删除需要等待,返回一个ResourcesRemainingError (4)所有的资源删除完后,删除namespace。 ``` // syncNamespaceFromKey looks for a namespace with the specified key in its store and synchronizes it func (nm *NamespaceController) syncNamespaceFromKey(key string) (err error) { startTime := time.Now() defer func() { glog.V(4).Infof("Finished syncing namespace %q (%v)", key, time.Since(startTime)) }() // 1.如果namespace不存在,返回nil namespace, err := nm.lister.Get(key) if errors.IsNotFound(err) { glog.Infof("Namespace has been deleted %v", key) return nil } if err != nil { utilruntime.HandleError(fmt.Errorf("Unable to retrieve namespace %v from store: %v", key, err)) return err } return nm.namespacedResourcesDeleter.Delete(namespace.Name) } ``` Delete函数的主要逻辑就是:如果ns不需要删除就返回,需要删除就先删除资源,再删除ns。 具体为: (1)ns没有DeletionTimestamp不做任何操作 (2)如果没有Finalizers,也不处理 (3) ``` // Delete deletes all resources in the given namespace. // Before deleting resources: // * It ensures that deletion timestamp is set on the // namespace (does nothing if deletion timestamp is missing). // * Verifies that the namespace is in the "terminating" phase // (updates the namespace phase if it is not yet marked terminating) // After deleting the resources: // * It removes finalizer token from the given namespace. // // Returns an error if any of those steps fail. // Returns ResourcesRemainingError if it deleted some resources but needs // to wait for them to go away. // Caller is expected to keep calling this until it succeeds. func (d *namespacedResourcesDeleter) Delete(nsName string) error { // Multiple controllers may edit a namespace during termination // first get the latest state of the namespace before proceeding // if the namespace was deleted already, don't do anything namespace, err := d.nsClient.Get(nsName, metav1.GetOptions{}) if err != nil { if errors.IsNotFound(err) { return nil } return err } // 1.ns没有DeletionTimestamp不做任何操作。 if namespace.DeletionTimestamp == nil { return nil } klog.V(5).Infof("namespace controller - syncNamespace - namespace: %s, finalizerToken: %s", namespace.Name, d.finalizerToken) // ensure that the status is up to date on the namespace // if we get a not found error, we assume the namespace is truly gone // 2. 对ns的状态进行修改。retryOnConflictError是一个通用的函数,updateNamespaceStatusFunc是实际修改的 // 函数,这里就是修改ns的 phase为Terminating namespace, err = d.retryOnConflictError(namespace, d.updateNamespaceStatusFunc) if err != nil { if errors.IsNotFound(err) { return nil } return err } // the latest view of the namespace asserts that namespace is no longer deleting.. if namespace.DeletionTimestamp.IsZero() { return nil } // 2.如果没有Finalizers,也不处理 // return if it is already finalized. if finalized(namespace) { return nil } // 3.开始删除ns下的所有资源,estimate表示有多少个资源删除不了 // there may still be content for us to remove estimate, err := d.deleteAllContent(namespace) if err != nil { return err } if estimate > 0 { return &ResourcesRemainingError{estimate} } // 移除finalize,然后apiserver能够删除 // we have removed content, so mark it finalized by us _, err = d.retryOnConflictError(namespace, d.finalizeNamespace) if err != nil { // in normal practice, this should not be possible, but if a deployment is running // two controllers to do namespace deletion that share a common finalizer token it's // possible that a not found could occur since the other controller would have finished the delete. if errors.IsNotFound(err) { return nil } return err } return nil } ```
#### 3.2. deleteAllContent 这里是用了 dynamic client 一个一个的删除所有的对象, 这里看起来就是并行的 ``` // deleteAllContent will use the dynamic client to delete each resource identified in groupVersionResources. // It returns an estimate of the time remaining before the remaining resources are deleted. // If estimate > 0, not all resources are guaranteed to be gone. func (d *namespacedResourcesDeleter) deleteAllContent(ns *v1.Namespace) (int64, error) { namespace := ns.Name namespaceDeletedAt := *ns.DeletionTimestamp var errs []error conditionUpdater := namespaceConditionUpdater{} estimate := int64(0) klog.V(4).Infof("namespace controller - deleteAllContent - namespace: %s", namespace) resources, err := d.discoverResourcesFn() if err != nil { // discovery errors are not fatal. We often have some set of resources we can operate against even if we don't have a complete list errs = append(errs, err) conditionUpdater.ProcessDiscoverResourcesErr(err) } // TODO(sttts): get rid of opCache and pass the verbs (especially "deletecollection") down into the deleter deletableResources := discovery.FilteredBy(discovery.SupportsAllVerbs{Verbs: []string{"delete"}}, resources) groupVersionResources, err := discovery.GroupVersionResources(deletableResources) if err != nil { // discovery errors are not fatal. We often have some set of resources we can operate against even if we don't have a complete list errs = append(errs, err) conditionUpdater.ProcessGroupVersionErr(err) } numRemainingTotals := allGVRDeletionMetadata{ gvrToNumRemaining: map[schema.GroupVersionResource]int{}, finalizersToNumRemaining: map[string]int{}, } for gvr := range groupVersionResources { gvrDeletionMetadata, err := d.deleteAllContentForGroupVersionResource(gvr, namespace, namespaceDeletedAt) if err != nil { // If there is an error, hold on to it but proceed with all the remaining // groupVersionResources. errs = append(errs, err) conditionUpdater.ProcessDeleteContentErr(err) } if gvrDeletionMetadata.finalizerEstimateSeconds > estimate { estimate = gvrDeletionMetadata.finalizerEstimateSeconds } if gvrDeletionMetadata.numRemaining > 0 { numRemainingTotals.gvrToNumRemaining[gvr] = gvrDeletionMetadata.numRemaining for finalizer, numRemaining := range gvrDeletionMetadata.finalizersToNumRemaining { if numRemaining == 0 { continue } numRemainingTotals.finalizersToNumRemaining[finalizer] = numRemainingTotals.finalizersToNumRemaining[finalizer] + numRemaining } } } conditionUpdater.ProcessContentTotals(numRemainingTotals) // we always want to update the conditions because if we have set a condition to "it worked" after it was previously, "it didn't work", // we need to reflect that information. Recall that additional finalizers can be set on namespaces, so this finalizer may clear itself and // NOT remove the resource instance. if hasChanged := conditionUpdater.Update(ns); hasChanged { if _, err = d.nsClient.UpdateStatus(ns); err != nil { utilruntime.HandleError(fmt.Errorf("couldn't update status condition for namespace %q: %v", namespace, err)) } } // if len(errs)==0, NewAggregate returns nil. klog.V(4).Infof("namespace controller - deleteAllContent - namespace: %s, estimate: %v, errors: %v", namespace, estimate, utilerrors.NewAggregate(errs)) return estimate, utilerrors.NewAggregate(errs) } ```
### 4 总结 (1)ns只对add, update的ns进行处理,而且只针对设置里deletionStamp的ns进行处理。 (2)如果ns的有deletionStamp,nsController做的操作为: * 第一,修改ns的状态为删除中 * 第二,移除ns的finalizer
为什么会这样,这就得补充一下ns的基本特征: (1)ns在删除之前,有一个finalizers:kubernetes。并且phase: Active finalizers:kubernetes的作用就是,删除ns的时候会卡住,得等nsController删除里该命名空间下的所有资源才会移除 ``` //删除前 root@k8s-master:~/testyaml# kubectl get ns zoux -oyaml -w apiVersion: v1 kind: Namespace metadata: creationTimestamp: "2021-07-20T03:18:44Z" name: zoux resourceVersion: "9449396" selfLink: /api/v1/namespaces/zoux uid: 12fab759-0cda-4d98-97db-330cbf407e15 spec: finalizers: - kubernetes status: phase: Active //删除中 apiVersion: v1 kind: Namespace metadata: creationTimestamp: "2021-07-20T03:18:44Z" deletionTimestamp: "2021-07-20T03:20:07Z" name: zoux resourceVersion: "9449629" selfLink: /api/v1/namespaces/zoux uid: 12fab759-0cda-4d98-97db-330cbf407e15 spec: finalizers: - kubernetes status: phase: Terminating ``` ================================================ FILE: k8s/kcm/9-kubernetes污点和容忍度概念介绍.md ================================================ ### 1. 概念介绍 **污点(Taint)** 应用于node身上,表示该节点有污点了,如果不能忍受这个污点的pod,你就不要调度/运行到这个节点上。如果是不能运行到这个节点上,那就是污点驱逐了。 **容忍度(Toleration)** 是应用于 Pod 上的。容忍度允许调度器调度带有对应污点的 Pod。或者允许这个pod继续运行到这个节点上。 可以看出来,污点和容忍度(Toleration)相互配合,可以用来避免 Pod 被分配/运行到不合适的节点上。 每个节点上都可以应用一个或多个污点,每个pod也是可以应用一个或多个容忍度。 ### 2. 污点详解 污点总共由4个字段组成: **key, value字段**:可以任意字符。这个可以自定义。 **Effect**:NoExecute,PreferNoSchedule,NoSchedule 三选一 * NoExecute表示不能运行污点,意思是如果该节点有这种污点,但是pod没有对应的容忍度,那么这个pod是会被驱逐的 * NoSchedule表示不能调度污点,意思是如果该节点有这种污点,pod没有对应的容忍度,那么在调度的时候,这个pod是不会考虑这个节点的 * PreferNoSchedule 是NoSchedule的软化版。意思是如果该节点有这种污点,pod没有对应的容忍度,那么在调度的时候,这个pod不会优先考虑这个节点,但是如果实在没有节点可用,它还是接受调度到该节点上的。 **TimeAdded** : 这个污点是什么时候加的 ``` // The node this Taint is attached to has the "effect" on // any pod that does not tolerate the Taint. type Taint struct { // Required. The taint key to be applied to a node. Key string `json:"key" protobuf:"bytes,1,opt,name=key"` // Required. The taint value corresponding to the taint key. // +optional Value string `json:"value,omitempty" protobuf:"bytes,2,opt,name=value"` // Required. The effect of the taint on pods // that do not tolerate the taint. // Valid effects are NoSchedule, PreferNoSchedule and NoExecute. Effect TaintEffect `json:"effect" protobuf:"bytes,3,opt,name=effect,casttype=TaintEffect"` // TimeAdded represents the time at which the taint was added. // It is only written for NoExecute taints. // +optional TimeAdded *metav1.Time `json:"timeAdded,omitempty" protobuf:"bytes,4,opt,name=timeAdded"` } ``` 添加污点的方式也很简单: ``` kubectl taint nodes node1 key1=value1:NoSchedule kubectl taint nodes node1 key1=value1:NoExecute ``` **k8s默认污点** - node.kubernetes.io/not-ready:节点未准备好,相当于节点状态Ready的值为False。 - node.kubernetes.io/unreachable:Node Controller访问不到节点,相当于节点状态Ready的值为Unknown - node.kubernetes.io/out-of-disk:节点磁盘耗尽 - node.kubernetes.io/memory-pressure:节点存在内存压力 - node.kubernetes.io/disk-pressure:节点存在磁盘压力 - node.kubernetes.io/network-unavailable:节点网络不可达 - node.kubernetes.io/unschedulable:节点不可调度 - node.cloudprovider.kubernetes.io/uninitialized:如果Kubelet启动时指定了一个外部的cloudprovider,它将给当前节点添加一个Taint将其标记为不可用。在cloud-controller-manager的一个controller初始化这个节点后,Kubelet将删除这个Taint
### 3. 容忍度详解 ``` // Toleration represents the toleration object that can be attached to a pod. // The pod this Toleration is attached to tolerates any taint that matches // the triple using the matching operator . type Toleration struct { // Key is the taint key that the toleration applies to. Empty means match all taint keys. // If the key is empty, operator must be Exists; this combination means to match all values and all keys. // +optional Key string // Operator represents a key's relationship to the value. // Valid operators are Exists and Equal. Defaults to Equal. // Exists is equivalent to wildcard for value, so that a pod can // tolerate all taints of a particular category. // +optional Operator TolerationOperator // Value is the taint value the toleration matches to. // If the operator is Exists, the value should be empty, otherwise just a regular string. // +optional Value string // Effect indicates the taint effect to match. Empty means match all taint effects. // When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute. // +optional Effect TaintEffect // // TolerationSeconds represents the period of time the toleration (which must be // of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, // it is not set, which means tolerate the taint forever (do not evict). Zero and // negative values will be treated as 0 (evict immediately) by the system. // +optional TolerationSeconds *int64 } ``` 容忍度应用在pod身上,可以看出来,相比污点,多了2个字段: **Operator**: string类型,Exists,Equal 二选一 `operator` 的默认值是 `Equal`。 一个容忍度和一个污点相“匹配”是指它们有一样的键名和效果,并且: - 如果 `operator` 是 `Exists` (此时容忍度不能指定 `value`), 例如这种 ``` tolerations: - key: "key1" operator: "Exists" effect: "NoSchedule" ``` - 如果 `operator` 是 `Equal` ,则它们的 `value` 应该相等。例如这种 ``` tolerations: - key: "key1" operator: "Equal" value: "value1" effect: "NoSchedule" ``` **TolerationSeconds**: 容忍时间。表示在驱逐之前,我还可以忍受你这个pod运行多久。只针对NoSchedule类型生效。
**说明:** 存在两种特殊情况: 如果一个容忍度的 `key` 为空且 `operator` 为 `Exists`, 表示这个容忍度与任意的 key、value 和 effect 都匹配,即这个容忍度能容忍任何污点。 如果 `effect` 为空,则可以与所有键名 `key1` 的效果相匹配。 **TolerationSeconds:** 容忍时间。如果没有设置默认是不容忍。 ### 4. 污点驱逐 污点驱逐:node在运行过程中,被设置了NoExecute的污点,但是运行的pod没有对应的容忍度。因此需要将这些pod删除。 kcm中是nodelifeController控制污点驱逐的。默认是开启的。如下参数默认是true。 ``` --enable-taint-manager=true --feature-gates=TaintBasedEvictions=true ``` ================================================ FILE: k8s/kcm/kcm篇源码分析总结.md ================================================ 目前为止,kcm篇源码分析共hpa, gc, deploy, rs, job, ns 6个主要的控制器。通过这些源码分析,总结下目前的工作: (1)更了解kcm的机制。kcm就是一堆控制器的结合。每个控制器只干自己相关的事情,通过控制器的共同操作,让集群中的资源达到 期望状态 (2)对以后问题的排查,或者需求的开发积累来经验。 * 例如,通过gc篇,以后k8s集群中出现来gc资源出现来问题,可以马上定位修复 * rs的expectations机制,informer机制等都可以借鉴代码 目前kcm打算就分析道这里,原因在于: * 通过这6个控制器已经了解来kcm控制器的主体运行逻辑,以后有需求,分析其他控制器的源码也非常容易,以后分析来再补充 * 将精力放到其他组件的源码分析上 ================================================ FILE: k8s/kube-apiserver/0-apiserver笔记规划.md ================================================ 本章节的目标就是弄懂kube-apiserver的实现细节。从本质来说,kube-apiserver就是一个go server服务器端。 假设我要实现kube-apiserver,我想到的要考虑的以下的事情: (1)apiserver的启动流程是怎么样的 (2)k8s这么多资源,是怎么注册的,如何进行多版本的资源管理 (3)如何和etcd存储打通 (4)一个request,经历了哪些流程 (5)认证,授权,Admission是如何实现的 (6)apiserver是如何处理create, update, delete请求的
因此这章节的目标就是弄清楚上诉的问题 ================================================ FILE: k8s/kube-apiserver/1-v1.17 kube-apiserver启动参数介绍.md ================================================ 摘自:https://v1-17.docs.kubernetes.io/zh/docs/reference/command-line-tools-reference/kube-apiserver/ - –etcd-servers:etcd集群地址 - –bind-address:监听地址 - –secure-port:https安全端口 - –advertise-address:集群通告地址 - –allow-privileged:启用授权 - –service-cluster-ip-range:Service虚拟IP地址段 - –enable-admission-plugins:准入控制模块 - –authorization-mode:认证授权,启用RBAC授权和节点自管理 - –enable-bootstrap-token-auth:启用TLS bootstrap机制 - –token-auth-file:bootstrap token文件 - –service-node-port-range:Service nodeport类型默认分配端口范围 - –kubelet-client-xxx:apiserver访问kubelet客户端证书 - –tls-xxx-file:apiserver https证书 - –etcd-xxxfile:连接Etcd集群证书 - –audit-log-xxx:审计日志 ```f --admission-control stringSlice 控制资源进入集群的准入控制插件的顺序列表。逗号分隔的 NamespaceLifecycle 列表。(默认值 [AlwaysAdmit]) --admission-control-config-file string 包含准入控制配置的文件。 --advertise-address ip 向集群成员通知 apiserver 消息的 IP 地址。这个地址必须能够被集群中其他成员访问。如果 IP 地址为空,将会使用 --bind-address,如果未指定 --bind-address,将会使用主机的默认接口地址。 --allow-privileged 如果为 true, 将允许特权容器。 --anonymous-auth 启用到 API server 的安全端口的匿名请求。未被其他认证方法拒绝的请求被当做匿名请求。匿名请求的用户名为 system:anonymous,用户组名为 system:unauthenticated。(默认值 true) --apiserver-count int 集群中运行的 apiserver 数量,必须为正数。(默认值 1) --audit-log-maxage int 基于文件名中的时间戳,旧审计日志文件的最长保留天数。 --audit-log-maxbackup int 旧审计日志文件的最大保留个数。 --audit-log-maxsize int 审计日志被轮转前的最大兆字节数。 --audit-log-path string 如果设置该值,所有到 apiserver 的请求都将会被记录到这个文件。'-' 表示记录到标准输出。 --audit-policy-file string 定义审计策略配置的文件的路径。需要打开 'AdvancedAuditing' 特性开关。AdvancedAuditing 需要一个配置来启用审计功能。 --audit-webhook-config-file string 一个具有 kubeconfig 格式文件的路径,该文件定义了审计的 webhook 配置。需要打开 'AdvancedAuditing' 特性开关。 --audit-webhook-mode string 发送审计事件的策略。 Blocking 模式表示正在发送事件时应该阻塞服务器的响应。 Batch 模式使 webhook 异步缓存和发送事件。 Known 模式为 batch,blocking。(默认值 "batch") --authentication-token-webhook-cache-ttl duration 从 webhook 令牌认证者获取的响应的缓存时长。( 默认值 2m0s) --authentication-token-webhook-config-file string 包含 webhook 配置的文件,用于令牌认证,具有 kubeconfig 格式。API server 将查询远程服务来决定对 bearer 令牌的认证。 --authorization-mode string 在安全端口上进行权限验证的插件的顺序列表。以逗号分隔的列表,包括:AlwaysAllow,AlwaysDeny,ABAC,Webhook,RBAC,Node.(默认值 "AlwaysAllow") --authorization-policy-file string 包含权限验证策略的 csv 文件,和 --authorization-mode=ABAC 一起使用,作用在安全端口上。 --authorization-webhook-cache-authorized-ttl duration 从 webhook 授权者获得的 'authorized' 响应的缓存时长。(默认值 5m0s) --authorization-webhook-cache-unauthorized-ttl duration 从 webhook 授权者获得的 'unauthorized' 响应的缓存时长。(默认值 30s) --authorization-webhook-config-file string 包含 webhook 配置的 kubeconfig 格式文件,和 --authorization-mode=Webhook 一起使用。API server 将查询远程服务来决定对 API server 安全端口的访问。 --azure-container-registry-config string 包含 Azure 容器注册表配置信息的文件的路径。 --basic-auth-file string 如果设置该值,这个文件将会被用于准许通过 http 基本认证到 API server 安全端口的请求。 --bind-address ip 监听 --seure-port 的 IP 地址。被关联的接口必须能够被集群其它节点和 CLI/web 客户端访问。如果为空,则将使用所有接口(0.0.0.0)。(默认值 0.0.0.0) --cert-dir string 存放 TLS 证书的目录。如果提供了 --tls-cert-file 和 --tls-private-key-file 选项,该标志将被忽略。(默认值 "/var/run/kubernetes") --client-ca-file string 如果设置此标志,对于任何请求,如果存包含 client-ca-file 中的 authorities 签名的客户端证书,将会使用客户端证书中的 CommonName 对应的身份进行认证。 --cloud-config string 云服务提供商配置文件路径。空字符串表示无配置文件 . --cloud-provider string 云服务提供商,空字符串表示无提供商。 --contention-profiling 如果已经启用 profiling,则启用锁竞争 profiling。 --cors-allowed-origins stringSlice CORS 的域列表,以逗号分隔。合法的域可以是一个匹配子域名的正则表达式。如果这个列表为空则不会启用 CORS. --delete-collection-workers int 用于 DeleteCollection 调用的工作者数量。这被用于加速 namespace 的清理。( 默认值 1) --deserialization-cache-size int 在内存中缓存的反序列化 json 对象的数量。 --enable-aggregator-routing 打开到 endpoints IP 的 aggregator 路由请求,替换 cluster IP。 --enable-garbage-collector 启用通用垃圾回收器 . 必须与 kube-controller-manager 对应的标志保持同步。 (默认值 true) --enable-logs-handler 如果为 true,则为 apiserver 日志功能安装一个 /logs 处理器。(默认值 true) --enable-swagger-ui 在 apiserver 的 /swagger-ui 路径启用 swagger ui。 --etcd-cafile string 用于保护 etcd 通信的 SSL CA 文件。 --etcd-certfile string 用于保护 etcd 通信的的 SSL 证书文件。 --etcd-keyfile string 用于保护 etcd 通信的 SSL 密钥文件 . --etcd-prefix string 附加到所有 etcd 中资源路径的前缀。 (默认值 "/registry") --etcd-quorum-read 如果为 true, 启用 quorum 读。 --etcd-servers stringSlice 连接的 etcd 服务器列表 , 形式为(scheme://ip:port),使用逗号分隔。 --etcd-servers-overrides stringSlice 针对单个资源的 etcd 服务器覆盖配置 , 以逗号分隔。 单个配置覆盖格式为 : group/resource#servers, 其中 servers 形式为 http://ip:port, 以分号分隔。 --event-ttl duration 事件驻留时间。(默认值 1h0m0s) --enable-bootstrap-token-auth 启用此选项以允许 'kube-system' 命名空间中的 'bootstrap.kubernetes.io/token' 类型密钥可以被用于 TLS 的启动认证。 --experimental-encryption-provider-config string 包含加密提供程序的配置的文件,该加密提供程序被用于在 etcd 中保存密钥。 --external-hostname string 为此 master 生成外部 URL 时使用的主机名 ( 例如 Swagger API 文档 )。 --feature-gates mapStringBool 一个描述 alpha/experimental 特性开关的键值对列表。 选项包括 : Accelerators=true|false (ALPHA - default=false) AdvancedAuditing=true|false (ALPHA - default=false) AffinityInAnnotations=true|false (ALPHA - default=false) AllAlpha=true|false (ALPHA - default=false) AllowExtTrafficLocalEndpoints=true|false (default=true) AppArmor=true|false (BETA - default=true) DynamicKubeletConfig=true|false (ALPHA - default=false) DynamicVolumeProvisioning=true|false (ALPHA - default=true) ExperimentalCriticalPodAnnotation=true|false (ALPHA - default=false) ExperimentalHostUserNamespaceDefaulting=true|false (BETA - default=false) LocalStorageCapacityIsolation=true|false (ALPHA - default=false) PersistentLocalVolumes=true|false (ALPHA - default=false) RotateKubeletClientCertificate=true|false (ALPHA - default=false) RotateKubeletServerCertificate=true|false (ALPHA - default=false) StreamingProxyRedirects=true|false (BETA - default=true) TaintBasedEvictions=true|false (ALPHA - default=false) --google-json-key string 用于认证的 Google Cloud Platform 服务账号的 JSON 密钥。 --insecure-allow-any-token username/group1,group2 如果设置该值 , 你的服务将处于非安全状态。任何令牌都将会被允许,并将从令牌中把用户信息解析成为 username/group1,group2。 --insecure-bind-address ip 用于监听 --insecure-port 的 IP 地址 ( 设置成 0.0.0.0 表示监听所有接口 )。(默认值 127.0.0.1) --insecure-port int 用于监听不安全和为认证访问的端口。这个配置假设你已经设置了防火墙规则,使得这个端口不能从集群外访问。对集群的公共地址的 443 端口的访问将被代理到这个端口。默认设置中使用 nginx 实现。(默认值 8080) --kubelet-certificate-authority string 证书 authority 的文件路径。 --kubelet-client-certificate string 用于 TLS 的客户端证书文件路径。 --kubelet-client-key string 用于 TLS 的客户端证书密钥文件路径 . --kubelet-https 为 kubelet 启用 https。 (默认值 true) --kubelet-preferred-address-types stringSlice 用于 kubelet 连接的首选 NodeAddressTypes 列表。 ( 默认值[Hostname,InternalDNS,InternalIP,ExternalDNS,ExternalIP]) --kubelet-read-only-port uint 已废弃 : kubelet 端口 . (默认值 10255) --kubelet-timeout duration kubelet 操作超时时间。(默认值 5s) --kubernetes-service-node-port int 如果不为 0,Kubernetes master 服务(用于创建 / 管理 apiserver)将会使用 NodePort 类型,并将这个值作为端口号。如果为 0,Kubernetes master 服务将会使用 ClusterIP 类型。 --master-service-namespace string 已废弃 : 注入到 pod 中的 kubernetes master 服务的命名空间。(默认值 "default") --max-connection-bytes-per-sec int 如果不为 0,每个用户连接将会被限速为该值(bytes/sec)。当前只应用于长时间运行的请求。 --max-mutating-requests-inflight int 在给定时间内进行中可变请求的最大数量。当超过该值时,服务将拒绝所有请求。0 值表示没有限制。(默认值 200) --max-requests-inflight int 在给定时间内进行中不可变请求的最大数量。当超过该值时,服务将拒绝所有请求。0 值表示没有限制。(默认值 400) --min-request-timeout int 一个可选字段,表示一个 handler 在一个请求超时前,必须保持它处于打开状态的最小秒数。当前只对监听请求 handler 有效,它基于这个值选择一个随机数作为连接超时值,以达到分散负载的目的(默认值 1800)。 --oidc-ca-file string 如果设置该值,将会使用 oidc-ca-file 中的任意一个 authority 对 OpenID 服务的证书进行验证,否则将会使用主机的根 CA 对其进行验证。 --oidc-client-id string 使用 OpenID 连接的客户端的 ID,如果设置了 oidc-issuer-url,则必须设置这个值。 --oidc-groups-claim string 如果提供该值,这个自定义 OpenID 连接名将指定给特定的用户组。该声明值需要是一个字符串或字符串数组。此标志为实验性的,请查阅验证相关文档进一步了解详细信息。 --oidc-issuer-url string OpenID 颁发者 URL,只接受 HTTPS 方案。如果设置该值,它将被用于验证 OIDC JSON Web Token(JWT)。 --oidc-username-claim string 用作用户名的 OpenID 声明值。注意,不保证除默认 ('sub') 外的其他声明值的唯一性和不变性。此标志为实验性的,请查阅验证相关文档进一步了解详细信息。 --profiling 在 web 接口 host:port/debug/pprof/ 上启用 profiling。(默认值 true) --proxy-client-cert-file string 当必须调用外部程序时,用于证明 aggregator 或者 kube-apiserver 的身份的客户端证书。包括代理到用户 api-server 的请求和调用 webhook 准入控制插件的请求。它期望这个证书包含一个来自于 CA 中的 --requestheader-client-ca-file 标记的签名。该 CA 在 kube-system 命名空间的 'extension-apiserver-authentication' configmap 中发布。从 Kube-aggregator 收到调用的组件应该使用该 CA 进行他们部分的双向 TLS 验证。 --proxy-client-key-file string 当必须调用外部程序时,用于证明 aggregator 或者 kube-apiserver 的身份的客户端证书密钥。包括代理到用户 api-server 的请求和调用 webhook 准入控制插件的请求。 --repair-malformed-updates 如果为 true,服务将会尽力修复更新请求以通过验证,例如:将更新请求 UID 的当前值设置为空。在我们修复了所有发送错误格式请求的客户端后,可以关闭这个标志。 --requestheader-allowed-names stringSlice 使用 --requestheader-username-headers 指定的,允许在头部提供用户名的客户端证书通用名称列表。如果为空,任何通过 --requestheader-client-ca-file 中 authorities 验证的客户端证书都是被允许的。 --requestheader-client-ca-file string 在信任请求头中以 --requestheader-username-headers 指示的用户名之前,用于验证接入请求中客户端证书的根证书捆绑。 --requestheader-extra-headers-prefix stringSlice 用于检查的请求头的前缀列表。建议使用 X-Remote-Extra-。 --requestheader-group-headers stringSlice 用于检查群组的请求头列表。建议使用 X-Remote-Group. --requestheader-username-headers stringSlice 用于检查用户名的请求头列表。建议使用 X-Remote-User。 --runtime-config mapStringString 传递给 apiserver 用于描述运行时配置的键值对集合。 apis/ 键可以被用来打开 / 关闭特定的 api 版本。apis// 键被用来打开 / 关闭特定的资源 . api/all 和 api/legacy 键分别用于控制所有的和遗留的 api 版本 . --secure-port int 用于监听具有认证授权功能的 HTTPS 协议的端口。如果为 0,则不会监听 HTTPS 协议。 (默认值 6443) --service-account-key-file stringArray 包含 PEM 加密的 x509 RSA 或 ECDSA 私钥或公钥的文件,用于验证 ServiceAccount 令牌。如果设置该值,--tls-private-key-file 将会被使用。指定的文件可以包含多个密钥,并且这个标志可以和不同的文件一起多次使用。 --service-cluster-ip-range ipNet CIDR 表示的 IP 范围,服务的 cluster ip 将从中分配。 一定不要和分配给 nodes 和 pods 的 IP 范围产生重叠。 --ssh-keyfile string 如果不为空,在使用安全的 SSH 代理访问节点时,将这个文件作为用户密钥文件。 --storage-backend string 持久化存储后端。 选项为 : 'etcd3' ( 默认 ), 'etcd2'. --storage-media-type string 在存储中保存对象的媒体类型。某些资源或者存储后端可能仅支持特定的媒体类型,并且忽略该配置项。(默认值 "application/vnd.kubernetes.protobuf") --storage-versions string 按组划分资源存储的版本。 以 "group1/version1,group2/version2,..." 的格式指定。当对象从一组移动到另一组时 , 你可以指定 "group1=group2/v1beta1,group3/v1beta1,..." 的格式。你只需要传入你希望从结果中改变的组的列表。默认为从 KUBE_API_VERSIONS 环境变量集成而来,所有注册组的首选版本列表。 (默认值 "admission.k8s.io/v1alpha1,admissionregistration.k8s.io/v1alpha1,apps/v1beta1,authentication.k8s.io/v1,authorization.k8s.io/v1,autoscaling/v1,batch/v1,certificates.k8s.io/v1beta1,componentconfig/v1alpha1,extensions/v1beta1,federation/v1beta1,imagepolicy.k8s.io/v1alpha1,networking.k8s.io/v1,policy/v1beta1,rbac.authorization.k8s.io/v1beta1,settings.k8s.io/v1alpha1,storage.k8s.io/v1,v1") --target-ram-mb int apiserver 内存限制,单位为 MB( 用于配置缓存大小等 )。 --tls-ca-file string 如果设置该值,这个证书 authority 将会被用于从 Admission Controllers 过来的安全访问。它必须是一个 PEM 加密的合法 CA 捆绑包。此外 , 该证书 authority 可以被添加到以 --tls-cert-file 提供的证书文件中 . --tls-cert-file string 包含用于 HTTPS 的默认 x509 证书的文件。(如果有 CA 证书,则附加于 server 证书之后)。如果启用了 HTTPS 服务,并且没有提供 --tls-cert-file 和 --tls-private-key-file,则将为公共地址生成一个自签名的证书和密钥并保存于 /var/run/kubernetes 目录。 --tls-private-key-file string 包含匹配 --tls-cert-file 的 x509 证书私钥的文件。 --tls-sni-cert-key namedCertKey 一对 x509 证书和私钥的文件路径 , 可以使用符合正式域名的域形式作为后缀。 如果没有提供域形式后缀 , 则将提取证书名。 非通配符版本优先于通配符版本 , 显示的域形式优先于证书中提取的名字。 对于多个密钥 / 证书对, 请多次使用 --tls-sni-cert-key。例如 : "example.crt,example.key" or "foo.crt,foo.key:*.foo.com,foo.com". (默认值[]) --token-auth-file string 如果设置该值,这个文件将被用于通过令牌认证来保护 API 服务的安全端口。 --version version[=true] 打印版本信息并退出。 --watch-cache 启用 apiserver 的监视缓存。(默认值 true) --watch-cache-sizes stringSlice 每种资源(pods, nodes 等)的监视缓存大小列表,以逗号分隔。每个缓存配置的形式为:resource#size,size 是一个数字。在 watch-cache 启用时生效。 ``` ================================================ FILE: k8s/kube-apiserver/10-kube-apiserver创建AggregatorServer.md ================================================ * [1\. kube\-apiserver 背景介绍](#1-kube-apiserver-背景介绍) * [2\. CreateAggregatorServer源码分析](#2-createaggregatorserver源码分析) * [2\.1 NewWithDelegate](#21-newwithdelegate) * [2\.1\.1 apiserviceRegistrationController\-处理APIService对象的增删改](#211-apiserviceregistrationcontroller-处理apiservice对象的增删改) * [2\.1\.2 availableController](#212-availablecontroller) * [2\.2 创建autoRegistrationController](#22-创建autoregistrationcontroller) * [2\.2\.1 checkAPIService](#221-checkapiservice) * [2\.2\.2 为什么需要这个](#222-为什么需要这个) * [2\.3 crdRegistrationController](#23-crdregistrationcontroller) * [2\.4 openAPIAggregationController](#24-openapiaggregationcontroller) * [3\. 总结](#3-总结) **本章重点:**分析第六个流程,创建APIExtensionsServer kube-apiserver整体启动流程如下: (1)资源注册。 (2)Cobra命令行参数解析 (3)创建APIServer通用配置 (4)创建APIExtensionsServer (5)创建KubeAPIServer (6)创建AggregatorServer (7)启动HTTP服务。 (8)启动HTTPS服务 ### 1. kube-apiserver 背景介绍 kube-apiserver其实是包含了3个server: aggregator、apiserver、apiExtensionsServer。通过聚合的方式,对外变成一个kube-apisever对外提供服务。 举例说明,如下图: ![image-20220703162508208](../images/apiserver-73.png) 当一个请求来的时候,首先经过的是aggregatorServer,aggregatorServer会判断这个服务是否是需要本地处理,如果需要本地处理,就放行到apiserver这一层,处理内置资源(pod, node, svc等等)。如果不是内置资源,那就是CRD资源,转到extensionsSever处理。 **怎么判断是否是本地服务呢?-通过APIService对象** K8s 有个资源对象叫做APIService, 这个资源就是表示当前支持的服务类型。 ``` root@cld-kmaster1-1051:/home/ngadm# kubectl explain APIService KIND: APIService VERSION: apiregistration.k8s.io/v1 DESCRIPTION: APIService represents a server for a particular GroupVersion. Name must be "version.group". FIELDS: apiVersion APIVersion defines the versioned schema of this representation of an object. Servers should convert recognized schemas to the latest internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources kind Kind is a string value representing the REST resource this object represents. Servers may infer this from the endpoint the client submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds metadata spec Spec contains information for locating and communicating with a server status Status contains derived information about an API server // local 表示本地, 非local表示sever root:/home/zoux# kubectl get APIService NAME SERVICE AVAILABLE AGE v1. Local True 369d v1.admissionregistration.k8s.io Local True 369d v1.apiextensions.k8s.io Local True 369d v1.apps Local True 369d v1.authentication.k8s.io Local True 369d v1.authorization.k8s.io Local True 369d v1.autoscaling Local True 369d v1.autoscaling.k8s.io Local True 35d v1.batch Local True 369d .... v1beta1.batch Local True 369d v1beta1.certificates.k8s.io Local True 369d v1beta1.coordination.k8s.io Local True 369d v1beta1.custom.metrics.k8s.io kube-system/kube-hpa True 369d ... root:/home/zoux# kubectl get APIService v2alpha1.batch -oyaml apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: creationTimestamp: "2021-06-28T10:02:30Z" labels: kube-aggregator.kubernetes.io/automanaged: onstart name: v2alpha1.batch resourceVersion: "34" selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v2alpha1.batch uid: 30aae086-ca97-4b90-8097-435561d1e56d spec: group: batch groupPriorityMinimum: 17400 service: null version: v2alpha1 versionPriority: 9 status: conditions: - lastTransitionTime: "2021-06-28T10:02:30Z" message: Local APIServices are always available reason: Local status: "True" type: Available ``` **所以访问batch这个group下的资源(就是job)就是本地访问,直接访问apisrver;访问v1beta1.custom.metrics.k8s.io(hpa)就是访问kube-system kube-hpa这个service**
所以从上面可以知道,AggregatorServer负责处理 `apiregistration.k8s.io` 组下的APIService资源请求,实际上所有的服务都是apiregistration.k8s.io。所以AggregatorServer是一个总的入口。 这也是为什么创建server的顺序为: apiExtensionsServer, apiserver、aggregator。
### 2. CreateAggregatorServer源码分析 目标:通过源码分析弄清楚具体的流程。 CreateAggregatorServer核心是运行了一下的控制器 - `apiserviceRegistrationController`:负责 APIServices 中资源的注册与删除; - `availableConditionController`:维护 APIServices 的可用状态,包括其引用 Service 是否可用等; - `autoRegistrationController`:用于保持 API 中存在的一组特定的 APIServices; - `crdRegistrationController`:负责将 CRD GroupVersions 自动注册到 APIServices 中; - `openAPIAggregationController`:将 APIServices 资源的变化同步至提供的 OpenAPI 文档; ``` aggregatorServer, err := createAggregatorServer(aggregatorConfig, kubeAPIServer.GenericAPIServer, apiExtensionsServer.Informers) if err != nil { // we don't need special handling for innerStopCh because the aggregator server doesn't create any go routines return nil, err } func createAggregatorServer(aggregatorConfig *aggregatorapiserver.Config, delegateAPIServer genericapiserver.DelegationTarget, apiExtensionInformers apiextensionsinformers.SharedInformerFactory) (*aggregatorapiserver.APIAggregator, error) { // 1.创建aggregatorServer aggregatorServer, err := aggregatorConfig.Complete().NewWithDelegate(delegateAPIServer) if err != nil { return nil, err } // create controllers for auto-registration apiRegistrationClient, err := apiregistrationclient.NewForConfig(aggregatorConfig.GenericConfig.LoopbackClientConfig) if err != nil { return nil, err } // 2.创建autoRegistrationController autoRegistrationController := autoregister.NewAutoRegisterController(aggregatorServer.APIRegistrationInformers.Apiregistration().V1().APIServices(), apiRegistrationClient) apiServices := apiServicesToRegister(delegateAPIServer, autoRegistrationController) crdRegistrationController := crdregistration.NewCRDRegistrationController( apiExtensionInformers.Apiextensions().InternalVersion().CustomResourceDefinitions(), autoRegistrationController) err = aggregatorServer.GenericAPIServer.AddPostStartHook("kube-apiserver-autoregistration", func(context genericapiserver.PostStartHookContext) error { go crdRegistrationController.Run(5, context.StopCh) go func() { // let the CRD controller process the initial set of CRDs before starting the autoregistration controller. // this prevents the autoregistration controller's initial sync from deleting APIServices for CRDs that still exist. // we only need to do this if CRDs are enabled on this server. We can't use discovery because we are the source for discovery. if aggregatorConfig.GenericConfig.MergedResourceConfig.AnyVersionForGroupEnabled("apiextensions.k8s.io") { crdRegistrationController.WaitForInitialSync() } autoRegistrationController.Run(5, context.StopCh) }() return nil }) if err != nil { return nil, err } err = aggregatorServer.GenericAPIServer.AddBootSequenceHealthChecks( makeAPIServiceAvailableHealthCheck( "autoregister-completion", apiServices, aggregatorServer.APIRegistrationInformers.Apiregistration().V1().APIServices(), ), ) if err != nil { return nil, err } return aggregatorServer, nil } ``` #### 2.1 NewWithDelegate 核心逻辑如下: (1)利用apiserver生成genericServer,这个和apiserver利用extensionServer生成是一样的 (2)注册路由信息 (3)生成apiserviceRegistrationController,启动监听APIServiceRegistrationController请求 (4)运行availableController 可以看出来这里核心就是运行了apiserviceRegistrationController 和 availableController这2个控制器 ``` // NewWithDelegate returns a new instance of APIAggregator from the given config. func (c completedConfig) NewWithDelegate(delegationTarget genericapiserver.DelegationTarget) (*APIAggregator, error) { // Prevent generic API server to install OpenAPI handler. Aggregator server // has its own customized OpenAPI handler. openAPIConfig := c.GenericConfig.OpenAPIConfig c.GenericConfig.OpenAPIConfig = nil // 1. 利用apiserver生成genericServer genericServer, err := c.GenericConfig.New("kube-aggregator", delegationTarget) if err != nil { return nil, err } apiregistrationClient, err := clientset.NewForConfig(c.GenericConfig.LoopbackClientConfig) if err != nil { return nil, err } informerFactory := informers.NewSharedInformerFactory( apiregistrationClient, 5*time.Minute, // this is effectively used as a refresh interval right now. Might want to do something nicer later on. ) s := &APIAggregator{ GenericAPIServer: genericServer, delegateHandler: delegationTarget.UnprotectedHandler(), proxyClientCert: c.ExtraConfig.ProxyClientCert, proxyClientKey: c.ExtraConfig.ProxyClientKey, proxyTransport: c.ExtraConfig.ProxyTransport, proxyHandlers: map[string]*proxyHandler{}, handledGroups: sets.String{}, lister: informerFactory.Apiregistration().V1().APIServices().Lister(), APIRegistrationInformers: informerFactory, serviceResolver: c.ExtraConfig.ServiceResolver, openAPIConfig: openAPIConfig, } // 2.注册路由信息 apiGroupInfo := apiservicerest.NewRESTStorage(c.GenericConfig.MergedResourceConfig, c.GenericConfig.RESTOptionsGetter) if err := s.GenericAPIServer.InstallAPIGroup(&apiGroupInfo); err != nil { return nil, err } enabledVersions := sets.NewString() for v := range apiGroupInfo.VersionedResourcesStorageMap { enabledVersions.Insert(v) } if !enabledVersions.Has(v1.SchemeGroupVersion.Version) { return nil, fmt.Errorf("API group/version %s must be enabled", v1.SchemeGroupVersion.String()) } apisHandler := &apisHandler{ codecs: aggregatorscheme.Codecs, lister: s.lister, discoveryGroup: discoveryGroup(enabledVersions), } s.GenericAPIServer.Handler.NonGoRestfulMux.Handle("/apis", apisHandler) s.GenericAPIServer.Handler.NonGoRestfulMux.UnlistedHandle("/apis/", apisHandler) // 3.生成apiserviceRegistrationController,监听APIServiceRegistrationController请求 apiserviceRegistrationController := NewAPIServiceRegistrationController(informerFactory.Apiregistration().V1().APIServices(), s) availableController, err := statuscontrollers.NewAvailableConditionController( informerFactory.Apiregistration().V1().APIServices(), c.GenericConfig.SharedInformerFactory.Core().V1().Services(), c.GenericConfig.SharedInformerFactory.Core().V1().Endpoints(), apiregistrationClient.ApiregistrationV1(), c.ExtraConfig.ProxyTransport, c.ExtraConfig.ProxyClientCert, c.ExtraConfig.ProxyClientKey, s.serviceResolver, ) if err != nil { return nil, err } // 启动监听 s.GenericAPIServer.AddPostStartHookOrDie("start-kube-aggregator-informers", func(context genericapiserver.PostStartHookContext) error { informerFactory.Start(context.StopCh) c.GenericConfig.SharedInformerFactory.Start(context.StopCh) return nil }) // 4.运行apiserviceRegistrationController s.GenericAPIServer.AddPostStartHookOrDie("apiservice-registration-controller", func(context genericapiserver.PostStartHookContext) error { go apiserviceRegistrationController.Run(context.StopCh) return nil }) // 5. 运行availableController s.GenericAPIServer.AddPostStartHookOrDie("apiservice-status-available-controller", func(context genericapiserver.PostStartHookContext) error { // if we end up blocking for long periods of time, we may need to increase threadiness. go availableController.Run(5, context.StopCh) return nil }) return s, nil } // 监听APIService这个对象的add, update, delete // NewAPIServiceRegistrationController returns a new APIServiceRegistrationController. func NewAPIServiceRegistrationController(apiServiceInformer informers.APIServiceInformer, apiHandlerManager APIHandlerManager) *APIServiceRegistrationController { c := &APIServiceRegistrationController{ apiHandlerManager: apiHandlerManager, apiServiceLister: apiServiceInformer.Lister(), apiServiceSynced: apiServiceInformer.Informer().HasSynced, queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "APIServiceRegistrationController"), } apiServiceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: c.addAPIService, UpdateFunc: c.updateAPIService, DeleteFunc: c.deleteAPIService, }) c.syncFn = c.sync return c } ``` ##### 2.1.1 apiserviceRegistrationController-处理APIService对象的增删改 addAPIService, updateAPIService, updateAPIService都是进入队列。通过Run->runWorker->processNextWorkItem->sync处理。 ``` func (c *APIServiceRegistrationController) sync(key string) error { // 如果APIService对象不存在,就删除 apiService, err := c.apiServiceLister.Get(key) if apierrors.IsNotFound(err) { c.apiHandlerManager.RemoveAPIService(key) return nil } if err != nil { return err } // 核心就是AddAPIService函数 return c.apiHandlerManager.AddAPIService(apiService) } ```
AddAPIService函数的核心逻辑: (1)如果存在,说明路由信息不用修改,直接更新porxy就行 (2)如果不存在,要处理restful url和路由的对应关系 ``` // AddAPIService adds an API service. It is not thread-safe, so only call it on one thread at a time please. // It's a slow moving API, so its ok to run the controller on a single thread func (s *APIAggregator) AddAPIService(apiService *v1.APIService) error { // if the proxyHandler already exists, it needs to be updated. The aggregation bits do not // since they are wired against listers because they require multiple resources to respond // 1.如果存在,说明路由信息不用修改,直接更新porxy就行 if proxyHandler, exists := s.proxyHandlers[apiService.Name]; exists { proxyHandler.updateAPIService(apiService) if s.openAPIAggregationController != nil { s.openAPIAggregationController.UpdateAPIService(proxyHandler, apiService) } return nil } // 2.如果不存在,要处理restful url和路由的对应关系 proxyPath := "/apis/" + apiService.Spec.Group + "/" + apiService.Spec.Version // v1. is a special case for the legacy API. It proxies to a wider set of endpoints. if apiService.Name == legacyAPIServiceName { proxyPath = "/api" } // register the proxy handler proxyHandler := &proxyHandler{ localDelegate: s.delegateHandler, proxyClientCert: s.proxyClientCert, proxyClientKey: s.proxyClientKey, proxyTransport: s.proxyTransport, serviceResolver: s.serviceResolver, } proxyHandler.updateAPIService(apiService) if s.openAPIAggregationController != nil { s.openAPIAggregationController.AddAPIService(proxyHandler, apiService) } s.proxyHandlers[apiService.Name] = proxyHandler s.GenericAPIServer.Handler.NonGoRestfulMux.Handle(proxyPath, proxyHandler) s.GenericAPIServer.Handler.NonGoRestfulMux.UnlistedHandlePrefix(proxyPath+"/", proxyHandler) // if we're dealing with the legacy group, we're done here if apiService.Name == legacyAPIServiceName { return nil } // if we've already registered the path with the handler, we don't want to do it again. if s.handledGroups.Has(apiService.Spec.Group) { return nil } // it's time to register the group aggregation endpoint groupPath := "/apis/" + apiService.Spec.Group groupDiscoveryHandler := &apiGroupHandler{ codecs: aggregatorscheme.Codecs, groupName: apiService.Spec.Group, lister: s.lister, delegate: s.delegateHandler, } // aggregation is protected s.GenericAPIServer.Handler.NonGoRestfulMux.Handle(groupPath, groupDiscoveryHandler) s.GenericAPIServer.Handler.NonGoRestfulMux.UnlistedHandle(groupPath+"/", groupDiscoveryHandler) s.handledGroups.Insert(apiService.Spec.Group) return nil } ```
updateAPIService的核心逻辑: (1)如果APIService对象是Local类型,不用设置代理 (2)否则设置路由代理,访问这个restful 服务,都由APIService对应的后端处理 ``` func (r *proxyHandler) updateAPIService(apiService *apiregistrationv1api.APIService) { if apiService.Spec.Service == nil { r.handlingInfo.Store(proxyHandlingInfo{local: true}) return } newInfo := proxyHandlingInfo{ name: apiService.Name, restConfig: &restclient.Config{ TLSClientConfig: restclient.TLSClientConfig{ Insecure: apiService.Spec.InsecureSkipTLSVerify, ServerName: apiService.Spec.Service.Name + "." + apiService.Spec.Service.Namespace + ".svc", CertData: r.proxyClientCert, KeyData: r.proxyClientKey, CAData: apiService.Spec.CABundle, }, }, serviceName: apiService.Spec.Service.Name, serviceNamespace: apiService.Spec.Service.Namespace, servicePort: *apiService.Spec.Service.Port, serviceAvailable: apiregistrationv1apihelper.IsAPIServiceConditionTrue(apiService, apiregistrationv1api.Available), } if r.proxyTransport != nil && r.proxyTransport.DialContext != nil { newInfo.restConfig.Dial = r.proxyTransport.DialContext } newInfo.proxyRoundTripper, newInfo.transportBuildingError = restclient.TransportFor(newInfo.restConfig) if newInfo.transportBuildingError != nil { klog.Warning(newInfo.transportBuildingError.Error()) } r.handlingInfo.Store(newInfo) } ```
这里大家可以参考我的一篇文章,就能更清楚理解了 [hpa-自定义metric-server](https://zoux86.github.io/post/2021-6-18-hpa-%E8%87%AA%E5%AE%9A%E4%B9%89metric-server/) ``` root@k8s-master:~/testyaml/hpa# cat tls.yaml apiVersion: v1 kind: Service metadata: name: kube-hpa namespace: kube-system spec: clusterIP: None ports: - name: https-hpa-dont-edit-it port: 9997 targetPort: 9997 selector: app: kube-hpa --- apiVersion: apiregistration.k8s.io/v1beta1 kind: APIService metadata: name: v1beta1.custom.metrics.k8s.io spec: service: name: kube-hpa namespace: kube-system port: 9997 group: custom.metrics.k8s.io version: v1beta1 insecureSkipTLSVerify: true groupPriorityMinimum: 100 versionPriority: 100 ``` ##### 2.1.2 availableController availableController 核心工作就是判断APIService对应的service是否能工作。所以处理监听APIService外,还要监听svc, ep资源。 ``` // NewAvailableConditionController returns a new AvailableConditionController. func NewAvailableConditionController( apiServiceInformer informers.APIServiceInformer, serviceInformer v1informers.ServiceInformer, endpointsInformer v1informers.EndpointsInformer, apiServiceClient apiregistrationclient.APIServicesGetter, proxyTransport *http.Transport, proxyClientCert []byte, proxyClientKey []byte, serviceResolver ServiceResolver, ) (*AvailableConditionController, error) { c := &AvailableConditionController{ apiServiceClient: apiServiceClient, apiServiceLister: apiServiceInformer.Lister(), apiServiceSynced: apiServiceInformer.Informer().HasSynced, serviceLister: serviceInformer.Lister(), servicesSynced: serviceInformer.Informer().HasSynced, endpointsLister: endpointsInformer.Lister(), endpointsSynced: endpointsInformer.Informer().HasSynced, serviceResolver: serviceResolver, queue: workqueue.NewNamedRateLimitingQueue( // We want a fairly tight requeue time. The controller listens to the API, but because it relies on the routability of the // service network, it is possible for an external, non-watchable factor to affect availability. This keeps // the maximum disruption time to a minimum, but it does prevent hot loops. workqueue.NewItemExponentialFailureRateLimiter(5*time.Millisecond, 30*time.Second), "AvailableConditionController"), } // if a particular transport was specified, use that otherwise build one // construct an http client that will ignore TLS verification (if someone owns the network and messes with your status // that's not so bad) and sets a very short timeout. This is a best effort GET that provides no additional information restConfig := &rest.Config{ TLSClientConfig: rest.TLSClientConfig{ Insecure: true, CertData: proxyClientCert, KeyData: proxyClientKey, }, } if proxyTransport != nil && proxyTransport.DialContext != nil { restConfig.Dial = proxyTransport.DialContext } transport, err := rest.TransportFor(restConfig) if err != nil { return nil, err } c.discoveryClient = &http.Client{ Transport: transport, // the request should happen quickly. Timeout: 5 * time.Second, } // resync on this one because it is low cardinality and rechecking the actual discovery // allows us to detect health in a more timely fashion when network connectivity to // nodes is snipped, but the network still attempts to route there. See // https://github.com/openshift/origin/issues/17159#issuecomment-341798063 apiServiceInformer.Informer().AddEventHandlerWithResyncPeriod( cache.ResourceEventHandlerFuncs{ AddFunc: c.addAPIService, UpdateFunc: c.updateAPIService, DeleteFunc: c.deleteAPIService, }, 30*time.Second) serviceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: c.addService, UpdateFunc: c.updateService, DeleteFunc: c.deleteService, }) endpointsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: c.addEndpoints, UpdateFunc: c.updateEndpoints, DeleteFunc: c.deleteEndpoints, }) c.syncFn = c.sync return c, nil } ```
这里核心就是AvailableConditionController.sync。具体逻辑不展开了。核心就是更新APIService status。判断APIserver对象的service是否可用 ``` func (c *AvailableConditionController) sync(key string) error { originalAPIService, err := c.apiServiceLister.Get(key) ```
``` root@k8s-master:~/testyaml/hpa# kubectl get APIService v1beta1.custom.metrics.k8s.io -oyaml apiVersion: apiregistration.k8s.io/v1 kind: APIService metadata: creationTimestamp: "2021-06-13T13:22:01Z" name: v1beta1.custom.metrics.k8s.io resourceVersion: "1590641" selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.custom.metrics.k8s.io uid: d488d6a8-7e79-4311-a1e9-0b12e4591375 spec: group: custom.metrics.k8s.io groupPriorityMinimum: 100 insecureSkipTLSVerify: true service: name: kube-hpa namespace: kube-system port: 9997 version: v1beta1 versionPriority: 100 status: //就是这个 conditions: - lastTransitionTime: "2021-06-13T13:42:17Z" message: all checks passed reason: Passed status: "True" type: Available ``` #### 2.2 创建autoRegistrationController autoRegistrationController也监听了APIService。统一通过Run->runWorker->processNextWorkItem->checkAPIService处理。核心就是checkAPIService函数。 ``` // NewAutoRegisterController creates a new autoRegisterController. func NewAutoRegisterController(apiServiceInformer informers.APIServiceInformer, apiServiceClient apiregistrationclient.APIServicesGetter) *autoRegisterController { c := &autoRegisterController{ apiServiceLister: apiServiceInformer.Lister(), apiServiceSynced: apiServiceInformer.Informer().HasSynced, apiServiceClient: apiServiceClient, apiServicesToSync: map[string]*v1.APIService{}, apiServicesAtStart: map[string]bool{}, syncedSuccessfullyLock: &sync.RWMutex{}, syncedSuccessfully: map[string]bool{}, queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "autoregister"), } c.syncHandler = c.checkAPIService apiServiceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: func(obj interface{}) { cast := obj.(*v1.APIService) c.queue.Add(cast.Name) }, UpdateFunc: func(_, obj interface{}) { cast := obj.(*v1.APIService) c.queue.Add(cast.Name) }, DeleteFunc: func(obj interface{}) { cast, ok := obj.(*v1.APIService) if !ok { tombstone, ok := obj.(cache.DeletedFinalStateUnknown) if !ok { klog.V(2).Infof("Couldn't get object from tombstone %#v", obj) return } cast, ok = tombstone.Obj.(*v1.APIService) if !ok { klog.V(2).Infof("Tombstone contained unexpected object: %#v", obj) return } } c.queue.Add(cast.Name) }, }) return c } ``` ##### 2.2.1 checkAPIService Apiservice按照同步类型分为2类manageOnStart,manageContinuously ,通过标签AutoRegisterManagedLabel标记。 ``` const ( // AutoRegisterManagedLabel is a label attached to the APIService that identifies how the APIService wants to be synced. AutoRegisterManagedLabel = "kube-aggregator.kubernetes.io/automanaged" // manageOnStart is a value for the AutoRegisterManagedLabel that indicates the APIService wants to be synced one time when the controller starts. manageOnStart = "onstart" // manageContinuously is a value for the AutoRegisterManagedLabel that indicates the APIService wants to be synced continuously. manageContinuously = "true" ) ``` checkAPIService这个函数就注释表就知道,该函数功能更加不同类型的不同动作做同步操作。
``` // checkAPIService syncs the current APIService against a list of desired APIService objects // // | A. desired: not found | B. desired: sync on start | C. desired: sync always // ------------------------------------------------|-----------------------|---------------------------|------------------------ // 1. current: lookup error | error | error | error // 2. current: not found | - | create once | create // 3. current: no sync | - | - | - // 4. current: sync on start, not present at start | - | - | - // 5. current: sync on start, present at start | delete once | update once | update once // 6. current: sync always | delete | update once | update func (c *autoRegisterController) checkAPIService(name string) (err error) { desired := c.GetAPIServiceToSync(name) curr, err := c.apiServiceLister.Get(name) // if we've never synced this service successfully, record a successful sync. hasSynced := c.hasSyncedSuccessfully(name) if !hasSynced { defer func() { if err == nil { c.setSyncedSuccessfully(name) } }() } switch { // we had a real error, just return it (1A,1B,1C) case err != nil && !apierrors.IsNotFound(err): return err // we don't have an entry and we don't want one (2A) case apierrors.IsNotFound(err) && desired == nil: return nil // the local object only wants to sync on start and has already synced (2B,5B,6B "once" enforcement) case isAutomanagedOnStart(desired) && hasSynced: return nil // we don't have an entry and we do want one (2B,2C) case apierrors.IsNotFound(err) && desired != nil: _, err := c.apiServiceClient.APIServices().Create(desired) if apierrors.IsAlreadyExists(err) { // created in the meantime, we'll get called again return nil } return err // we aren't trying to manage this APIService (3A,3B,3C) case !isAutomanaged(curr): return nil // the remote object only wants to sync on start, but was added after we started (4A,4B,4C) case isAutomanagedOnStart(curr) && !c.apiServicesAtStart[name]: return nil // the remote object only wants to sync on start and has already synced (5A,5B,5C "once" enforcement) case isAutomanagedOnStart(curr) && hasSynced: return nil // we have a spurious APIService that we're managing, delete it (5A,6A) case desired == nil: opts := &metav1.DeleteOptions{Preconditions: metav1.NewUIDPreconditions(string(curr.UID))} err := c.apiServiceClient.APIServices().Delete(curr.Name, opts) if apierrors.IsNotFound(err) || apierrors.IsConflict(err) { // deleted or changed in the meantime, we'll get called again return nil } return err // if the specs already match, nothing for us to do case reflect.DeepEqual(curr.Spec, desired.Spec): return nil } // we have an entry and we have a desired, now we deconflict. Only a few fields matter. (5B,5C,6B,6C) apiService := curr.DeepCopy() apiService.Spec = desired.Spec _, err = c.apiServiceClient.APIServices().Update(apiService) if apierrors.IsNotFound(err) || apierrors.IsConflict(err) { // deleted or changed in the meantime, we'll get called again return nil } return err } ``` ##### 2.2.2 为什么需要这个 作用:用于保持 API 中存在的一组特定的 APIServices 内置资源的APIService都会有标签`kube-aggregator.kubernetes.io/automanaged: onstart`,例如:v1.apps apiService。autoRegistrationController创建并维护这些列表中的APIService,也即我们看到的Local apiService; CRD资源则是automanaged=true,表示always 而自定义service类型的APIService是没有的这个标签,因为自己会更新路由。 ``` roo # kubectl get APIService --show-labels NAME SERVICE AVAILABLE AGE LABELS v1. Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.admissionregistration.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.apiextensions.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.apps Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.authentication.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.authorization.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.autoscaling Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.autoscaling.k8s.io Local True 35d kube-aggregator.kubernetes.io/automanaged=true v1.batch Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.coordination.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.messaging.k8s.io Local True 44d kube-aggregator.kubernetes.io/automanaged=true v1.networking.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.rbac.authorization.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.schedular.istio.io Local True 41d kube-aggregator.kubernetes.io/automanaged=true v1.scheduling.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1.security.symphony.netease.com Local True 44d kube-aggregator.kubernetes.io/automanaged=true v1.storage.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1alpha1.argoproj.io Local True 41d kube-aggregator.kubernetes.io/automanaged=true v1alpha1.auditregistration.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1alpha1.authentication.istio.io Local True 41d kube-aggregator.kubernetes.io/automanaged=true v1alpha1.certmanager.k8s.io Local True 44d kube-aggregator.kubernetes.io/automanaged=true v1alpha1.crdlbcontroller.k8s.io Local True 44d kube-aggregator.kubernetes.io/automanaged=true v1alpha1.loadbalancer.k8s.io Local True 35d kube-aggregator.kubernetes.io/automanaged=true v1alpha1.multicluster.admiralty.io Local True 35d kube-aggregator.kubernetes.io/automanaged=true v1alpha1.networking.symphony.netease.com Local True 41d kube-aggregator.kubernetes.io/automanaged=true v1alpha1.node.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1alpha1.rbac.authorization.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1alpha1.rbac.istio.io Local True 35d kube-aggregator.kubernetes.io/automanaged=true v1alpha1.resources.symphony.netease.com Local True 44d kube-aggregator.kubernetes.io/automanaged=true v1alpha1.scheduling.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1alpha1.settings.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1alpha1.storage.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1alpha2.config.istio.io Local True 35d kube-aggregator.kubernetes.io/automanaged=true v1alpha3.networking.istio.io Local True 41d kube-aggregator.kubernetes.io/automanaged=true v1beta1.admissionregistration.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.apiextensions.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.apps Local True 172d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.authentication.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.authorization.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.batch Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.certificates.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.coordination.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.custom.metrics.k8s.io kube-system/kube-hpa True 369d v1beta1.discovery.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.events.k8s.io Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.extensions Local True 369d kube-aggregator.kubernetes.io/automanaged=onstart v1beta1.kustomize.toolkit.fluxcd.io Local True 44d kube-aggregator.kubernetes.io/automanaged=true ``` #### 2.3 crdRegistrationController crdRegistrationController监听的是crd资源的增删改操作。也是Run->runWorker->processNextWorkItem->handleVersionUpdate。核心看handleVersionUpdate。 从这里可以看出来:APIService就是根据CRD资源的增删改,修改APIService对象。CRD资源则是automanaged=true,表示always ``` func (c *crdRegistrationController) handleVersionUpdate(groupVersion schema.GroupVersion) error { apiServiceName := groupVersion.Version + "." + groupVersion.Group // check all CRDs. There shouldn't that many, but if we have problems later we can index them crds, err := c.crdLister.List(labels.Everything()) if err != nil { return err } for _, crd := range crds { if crd.Spec.Group != groupVersion.Group { continue } for _, version := range crd.Spec.Versions { if version.Name != groupVersion.Version || !version.Served { continue } c.apiServiceRegistration.AddAPIServiceToSync(&v1.APIService{ ObjectMeta: metav1.ObjectMeta{Name: apiServiceName}, Spec: v1.APIServiceSpec{ Group: groupVersion.Group, Version: groupVersion.Version, GroupPriorityMinimum: 1000, // CRDs should have relatively low priority VersionPriority: 100, // CRDs will be sorted by kube-like versions like any other APIService with the same VersionPriority }, }) return nil } } c.apiServiceRegistration.RemoveAPIServiceToSync(apiServiceName) return nil } // CRD表示要跟着资源变化一起同步的APIService // AddAPIServiceToSync registers an API service to sync continuously. func (c *autoRegisterController) AddAPIServiceToSync(in *v1.APIService) { c.addAPIServiceToSync(in, manageContinuously) } ``` #### 2.4 openAPIAggregationController openAPIAggregationController 是在PrepareRun中运行的,核心也是监听APIService对象。然后Run->runWorker->processNextWorkItem->sync,同步OpenAPI 文档。 ``` // PrepareRun prepares the aggregator to run, by setting up the OpenAPI spec and calling // the generic PrepareRun. func (s *APIAggregator) PrepareRun() (preparedAPIAggregator, error) { // add post start hook before generic PrepareRun in order to be before /healthz installation if s.openAPIConfig != nil { s.GenericAPIServer.AddPostStartHookOrDie("apiservice-openapi-controller", func(context genericapiserver.PostStartHookContext) error { go s.openAPIAggregationController.Run(context.StopCh) return nil }) } prepared := s.GenericAPIServer.PrepareRun() // delay OpenAPI setup until the delegate had a chance to setup their OpenAPI handlers if s.openAPIConfig != nil { specDownloader := openapiaggregator.NewDownloader() openAPIAggregator, err := openapiaggregator.BuildAndRegisterAggregator( &specDownloader, s.GenericAPIServer.NextDelegate(), s.GenericAPIServer.Handler.GoRestfulContainer.RegisteredWebServices(), s.openAPIConfig, s.GenericAPIServer.Handler.NonGoRestfulMux) if err != nil { return preparedAPIAggregator{}, err } s.openAPIAggregationController = openapicontroller.NewAggregationController(&specDownloader, openAPIAggregator) } return preparedAPIAggregator{APIAggregator: s, runnable: prepared}, nil } ``` ### 3. 总结 可以看出来AggregatorServer做了很多事情。kube-apiserver实现聚合的关键就是它,通过APIService资源扩展了api。 可以利用这个机制做很多事情,比如自定义mertic-server。比如可以通过添加APIService实现CRD的效果。 社区也有一个专业的工具,详见:https://github.com/kubernetes-sigs/apiserver-builder-alpha/ apiserver-builder-alpha是一系列工具和库的集合,它能够: 1. 为新的API资源创建Go类型、控制器(基于controller-runtime)、测试用例、文档 2. 构建、(独立、在Minikube或者在K8S中)运行扩展的控制平面组件(APIServer) 3. 让在控制器中watch/update资源更简单 4. 让创建新的资源/子资源更简单 5. 提供大部分合理的默认值 ================================================ FILE: k8s/kube-apiserver/11-kube-apiserver 启动http和https服务.md ================================================ * [1\. 启动http服务](#1-启动http服务) * [1\.1 链路流程](#11-链路流程) * [1\.2 insecureHandlerChain](#12-insecurehandlerchain) * [2\. 启动https服务](#2-启动https服务) * [2\.1 启动过程](#21-启动过程) * [2\.2 DefaultBuildHandlerChain](#22-defaultbuildhandlerchain) * [2\.3 调用链路](#23-调用链路) * [2\.3\.1\. NewConfig 指定了server\.Config\.BuildHandlerChainFunc=DefaultBuildHandlerChain](#231-newconfig-指定了serverconfigbuildhandlerchainfuncdefaultbuildhandlerchain) * [2\.3\.2\. completedConfig\.new 使用这个func](#232-completedconfignew-使用这个func) * [2\.3\.3\. createAggregatorServer调用了NewWithDelegate,调用了第二步的New函数](#233-createaggregatorserver调用了newwithdelegate调用了第二步的new函数) * [2\.3\.4\. Run函数调用了NonBlockingRun函数](#234-run函数调用了nonblockingrun函数) * [3 总结](#3-总结) **本章重点:**分析最后两个流程,启动HTTP,HTTPS服务 kube-apiserver整体启动流程如下: (1)资源注册。 (2)Cobra命令行参数解析 (3)创建APIServer通用配置 (4)创建APIExtensionsServer (5)创建KubeAPIServer (6)创建AggregatorServer (7)启动HTTP服务。 (8)启动HTTPS服务 ### 1. 启动http服务 #### 1.1 链路流程 Go语言提供的HTTP标准库非常强大,Kubernetes API Server在其基础上并没有过多的封装,因为它的功能和性能已经很完善了,可直接拿来用。在Go语言中开启HTTP服务有很多种方法,例如通过http.ListenAndServe函数可以直接启动HTTP服务,其内部实现了创建 Socket、监控端口等操作。下面看看Kubernetes APIServer通过自定义http.Server的方式创建HTTP服务的过程,代码示例如下: ``` if insecureServingInfo != nil { insecureHandlerChain := kubeserver.BuildInsecureHandlerChain(aggregatorServer.GenericAPIServer.UnprotectedHandler(), kubeAPIServerConfig.GenericConfig) if err := insecureServingInfo.Serve(insecureHandlerChain, kubeAPIServerConfig.GenericConfig.RequestTimeout, stopCh); err != nil { return nil, err } } // Serve starts an insecure http server with the given handler. It fails only if // the initial listen call fails. It does not block. func (s *DeprecatedInsecureServingInfo) Serve(handler http.Handler, shutdownTimeout time.Duration, stopCh <-chan struct{}) error { insecureServer := &http.Server{ Addr: s.Listener.Addr().String(), Handler: handler, MaxHeaderBytes: 1 << 20, } if len(s.Name) > 0 { klog.Infof("Serving %s insecurely on %s", s.Name, s.Listener.Addr()) } else { klog.Infof("Serving insecurely on %s", s.Listener.Addr()) } _, err := RunServer(insecureServer, s.Listener, shutdownTimeout, stopCh) // NOTE: we do not handle stoppedCh returned by RunServer for graceful termination here return err } // RunServer spawns a go-routine continuously serving until the stopCh is // closed. // It returns a stoppedCh that is closed when all non-hijacked active requests // have been processed. // This function does not block // TODO: make private when insecure serving is gone from the kube-apiserver func RunServer( server *http.Server, ln net.Listener, shutDownTimeout time.Duration, stopCh <-chan struct{}, ) (<-chan struct{}, error) { if ln == nil { return nil, fmt.Errorf("listener must not be nil") } // Shutdown server gracefully. stoppedCh := make(chan struct{}) go func() { defer close(stoppedCh) <-stopCh ctx, cancel := context.WithTimeout(context.Background(), shutDownTimeout) server.Shutdown(ctx) cancel() }() go func() { defer utilruntime.HandleCrash() var listener net.Listener listener = tcpKeepAliveListener{ln.(*net.TCPListener)} if server.TLSConfig != nil { listener = tls.NewListener(listener, server.TLSConfig) } err := server.Serve(listener) msg := fmt.Sprintf("Stopped listening on %s", ln.Addr().String()) select { case <-stopCh: klog.Info(msg) default: panic(fmt.Sprintf("%s due to error: %v", msg, err)) } }() return stoppedCh, nil } ``` 在RunServer函数中,通过Go语言标准库的serverServe监听listener,并在运行过程中为每个连接创建一个goroutine。goroutine读取请求,然后调用Handler函数来处理并响应请求。另外,在Kubernetes API Server的代码中还实现了平滑关闭HTTP服务的功能,利用Go语言标准库的HTTP Server.Shutdown函数可以在不干扰任何活跃连接的情况下关闭服务。其原理是,首先关闭所有的监听listener,然后 关闭所有的空闲连接,接着无限期地等待所有连接变成空闲状态并关闭。如果设置带有超时的Context,将在HTTP服务关闭之前返回Context超时错误。 #### 1.2 insecureHandlerChain 所以如果是http请求的话。处理函数的链路为: **WithMaxInFlightLimit:apiserver**限流策略,通过go chan实现限流。--max-requests-inflight=1000 --max-mutating-requests-inflight=1000指定了QPS。 **WithAudit**: 开启审计,日志以event格式输出 **WithAuthentication:** 进行认证,其实是为了方便审计 **WithCORS:** cors全称是--cors-allowed-origins, 通过kube-apiserver的cors-allowed-origins指定运行的cors. 例如: -cors-allowed-origins = http://www.example.com, https://*.example.com **WithTimeoutForNonLongRunningRequests:** 设置超时时间,默认是1min **WithRequestInfo**: 根据请求信息,补充完整requestInfo结构体信息 **WithCacheControl:** 给request设置Cache-Control信息 **WithPanicRecovery:** 如果一个请求给apiserver造成了panic, 设置http.StatusInternalServerError 函数介绍: ``` // BuildInsecureHandlerChain sets up the server to listen to http. Should be removed. func BuildInsecureHandlerChain(apiHandler http.Handler, c *server.Config) http.Handler { handler := apiHandler handler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc, c.EventQpsRatio, c.RequestTimeout) handler = genericapifilters.WithAudit(handler, c.AuditBackend, c.AuditPolicyChecker, c.LongRunningFunc) handler = genericapifilters.WithAuthentication(handler, server.InsecureSuperuser{}, nil, nil) handler = genericfilters.WithCORS(handler, c.CorsAllowedOriginList, nil, nil, nil, "true") handler = genericfilters.WithTimeoutForNonLongRunningRequests(handler, c.LongRunningFunc, c.RequestTimeout) handler = genericfilters.WithWaitGroup(handler, c.LongRunningFunc, c.HandlerChainWaitGroup) handler = genericapifilters.WithRequestInfo(handler, server.NewRequestInfoResolver(c)) handler = genericapifilters.WithCacheControl(handler) handler = genericfilters.WithPanicRecovery(handler) return handler } ``` 请求是从下到上的,所以顺序为:Panic recovery -> TimeOut -> Authentication -> Audit -> MaxInFlightLimit
### 2. 启动https服务 #### 2.1 启动过程 在NonBlockingRun函数,启动了https服务 ``` // NonBlockingRun spawns the secure http server. An error is // returned if the secure port cannot be listened on. func (s preparedGenericAPIServer) NonBlockingRun(stopCh <-chan struct{}) error { // Use an stop channel to allow graceful shutdown without dropping audit events // after http server shutdown. auditStopCh := make(chan struct{}) // Start the audit backend before any request comes in. This means we must call Backend.Run // before http server start serving. Otherwise the Backend.ProcessEvents call might block. if s.AuditBackend != nil { if err := s.AuditBackend.Run(auditStopCh); err != nil { return fmt.Errorf("failed to run the audit backend: %v", err) } } // 开启https服务 // Use an internal stop channel to allow cleanup of the listeners on error. internalStopCh := make(chan struct{}) var stoppedCh <-chan struct{} if s.SecureServingInfo != nil && s.Handler != nil { var err error stoppedCh, err = s.SecureServingInfo.Serve(s.Handler, s.ShutdownTimeout, internalStopCh) if err != nil { close(internalStopCh) close(auditStopCh) return err } } // Now that listener have bound successfully, it is the // responsibility of the caller to close the provided channel to // ensure cleanup. go func() { <-stopCh close(internalStopCh) if stoppedCh != nil { <-stoppedCh } s.HandlerChainWaitGroup.Wait() close(auditStopCh) }() s.RunPostStartHooks(stopCh) if _, err := systemd.SdNotify(true, "READY=1\n"); err != nil { klog.Errorf("Unable to send systemd daemon successful start message: %v\n", err) } return nil } ``` HTTPS服务在http.Server上增加了TLSConfig配置,TLSConfig用于配置相关证书,可以通过命令行相关参数(--client-ca-file、--tls-private-key-file、--tls-cert-file参数)进行配置。具体过程不再赘述。 #### 2.2 DefaultBuildHandlerChain 很多人在网上看见的都是这个图。左边的handler-chain其实就是https服务的handlers ![handler-chian](../images/handler-chian.jpg) 调用函数为**DefaultBuildHandlerChain**: DefaultBuildHandlerChain比insecureHandlerChain 多了结果授权的Handler, 其他基本一致。 ``` func DefaultBuildHandlerChain(apiHandler http.Handler, c *Config) http.Handler { handler := genericapifilters.WithAuthorization(apiHandler, c.Authorization.Authorizer, c.Serializer) handler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc, c.EventQpsRatio, c.RequestTimeout) handler = genericapifilters.WithImpersonation(handler, c.Authorization.Authorizer, c.Serializer) handler = genericapifilters.WithAudit(handler, c.AuditBackend, c.AuditPolicyChecker, c.LongRunningFunc) failedHandler := genericapifilters.Unauthorized(c.Serializer, c.Authentication.SupportsBasicAuth) failedHandler = genericapifilters.WithFailedAuthenticationAudit(failedHandler, c.AuditBackend, c.AuditPolicyChecker) handler = genericapifilters.WithAuthentication(handler, c.Authentication.Authenticator, failedHandler, c.Authentication.APIAudiences) handler = genericfilters.WithCORS(handler, c.CorsAllowedOriginList, nil, nil, nil, "true") handler = genericfilters.WithTimeoutForNonLongRunningRequests(handler, c.LongRunningFunc, c.RequestTimeout) handler = genericfilters.WithWaitGroup(handler, c.LongRunningFunc, c.HandlerChainWaitGroup) handler = genericapifilters.WithRequestInfo(handler, c.RequestInfoResolver) handler = genericfilters.WithPanicRecovery(handler) return handler } ``` #### 2.3 调用链路 **调用链路**: createAggregatorServer -> NewWithDelegate -> NewConfig -> DefaultBuildHandlerChain ##### 2.3.1. NewConfig 指定了server.Config.BuildHandlerChainFunc=DefaultBuildHandlerChain ``` // NewConfig returns a Config struct with the default values func NewConfig(codecs serializer.CodecFactory) *Config { defaultHealthChecks := []healthz.HealthChecker{healthz.PingHealthz, healthz.LogHealthz} return &Config{ Serializer: codecs, BuildHandlerChainFunc: DefaultBuildHandlerChain, ``` ##### 2.3.2. completedConfig.new 使用这个func 最终APIServerHandler = DefaultBuildHandlerChain 最终GenericAPIServer.Handler = DefaultBuildHandlerChain ``` // New creates a new server which logically combines the handling chain with the passed server. // name is used to differentiate for logging. The handler chain in particular can be difficult as it starts delgating. // delegationTarget may not be nil. func (c completedConfig) New(name string, delegationTarget DelegationTarget) (*GenericAPIServer, error) { if c.Serializer == nil { return nil, fmt.Errorf("Genericapiserver.New() called with config.Serializer == nil") } if c.LoopbackClientConfig == nil { return nil, fmt.Errorf("Genericapiserver.New() called with config.LoopbackClientConfig == nil") } if c.EquivalentResourceRegistry == nil { return nil, fmt.Errorf("Genericapiserver.New() called with config.EquivalentResourceRegistry == nil") } // handlerChainBuilder := func(handler http.Handler) http.Handler { return c.BuildHandlerChainFunc(handler, c.Config) } apiServerHandler := NewAPIServerHandler(name, c.Serializer, handlerChainBuilder, delegationTarget.UnprotectedHandler()) } func NewAPIServerHandler(name string, s runtime.NegotiatedSerializer, handlerChainBuilder HandlerChainBuilderFn, notFoundHandler http.Handler) *APIServerHandler { nonGoRestfulMux := mux.NewPathRecorderMux(name) if notFoundHandler != nil { nonGoRestfulMux.NotFoundHandler(notFoundHandler) } gorestfulContainer := restful.NewContainer() gorestfulContainer.ServeMux = http.NewServeMux() gorestfulContainer.Router(restful.CurlyRouter{}) // e.g. for proxy/{kind}/{name}/{*} gorestfulContainer.RecoverHandler(func(panicReason interface{}, httpWriter http.ResponseWriter) { logStackOnRecover(s, panicReason, httpWriter) }) gorestfulContainer.ServiceErrorHandler(func(serviceErr restful.ServiceError, request *restful.Request, response *restful.Response) { serviceErrorHandler(s, serviceErr, request, response) }) director := director{ name: name, goRestfulContainer: gorestfulContainer, nonGoRestfulMux: nonGoRestfulMux, } return &APIServerHandler{ FullHandlerChain: handlerChainBuilder(director), GoRestfulContainer: gorestfulContainer, NonGoRestfulMux: nonGoRestfulMux, Director: director, } } ``` ##### 2.3.3. createAggregatorServer调用了NewWithDelegate,调用了第二步的New函数 ``` func createAggregatorServer(aggregatorConfig *aggregatorapiserver.Config, delegateAPIServer genericapiserver.DelegationTarget, apiExtensionInformers apiextensionsinformers.SharedInformerFactory) (*aggregatorapiserver.APIAggregator, error) { aggregatorServer, err := aggregatorConfig.Complete().NewWithDelegate(delegateAPIServer) if err != nil { return nil, err } // create controller ``` 所以APIAggregator=DefaultBuildHandlerChain
最终还调用了server.GenericAPIServer.PrepareRun().Run(stopCh) ``` // RunAggregator runs the API Aggregator. func (o AggregatorOptions) RunAggregator(stopCh <-chan struct{}) error { server, err := config.Complete().NewWithDelegate(genericapiserver.NewEmptyDelegate()) if err != nil { return err } return server.GenericAPIServer.PrepareRun().Run(stopCh) } // NewWithDelegate returns a new instance of APIAggregator from the given config. func (c completedConfig) NewWithDelegate(delegationTarget genericapiserver.DelegationTarget) (*APIAggregator, error) { // Prevent generic API server to install OpenAPI handler. Aggregator server // has its own customized OpenAPI handler. openAPIConfig := c.GenericConfig.OpenAPIConfig c.GenericConfig.OpenAPIConfig = nil genericServer, err := c.GenericConfig.New("kube-aggregator", delegationTarget) if err != nil { return nil, err } ``` ##### 2.3.4. Run函数调用了NonBlockingRun函数 NonBlockingRun 调用了SecureServingInfo.Serve。handler是s.Handler,就是preparedGenericAPIServer ``` if s.SecureServingInfo != nil && s.Handler != nil { var err error stoppedCh, err = s.SecureServingInfo.Serve(s.Handler, s.ShutdownTimeout, internalStopCh) if err != nil { close(internalStopCh) close(auditStopCh) return err } } ``` ### 3 总结 到这里, kube-apiserver 对一个请求的处理就非常清楚了。 (1)先是通过统一的handler chain处理。(http, https是不同的chain, https多了授权相关的处理) (2) 然后看是否是aggregated server需要出现的请求(APIService) ​ (3) 如果是内置资源或者CRD资源,则通过kube-apiserver处理(MUX后面的流程,接下来进行分析) ![handler-chian](../images/handler-chian.jpg) ================================================ FILE: k8s/kube-apiserver/12-k8s之Authentication.md ================================================ Table of Contents ================= * [1. 简介](#1-简介) * [2. 认证器的生成](#2-认证器的生成) * [2.1 调用链路](#21-调用链路) * [2.2 BuildAuthenticator](#22-buildauthenticator) * [2.3 ToAuthenticationConfig](#23-toauthenticationconfig) * [2.4 New](#24-new) * [3. 具体的认证过程](#3-具体的认证过程) * [3.1 调用链路](#31-调用链路) * [3.2 t.handler到底是谁](#32-thandler到底是谁) * [3.3 DefaultBuildHandlerChain](#33-defaultbuildhandlerchain) * [3.4 WithAuthentication](#34-withauthentication) * [4. 9种认证方式介绍](#4-9种认证方式介绍) * [4.1 BasicAuth认证](#41-basicauth认证) * [4.2 ClientCA认证](#42-clientca认证) * [4.3 TokenAuth认证](#43-tokenauth认证) * [4.4 BootstrapToken认证](#44-bootstraptoken认证) * [4.5 RequestHeader认证](#45--requestheader认证) * [4.6 WebhookTokenAuth认证](#46-webhooktokenauth认证) * [4.7 Anonymous认证](#47-anonymous认证) * [4.8 OIDC认证](#48-oidc认证) * [4.9 ServiceAccountAuth认证](#49-serviceaccountauth认证) * [4.10 总结](#410-总结) * [5.参考链接:](#5参考链接) ### 1. 简介 kube-apiserver作为一个服务器端。每次请求到来时都需要经过认证授权,以及一系列的访问控制。k8s1.17版本中,共提供9中认证方式:Anonymous,BootstrapToken,ClientCert,OIDC,PasswordFile,RequestHeader,ServiceAccounts,TokenFile,WebHook。 认证和授权的区别: 假设apiserver收到了一个请求,一个名叫张三的用户想删除 namespaceA下的一个pod。 **认证:** apiserver判断你到底是不是张三 **授权:** 张三到底有没有删除这个pod的权限
### 2. 认证器的生成 #### 2.1 调用链路 run函数经过一系列的调用,最终调用BuildAuthenticator函数来生成认证器。 cmd/kube-apiserver/app/server.go Run -> CreateServerChain -> CreateKubeAPIServerConfig -> buildGenericConfig ->BuildAuthenticator ``` 以下函数只显示关键的代码 // Run runs the specified APIServer. This should never exit. func Run(completeOptions completedServerRunOptions, stopCh <-chan struct{}) error { server, err := CreateServerChain(completeOptions, stopCh) } // CreateServerChain creates the apiservers connected via delegation. func CreateServerChain(completedOptions completedServerRunOptions, stopCh <-chan struct{}) (*aggregatorapiserver.APIAggregator, error) { kubeAPIServerConfig, insecureServingInfo, serviceResolver, pluginInitializer, err := CreateKubeAPIServerConfig(completedOptions, nodeTunneler, proxyTransport) } // CreateKubeAPIServerConfig creates all the resources for running the API server, but runs none of them func CreateKubeAPIServerConfig( s completedServerRunOptions, nodeTunneler tunneler.Tunneler, proxyTransport *http.Transport, ) ( *master.Config, *genericapiserver.DeprecatedInsecureServingInfo, aggregatorapiserver.ServiceResolver, []admission.PluginInitializer, error, ) { genericConfig, versionedInformers, insecureServingInfo, serviceResolver, pluginInitializers, admissionPostStartHook, storageFactory, err := buildGenericConfig(s.ServerRunOptions, proxyTransport) } // BuildGenericConfig takes the master server options and produces the genericapiserver.Config associated with it func buildGenericConfig( s *options.ServerRunOptions, proxyTransport *http.Transport, ) ( genericConfig *genericapiserver.Config, versionedInformers clientgoinformers.SharedInformerFactory, insecureServingInfo *genericapiserver.DeprecatedInsecureServingInfo, serviceResolver aggregatorapiserver.ServiceResolver, pluginInitializers []admission.PluginInitializer, admissionPostStartHook genericapiserver.PostStartHookFunc, storageFactory *serverstorage.DefaultStorageFactory, lastErr error, ) { // 认证 genericConfig.Authentication.Authenticator, genericConfig.OpenAPIConfig.SecurityDefinitions, err = BuildAuthenticator(s, clientgoExternalClient, versionedInformers) if err != nil { lastErr = fmt.Errorf("invalid authentication config: %v", err) return } //授权 genericConfig.Authorization.Authorizer, genericConfig.RuleResolver, err =BuildAuthorizer(s, versionedInformers) } ```
#### 2.2 BuildAuthenticator BuildAuthenticator目前是关键的函数,从这里开始进行代码分析 ``` // BuildAuthenticator constructs the authenticator func BuildAuthenticator(s *options.ServerRunOptions, extclient clientgoclientset.Interface, versionedInformer clientgoinformers.SharedInformerFactory) (authenticator.Request, *spec.SecurityDefinitions, error) { // 1.生成config authenticatorConfig, err := s.Authentication.ToAuthenticationConfig() if err != nil { return nil, nil, err } if s.Authentication.ServiceAccounts.Lookup || utilfeature.DefaultFeatureGate.Enabled(features.TokenRequest) { authenticatorConfig.ServiceAccountTokenGetter = serviceaccountcontroller.NewGetterFromClient( extclient, versionedInformer.Core().V1().Secrets().Lister(), versionedInformer.Core().V1().ServiceAccounts().Lister(), versionedInformer.Core().V1().Pods().Lister(), ) } authenticatorConfig.BootstrapTokenAuthenticator = bootstrap.NewTokenAuthenticator( versionedInformer.Core().V1().Secrets().Lister().Secrets(v1.NamespaceSystem), ) // 2.更加config,生成每个认证方式的 handler return authenticatorConfig.New() } ```
#### 2.3 ToAuthenticationConfig 可以看出来这里ToAuthenticationConfig就是根据输入的配置,判断哪些认证方式(上面说的九种)需要生成config 这里直接看代码函数就行。
#### 2.4 New (1)New函数根据认证的配置信息,针对9中认证方法,生成对应的handler。具体做法就是将各种认证生成authenticator,加入authenticators数组 (2)将authenticators数组生成一个union handler (3)最终得认证器AuthenticatedGroupAdder ``` // New returns an authenticator.Request or an error that supports the standard // Kubernetes authentication mechanisms. func (config Config) New() (authenticator.Request, *spec.SecurityDefinitions, error) { var authenticators []authenticator.Request var tokenAuthenticators []authenticator.Token securityDefinitions := spec.SecurityDefinitions{} // front-proxy, BasicAuth methods, local first, then remote // Add the front proxy authenticator if requested if config.RequestHeaderConfig != nil { requestHeaderAuthenticator := headerrequest.NewDynamicVerifyOptionsSecure( config.RequestHeaderConfig.CAContentProvider.VerifyOptions, config.RequestHeaderConfig.AllowedClientNames, config.RequestHeaderConfig.UsernameHeaders, config.RequestHeaderConfig.GroupHeaders, config.RequestHeaderConfig.ExtraHeaderPrefixes, ) authenticators = append(authenticators, authenticator.WrapAudienceAgnosticRequest(config.APIAudiences, requestHeaderAuthenticator)) } // 1.将各种认证生成authenticator,加入authenticators数组 // basic auth if len(config.BasicAuthFile) > 0 { basicAuth, err := newAuthenticatorFromBasicAuthFile(config.BasicAuthFile) if err != nil { return nil, nil, err } authenticators = append(authenticators, authenticator.WrapAudienceAgnosticRequest(config.APIAudiences, basicAuth)) securityDefinitions["HTTPBasic"] = &spec.SecurityScheme{ SecuritySchemeProps: spec.SecuritySchemeProps{ Type: "basic", Description: "HTTP Basic authentication", }, } } // X509 methods if config.ClientCAContentProvider != nil { certAuth := x509.NewDynamic(config.ClientCAContentProvider.VerifyOptions, x509.CommonNameUserConversion) authenticators = append(authenticators, certAuth) } // Bearer token methods, local first, then remote if len(config.TokenAuthFile) > 0 { tokenAuth, err := newAuthenticatorFromTokenFile(config.TokenAuthFile) if err != nil { return nil, nil, err } tokenAuthenticators = append(tokenAuthenticators, authenticator.WrapAudienceAgnosticToken(config.APIAudiences, tokenAuth)) } if len(config.ServiceAccountKeyFiles) > 0 { serviceAccountAuth, err := newLegacyServiceAccountAuthenticator(config.ServiceAccountKeyFiles, config.ServiceAccountLookup, config.APIAudiences, config.ServiceAccountTokenGetter) if err != nil { return nil, nil, err } tokenAuthenticators = append(tokenAuthenticators, serviceAccountAuth) } if utilfeature.DefaultFeatureGate.Enabled(features.TokenRequest) && config.ServiceAccountIssuer != "" { serviceAccountAuth, err := newServiceAccountAuthenticator(config.ServiceAccountIssuer, config.ServiceAccountKeyFiles, config.APIAudiences, config.ServiceAccountTokenGetter) if err != nil { return nil, nil, err } tokenAuthenticators = append(tokenAuthenticators, serviceAccountAuth) } if config.BootstrapToken { if config.BootstrapTokenAuthenticator != nil { // TODO: This can sometimes be nil because of tokenAuthenticators = append(tokenAuthenticators, authenticator.WrapAudienceAgnosticToken(config.APIAudiences, config.BootstrapTokenAuthenticator)) } } // NOTE(ericchiang): Keep the OpenID Connect after Service Accounts. // // Because both plugins verify JWTs whichever comes first in the union experiences // cache misses for all requests using the other. While the service account plugin // simply returns an error, the OpenID Connect plugin may query the provider to // update the keys, causing performance hits. if len(config.OIDCIssuerURL) > 0 && len(config.OIDCClientID) > 0 { oidcAuth, err := newAuthenticatorFromOIDCIssuerURL(oidc.Options{ IssuerURL: config.OIDCIssuerURL, ClientID: config.OIDCClientID, APIAudiences: config.APIAudiences, CAFile: config.OIDCCAFile, UsernameClaim: config.OIDCUsernameClaim, UsernamePrefix: config.OIDCUsernamePrefix, GroupsClaim: config.OIDCGroupsClaim, GroupsPrefix: config.OIDCGroupsPrefix, SupportedSigningAlgs: config.OIDCSigningAlgs, RequiredClaims: config.OIDCRequiredClaims, }) if err != nil { return nil, nil, err } tokenAuthenticators = append(tokenAuthenticators, oidcAuth) } if len(config.WebhookTokenAuthnConfigFile) > 0 { webhookTokenAuth, err := newWebhookTokenAuthenticator(config.WebhookTokenAuthnConfigFile, config.WebhookTokenAuthnVersion, config.WebhookTokenAuthnCacheTTL, config.APIAudiences) if err != nil { return nil, nil, err } tokenAuthenticators = append(tokenAuthenticators, webhookTokenAuth) } if len(tokenAuthenticators) > 0 { // Union the token authenticators tokenAuth := tokenunion.New(tokenAuthenticators...) // Optionally cache authentication results if config.TokenSuccessCacheTTL > 0 || config.TokenFailureCacheTTL > 0 { tokenAuth = tokencache.New(tokenAuth, true, config.TokenSuccessCacheTTL, config.TokenFailureCacheTTL) } authenticators = append(authenticators, bearertoken.New(tokenAuth), websocket.NewProtocolAuthenticator(tokenAuth)) securityDefinitions["BearerToken"] = &spec.SecurityScheme{ SecuritySchemeProps: spec.SecuritySchemeProps{ Type: "apiKey", Name: "authorization", In: "header", Description: "Bearer Token authentication", }, } } if len(authenticators) == 0 { if config.Anonymous { return anonymous.NewAuthenticator(), &securityDefinitions, nil } return nil, &securityDefinitions, nil } // 2. 生成一个union handler authenticator := union.New(authenticators...) // 3.最终得认证器AuthenticatedGroupAdder authenticator = group.NewAuthenticatedGroupAdder(authenticator) if config.Anonymous { // If the authenticator chain returns an error, return an error (don't consider a bad bearer token // or invalid username/password combination anonymous). authenticator = union.NewFailOnError(authenticator, anonymous.NewAuthenticator()) } return authenticator, &securityDefinitions, nil } // union.New函数。 // New returns a request authenticator that validates credentials using a chain of authenticator.Request objects. // The entire chain is tried until one succeeds. If all fail, an aggregate error is returned. func New(authRequestHandlers ...authenticator.Request) authenticator.Request { if len(authRequestHandlers) == 1 { return authRequestHandlers[0] } return &unionAuthRequestHandler{Handlers: authRequestHandlers, FailOnError: false} } ``` 为什么要弄成一个unionAuthRequestHandler,原因在于unionAuthRequestHandler有一个这样的函数AuthenticateRequest。 从这里可以看出来,unionAuthRequestHandler分别调用各种认证方法的handler,如果有一种方法认证成功,则成功,返回相应的用户信息。 ``` // AuthenticateRequest authenticates the request using a chain of authenticator.Request objects. func (authHandler *unionAuthRequestHandler) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) { var errlist []error for _, currAuthRequestHandler := range authHandler.Handlers { resp, ok, err := currAuthRequestHandler.AuthenticateRequest(req) if err != nil { if authHandler.FailOnError { return resp, ok, err } errlist = append(errlist, err) continue } if ok { return resp, ok, err } } return nil, false, utilerrors.NewAggregate(errlist) } ```
### 3. 具体的认证过程 Authenticator步骤的输入是整个HTTP请求,但是,它通常只是检查HTTP Headers and/or client certificate。 可以指定多个Authenticator模块,在这种情况下,每个认证模块都按顺序尝试,直到其中一个成功即可。 如果认证成功,则用户的`username`会传入授权模块做进一步授权验证;而对于认证失败的请求则返回HTTP 401。 Kubernetes使用client certificates, bearer tokens, an authenticating proxy, or HTTP basic auth, 通过身份验证插件对API请求进行身份验证。 当向API服务器发出一个HTTP请求,Authentication plugin会尝试将以下属性与请求关联: - Username: 标识终端用户的字符串, 常用值可能是kube-admin或[jane@example.com](mailto:jane@example.com)。 - UID: 标识终端用户的字符串,比Username更具有唯一性。 - Groups: a set of strings which associate users with a set of commonly grouped users. - Extra fields: 可能有用的额外信息 系统中把这4个属性封装成一个type DefaultInfo struct ,见/pkg/auth/user/user.go。 ``` // DefaultInfo provides a simple user information exchange object // for components that implement the UserInfo interface. type DefaultInfo struct { Name string UID string Groups []string Extra map[string][]string } ``` #### 3.1 调用链路 staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go ServeHTTP -> ServeHTTP -> ``` func (t *timeoutHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { ... go func() { defer func() { err := recover() // do not wrap the sentinel ErrAbortHandler panic value if err != nil && err != http.ErrAbortHandler { // Same as stdlib http server code. Manually allocate stack // trace buffer size to prevent excessively large logs const size = 64 << 10 buf := make([]byte, size) buf = buf[:runtime.Stack(buf, false)] err = fmt.Sprintf("%v\n%s", err, buf) } resultCh <- err }() t.handler.ServeHTTP(tw, r) }() } ServeHTTP是一个接口,所以就看t.handler是谁 // ServeHTTP calls f(w, r). func (f HandlerFunc) ServeHTTP(w ResponseWriter, r *Request) { f(w, r) } ```
#### 3.2 t.handler到底是谁 staging/src/k8s.io/apiserver/pkg/server/config.go Apiserver的config中定义了handler函数。这是一串链式handler函数。 ``` // NewConfig returns a Config struct with the default values func NewConfig(codecs serializer.CodecFactory) *Config { defaultHealthChecks := []healthz.HealthChecker{healthz.PingHealthz, healthz.LogHealthz} return &Config{ Serializer: codecs, BuildHandlerChainFunc: DefaultBuildHandlerChain, } ```
#### 3.3 DefaultBuildHandlerChain DefaultBuildHandlerChain定义了链式handle。其中认证的就是 WithAuthentication函数 ``` func DefaultBuildHandlerChain(apiHandler http.Handler, c *Config) http.Handler { handler := genericapifilters.WithAuthorization(apiHandler, c.Authorization.Authorizer, c.Serializer) handler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc) handler = genericapifilters.WithImpersonation(handler, c.Authorization.Authorizer, c.Serializer) handler = genericapifilters.WithAudit(handler, c.AuditBackend, c.AuditPolicyChecker, c.LongRunningFunc) failedHandler := genericapifilters.Unauthorized(c.Serializer, c.Authentication.SupportsBasicAuth) // 认证的handler failedHandler = genericapifilters.WithFailedAuthenticationAudit(failedHandler, c.AuditBackend, c.AuditPolicyChecker) handler = genericapifilters.WithAuthentication(handler, c.Authentication.Authenticator, failedHandler, c.Authentication.APIAudiences) handler = genericfilters.WithCORS(handler, c.CorsAllowedOriginList, nil, nil, nil, "true") handler = genericfilters.WithTimeoutForNonLongRunningRequests(handler, c.LongRunningFunc, c.RequestTimeout) handler = genericfilters.WithWaitGroup(handler, c.LongRunningFunc, c.HandlerChainWaitGroup) handler = genericapifilters.WithRequestInfo(handler, c.RequestInfoResolver) handler = genericfilters.WithPanicRecovery(handler) return handler } ```
#### 3.4 WithAuthentication WithAuthentication主要干了两件事: (1)调用AuthenticateRequest进行了认证。这里实际就是之前的unionAuthRequestHandler.AuthenticateRequest unionAuthRequestHandler.AuthenticateRequest会遍历所有的认证handler,然后有一个认证成功,就返回ok。 (2)如果认证失败,调用failed.ServeHTTP(w, req)进行处理 (3)如果成功, req.Header.Del("Authorization")删除头部的Authorization, 表示认证通过了 ``` // WithAuthentication creates an http handler that tries to authenticate the given request as a user, and then // stores any such user found onto the provided context for the request. If authentication fails or returns an error // the failed handler is used. On success, "Authorization" header is removed from the request and handler // is invoked to serve the request. func WithAuthentication(handler http.Handler, auth authenticator.Request, failed http.Handler, apiAuds authenticator.Audiences) http.Handler { if auth == nil { klog.Warningf("Authentication is disabled") return handler } return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) { authenticationStart := time.Now() if len(apiAuds) > 0 { req = req.WithContext(authenticator.WithAudiences(req.Context(), apiAuds)) } // 这里调用了AuthenticateRequest进行认证 resp, ok, err := auth.AuthenticateRequest(req) if err != nil || !ok { if err != nil { klog.Errorf("Unable to authenticate the request due to an error: %v", err) authenticatedAttemptsCounter.WithLabelValues(errorLabel).Inc() authenticationLatency.WithLabelValues(errorLabel).Observe(time.Since(authenticationStart).Seconds()) } else if !ok { authenticatedAttemptsCounter.WithLabelValues(failureLabel).Inc() authenticationLatency.WithLabelValues(failureLabel).Observe(time.Since(authenticationStart).Seconds()) } failed.ServeHTTP(w, req) return } if len(apiAuds) > 0 && len(resp.Audiences) > 0 && len(authenticator.Audiences(apiAuds).Intersect(resp.Audiences)) == 0 { klog.Errorf("Unable to match the audience: %v , accepted: %v", resp.Audiences, apiAuds) failed.ServeHTTP(w, req) return } // authorization header is not required anymore in case of a successful authentication. req.Header.Del("Authorization") req = req.WithContext(genericapirequest.WithUser(req.Context(), resp.User)) authenticatedUserCounter.WithLabelValues(compressUsername(resp.User.GetName())).Inc() authenticatedAttemptsCounter.WithLabelValues(successLabel).Inc() authenticationLatency.WithLabelValues(successLabel).Observe(time.Since(authenticationStart).Seconds()) handler.ServeHTTP(w, req) }) } ``` ### 4. 9种认证方式介绍 #### 4.1 BasicAuth认证 BasicAuth是一种简单的HTTP协议上的认证机制,客户端将用户、密码写入请求头中,HTTP服务端尝试从请求头中验证用户、密码信息,从而实现身份验证。客户端发送的请求头示例如下: ``` Authorization: Basic BASE64ENCODED(USER:PASSWORD) ``` 请求头的key为Authorization,value为Basic BASE64ENCODED(USER:PASSWORD),其中用户名及密码是通过Base64编码后的字符串。
**启用BasicAuth认证:** kube-apiserver通过指定--basic-auth-file参数启用BasicAuth认证。AUTH_FILE是一个CSV文件,每个用户在CSV中的表现形式为password、username、uid,代码示例如下: ``` a0d175cf548f665938498,derk,1 ```
**认证函数:** staging/src/k8s.io/apiserver/plugin/pkg/authenticator/request/basicauth/basicauth.go ``` // AuthenticateRequest authenticates the request using the "Authorization: Basic" header in the request func (a *Authenticator) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) { username, password, found := req.BasicAuth() if !found { return nil, false, nil } resp, ok, err := a.auth.AuthenticatePassword(req.Context(), username, password) // If the password authenticator didn't error, provide a default error if !ok && err == nil { err = errInvalidAuth } return resp, ok, err } ```
#### 4.2 ClientCA认证 ClientCA认证,也被称为TLS双向认证,即服务端与客户端互相验证证书的正确性。使用ClientCA认证的时候,只要是CA签名过的证书都可以通过验证。1.启用ClientCA认证kube-apiserver通过指定--client-ca-file参数启用ClientCA认证。这个目前比较常用。 ClientCA认证接口定义了AuthenticateRequest方法,该方法接收客户端请求。若验证失败,bool值会为false;若验证成功,bool值会为true,并返回*authenticator.Response,*authenticator.Response中携带了身份验证用户的信息,例如Name、UID、Groups、Extra等信息。 staging/src/k8s.io/apiserver/pkg/authentication/request/x509/x509.go ``` // AuthenticateRequest authenticates the request using presented client certificates func (a *Authenticator) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) { if req.TLS == nil || len(req.TLS.PeerCertificates) == 0 { return nil, false, nil } // Use intermediates, if provided optsCopy, ok := a.verifyOptionsFn() // if there are intentionally no verify options, then we cannot authenticate this request if !ok { return nil, false, nil } if optsCopy.Intermediates == nil && len(req.TLS.PeerCertificates) > 1 { optsCopy.Intermediates = x509.NewCertPool() for _, intermediate := range req.TLS.PeerCertificates[1:] { optsCopy.Intermediates.AddCert(intermediate) } } remaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now()) clientCertificateExpirationHistogram.Observe(remaining.Seconds()) chains, err := req.TLS.PeerCertificates[0].Verify(optsCopy) if err != nil { return nil, false, err } var errlist []error for _, chain := range chains { user, ok, err := a.user.User(chain) if err != nil { errlist = append(errlist, err) continue } if ok { return user, ok, err } } return nil, false, utilerrors.NewAggregate(errlist) } ``` 在进行ClientCA认证时,通过req.TLS.PeerCertifcates[0].Verify验证证书,如果是CA签名过的证书,都可以通过验证,认证失败会返回false,而认证成功会返回true。
#### 4.3 TokenAuth认证 Token也被称为令牌,服务端为了验证客户端的身份,需要客户端向服务端提供一个可靠的验证信息,这个验证信息就是Token。TokenAuth是基于Token的认证,Token一般是一个字符串。 **启用TokenAuth认证** kube-apiserver通过指定--token-auth-file参数启用TokenAuth认证。TOKEN_FILE是一个CSV文件,每个用户在CSV中的表现形式为token、user、userid、group,代码示例如下: ``` a0d73844190894384102943,kubelet-bootstrap.1001,"system:kubelet-bootstrap" ``` Token认证接口定义了AuthenticateToken方法,该方法接收token字符串。若验证失败,bool值会为false;若验证成功,bool值会为true,并返回*authenticator.Response,*authenticator.Response中携带了身份验证用户的信息,例如Name、UID、Groups、Extra等信息。 ``` func (a *TokenAuthenticator) AuthenticateToken(ctx context.Context, value string) (*authenticator.Response, bool, error) { user, ok := a.tokens[value] if !ok { return nil, false, nil } return &authenticator.Response{User: user}, true, nil } ```
#### 4.4 BootstrapToken认证 当Kubernetes集群中有非常多的节点时,手动为每个节点配置TLS认证比较烦琐,为此Kubernetes提供了BootstrapToken认证,其也被称为引导Token。客户端的Token信息与服务端的Token相匹配,则认证通过,自动为节点颁发证书,这是一种引导Token的机制。客户端发送的请求头示例如下: ``` Authorization: Bearer 07410b.f2355rejewrql ``` 请求头的key为Authorization,value为Bearer,其中TOKENS的表现形式为[a-z0-9]{6}.[a-z0-9]{16}。第一个组是Token ID,第二个组是TokenSecret。 **启用BootstrapToken认证** kube-apiserver通过指定--enable-bootstrap-token-auth参数启用BootstrapToken认证。 这个在安装kubelet的时候使用过。 BootstrapToken认证接口定义了AuthenticateToken方法,该方法接收token字符串。若验证失败,bool值会为false;若验证成功,bool值会为true,并返回*authenticator.Response,*authenticator.Response中携带了身份验证用户的信息,例如Name、UID、Groups、Extra等信息。 plugin/pkg/auth/authenticator/token/bootstrap/bootstrap.go ``` func (t *TokenAuthenticator) AuthenticateToken(ctx context.Context, token string) (*authenticator.Response, bool, error) { tokenID, tokenSecret, err := bootstraptokenutil.ParseToken(token) if err != nil { // Token isn't of the correct form, ignore it. return nil, false, nil } secretName := bootstrapapi.BootstrapTokenSecretPrefix + tokenID secret, err := t.lister.Get(secretName) if err != nil { if errors.IsNotFound(err) { klog.V(3).Infof("No secret of name %s to match bootstrap bearer token", secretName) return nil, false, nil } return nil, false, err } if secret.DeletionTimestamp != nil { tokenErrorf(secret, "is deleted and awaiting removal") return nil, false, nil } if string(secret.Type) != string(bootstrapapi.SecretTypeBootstrapToken) || secret.Data == nil { tokenErrorf(secret, "has invalid type, expected %s.", bootstrapapi.SecretTypeBootstrapToken) return nil, false, nil } ts := bootstrapsecretutil.GetData(secret, bootstrapapi.BootstrapTokenSecretKey) if subtle.ConstantTimeCompare([]byte(ts), []byte(tokenSecret)) != 1 { tokenErrorf(secret, "has invalid value for key %s, expected %s.", bootstrapapi.BootstrapTokenSecretKey, tokenSecret) return nil, false, nil } id := bootstrapsecretutil.GetData(secret, bootstrapapi.BootstrapTokenIDKey) if id != tokenID { tokenErrorf(secret, "has invalid value for key %s, expected %s.", bootstrapapi.BootstrapTokenIDKey, tokenID) return nil, false, nil } if bootstrapsecretutil.HasExpired(secret, time.Now()) { // logging done in isSecretExpired method. return nil, false, nil } if bootstrapsecretutil.GetData(secret, bootstrapapi.BootstrapTokenUsageAuthentication) != "true" { tokenErrorf(secret, "not marked %s=true.", bootstrapapi.BootstrapTokenUsageAuthentication) return nil, false, nil } groups, err := bootstrapsecretutil.GetGroups(secret) if err != nil { tokenErrorf(secret, "has invalid value for key %s: %v.", bootstrapapi.BootstrapTokenExtraGroupsKey, err) return nil, false, nil } return &authenticator.Response{ User: &user.DefaultInfo{ Name: bootstrapapi.BootstrapUserPrefix + string(id), Groups: groups, }, }, true, nil } ``` 在进行BootstrapToken认证时,通过paseToken函数解析出Token ID和TokenSecret,验证Token Secret中的Expire(过期)、Data、Type等,认证失败会返回false,而认证成功会返回true。 #### 4.5 RequestHeader认证 Kubernetes可以设置一个认证代理,客户端发送的认证请求可以通过认证代理将验证信息发送给kube-apiserver组件。RequestHeader认证使用的就是这种代理方式,它使用请求头将用户名和组信息发送给kube-apiserver。 RequestHeader认证有几个列表,分别介绍如下。 ● 用户名列表。建议使用X-Remote-User,如果启用RequestHeader认证,该参数必选。 ● 组列表。建议使用X-Remote-Group,如果启用RequestHeader认证,该参数可选。 ● 额外列表。建议使用X-Remote-Extra-,如果启用RequestHeader认证,该参数可选。 当客户端发送认证请求时,kube-apiserver根据Header Values中的用户名列表来识别用户,例如返回X-Remote-User:Bob则表示验证成功。 **启用RequestHeader认证** kube-apiserver通过指定如下参数启用RequestHeader认证。 ●--requestheader-client-ca-file:指定有效的客户端CA证书。 ●--requestheader-allowed-names:指定通用名称(CommonName)。 ●--requestheader-extra-headers-prefix:指定额外列表。 ●--requestheader-group-headers:指定组列表。 ●--requestheader-username-headers:指定用户名列表。 kube-apiserver收到客户端验证请求后,会先通过--requestheader-client-ca-file参数对客户端证书进行验证。 --requestheader-username-headers参数指定了Header中包含的用户名,这一参数中的列表确定了有效的用户名列表,如果该列表为空,则所有通过--requestheader-client-ca-file参数校验的请求都允许通过。 ``` func (a *requestHeaderAuthRequestHandler) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) { name := headerValue(req.Header, a.nameHeaders.Value()) if len(name) == 0 { return nil, false, nil } groups := allHeaderValues(req.Header, a.groupHeaders.Value()) extra := newExtra(req.Header, a.extraHeaderPrefixes.Value()) // clear headers used for authentication for _, headerName := range a.nameHeaders.Value() { req.Header.Del(headerName) } for _, headerName := range a.groupHeaders.Value() { req.Header.Del(headerName) } for k := range extra { for _, prefix := range a.extraHeaderPrefixes.Value() { req.Header.Del(prefix + k) } } return &authenticator.Response{ User: &user.DefaultInfo{ Name: name, Groups: groups, Extra: extra, }, }, true, nil } ``` 在进行RequestHeader认证时,通过headerValue函数从请求头中读取所有的用户信息,通过allHeaderValues函数读取所有组的信息,通过newExtra函数读取所有额外的信息。当用户名无法匹配时,则认证失败返回false,反之则认证成功返回true。
#### 4.6 WebhookTokenAuth认证 Webhook也被称为钩子,是一种基于HTTP协议的回调机制,当客户端发送的认证请求到达kube-apiserver时,kube-apiserver回调钩子方法,将验证信息发送给远程的Webhook服务器进行认证,然后根据Webhook服务器返回的状态码来判断是否认证成功。 **启用WebhookTokenAuth认证** kube-apiserver通过指定如下参数启用WebhookTokenAuth认证。 ●--authentication-token-webhook-config-file:Webhook配置文件描述了如何访问远程Webhook服务。 ●--authentication-token-webhook-cache-ttl:缓存认证时间,默认值为2分钟。
WebhookTokenAuth认证接口定义了AuthenticateToken方法,该方法接收token字符串。若验证失败,bool值会为false;若验证成功,bool值会为true,并返回*authenticator.Response,*authenticator.Response中携带了身份验证用户的信息,例如Name、UID、Groups、Extra等信息。
#### 4.7 Anonymous认证 Anonymous认证就是匿名认证,未被其他认证器拒绝的请求都可视为匿名请求。kube-apiserver默认开启Anonymous(匿名)认证。1.启用Anonymous认证kube-apiserver通过指定--anonymous-auth参数启用Anonymous认证,默认该参数值为true。 Anonymous认证接口定义了AuthenticateRequest方法,该方法接收客户端请求。若验证失败,bool值会为false;若验证成功,bool值会为true,并返回*authenticator.Response,*authenticator.Response中携带了身份验证用户的信息,例如Name、UID、Groups、Extra等信息。 在进行Anonymous认证时,直接验证成功,返回true。 #### 4.8 OIDC认证 OIDC(OpenID Connect)是一套基于OAuth 2.0协议的轻量级认证规范,其提供了通过API进行身份交互的框架。OIDC认证除了认证请求外,还会标明请求的用户身份(ID Token)。其中Toekn被称为ID Token,此ID Token是JSON WebToken (JWT),具有由服务器签名的相关字段。 OIDC认证流程介绍如下。(1)Kubernetes用户想访问Kubernetes API Server,先通过认证服务(AuthServer,例如Google Accounts服务)认证自己,得到access_token、id_token和refresh_token。(2)Kubernetes用户把access_token、id_token和refresh_token配置到客户端应用程序(如kubectl或dashboard工具等)中。(3)Kubernetes客户端使用Token以用户的身份访问Kubernetes API Server。Kubernetes API Server和Auth Server并没有直接进行交互,而是鉴定客户端发送的Token是否为合法Token。 **启用OIDC认证** kube-apiserver通过指定如下参数启用OIDC认证。 ●--oidc-ca-file:签署身份提供商的CA证书的路径,默认值为主机的根CA证书的路径(即/etc/kubernetes/ssl/kc-ca.pem)。 ●--oidc-client-id:颁发所有Token的Client ID。 ●--oidc-groups-claim:JWT(JSON Web Token)声明的用户组名称。 ●--oidc-groups-prefix:组名前缀,所有组都将以此值为前缀,以避免与其他身份验证策略发生冲突。 ●--oidc-issuer-url:Auth Server服务的URL地址,例如使用GoogleAccounts服务。 ●--oidc-required-claim:该参数是键值对,用于描述ID Token中的必要声明。如果设置该参数,则验证声明是否以匹配值存在于ID Token中。重复指定该参数可以设置多个声明。 ●--oidc-signing-algs:JOSE非对称签名算法列表,算法以逗号分隔。如果以alg开头的JWT请求不在此列表中,请求会被拒绝(默认值为[RS256])。 ●--oidc-username-claim:JWT(JSON Web Token)声明的用户名称(默认值为sub)。●--oidc-username-prefix:用户名前缀,所有用户名都将以此值为前缀,以避免与其他身份验证策略发生冲突。如果要跳过任何前缀,请设置该参数值为-。 #### 4.9 ServiceAccountAuth认证 ServiceAccountAuth是一种特殊的认证机制,其他认证机制都是处于Kubernetes集群外部而希望访问kube-apiserver组件,而ServiceAccountAuth认证是从Pod资源内部访问kube-apiserver组件,提供给运行在Pod资源中的进程使用,它为Pod资源中的进程提供必要的身份证明,从而获取集群的信息。ServiceAccountAuth认证通过Kubernetes资源的Service Account实现。 具体使用就是在创建pod的时候,定义使用ServiceAccount。 #### 4.10 总结 这一部分基本都是摘抄kubernetes源码解剖部分的内容。目的就是先了解一下具体有哪些认证,以后有需要再深入了解一下。
### 5.参考链接: https://www.jianshu.com/p/daa4ff387a78 书籍:kubernetes源码解剖,郑东 ================================================ FILE: k8s/kube-apiserver/13-k8s之Authorization.md ================================================ Table of Contents ================= * [1. Authorization简介](#1-authorization简介) * [2. 6种授权机制](#2-6种授权机制) * [2.1 AlwaysAllow](#21-alwaysallow) * [2.2 AlwaysDeny授权](#22-alwaysdeny授权) * [2.3 ABAC授权](#23-abac授权) * [2.4 Webhook授权](#24-webhook授权) * [2.5 RBAC授权](#25-rbac授权) * [2.6 Node授权](#26-node授权) * [3. 总结](#3-总结) * [4. 参考](#4-参考) kube-apiserver中与权限相关的主要有三种机制,即认证、鉴权和准入控制。这里主要记录鉴权相关的笔记。 ### 1. Authorization简介 客户端请求到了apiserver端后,首先是认证,然后就是授权。apiserver同样也支持多种授权机制,并支持同时开启多个授权功能,如果开启多个授权功能,则按照顺序执行授权器,在前面的授权器具有更高的优先级来允许或拒绝请求。客户端发起一个请求,在经过授权阶段后,**只要有一个授权器通过则授权成功**。
kube-apiserver目前提供了6种授权机制,分别是AlwaysAllow、AlwaysDeny、ABAC、Webhook、RBAC、Node。 可通过kube-apiserver启动参数--authorization-mode参数设置授权机制。 目前比较常用的就是RBAC和webhook。 --authorization-mode=RBAC,Webhook
### 2. 6种授权机制 #### 2.1 AlwaysAllow 在进行AlwaysAllow授权时,直接授权成功,返回DecisionAllow决策状态。另外,AlwaysAllow的规则解析器会将资源类型的规则列表(ResourceRuleInfo)和非资源类型的规则列表(NonResourceRuleInfo)都设置为通配符(*)匹配所有资源版本、资源及资源操作方法。代码示例如下: ``` staging/src/k8s.io/apiserver/pkg/authorization/authorizerfactory/builtin.go func (alwaysAllowAuthorizer) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) { return []authorizer.ResourceRuleInfo{ &authorizer.DefaultResourceRuleInfo{ Verbs: []string{"*"}, APIGroups: []string{"*"}, Resources: []string{"*"}, }, }, []authorizer.NonResourceRuleInfo{ &authorizer.DefaultNonResourceRuleInfo{ Verbs: []string{"*"}, NonResourceURLs: []string{"*"}, }, }, false, nil } ``` #### 2.2 AlwaysDeny授权 AlwaysDeny授权器会阻止所有请求,该授权器很少单独使用,一般会结合其他授权器一起使用。它的应用场景是先拒绝所有请求,再允许授权过的用户请求。 --authorization-mode=AlwaysDeny,Webhook 。所以这样就做到了,只允许Webhook授权的用户通过。
在进行AlwaysDeny授权时,直接返回DecisionNoOpionion决策状态。如果存在下一个授权器,会继续执行下一个授权器;如果不存在下一个授权器,则会拒绝所有请求。这就是kube-apiserver使用AlwaysDeny的应用场景。另外,AlwaysDeny的规则解析器会将资源类型的规则列表(ResourceRuleInfo)和非资源类型的规则列表(NonResourceRuleInfo)都设置为空,代码示例如下:[插图] ``` staging/src/k8s.io/apiserver/pkg/authorization/authorizerfactory/builtin.go func (alwaysDenyAuthorizer) Authorize(ctx context.Context, a authorizer.Attributes) (decision authorizer.Decision, reason string, err error) { return authorizer.DecisionNoOpinion, "Everything is forbidden.", nil } func (alwaysDenyAuthorizer) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) { return []authorizer.ResourceRuleInfo{}, []authorizer.NonResourceRuleInfo{}, false, nil } ``` #### 2.3 ABAC授权 ABAC授权器基于属性的访问控制(Attribute-Based Access Control,ABAC)定义了访问控制范例,其中通过将属性组合在一起的策略来向用户授予操作权限。 kube-apiserver通过指定如下参数启用ABAC授权。 ●--authorization-mode=ABAC:启用ABAC授权器。 ●--authorization-policy-file:基于ABAC模式,指定策略文件,该文件使用JSON格式进行描述,每一行都是一个策略对象。如下: Alice 可以对所有somenamespace下的资源做任何事情: ``` {"apiVersion": "abac.authorization.kubernetes.io/v1beta1", "kind": "Policy", "spec": {"user": "alice", "namespace": "somenamespace", "resource": "*", "apiGroup": "*"}} ``` **参考**:https://kubernetes.io/zh/docs/reference/access-authn-authz/abac/ 在进行ABAC授权时,遍历所有的策略,通过matches函数进行匹配,如果授权成功,返回DecisionAllow决策状态。另外,ABAC的规则 解析器会根据每一个策略将资源类型的规则列表(ResourceRuleInfo)和非资源类型的规则列表(NonResourceRuleInfo)都设置为该 用户有权限操作的资源版本、资源及资源操作方法。代码示例如下: ``` pkg/auth/authorizer/abac/abac.go // Authorize implements authorizer.Authorize func (pl PolicyList) Authorize(ctx context.Context, a authorizer.Attributes) (authorizer.Decision, string, error) { for _, p := range pl { if matches(*p, a) { return authorizer.DecisionAllow, "", nil } } return authorizer.DecisionNoOpinion, "No policy matched.", nil // TODO: Benchmark how much time policy matching takes with a medium size // policy file, compared to other steps such as encoding/decoding. // Then, add Caching only if needed. } // RulesFor returns rules for the given user and namespace. func (pl PolicyList) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) { var ( resourceRules []authorizer.ResourceRuleInfo nonResourceRules []authorizer.NonResourceRuleInfo ) for _, p := range pl { if subjectMatches(*p, user) { if p.Spec.Namespace == "*" || p.Spec.Namespace == namespace { if len(p.Spec.Resource) > 0 { r := authorizer.DefaultResourceRuleInfo{ Verbs: getVerbs(p.Spec.Readonly), APIGroups: []string{p.Spec.APIGroup}, Resources: []string{p.Spec.Resource}, } var resourceRule authorizer.ResourceRuleInfo = &r resourceRules = append(resourceRules, resourceRule) } if len(p.Spec.NonResourcePath) > 0 { r := authorizer.DefaultNonResourceRuleInfo{ Verbs: getVerbs(p.Spec.Readonly), NonResourceURLs: []string{p.Spec.NonResourcePath}, } var nonResourceRule authorizer.NonResourceRuleInfo = &r nonResourceRules = append(nonResourceRules, nonResourceRule) } } } } return resourceRules, nonResourceRules, false, nil } ``` **缺点:** 每次更新的时候,需要更新文件,并且重启kube-apiserver。 #### 2.4 Webhook授权 Webhook授权器拥有基于HTTP协议回调的机制,当用户授权时,kube-apiserver组件会查询外部的Webhook服务。该过程与WebhookTokenAuth认证相似,但其中确认用户身份的机制不一样。当客户端发送的认证请求到达kube-apiserver时,kube-apiserver回调钩子方法,将授权信息发送给远程的Webhook服务器进行认证,根据Webhook服务器返回的状态来判断是否授权成功。
kube-apiserver通过指定如下参数启用Webhook授权。 ●--authorization-mode=Webhook:启用Webhook授权器。 ●--authorization-webhook-config-file:使用kubeconfig格式的Webhook配置文件。Webhook授权器配置文件定义如下: ``` # Kubernetes API 版本 apiVersion: v1 # API 对象种类 kind: Config # clusters 代表远程服务。 clusters: - name: name-of-remote-authz-service cluster: # 对远程服务进行身份认证的 CA。 certificate-authority: /path/to/ca.pem # 远程服务的查询 URL。必须使用 'https'。 server: https://authz.example.com/authorize # users 代表 API 服务器的 webhook 配置 users: - name: name-of-api-server user: client-certificate: /path/to/cert.pem # webhook plugin 使用 cert client-key: /path/to/key.pem # cert 所对应的 key # kubeconfig 文件必须有 context。需要提供一个给 API 服务器。 current-context: webhook contexts: - context: cluster: name-of-remote-authz-service user: name-of-api-server name: webhook ``` 如上配置,文件使用kubeconfig格式。在该配置文件中,users指的是kube-apiserver本身,clusters指的是远程Webhook服务。 **参考:**https://kubernetes.io/zh/docs/reference/access-authn-authz/webhook/
在进行Webhook授权时,首先通过w.responseCache.Get函数从缓存中查找是否已有缓存的授权,如果有则直接使用该状态 (Status),如果没有则通过w.subjectAccessReview.Create(RESTClient)从远程的Webhook服务器获取授权验证,该函数发送Post 请求,并在请求体(Body)中携带授权信息。在验证Webhook服务器授权之后,返回的Status.Allowed字段为true,表示授权成功并返 回DecisionAllow决策状态。另外,Webhook的规则解析器不支持规则列表解析,因为规则是由远程的Webhook服务端进行授权的。所 以Webhook的规则解析器的资源类型的规则列表(ResourceRuleInfo)和非资源类型的规则列表(NonResourceRuleInfo)都会被设置 为空。代码示例如下: ``` staging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go // Authorize makes a REST request to the remote service describing the attempted action as a JSON // serialized api.authorization.v1beta1.SubjectAccessReview object. An example request body is // provided below. // // { // "apiVersion": "authorization.k8s.io/v1beta1", // "kind": "SubjectAccessReview", // "spec": { // "resourceAttributes": { // "namespace": "kittensandponies", // "verb": "GET", // "group": "group3", // "resource": "pods" // }, // "user": "jane", // "group": [ // "group1", // "group2" // ] // } // } // // The remote service is expected to fill the SubjectAccessReviewStatus field to either allow or // disallow access. A permissive response would return: // // { // "apiVersion": "authorization.k8s.io/v1beta1", // "kind": "SubjectAccessReview", // "status": { // "allowed": true // } // } // // To disallow access, the remote service would return: // // { // "apiVersion": "authorization.k8s.io/v1beta1", // "kind": "SubjectAccessReview", // "status": { // "allowed": false, // "reason": "user does not have read access to the namespace" // } // } // // TODO(mikedanese): We should eventually support failing closed when we // encounter an error. We are failing open now to preserve backwards compatible // behavior. func (w *WebhookAuthorizer) Authorize(ctx context.Context, attr authorizer.Attributes) (decision authorizer.Decision, reason string, err error) { r := &authorizationv1.SubjectAccessReview{} if user := attr.GetUser(); user != nil { r.Spec = authorizationv1.SubjectAccessReviewSpec{ User: user.GetName(), UID: user.GetUID(), Groups: user.GetGroups(), Extra: convertToSARExtra(user.GetExtra()), } } if attr.IsResourceRequest() { r.Spec.ResourceAttributes = &authorizationv1.ResourceAttributes{ Namespace: attr.GetNamespace(), Verb: attr.GetVerb(), Group: attr.GetAPIGroup(), Version: attr.GetAPIVersion(), Resource: attr.GetResource(), Subresource: attr.GetSubresource(), Name: attr.GetName(), } } else { r.Spec.NonResourceAttributes = &authorizationv1.NonResourceAttributes{ Path: attr.GetPath(), Verb: attr.GetVerb(), } } key, err := json.Marshal(r.Spec) if err != nil { return w.decisionOnError, "", err } // 先使用缓存 if entry, ok := w.responseCache.Get(string(key)); ok { r.Status = entry.(authorizationv1.SubjectAccessReviewStatus) } else { var ( result *authorizationv1.SubjectAccessReview err error ) webhook.WithExponentialBackoff(ctx, w.initialBackoff, func() error { result, err = w.subjectAccessReview.CreateContext(ctx, r) return err }, webhook.DefaultShouldRetry) if err != nil { // An error here indicates bad configuration or an outage. Log for debugging. klog.Errorf("Failed to make webhook authorizer request: %v", err) return w.decisionOnError, "", err } r.Status = result.Status if shouldCache(attr) { if r.Status.Allowed { w.responseCache.Add(string(key), r.Status, w.authorizedTTL) } else { w.responseCache.Add(string(key), r.Status, w.unauthorizedTTL) } } } switch { case r.Status.Denied && r.Status.Allowed: return authorizer.DecisionDeny, r.Status.Reason, fmt.Errorf("webhook subject access review returned both allow and deny response") case r.Status.Denied: return authorizer.DecisionDeny, r.Status.Reason, nil case r.Status.Allowed: return authorizer.DecisionAllow, r.Status.Reason, nil default: return authorizer.DecisionNoOpinion, r.Status.Reason, nil } } //TODO: need to finish the method to get the rules when using webhook mode func (w *WebhookAuthorizer) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) { var ( resourceRules []authorizer.ResourceRuleInfo nonResourceRules []authorizer.NonResourceRuleInfo ) incomplete := true return resourceRules, nonResourceRules, incomplete, fmt.Errorf("webhook authorizer does not support user rule resolution") } ```
#### 2.5 RBAC授权 RBAC授权器现实了基于角色的权限访问控制(Role-Based Access Control),其也是目前使用最为广泛的授权模型。在RBAC授权器 中,权限与角色相关联,形成了用户—角色—权限的授权模型。用户通过加入某些角色从而得到这些角色的操作权限,这极大地简化了权 限管理。 在kube-apiserver设计的RBAC授权器中,新增了角色与集群绑定的概念,也就是说,kube-apiserver可以提供4种数据类型来表达基于角色的授权,它们分别是角色(Role)、集群角色(ClusterRole)、角色绑定(RoleBinding)及集群角色绑定(ClusterRoleBinding),这4种数据类型定义在vendor/k8s.io/api/rbac/v1/types.go中。 Role <-> RoleBinding。 角色只能被授予某一个命名空间的权限。 ClusterRole <-> ClusterRoleBinding。集群角色是一组用户的集合,与规则相关联。集群角色能够被授予集群范围的权限,例如节点、非资源类型的服务端点(Endpoint)、跨所有命名空间的权限等。
``` // Role is a namespaced, logical grouping of PolicyRules that can be referenced as a unit by a RoleBinding. type Role struct { metav1.TypeMeta `json:",inline"` // Standard object's metadata. // +optional metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` // Rules holds all the PolicyRules for this Role // +optional Rules []PolicyRule `json:"rules" protobuf:"bytes,2,rep,name=rules"` } // +genclient // +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object // RoleBinding references a role, but does not contain it. It can reference a Role in the same namespace or a ClusterRole in the global namespace. // It adds who information via Subjects and namespace information by which namespace it exists in. RoleBindings in a given // namespace only have effect in that namespace. type RoleBinding struct { metav1.TypeMeta `json:",inline"` // Standard object's metadata. // +optional metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"` // Subjects holds references to the objects the role applies to. // +optional Subjects []Subject `json:"subjects,omitempty" protobuf:"bytes,2,rep,name=subjects"` // RoleRef can reference a Role in the current namespace or a ClusterRole in the global namespace. // If the RoleRef cannot be resolved, the Authorizer must return an error. RoleRef RoleRef `json:"roleRef" protobuf:"bytes,3,opt,name=roleRef"` } ``` **Role**就是相当于定义了一些规则列表。具体就是。举个例子,这个就是定义了一个 haimaxy-role 的角色。这个角色可以对 extensions.apps组下面的deploy, rs, pod执行 "get", "list", "watch", "create", "update", "patch", "delete"操作。 ``` apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: haimaxy-role namespace: kube-system rules: - apiGroups: ["", "extensions", "apps"] resources: ["deployments", "replicasets", "pods"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] # 也可以使用['*'] ```
而RoleBinding就是将某个用户和角色进行绑定。 这里就是 User.haimaxy绑定了上面的haimaxy-role。这样haimaxy这个用户,就可以对 extensions.apps组下面的deploy, rs, pod执行 "get", "list", "watch", "create", "update", "patch", "delete"操作。 ``` apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: haimaxy-rolebinding namespace: kube-system subjects: - kind: User name: haimaxy apiGroup: "" roleRef: kind: Role name: haimaxy-role apiGroup: "" ``` 更多的使用教程,可以参考这个博客,手动创建一波就清楚了:https://www.qikqiak.com/post/use-rbac-in-k8s/
rbac的鉴权流程如下: 1. 通过`Request`获取`Attribute`包括用户,资源和对应的操作 2. `Authorize`调用`VisitRulesFor`进行具体的鉴权 3. 获取所有的ClusterRoleBindings,并对其进行遍历操作 4. 根据请求User信息,判断该是否被绑定在该ClusterRoleBinding中 5. 若在将通过函数`GetRoleReferenceRules()`获取绑定的Role所控制的访问的资源 6. 将Role所控制的访问的资源,与从API请求中提取出的资源进行比对,若比对成功,即为API请求的调用者有权访问相关资源 7. 遍历ClusterRoleBinding中,都没有获得鉴权成功的操作,将会判断提取出的信息中是否包括了namespace的信息,若包括了,将会获取该namespace下的所有RoleBindings,类似ClusterRoleBindings 8. 若在遍历了所有CluterRoleBindings,及该namespace下的所有RoleBingdings之后,仍没有对资源比对成功,则可判断该API请求的调用者没有权限访问相关资源, 鉴权失败 这里没有细看,参考了文章:https://qingwave.github.io/kube-apiserver-authorization-code/
#### 2.6 Node授权 Node授权器也被称为节点授权,是一种特殊用途的授权机制,专门授权由kubelet组件发出的API请求。 Node授权器基于RBAC授权机制实现,对kubelet组件进行基于system:node内置角色的权限控制。 system:node内置角色的权限定义在NodeRules函数中,代码示例如下: NodeRules函数定义了system:node内置角色的权限,它拥有许多资源的操作权限,例如Configmap、Secret、Service、Pod等资源。例如,在上面的代码中,针对Pod资源的get、list、watch、create、delete等操作权限。 ``` const ( legacyGroup = "" appsGroup = "apps" authenticationGroup = "authentication.k8s.io" authorizationGroup = "authorization.k8s.io" autoscalingGroup = "autoscaling" batchGroup = "batch" certificatesGroup = "certificates.k8s.io" coordinationGroup = "coordination.k8s.io" discoveryGroup = "discovery.k8s.io" extensionsGroup = "extensions" policyGroup = "policy" rbacGroup = "rbac.authorization.k8s.io" storageGroup = "storage.k8s.io" resMetricsGroup = "metrics.k8s.io" customMetricsGroup = "custom.metrics.k8s.io" networkingGroup = "networking.k8s.io" eventsGroup = "events.k8s.io" ) func NodeRules() []rbacv1.PolicyRule { nodePolicyRules := []rbacv1.PolicyRule{ // Needed to check API access. These creates are non-mutating rbacv1helpers.NewRule("create").Groups(authenticationGroup).Resources("tokenreviews").RuleOrDie(), rbacv1helpers.NewRule("create").Groups(authorizationGroup).Resources("subjectaccessreviews", "localsubjectaccessreviews").RuleOrDie(), // Needed to build serviceLister, to populate env vars for services rbacv1helpers.NewRule(Read...).Groups(legacyGroup).Resources("services").RuleOrDie(), // Nodes can register Node API objects and report status. // Use the NodeRestriction admission plugin to limit a node to creating/updating its own API object. rbacv1helpers.NewRule("create", "get", "list", "watch").Groups(legacyGroup).Resources("nodes").RuleOrDie(), rbacv1helpers.NewRule("update", "patch").Groups(legacyGroup).Resources("nodes/status").RuleOrDie(), rbacv1helpers.NewRule("update", "patch").Groups(legacyGroup).Resources("nodes").RuleOrDie(), // TODO: restrict to the bound node as creator in the NodeRestrictions admission plugin rbacv1helpers.NewRule("create", "update", "patch").Groups(legacyGroup).Resources("events").RuleOrDie(), // TODO: restrict to pods scheduled on the bound node once field selectors are supported by list/watch authorization rbacv1helpers.NewRule(Read...).Groups(legacyGroup).Resources("pods").RuleOrDie(), // Needed for the node to create/delete mirror pods. // Use the NodeRestriction admission plugin to limit a node to creating/deleting mirror pods bound to itself. rbacv1helpers.NewRule("create", "delete").Groups(legacyGroup).Resources("pods").RuleOrDie(), // Needed for the node to report status of pods it is running. // Use the NodeRestriction admission plugin to limit a node to updating status of pods bound to itself. rbacv1helpers.NewRule("update", "patch").Groups(legacyGroup).Resources("pods/status").RuleOrDie(), // Needed for the node to create pod evictions. // Use the NodeRestriction admission plugin to limit a node to creating evictions for pods bound to itself. rbacv1helpers.NewRule("create").Groups(legacyGroup).Resources("pods/eviction").RuleOrDie(), // Needed for imagepullsecrets, rbd/ceph and secret volumes, and secrets in envs // Needed for configmap volume and envs // Use the Node authorization mode to limit a node to get secrets/configmaps referenced by pods bound to itself. rbacv1helpers.NewRule("get", "list", "watch").Groups(legacyGroup).Resources("secrets", "configmaps").RuleOrDie(), // Needed for persistent volumes // Use the Node authorization mode to limit a node to get pv/pvc objects referenced by pods bound to itself. rbacv1helpers.NewRule("get").Groups(legacyGroup).Resources("persistentvolumeclaims", "persistentvolumes").RuleOrDie(), // TODO: add to the Node authorizer and restrict to endpoints referenced by pods or PVs bound to the node // Needed for glusterfs volumes rbacv1helpers.NewRule("get").Groups(legacyGroup).Resources("endpoints").RuleOrDie(), // Used to create a certificatesigningrequest for a node-specific client certificate, and watch // for it to be signed. This allows the kubelet to rotate it's own certificate. rbacv1helpers.NewRule("create", "get", "list", "watch").Groups(certificatesGroup).Resources("certificatesigningrequests").RuleOrDie(), // Leases rbacv1helpers.NewRule("get", "create", "update", "patch", "delete").Groups("coordination.k8s.io").Resources("leases").RuleOrDie(), // CSI rbacv1helpers.NewRule("get").Groups(storageGroup).Resources("volumeattachments").RuleOrDie(), } if utilfeature.DefaultFeatureGate.Enabled(features.ExpandPersistentVolumes) { // Use the Node authorization mode to limit a node to update status of pvc objects referenced by pods bound to itself. // Use the NodeRestriction admission plugin to limit a node to just update the status stanza. pvcStatusPolicyRule := rbacv1helpers.NewRule("get", "update", "patch").Groups(legacyGroup).Resources("persistentvolumeclaims/status").RuleOrDie() nodePolicyRules = append(nodePolicyRules, pvcStatusPolicyRule) } if utilfeature.DefaultFeatureGate.Enabled(features.TokenRequest) { // Use the Node authorization to limit a node to create tokens for service accounts running on that node // Use the NodeRestriction admission plugin to limit a node to create tokens bound to pods on that node tokenRequestRule := rbacv1helpers.NewRule("create").Groups(legacyGroup).Resources("serviceaccounts/token").RuleOrDie() nodePolicyRules = append(nodePolicyRules, tokenRequestRule) } // CSI if utilfeature.DefaultFeatureGate.Enabled(features.CSIDriverRegistry) { csiDriverRule := rbacv1helpers.NewRule("get", "watch", "list").Groups("storage.k8s.io").Resources("csidrivers").RuleOrDie() nodePolicyRules = append(nodePolicyRules, csiDriverRule) } if utilfeature.DefaultFeatureGate.Enabled(features.CSINodeInfo) { csiNodeInfoRule := rbacv1helpers.NewRule("get", "create", "update", "patch", "delete").Groups("storage.k8s.io").Resources("csinodes").RuleOrDie() nodePolicyRules = append(nodePolicyRules, csiNodeInfoRule) } // RuntimeClass if utilfeature.DefaultFeatureGate.Enabled(features.RuntimeClass) { nodePolicyRules = append(nodePolicyRules, rbacv1helpers.NewRule("get", "list", "watch").Groups("node.k8s.io").Resources("runtimeclasses").RuleOrDie()) } return nodePolicyRules } ```
在进行Node授权时,通过r.identifier.NodeIdentity函数获取角色信息,并验证其是否为system:node内置角色,nodeName的表现形式为system:node:。通过rbac.RulesAllow函数进行RBAC授权,如果授权成功,返回DecisionAllow决策状态。 ``` func (r *NodeAuthorizer) Authorize(ctx context.Context, attrs authorizer.Attributes) (authorizer.Decision, string, error) { nodeName, isNode := r.identifier.NodeIdentity(attrs.GetUser()) if !isNode { // reject requests from non-nodes return authorizer.DecisionNoOpinion, "", nil } if len(nodeName) == 0 { // reject requests from unidentifiable nodes klog.V(2).Infof("NODE DENY: unknown node for user %q", attrs.GetUser().GetName()) return authorizer.DecisionNoOpinion, fmt.Sprintf("unknown node for user %q", attrs.GetUser().GetName()), nil } // subdivide access to specific resources if attrs.IsResourceRequest() { requestResource := schema.GroupResource{Group: attrs.GetAPIGroup(), Resource: attrs.GetResource()} switch requestResource { case secretResource: return r.authorizeReadNamespacedObject(nodeName, secretVertexType, attrs) case configMapResource: return r.authorizeReadNamespacedObject(nodeName, configMapVertexType, attrs) case pvcResource: if r.features.Enabled(features.ExpandPersistentVolumes) { if attrs.GetSubresource() == "status" { return r.authorizeStatusUpdate(nodeName, pvcVertexType, attrs) } } return r.authorizeGet(nodeName, pvcVertexType, attrs) case pvResource: return r.authorizeGet(nodeName, pvVertexType, attrs) case vaResource: return r.authorizeGet(nodeName, vaVertexType, attrs) case svcAcctResource: if r.features.Enabled(features.TokenRequest) { return r.authorizeCreateToken(nodeName, serviceAccountVertexType, attrs) } return authorizer.DecisionNoOpinion, fmt.Sprintf("disabled by feature gate %s", features.TokenRequest), nil case leaseResource: return r.authorizeLease(nodeName, attrs) case csiNodeResource: if r.features.Enabled(features.CSINodeInfo) { return r.authorizeCSINode(nodeName, attrs) } return authorizer.DecisionNoOpinion, fmt.Sprintf("disabled by feature gates %s", features.CSINodeInfo), nil } } // Access to other resources is not subdivided, so just evaluate against the statically defined node rules if rbac.RulesAllow(attrs, r.nodeRules...) { return authorizer.DecisionAllow, "", nil } return authorizer.DecisionNoOpinion, "", nil } ```
### 3. 总结 (1)本文主要参考《kubernetes源码解剖》这本书籍,通过根据一些自身使用,记录一下k8s授权方面的知识。日后需要相关开发或者更深入的了解时,有一定的知识基础。 (2)针对这6种授权模式,个人认为的优缺点如下: | 模式 | 优点 | 缺点 | | ----------- | ------------------------------------------------------------ | ------------------------------------------------------------ | | AlwaysAllow | 简答,适用于自己搭建集群实践,这样就会省一些搭建环境的事情 | 非常不安全 | | AlwaysDeny | 基本是配合使用的。AlwaysDeny放第一个,先拒绝所有请求,然后后面加webhook这些,这样的好处就是,我只允许我webhook认证过的请求。 | | | ABAC | | 每次需要修改文件,并且重启apiserver。而正式集群中,重启apiserver的风险是非常大的 | | Webhook | 可以通过webhook使用一套 定制化的权限管理系统 | | | RBAC | 是目前比较流行的授权思路。有这么几个优点:
(1)对集群中的资源和非资源均拥有完整的覆盖
(2)整个RBAC完全由几个API对象完成,同其他API对象一样,可以用kubectl或API进行操作
(3)可以在运行时进行操作,无需重启API Server | | | Node | 专门针对kubelet的授权,事先专门为kubelet定义好了一组权限 | | 个人感觉目前常用的就是:RBAC,Webhook,Node这三种 ### 4. 参考 书籍:kubernetes源码解剖,郑东 https://kubernetes.io/zh/docs/reference/access-authn-authz/abac/ https://kubernetes.io/zh/docs/reference/access-authn-authz/webhook/ https://www.qikqiak.com/post/use-rbac-in-k8s/ https://qingwave.github.io/kube-apiserver-authorization-code/
================================================ FILE: k8s/kube-apiserver/14-k8s之admission分析.md ================================================ Table of Contents ================= * [1. 背景](#1-背景) * [2. 分析流程](#2-分析流程) * [2.1 Admission的注册](#21-admission的注册) * [2.2 admission的调用](#22-admission的调用) * [2.3 validatingwebhook, mutatingwebhook的调用](#23--validatingwebhook-mutatingwebhook的调用) * [2.3.1 ValidatingAdmissionWebhook调用](#231-validatingadmissionwebhook调用) * [2.3.2 MutatingAdmissionWebhook调用](#232-mutatingadmissionwebhook调用) * [2.4 动态更新webhook的原理](#24-动态更新webhook的原理) * [3. 总结](#3-总结) * [4.参考链接:](#4参考链接) ### 1. 背景 api Request -> 认证 -> 授权 -> admission -> etcd. 和initializer不同,webhook是在保存在etcd之前工作的。 经过了认证,授权之后,接下来就到了webhook这个环节了。 这篇笔记主要就是分析 `MutatingAdmissionWebhook` 和 `ValidatingAdmissionWebhook` 如何工作的。
### 2. 分析流程 #### 2.1 Admission的注册 kube-apiserver在调用NewServerRunOptions函数初始化options的时候,调用了NewAdmissionOptions去初始化了AdmissionOptions,并注册了内置的 admission插件和webhook admission插件。 ``` // NewServerRunOptions creates a new ServerRunOptions object with default parameters func NewServerRunOptions() *ServerRunOptions { s := ServerRunOptions{ // 省略... // 初始化AdmissionOptions Admission: kubeoptions.NewAdmissionOptions(), Authentication: kubeoptions.NewBuiltInAuthenticationOptions().WithAll(), Authorization: kubeoptions.NewBuiltInAuthorizationOptions(), // 省略... } // ... return &s } ```
**AdmissionOptions的一些基础概念** ``` options.AdmissionOptions // AdmissionOptions holds the admission options. // It is a wrap of generic AdmissionOptions. type AdmissionOptions struct { // GenericAdmission holds the generic admission options. GenericAdmission *genericoptions.AdmissionOptions // DEPRECATED flag, should use EnabledAdmissionPlugins and DisabledAdmissionPlugins. // They are mutually exclusive, specify both will lead to an error. PluginNames []string } genericoptions.AdmissionOptions // AdmissionOptions holds the admission options type AdmissionOptions struct { // 有序的推荐插件列表集合 RecommendedPluginOrder []string // 默认禁止的插件 DefaultOffPlugins sets.String // 开启的插件列表,通过kube-apiserver 启动参数设置--enable-admission-plugins 选项 EnablePlugins []string // 禁止的插件列表,通过kube-apiserver 启动参数设置 --disable-admission-plugins 选项 DisablePlugins []string // ConfigFile is the file path with admission control configuration. ConfigFile string // 代表了所有已经注册的插件 Plugins *admission.Plugins } ```
**options. NewAdmissionOptions()** NewAdmissionOptions里面先是调用genericoptions.NewAdmissionOptions创建一个AdmissionOptions,NewAdmissionOptions同时也注册了lifecycle、validatingwebhook、mutatingwebhook这三个插件。然后再调用RegisterAllAdmissionPlugins注册内置的其他admission。 ``` options. NewAdmissionOptions() // NewAdmissionOptions creates a new instance of AdmissionOptions // Note: // In addition it calls RegisterAllAdmissionPlugins to register // all kube-apiserver admission plugins. // // Provides the list of RecommendedPluginOrder that holds sane values // that can be used by servers that don't care about admission chain. // Servers that do care can overwrite/append that field after creation. func NewAdmissionOptions() *AdmissionOptions { // 这里注册了 lifecycle, initialization,validatingwebhook,mutatingwebhook 四个admission。(2.2.2 中mutating的注册函数就是这个时候调用的) options := genericoptions.NewAdmissionOptions() // 这里注册了所有的 admission, 没有上面四个 admission // register all admission plugins RegisterAllAdmissionPlugins(options.Plugins) // set RecommendedPluginOrder options.RecommendedPluginOrder = AllOrderedPlugins // 确定了admission-plugin的相对顺序。 // set DefaultOffPlugins // 设置默认的停用插件 options.DefaultOffPlugins = DefaultOffAdmissionPlugins() return &AdmissionOptions{ GenericAdmission: options, } } genericoptions.NewAdmissionOptions() // NewAdmissionOptions creates a new instance of AdmissionOptions // Note: // In addition it calls RegisterAllAdmissionPlugins to register // all generic admission plugins. // // Provides the list of RecommendedPluginOrder that holds sane values // that can be used by servers that don't care about admission chain. // Servers that do care can overwrite/append that field after creation. func NewAdmissionOptions() *AdmissionOptions { options := &AdmissionOptions{ Plugins: admission.NewPlugins(), // This list is mix of mutating admission plugins and validating // admission plugins. The apiserver always runs the validating ones // after all the mutating ones, so their relative order in this list // doesn't matter. RecommendedPluginOrder: []string{lifecycle.PluginName, initialization.PluginName, mutatingwebhook.PluginName, validatingwebhook.PluginName}, DefaultOffPlugins: sets.NewString(initialization.PluginName), } // 注册了lifecycle、validatingwebhook、mutatingwebhook server.RegisterAllAdmissionPlugins(options.Plugins) return options } // validatingwebhook, mutatingwebhook 是动态的,这里应该就是注册一个总体的概念,而不是一个一个的实体。 // RegisterAllAdmissionPlugins registers all admission plugins func RegisterAllAdmissionPlugins(plugins *admission.Plugins) { lifecycle.Register(plugins) initialization.Register(plugins) validatingwebhook.Register(plugins) mutatingwebhook.Register(plugins) } ``` **AllOrderedPlugins** ``` // AllOrderedPlugins is the list of all the plugins in order. var AllOrderedPlugins = []string{ admit.PluginName, // AlwaysAdmit autoprovision.PluginName, // NamespaceAutoProvision lifecycle.PluginName, // NamespaceLifecycle exists.PluginName, // NamespaceExists scdeny.PluginName, // SecurityContextDeny antiaffinity.PluginName, // LimitPodHardAntiAffinityTopology podpreset.PluginName, // PodPreset limitranger.PluginName, // LimitRanger serviceaccount.PluginName, // ServiceAccount noderestriction.PluginName, // NodeRestriction alwayspullimages.PluginName, // AlwaysPullImages imagepolicy.PluginName, // ImagePolicyWebhook podsecuritypolicy.PluginName, // PodSecurityPolicy podnodeselector.PluginName, // PodNodeSelector podpriority.PluginName, // Priority defaulttolerationseconds.PluginName, // DefaultTolerationSeconds podtolerationrestriction.PluginName, // PodTolerationRestriction exec.DenyEscalatingExec, // DenyEscalatingExec exec.DenyExecOnPrivileged, // DenyExecOnPrivileged eventratelimit.PluginName, // EventRateLimit extendedresourcetoleration.PluginName, // ExtendedResourceToleration label.PluginName, // PersistentVolumeLabel setdefault.PluginName, // DefaultStorageClass storageobjectinuseprotection.PluginName, // StorageObjectInUseProtection gc.PluginName, // OwnerReferencesPermissionEnforcement resize.PluginName, // PersistentVolumeClaimResize mutatingwebhook.PluginName, // MutatingAdmissionWebhook initialization.PluginName, // Initializers validatingwebhook.PluginName, // ValidatingAdmissionWebhook resourcequota.PluginName, // ResourceQuota deny.PluginName, // AlwaysDeny } ```
#### 2.2 admission的调用 前面已经分析AdmissionPlugin注册到ServerRunOptions的过程, buildGenericConfig中会调用ServerRunOptions.Admission.ApplyTo生成admission chain设置到GenericConfig里面。把所有的admission plugin生成chainAdmissionHandler对象,其实就是plugin数组,这个类的Admit、Validate等方法会遍历调用每个plugin的Admit、Validate方法 ``` buildGenericConfig(){ err = s.Admission.ApplyTo( genericConfig, versionedInformers, kubeClientConfig, feature.DefaultFeatureGate, pluginInitializers...) } ``` GenericConfig.AdmissionControl 又会赋值给GenericAPIServer.admissionControl ``` func (a *AdmissionOptions) ApplyTo( c *server.Config, informers informers.SharedInformerFactory, kubeAPIServerClientConfig *rest.Config, features featuregate.FeatureGate, pluginInitializers ...admission.PluginInitializer, ) error { // 省略 ... // 找到所有启用的plugin pluginNames := a.enabledPluginNames() pluginsConfigProvider, err := admission.ReadAdmissionConfiguration(pluginNames, a.ConfigFile, configScheme) if err != nil { return fmt.Errorf("failed to read plugin config: %v", err) } clientset, err := kubernetes.NewForConfig(kubeAPIServerClientConfig) if err != nil { return err } genericInitializer := initializer.New(clientset, informers, c.Authorization.Authorizer, features) initializersChain := admission.PluginInitializers{} pluginInitializers = append(pluginInitializers, genericInitializer) initializersChain = append(initializersChain, pluginInitializers...) // 把所有的admission plugin生成admissionChain,实际是个plugin数组 admissionChain, err := a.Plugins.NewFromPlugins(pluginNames, pluginsConfigProvider, initializersChain, a.Decorators) if err != nil { return err } // 把admissionChain设置给GenericConfig.AdmissionControl c.AdmissionControl = admissionmetrics.WithStepMetrics(admissionChain) return nil } ``` Admission Plugin是在kube-apiserver处理完前面的handler之后,在调用RESTStorage的Get、Create、Update、Delete等函数前会调用Admission Plugin。 kube-apiserver有很多的handler组成了handler链,这写handler链的最内层,是使用gorestful框架注册的WebService。每个WebService都对应一种资源的RESTStorage,比如NodeStorage(pkg/registry/core/node/storage/storage.go ),installAPIResources初始化WebService时,会把RESTStorage的Get、Create、Update等函数分别封装成Get、POST、PUT等http方法的handler注册到WebService中。 比如把Update函数封装成http handler 作为PUT方法的handler,而在这个hanlder调用Update函数之前,会先调用Admission Plugin的Admit、Validate等函数。下面看个PUT方法的例子。
a.group.Admit是从GenericAPIServer.admissionControl取的值,就是前面ApplyTo函数生成的admissionChain。admit、updater作为参数调用restfulUpdateResource函数生成的handler a.group.Admit是从GenericAPIServer.admissionControl取的值,就是前面ApplyTo函数生成的admissionChain。admit、updater作为参数调用restfulUpdateResource函数生成的handler ``` // staging/src/k8s.io/apiserver/pkg/endpoints/installer.go func (a *APIInstaller) registerResourceHandlers(path string, storage rest.Storage, ws *restful.WebService) (*metav1.APIResource, error) { admit := a.group.Admit // 省略 ... updater, isUpdater := storage.(rest.Updater) // 省略 ... switch action.Verb { case "GET": ... case "PUT": // Update a resource. doc := "replace the specified " + kind if isSubresource { doc = "replace " + subresource + " of the specified " + kind } // admit、updater作为参数调用restfulUpdateResource函数生成的handler handler := metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, restfulUpdateResource(updater, reqScope, admit)) route := ws.PUT(action.Path).To(handler). Doc(doc). Param(ws.QueryParameter("pretty", "If 'true', then the output is pretty printed.")). Operation("replace"+namespaced+kind+strings.Title(subresource)+operationSuffix). Produces(append(storageMeta.ProducesMIMETypes(action.Verb), mediaTypes...)...). Returns(http.StatusOK, "OK", producedObject). // TODO: in some cases, the API may return a v1.Status instead of the versioned object // but currently go-restful can't handle multiple different objects being returned. Returns(http.StatusCreated, "Created", producedObject). Reads(defaultVersionedObject). Writes(producedObject) if err := AddObjectParams(ws, route, versionedUpdateOptions); err != nil { return nil, err } addParams(route, action.Params) routes = append(routes, route) case "PARTCH": ... // 省略 .... } } restfulUpdateResource调用了 handlers.UpdateResource。 func restfulUpdateResource(r rest.Updater, scope handlers.RequestScope, admit admission.Interface) restful.RouteFunction { return func(req *restful.Request, res *restful.Response) { handlers.UpdateResource(r, &scope, admit)(res.ResponseWriter, req.Request) } } ``` 看handlers.UpdateResource的代码实现,会先判断如果传入的admission.Interface参数是MutationInterface类型,就调用Admit,也就是调用admissionChain的Admit,最终会遍历调用每个Admission Plugin的Admit方法。而Webhook Admission是众多admission中的一个。 执行完Admission,后面的requestFunc 才会调用RESTStorage的Update函数。每个资源的RESTStorage最终都是要调用ETCD3Storage的Get、Update等函数。 ``` // staging/src/k8s.io/apiserver/pkg/endpoints/handlers/update.go func UpdateResource(r rest.Updater, scope *RequestScope, admit admission.Interface) http.HandlerFunc { return func(w http.ResponseWriter, req *http.Request) { // 省略 ... ae := request.AuditEventFrom(ctx) audit.LogRequestObject(ae, obj, scope.Resource, scope.Subresource, scope.Serializer) admit = admission.WithAudit(admit, ae) // 如果admit是MutationInterface类型的,就调用其Admit函数,也就是admissionChain的Admit if mutatingAdmission, ok := admit.(admission.MutationInterface); ok { transformers = append(transformers, func(ctx context.Context, newObj, oldObj runtime.Object) (runtime.Object, error) { isNotZeroObject, err := hasUID(oldObj) if err != nil { return nil, fmt.Errorf("unexpected error when extracting UID from oldObj: %v", err.Error()) } else if !isNotZeroObject { if mutatingAdmission.Handles(admission.Create) { return newObj, mutatingAdmission.Admit(ctx, admission.NewAttributesRecord(newObj, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Create, updateToCreateOptions(options), dryrun.IsDryRun(options.DryRun), userInfo), scope) } } else { if mutatingAdmission.Handles(admission.Update) { return newObj, mutatingAdmission.Admit(ctx, admission.NewAttributesRecord(newObj, oldObj, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Update, options, dryrun.IsDryRun(options.DryRun), userInfo), scope) } } return newObj, nil }) } // 省略 ... // 执行完MutationInterface类型的admission,这里先会执行validatingAdmission,然后才调用RESTStorage的Update函数 requestFunc := func() (runtime.Object, error) { obj, created, err := r.Update( ctx, name, rest.DefaultUpdatedObjectInfo(obj, transformers...), withAuthorization(rest.AdmissionToValidateObjectFunc( admit, admission.NewAttributesRecord(nil, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Create, updateToCreateOptions(options), dryrun.IsDryRun(options.DryRun), userInfo), scope), scope.Authorizer, createAuthorizerAttributes), // 这里调用了validatingAdmission.Validate函数 rest.AdmissionToValidateObjectUpdateFunc( admit, admission.NewAttributesRecord(nil, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Update, options, dryrun.IsDryRun(options.DryRun), userInfo), scope), false, options, ) wasCreated = created return obj, err } result, err := finishRequest(timeout, func() (runtime.Object, error) { result, err := requestFunc() // 省略 ... return result, err }) // ... transformResponseObject(ctx, scope, trace, req, w, status, outputMediaType, result) } } // 这里调用了validatingAdmission.Validate函数 // AdmissionToValidateObjectUpdateFunc converts validating admission to a rest validate object update func func AdmissionToValidateObjectUpdateFunc(admit admission.Interface, staticAttributes admission.Attributes, o admission.ObjectInterfaces) ValidateObjectUpdateFunc { validatingAdmission, ok := admit.(admission.ValidationInterface) if !ok { return func(ctx context.Context, obj, old runtime.Object) error { return nil } } return func(ctx context.Context, obj, old runtime.Object) error { finalAttributes := admission.NewAttributesRecord( obj, old, staticAttributes.GetKind(), staticAttributes.GetNamespace(), staticAttributes.GetName(), staticAttributes.GetResource(), staticAttributes.GetSubresource(), staticAttributes.GetOperation(), staticAttributes.GetOperationOptions(), staticAttributes.IsDryRun(), staticAttributes.GetUserInfo(), ) if !validatingAdmission.Handles(finalAttributes.GetOperation()) { return nil } return validatingAdmission.Validate(ctx, finalAttributes, o) } } ``` 以上是PUT方法的例子,里面调用了MutationInterface和ValidationInterface。其他的方法比如POST、DELETE等也是类似。但是GET方法不会调用Admission Plugin。 #### 2.3 validatingwebhook, mutatingwebhook的调用 validatingwebhook和mutatingwebhook分别位于staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/validating/plugin.go,staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/plugin.go两个文件中。 ##### 2.3.1 ValidatingAdmissionWebhook调用 (1) ValidatingAdmissionWebhook的Validate()函数实现了ValidationInterface接口,有请求到来时kube-apiserver会调用所有admission 的Validate()方法。ValidatingAdmissionWebhook持有了一个Webhook对象,Validate()会调用Webhook.Dispatch()。 (2)Webhook.Dispatch()又调用了其持有的dispatcher的Dispatch()方法。dispatcher时通过dispatcherFactory创建的,dispatcherFactory是ValidatingAdmissionWebhook创建generic.Webhook时候传入的newValidatingDispatcher函数。调用dispatcherFactory函数创建的实际上是validatingDispatcher对象,也就是Webhook.Dispatch()调用的是validatingDispatcher.Dispatch()。 (3)validatingDispatcher.Dispatch()会逐个远程调用注册的webhook plugin NewValidatingAdmissionWebhook初始化了ValidatingAdmissionWebhook对象,内部持有了一个generic.Webhook对象,generic.Webhook是一个Validate和mutate公用的框架,创建generic.Webhook时需要一个dispatcherFactory函数,用这个函数生成dispatcher对象。 ``` // staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/validating/plugin.go // NewValidatingAdmissionWebhook returns a generic admission webhook plugin. func NewValidatingAdmissionWebhook(configFile io.Reader) (*Plugin, error) { handler := admission.NewHandler(admission.Connect, admission.Create, admission.Delete, admission.Update) p := &Plugin{} var err error p.Webhook, err = generic.NewWebhook(handler, configFile, configuration.NewValidatingWebhookConfigurationManager, newValidatingDispatcher(p)) if err != nil { return nil, err } return p, nil } // Validate makes an admission decision based on the request attributes. func (a *Plugin) Validate(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) error { return a.Webhook.Dispatch(ctx, attr, o) } ``` 调用generic.Webhook.Dispatch()时会调用dispatcher对象的Dispatch。 ``` // Dispatch is called by the downstream Validate or Admit methods. func (a *Webhook) Dispatch(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) error { if rules.IsWebhookConfigurationResource(attr) { return nil } if !a.WaitForReady() { return admission.NewForbidden(attr, fmt.Errorf("not yet ready to handle request")) } hooks := a.hookSource.Webhooks() return a.dispatcher.Dispatch(ctx, attr, o, hooks) } ``` validatingDispatcher.Dispatch遍历所有的hooks ,找到相关的webhooks,然后执行callHooks调用外部注册进来的 ```go func (d *validatingDispatcher) Dispatch(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces, hooks []webhook.WebhookAccessor) error { var relevantHooks []*generic.WebhookInvocation // Construct all the versions we need to call our webhooks versionedAttrs := map[schema.GroupVersionKind]*generic.VersionedAttributes{} for _, hook := range hooks { // 遍历所有的webhooks,根据ValidatingWebhookConfiguration中的rules是否匹配找到所有相关的hooks invocation, statusError := d.plugin.ShouldCallHook(hook, attr, o) if statusError != nil { return statusError } if invocation == nil { continue } relevantHooks = append(relevantHooks, invocation) // If we already have this version, continue if _, ok := versionedAttrs[invocation.Kind]; ok { continue } versionedAttr, err := generic.NewVersionedAttributes(attr, invocation.Kind, o) if err != nil { return apierrors.NewInternalError(err) } versionedAttrs[invocation.Kind] = versionedAttr } if len(relevantHooks) == 0 { // no matching hooks return nil } // Check if the request has already timed out before spawning remote calls select { case <-ctx.Done(): // parent context is canceled or timed out, no point in continuing return apierrors.NewTimeoutError("request did not complete within requested timeout", 0) default: } wg := sync.WaitGroup{} errCh := make(chan error, len(relevantHooks)) wg.Add(len(relevantHooks)) for i := range relevantHooks { go func(invocation *generic.WebhookInvocation) { defer wg.Done() hook, ok := invocation.Webhook.GetValidatingWebhook() if !ok { utilruntime.HandleError(fmt.Errorf("validating webhook dispatch requires v1.ValidatingWebhook, but got %T", hook)) return } versionedAttr := versionedAttrs[invocation.Kind] t := time.Now() // 启动多个go routine 并行调用注册进来的webhook plugin err := d.callHook(ctx, hook, invocation, versionedAttr) ignoreClientCallFailures := hook.FailurePolicy != nil && *hook.FailurePolicy == v1.Ignore rejected := false if err != nil { switch err := err.(type) { case *webhookutil.ErrCallingWebhook: if !ignoreClientCallFailures { rejected = true admissionmetrics.Metrics.ObserveWebhookRejection(hook.Name, "validating", string(versionedAttr.Attributes.GetOperation()), admissionmetrics.WebhookRejectionCallingWebhookError, 0) } case *webhookutil.ErrWebhookRejection: rejected = true admissionmetrics.Metrics.ObserveWebhookRejection(hook.Name, "validating", string(versionedAttr.Attributes.GetOperation()), admissionmetrics.WebhookRejectionNoError, int(err.Status.ErrStatus.Code)) default: rejected = true admissionmetrics.Metrics.ObserveWebhookRejection(hook.Name, "validating", string(versionedAttr.Attributes.GetOperation()), admissionmetrics.WebhookRejectionAPIServerInternalError, 0) } } admissionmetrics.Metrics.ObserveWebhook(time.Since(t), rejected, versionedAttr.Attributes, "validating", hook.Name) if err == nil { return } if callErr, ok := err.(*webhookutil.ErrCallingWebhook); ok { if ignoreClientCallFailures { klog.Warningf("Failed calling webhook, failing open %v: %v", hook.Name, callErr) utilruntime.HandleError(callErr) return } klog.Warningf("Failed calling webhook, failing closed %v: %v", hook.Name, err) errCh <- apierrors.NewInternalError(err) return } if rejectionErr, ok := err.(*webhookutil.ErrWebhookRejection); ok { err = rejectionErr.Status } klog.Warningf("rejected by webhook %q: %#v", hook.Name, err) errCh <- err }(relevantHooks[i]) } // 等待多个goroutine 执行完成 wg.Wait() close(errCh) var errs []error for e := range errCh { errs = append(errs, e) } if len(errs) == 0 { return nil } if len(errs) > 1 { for i := 1; i < len(errs); i++ { // TODO: merge status errors; until then, just return the first one. utilruntime.HandleError(errs[i]) } } return errs[0] } ``` ##### 2.3.2 MutatingAdmissionWebhook调用 看MutatingWebhook的构造函数就可以看到,MutatingWebhook和ValidatingWebhook的代码架构是一样的,只不过在创建generic.Webhook的时候传入的dispatcherFactory函数是newMutatingDispatcher,所以Webhook.Dispatch()最终调用的就是mutatingDispatcher.Dispatch(),这个和validatingDispatcher.Dispatch的实现逻辑基本是一样的,也是根据WebhookConfiguration中的rules是否匹配找到相关的webhooks,然后逐个调用。 ``` // staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/plugin.go // NewMutatingWebhook returns a generic admission webhook plugin. func NewMutatingWebhook(configFile io.Reader) (*Plugin, error) { handler := admission.NewHandler(admission.Connect, admission.Create, admission.Delete, admission.Update) p := &Plugin{} var err error p.Webhook, err = generic.NewWebhook(handler, configFile, configuration.NewMutatingWebhookConfigurationManager, newMutatingDispatcher(p)) if err != nil { return nil, err } return p, nil } // ValidateInitialization implements the InitializationValidator interface. func (a *Plugin) ValidateInitialization() error { if err := a.Webhook.ValidateInitialization(); err != nil { return err } return nil } // Admit makes an admission decision based on the request attributes. func (a *Plugin) Admit(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) error { return a.Webhook.Dispatch(ctx, attr, o) } func (a *mutatingDispatcher) Dispatch(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces, hooks []webhook.WebhookAccessor) error { reinvokeCtx := attr.GetReinvocationContext() var webhookReinvokeCtx *webhookReinvokeContext if v := reinvokeCtx.Value(PluginName); v != nil { webhookReinvokeCtx = v.(*webhookReinvokeContext) } else { webhookReinvokeCtx = &webhookReinvokeContext{} reinvokeCtx.SetValue(PluginName, webhookReinvokeCtx) } if reinvokeCtx.IsReinvoke() && webhookReinvokeCtx.IsOutputChangedSinceLastWebhookInvocation(attr.GetObject()) { // If the object has changed, we know the in-tree plugin re-invocations have mutated the object, // and we need to reinvoke all eligible webhooks. webhookReinvokeCtx.RequireReinvokingPreviouslyInvokedPlugins() } defer func() { webhookReinvokeCtx.SetLastWebhookInvocationOutput(attr.GetObject()) }() var versionedAttr *generic.VersionedAttributes //是一个一个执行的 for i, hook := range hooks { attrForCheck := attr if versionedAttr != nil { attrForCheck = versionedAttr } invocation, statusErr := a.plugin.ShouldCallHook(hook, attrForCheck, o) if statusErr != nil { return statusErr } if invocation == nil { continue } hook, ok := invocation.Webhook.GetMutatingWebhook() if !ok { return fmt.Errorf("mutating webhook dispatch requires v1.MutatingWebhook, but got %T", hook) } // This means that during reinvocation, a webhook will not be // called for the first time. For example, if the webhook is // skipped in the first round because of mismatching labels, // even if the labels become matching, the webhook does not // get called during reinvocation. if reinvokeCtx.IsReinvoke() && !webhookReinvokeCtx.ShouldReinvokeWebhook(invocation.Webhook.GetUID()) { continue } return nil } ``` #### 2.4 动态更新webhook的原理 我们使用的时候都是通过创建类似于创建这样的来增加一个webhook。那如果我增加了一个这个webhook, 是如何生效的呢。 ``` apiVersion: admissionregistration.k8s.io/v1beta1 kind: ValidatingWebhookConfiguration metadata: name: validation-webhook-example-cfg labels: app: admission-webhook-example webhooks: - name: required-labels.banzaicloud.com clientConfig: service: name: admission-webhook-example-webhook-svc namespace: default path: "/validate" caBundle: ${CA_BUNDLE} rules: - operations: [ "CREATE" ] apiGroups: ["apps", ""] apiVersions: ["v1"] resources: ["deployments","services"] namespaceSelector: matchLabels: admission-webhook-example: enabled ```
``` // Dispatch is called by the downstream Validate or Admit methods. func (a *Webhook) Dispatch(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) error { if rules.IsWebhookConfigurationResource(attr) { return nil } if !a.WaitForReady() { return admission.NewForbidden(attr, fmt.Errorf("not yet ready to handle request")) } //这里获取了所有的webhook,然后再调用的Dispatch函数 hooks := a.hookSource.Webhooks() return a.dispatcher.Dispatch(ctx, attr, o, hooks) } // 获得所有的validatingWebhookConfiguration // Webhooks returns the merged ValidatingWebhookConfiguration. func (v *validatingWebhookConfigurationManager) Webhooks() []webhook.WebhookAccessor { return v.configuration.Load().([]webhook.WebhookAccessor) } ``` ValidatingWebhookConfigurationManager会维护所有的validatingWebhookConfiguration,一旦有ValidatingWebhookConfigurationManager的add, update, del都会调用updateConfiguration更新 ``` pkg/admission/configuration/validating_webhook_manager.go func NewValidatingWebhookConfigurationManager(f informers.SharedInformerFactory) generic.Source { informer := f.Admissionregistration().V1().ValidatingWebhookConfigurations() manager := &validatingWebhookConfigurationManager{ configuration: &atomic.Value{}, lister: informer.Lister(), hasSynced: informer.Informer().HasSynced, } // Start with an empty list manager.configuration.Store([]webhook.WebhookAccessor{}) // On any change, rebuild the config informer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ AddFunc: func(_ interface{}) { manager.updateConfiguration() }, UpdateFunc: func(_, _ interface{}) { manager.updateConfiguration() }, DeleteFunc: func(_ interface{}) { manager.updateConfiguration() }, }) return manager } //然后上面Load的时候就回获得最新的webhook func mergeValidatingWebhookConfigurations(configurations []*v1.ValidatingWebhookConfiguration) []webhook.WebhookAccessor { sort.SliceStable(configurations, ValidatingWebhookConfigurationSorter(configurations).ByName) accessors := []webhook.WebhookAccessor{} for _, c := range configurations { // webhook names are not validated for uniqueness, so we check for duplicates and // add a int suffix to distinguish between them names := map[string]int{} for i := range c.Webhooks { n := c.Webhooks[i].Name uid := fmt.Sprintf("%s/%s/%d", c.Name, n, names[n]) names[n]++ accessors = append(accessors, webhook.NewValidatingWebhookAccessor(uid, c.Name, &c.Webhooks[i])) } } return accessors } ``` ### 3. 总结 (1)webhook是通过插入在 apiserver的处理链条中,存入etcd之前生效的 (2)mutatingwebhook,ValidatingAdmission都是有对应的manager来实时更新的 (3)ValidatingAdmission,mutatingwebhook的不同在于 * 所有的请求先经过mutatingwebhook,在经过ValidatingAdmission。这也很好理解,因为ValidatingAdmission不会修改对象 * ValidatingAdmission是并行处理的,都满足后放行(可以设置超时跳过改webhook的策略) * mutatingwebhook是一个一个串行操作的 ### 4.参考链接: https://blog.csdn.net/u014152978/article/details/107170600 ================================================ FILE: k8s/kube-apiserver/15-k8s之etcd存储实现.md ================================================ Table of Contents ================= * [1. etcd 配置](#1-etcd-配置) * [2. Apiserver定义etcd的config](#2-apiserver定义etcd的config) * [2.1 DefaultAPIResourceConfigSource](#21-defaultapiresourceconfigsource) * [2.2 初始化 storageFactory](#22-初始化-storagefactory) * [3. 以pod为例, apiserver是如何add/del/update etcd资源的](#3-以pod为例-apiserver是如何adddelupdate-etcd资源的) * [3.1 NewStorage](#31-newstorage) * [3.2 pod.Strategy](#32-podstrategy) * [3.3 CompleteWithOptions](#33-completewithoptions) * [3.4 总结](#34-总结) * [4. 总结](#4-总结) * [5.参考链接:](#5参考链接) 本节介绍apiserver是如何使用etcd进行存储的。在apiserver的启动流程下中分析到了,不同资源的url注册最终依赖于一个Storage的东西。接下来就分析Storage到底是什么。 ### 1. etcd 配置 这里就是指定一些etcd的参数,比如EnableWatchCache,EtcdPathPrefix,数据格式等等。 ``` // NewServerRunOptions creates a new ServerRunOptions object with default parameters func NewServerRunOptions() *ServerRunOptions { s := ServerRunOptions{ ... // 资源信息存储路径前缀缺省为:DefaultEtcdPathPrefix = "registry"。但是这个参数我们可以在运行时指定参数覆盖,具体的参数配置为:etcd-prefix Etcd: genericoptions.NewEtcdOptions(storagebackend.NewDefaultConfig(kubeoptions.DefaultEtcdPathPrefix, nil)), ... } // 指定etcd的数据格式为protobuf s.Etcd.DefaultStorageMediaType = "application/vnd.kubernetes.protobuf" } func NewEtcdOptions(backendConfig *storagebackend.Config) *EtcdOptions { options := &EtcdOptions{ StorageConfig: *backendConfig, DefaultStorageMediaType: "application/json", DeleteCollectionWorkers: 1, EnableGarbageCollection: true, EnableWatchCache: true, DefaultWatchCacheSize: 100, } options.StorageConfig.CountMetricPollPeriod = time.Minute return options } func NewDefaultConfig(prefix string, codec runtime.Codec) *Config { return &Config{ Paging: true, Prefix: prefix, Codec: codec, CompactionInterval: DefaultCompactInterval, } } ``` ### 2. Apiserver定义etcd的config cmd/kube-apiserver/app/server.go buildGenericConfig函数回生存很多config,其中就有etcd的config。 buildGenericConfig关于etcd做的事情如下: - 1、调用 `master.DefaultAPIResourceConfigSource` 加载需要启用的 API Resource - 2、初始化,并补全StorageFactory的配置。s.Etcd就是上面定义的etcd配置 ``` ... // 1.加载默认支持的资源 genericConfig.MergedResourceConfig = master.DefaultAPIResourceConfigSource() ... storageFactoryConfig := kubeapiserver.NewStorageFactoryConfig() storageFactoryConfig.APIResourceConfig = genericConfig.MergedResourceConfig completedStorageFactoryConfig, err := storageFactoryConfig.Complete(s.Etcd) if err != nil { lastErr = err return } storageFactory, lastErr = completedStorageFactoryConfig.New() if lastErr != nil { return } if genericConfig.EgressSelector != nil { storageFactory.StorageConfig.Transport.EgressLookup = genericConfig.EgressSelector.Lookup } if lastErr = s.Etcd.ApplyWithStorageFactoryTo(storageFactory, genericConfig); lastErr != nil { return } ```
#### 2.1 DefaultAPIResourceConfigSource 可以看出来DefaultAPIResourceConfigSource函数就是返回当前集群支持哪些默认的版本和资源。 ``` // DefaultAPIResourceConfigSource returns default configuration for an APIResource. func DefaultAPIResourceConfigSource() *serverstorage.ResourceConfig { ret := serverstorage.NewResourceConfig() // NOTE: GroupVersions listed here will be enabled by default. Don't put alpha versions in the list. ret.EnableVersions( admissionregistrationv1.SchemeGroupVersion, admissionregistrationv1beta1.SchemeGroupVersion, apiv1.SchemeGroupVersion, appsv1.SchemeGroupVersion, authenticationv1.SchemeGroupVersion, authenticationv1beta1.SchemeGroupVersion, authorizationapiv1.SchemeGroupVersion, authorizationapiv1beta1.SchemeGroupVersion, autoscalingapiv1.SchemeGroupVersion, autoscalingapiv2beta1.SchemeGroupVersion, autoscalingapiv2beta2.SchemeGroupVersion, batchapiv1.SchemeGroupVersion, batchapiv1beta1.SchemeGroupVersion, certificatesapiv1beta1.SchemeGroupVersion, coordinationapiv1.SchemeGroupVersion, coordinationapiv1beta1.SchemeGroupVersion, discoveryv1beta1.SchemeGroupVersion, eventsv1beta1.SchemeGroupVersion, extensionsapiv1beta1.SchemeGroupVersion, networkingapiv1.SchemeGroupVersion, networkingapiv1beta1.SchemeGroupVersion, nodev1beta1.SchemeGroupVersion, policyapiv1beta1.SchemeGroupVersion, rbacv1.SchemeGroupVersion, rbacv1beta1.SchemeGroupVersion, storageapiv1.SchemeGroupVersion, storageapiv1beta1.SchemeGroupVersion, schedulingapiv1beta1.SchemeGroupVersion, schedulingapiv1.SchemeGroupVersion, ) // enable non-deprecated beta resources in extensions/v1beta1 explicitly so we have a full list of what's possible to serve ret.EnableResources( extensionsapiv1beta1.SchemeGroupVersion.WithResource("ingresses"), ) // disable deprecated beta resources in extensions/v1beta1 explicitly so we have a full list of what's possible to serve ret.DisableResources( extensionsapiv1beta1.SchemeGroupVersion.WithResource("daemonsets"), extensionsapiv1beta1.SchemeGroupVersion.WithResource("deployments"), extensionsapiv1beta1.SchemeGroupVersion.WithResource("networkpolicies"), extensionsapiv1beta1.SchemeGroupVersion.WithResource("podsecuritypolicies"), extensionsapiv1beta1.SchemeGroupVersion.WithResource("replicasets"), extensionsapiv1beta1.SchemeGroupVersion.WithResource("replicationcontrollers"), ) // disable deprecated beta versions explicitly so we have a full list of what's possible to serve ret.DisableVersions( appsv1beta1.SchemeGroupVersion, appsv1beta2.SchemeGroupVersion, ) // disable alpha versions explicitly so we have a full list of what's possible to serve ret.DisableVersions( auditregistrationv1alpha1.SchemeGroupVersion, batchapiv2alpha1.SchemeGroupVersion, nodev1alpha1.SchemeGroupVersion, rbacv1alpha1.SchemeGroupVersion, schedulingv1alpha1.SchemeGroupVersion, settingsv1alpha1.SchemeGroupVersion, storageapiv1alpha1.SchemeGroupVersion, flowcontrolv1alpha1.SchemeGroupVersion, ) return ret } ``` #### 2.2 初始化 storageFactory 这里分为了三步。 第一步:NewStorageFactoryConfig。 第二步:storageFactory, lastErr = completedStorageFactoryConfig.New() 第三步:s.Etcd.ApplyWithStorageFactoryTo(storageFactory, genericConfig)
**第一步**,NewStorageFactoryConfig就是定义了一些编码解码方式,以及需要覆盖的资源 ``` // NewStorageFactoryConfig returns a new StorageFactoryConfig set up with necessary resource overrides. func NewStorageFactoryConfig() *StorageFactoryConfig { resources := []schema.GroupVersionResource{ batch.Resource("cronjobs").WithVersion("v1beta1"), networking.Resource("ingresses").WithVersion("v1beta1"), // TODO #83513 csinodes override can be removed in 1.18 apisstorage.Resource("csinodes").WithVersion("v1beta1"), apisstorage.Resource("csidrivers").WithVersion("v1beta1"), } return &StorageFactoryConfig{ Serializer: legacyscheme.Codecs, //传统的编码解码 DefaultResourceEncoding: serverstorage.NewDefaultResourceEncodingConfig(legacyscheme.Scheme), ResourceEncodingOverrides: resources, } } ```
**第二步**,New 就是初始化了一个NewDefaultStorageFactory结构体。 * 描述了如何创建到底层存储的连接,包含了各种存储接口storage.Interface实现的认证信息。 例如默认使用etcd3,编码转换等方式。 ``` // Config is configuration for creating a storage backend. type Config struct { // Type defines the type of storage backend. Default ("") is "etcd3". Type string // Prefix is the prefix to all keys passed to storage.Interface methods. Prefix string // Transport holds all connection related info, i.e. equal TransportConfig means equal servers we talk to. Transport TransportConfig // Paging indicates whether the server implementation should allow paging (if it is // supported). This is generally configured by feature gating, or by a specific // resource type not wishing to allow paging, and is not intended for end users to // set. Paging bool Codec runtime.Codec // EncodeVersioner is the same groupVersioner used to build the // storage encoder. Given a list of kinds the input object might belong // to, the EncodeVersioner outputs the gvk the object will be // converted to before persisted in etcd. EncodeVersioner runtime.GroupVersioner // Transformer allows the value to be transformed prior to persisting into etcd. Transformer value.Transformer // CompactionInterval is an interval of requesting compaction from apiserver. // If the value is 0, no compaction will be issued. CompactionInterval time.Duration // CountMetricPollPeriod specifies how often should count metric be updated CountMetricPollPeriod time.Duration } ``` 以及其他的参数如下: ``` // New returns a new storage factory created from the completed storage factory configuration. func (c *completedStorageFactoryConfig) New() (*serverstorage.DefaultStorageFactory, error) { resourceEncodingConfig := resourceconfig.MergeResourceEncodingConfigs(c.DefaultResourceEncoding, c.ResourceEncodingOverrides) storageFactory := serverstorage.NewDefaultStorageFactory( c.StorageConfig, //描述了如何创建到底层存储的连接,包含了各种存储接口storage.Interface实现的认证信息。 c.DefaultStorageMediaType, //数据格式,缺省存储媒介类型,application/json c.Serializer, //缺省序列化实例,legacyscheme.Codecs resourceEncodingConfig, // 资源编码配置 c.APIResourceConfig, //API启用的资源版本 SpecialDefaultResourcePrefixes) //前缀 // 同居资源绑定,约定了同居资源的查找顺序 storageFactory.AddCohabitatingResources(networking.Resource("networkpolicies"), extensions.Resource("networkpolicies")) storageFactory.AddCohabitatingResources(apps.Resource("deployments"), extensions.Resource("deployments")) storageFactory.AddCohabitatingResources(apps.Resource("daemonsets"), extensions.Resource("daemonsets")) storageFactory.AddCohabitatingResources(apps.Resource("replicasets"), extensions.Resource("replicasets")) storageFactory.AddCohabitatingResources(api.Resource("events"), events.Resource("events")) storageFactory.AddCohabitatingResources(api.Resource("replicationcontrollers"), extensions.Resource("replicationcontrollers")) // to make scale subresources equivalent storageFactory.AddCohabitatingResources(policy.Resource("podsecuritypolicies"), extensions.Resource("podsecuritypolicies")) storageFactory.AddCohabitatingResources(networking.Resource("ingresses"), extensions.Resource("ingresses")) for _, override := range c.EtcdServersOverrides { tokens := strings.Split(override, "#") apiresource := strings.Split(tokens[0], "/") group := apiresource[0] resource := apiresource[1] groupResource := schema.GroupResource{Group: group, Resource: resource} servers := strings.Split(tokens[1], ";") storageFactory.SetEtcdLocation(groupResource, servers) } if len(c.EncryptionProviderConfigFilepath) != 0 { transformerOverrides, err := encryptionconfig.GetTransformerOverrides(c.EncryptionProviderConfigFilepath) if err != nil { return nil, err } for groupResource, transformer := range transformerOverrides { storageFactory.SetTransformer(groupResource, transformer) } } return storageFactory, nil } func NewDefaultStorageFactory( config storagebackend.Config, defaultMediaType string, // 从EtcdOptions参数中传入的,缺省为 application/json,见NewEtcdOptions方法 defaultSerializer runtime.StorageSerializer, // 具体的值:legacyscheme.Codecs resourceEncodingConfig ResourceEncodingConfig, // 资源编码配置情况,并不是所有的资源都按照指定的Group来存放,有些特例。另外也可以指定存储在不同etcd、不同的prefix、甚至于不同的编码存储。 resourceConfig APIResourceConfigSource, // 启用的资源版本的API情况 specialDefaultResourcePrefixes map[schema.GroupResource]string, // 见:SpecialDefaultResourcePrefixes ) *DefaultStorageFactory { config.Paging = utilfeature.DefaultFeatureGate.Enabled(features.APIListChunking) if len(defaultMediaType) == 0 { defaultMediaType = runtime.ContentTypeJSON } return &DefaultStorageFactory{ StorageConfig: config, // 描述了如何创建到底层存储的连接,包含了各种存储接口storage.Interface实现的认证信息。 Overrides: map[schema.GroupResource]groupResourceOverrides{}, // 特殊资源处理 DefaultMediaType: defaultMediaType, // 缺省存储媒介类型,application/json DefaultSerializer: defaultSerializer, // 缺省序列化实例,legacyscheme.Codecs ResourceEncodingConfig: resourceEncodingConfig, // 资源编码配置 APIResourceConfigSource: resourceConfig, // API启用的资源版本 DefaultResourcePrefixes: specialDefaultResourcePrefixes, // 特殊资源prefix newStorageCodecFn: NewStorageCodec, // 为提供的存储媒介类型、序列化和请求的存储与内存版本组装一个存储codec } } ```
**第三步,**初始化 RESTOptionsGetter,后期根据其获取操作 Etcd 的句柄,同时添加 etcd 的健康检查方法 ``` func (s *EtcdOptions) ApplyWithStorageFactoryTo(factory serverstorage.StorageFactory, c *server.Config) error { if err := s.addEtcdHealthEndpoint(c); err != nil { return err } c.RESTOptionsGetter = &StorageFactoryRestOptionsFactory{Options: *s, StorageFactory: factory} return nil } ```
最终构建好的DefaultStorageFactory,会被存储在genericapiserver.Config的RESTOptionsGetter成员中,代码如上所示。
### 3. 以pod为例, apiserver是如何add/del/update etcd资源的 在创建KubeAPIServer的过程中,会调用InstallLegacyAPI注册api资源。其中就有一个NewLegacyRESTStorage的函数 ``` func (c LegacyRESTStorageProvider) NewLegacyRESTStorage(restOptionsGetter generic.RESTOptionsGetter) (LegacyRESTStorage, genericapiserver.APIGroupInfo, error) { 。。。 podStorage, err := podstore.NewStorage( restOptionsGetter, //这个就是之前的restOptionsGetter,里面有etcd的各种配置 nodeStorage.KubeletConnectionInfo, c.ProxyTransport, podDisruptionClient, ) serviceRest, serviceRestProxy := servicestore.NewREST(serviceRESTStorage, endpointsStorage, podStorage.Pod, serviceClusterIPAllocator, secondaryServiceClusterIPAllocator, serviceNodePortAllocator, c.ProxyTransport) restStorageMap := map[string]rest.Storage{ "pods": podStorage.Pod, "pods/attach": podStorage.Attach, "pods/status": podStorage.Status, "pods/log": podStorage.Log, "pods/exec": podStorage.Exec, "pods/portforward": podStorage.PortForward, "pods/proxy": podStorage.Proxy, "pods/binding": podStorage.Binding, "bindings": podStorage.LegacyBinding, "podTemplates": podTemplateStorage, "replicationControllers": controllerStorage.Controller, "replicationControllers/status": controllerStorage.Status, "services": serviceRest, "services/proxy": serviceRestProxy, "services/status": serviceStatusStorage, "endpoints": endpointsStorage, "nodes": nodeStorage.Node, "nodes/status": nodeStorage.Status, "nodes/proxy": nodeStorage.Proxy, "events": eventStorage, "limitRanges": limitRangeStorage, "resourceQuotas": resourceQuotaStorage, "resourceQuotas/status": resourceQuotaStatusStorage, "namespaces": namespaceStorage, "namespaces/status": namespaceStatusStorage, "namespaces/finalize": namespaceFinalizeStorage, "secrets": secretStorage, "serviceAccounts": serviceAccountStorage, "persistentVolumes": persistentVolumeStorage, "persistentVolumes/status": persistentVolumeStatusStorage, "persistentVolumeClaims": persistentVolumeClaimStorage, "persistentVolumeClaims/status": persistentVolumeClaimStatusStorage, "configMaps": configMapStorage, "componentStatuses": componentstatus.NewStorage(componentStatusStorage{c.StorageFactory}.serversToValidate), } if legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: "autoscaling", Version: "v1"}) { restStorageMap["replicationControllers/scale"] = controllerStorage.Scale } if legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: "policy", Version: "v1beta1"}) { restStorageMap["pods/eviction"] = podStorage.Eviction } if serviceAccountStorage.Token != nil { restStorageMap["serviceaccounts/token"] = serviceAccountStorage.Token } if utilfeature.DefaultFeatureGate.Enabled(features.EphemeralContainers) { restStorageMap["pods/ephemeralcontainers"] = podStorage.EphemeralContainers } apiGroupInfo.VersionedResourcesStorageMap["v1"] = restStorageMap return restStorage, apiGroupInfo, nil } ```
#### 3.1 NewStorage ``` // NewStorage returns a RESTStorage object that will work against pods. func NewStorage(optsGetter generic.RESTOptionsGetter, k client.ConnectionInfoGetter, proxyTransport http.RoundTripper, podDisruptionBudgetClient policyclient.PodDisruptionBudgetsGetter) (PodStorage, error) { store := &genericregistry.Store{ NewFunc: func() runtime.Object { return &api.Pod{} }, //NewFunc用于构建一个Pod实例 NewListFunc: func() runtime.Object { return &api.PodList{} }, PredicateFunc: pod.MatchPod, DefaultQualifiedResource: api.Resource("pods"), // 关键点1,pod.Strategy CreateStrategy: pod.Strategy, UpdateStrategy: pod.Strategy, DeleteStrategy: pod.Strategy, ReturnDeletedObject: true, TableConvertor: printerstorage.TableConvertor{TableGenerator: printers.NewTableGenerator().With(printersinternal.AddHandlers)}, } options := &generic.StoreOptions{ RESTOptions: optsGetter, AttrFunc: pod.GetAttrs, TriggerFunc: map[string]storage.IndexerFunc{"spec.nodeName": pod.NodeNameTriggerFunc}, } // 关键点2,CompleteWithOptions if err := store.CompleteWithOptions(options); err != nil { return PodStorage{}, err } statusStore := *store statusStore.UpdateStrategy = pod.StatusStrategy ephemeralContainersStore := *store ephemeralContainersStore.UpdateStrategy = pod.EphemeralContainersStrategy bindingREST := &BindingREST{store: store} return PodStorage{ Pod: &REST{store, proxyTransport}, Binding: &BindingREST{store: store}, LegacyBinding: &LegacyBindingREST{bindingREST}, Eviction: newEvictionStorage(store, podDisruptionBudgetClient), Status: &StatusREST{store: &statusStore}, EphemeralContainers: &EphemeralContainersREST{store: &ephemeralContainersStore}, Log: &podrest.LogREST{Store: store, KubeletConn: k}, Proxy: &podrest.ProxyREST{Store: store, ProxyTransport: proxyTransport}, Exec: &podrest.ExecREST{Store: store, KubeletConn: k}, Attach: &podrest.AttachREST{Store: store, KubeletConn: k}, PortForward: &podrest.PortForwardREST{Store: store, KubeletConn: k}, }, nil } ``` 上面代码的关键,就是store对象的创建,store.Storage的类型为:storage.Interface接口。 这里有两个关键点,pod.Strategy和CompleteWithOptions函数。 #### 3.2 pod.Strategy 可以看出来,podStrategy就是每一个storage独特的地方。比如NamespaceScoped函数表示了这个资源是否有namespaces这个概念。 这决定了了url中是否有namespace前缀。 PrepareForCreate,对接受的pod进行了status的初始化。这样通过kubectl create pod的话。obj包含了template信息。PrepareForCreate函数进行了status的初始化。 ``` // podStrategy implements behavior for Pods type podStrategy struct { runtime.ObjectTyper names.NameGenerator } // Strategy is the default logic that applies when creating and updating Pod // objects via the REST API. var Strategy = podStrategy{legacyscheme.Scheme, names.SimpleNameGenerator} // NamespaceScoped is true for pods. func (podStrategy) NamespaceScoped() bool { return true } // PrepareForCreate clears fields that are not allowed to be set by end users on creation. func (podStrategy) PrepareForCreate(ctx context.Context, obj runtime.Object) { pod := obj.(*api.Pod) pod.Status = api.PodStatus{ Phase: api.PodPending, QOSClass: qos.GetPodQOS(pod), } podutil.DropDisabledPodFields(pod, nil) } // PrepareForUpdate clears fields that are not allowed to be set by end users on update. func (podStrategy) PrepareForUpdate(ctx context.Context, obj, old runtime.Object) { newPod := obj.(*api.Pod) oldPod := old.(*api.Pod) newPod.Status = oldPod.Status podutil.DropDisabledPodFields(newPod, oldPod) } // Validate validates a new pod. func (podStrategy) Validate(ctx context.Context, obj runtime.Object) field.ErrorList { pod := obj.(*api.Pod) allErrs := validation.ValidatePodCreate(pod) allErrs = append(allErrs, validation.ValidateConditionalPod(pod, nil, field.NewPath(""))...) return allErrs } // Canonicalize normalizes the object after validation. func (podStrategy) Canonicalize(obj runtime.Object) { } 。。。 ```
#### 3.3 CompleteWithOptions `store.CompleteWithOptions` 主要功能是为 store 中的配置设置一些默认的值以及根据提供的 options 更新 store,其中最主要的就是初始化 store 的后端存储实例。CompleteWithOptions函数如下: ``` // CompleteWithOptions updates the store with the provided options and // defaults common fields. func (e *Store) CompleteWithOptions(options *generic.StoreOptions) error { 。。。省略了一些检查代码。。。 attrFunc := options.AttrFunc if attrFunc == nil { if isNamespaced { attrFunc = storage.DefaultNamespaceScopedAttr } else { attrFunc = storage.DefaultClusterScopedAttr } } if e.PredicateFunc == nil { e.PredicateFunc = func(label labels.Selector, field fields.Selector) storage.SelectionPredicate { return storage.SelectionPredicate{ Label: label, Field: field, GetAttrs: attrFunc, } } } // GetRESTOptions对etcd进行了初始化。 opts, err := options.RESTOptions.GetRESTOptions(e.DefaultQualifiedResource) if err != nil { return err } // ResourcePrefix must come from the underlying factory prefix := opts.ResourcePrefix if !strings.HasPrefix(prefix, "/") { prefix = "/" + prefix } if prefix == "/" { return fmt.Errorf("store for %s has an invalid prefix %q", e.DefaultQualifiedResource.String(), opts.ResourcePrefix) } // Set the default behavior for storage key generation if e.KeyRootFunc == nil && e.KeyFunc == nil { if isNamespaced { e.KeyRootFunc = func(ctx context.Context) string { return NamespaceKeyRootFunc(ctx, prefix) } e.KeyFunc = func(ctx context.Context, name string) (string, error) { return NamespaceKeyFunc(ctx, prefix, name) } } else { e.KeyRootFunc = func(ctx context.Context) string { return prefix } e.KeyFunc = func(ctx context.Context, name string) (string, error) { return NoNamespaceKeyFunc(ctx, prefix, name) } } } // We adapt the store's keyFunc so that we can use it with the StorageDecorator // without making any assumptions about where objects are stored in etcd keyFunc := func(obj runtime.Object) (string, error) { accessor, err := meta.Accessor(obj) if err != nil { return "", err } if isNamespaced { return e.KeyFunc(genericapirequest.WithNamespace(genericapirequest.NewContext(), accessor.GetNamespace()), accessor.GetName()) } return e.KeyFunc(genericapirequest.NewContext(), accessor.GetName()) } if e.DeleteCollectionWorkers == 0 { e.DeleteCollectionWorkers = opts.DeleteCollectionWorkers } e.EnableGarbageCollection = opts.EnableGarbageCollection if e.ObjectNameFunc == nil { e.ObjectNameFunc = func(obj runtime.Object) (string, error) { accessor, err := meta.Accessor(obj) if err != nil { return "", err } return accessor.GetName(), nil } } if e.Storage.Storage == nil { e.Storage.Codec = opts.StorageConfig.Codec var err error e.Storage.Storage, e.DestroyFunc, err = opts.Decorator( opts.StorageConfig, prefix, keyFunc, e.NewFunc, e.NewListFunc, attrFunc, options.TriggerFunc, ) if err != nil { return err } e.StorageVersioner = opts.StorageConfig.EncodeVersioner if opts.CountMetricPollPeriod > 0 { stopFunc := e.startObservingCount(opts.CountMetricPollPeriod) previousDestroy := e.DestroyFunc e.DestroyFunc = func() { stopFunc() if previousDestroy != nil { previousDestroy() } } } } return nil } ``` 在`CompleteWithOptions`方法内,调用了`options.RESTOptions.GetRESTOptions` 方法,其最终返回`generic.RESTOptions` 对象,`generic.RESTOptions` 对象中包含对 etcd 初始化的一些配置、数据序列化方法以及对 etcd 操作的 storage.Interface 对象。其会依次调用`StorageWithCacher-->NewRawStorage-->Create`方法创建最终依赖的后端存储。 ``` func (f *StorageFactoryRestOptionsFactory) GetRESTOptions(resource schema.GroupResource) (generic.RESTOptions, error) { storageConfig, err := f.StorageFactory.NewConfig(resource) if err != nil { return generic.RESTOptions{}, fmt.Errorf("unable to find storage destination for %v, due to %v", resource, err.Error()) } ret := generic.RESTOptions{ StorageConfig: storageConfig, Decorator: generic.UndecoratedStorage, DeleteCollectionWorkers: f.Options.DeleteCollectionWorkers, EnableGarbageCollection: f.Options.EnableGarbageCollection, ResourcePrefix: f.StorageFactory.ResourcePrefix(resource), CountMetricPollPeriod: f.Options.StorageConfig.CountMetricPollPeriod, } if f.Options.EnableWatchCache { sizes, err := ParseWatchCacheSizes(f.Options.WatchCacheSizes) if err != nil { return generic.RESTOptions{}, err } cacheSize, ok := sizes[resource] if !ok { cacheSize = f.Options.DefaultWatchCacheSize } // depending on cache size this might return an undecorated storage ret.Decorator = genericregistry.StorageWithCacher(cacheSize) } return ret, nil } // NewRawStorage creates the low level kv storage. This is a work-around for current // two layer of same storage interface. // TODO: Once cacher is enabled on all registries (event registry is special), we will remove this method. func NewRawStorage(config *storagebackend.Config) (storage.Interface, factory.DestroyFunc, error) { return factory.Create(*config) } // Create creates a storage backend based on given config. func Create(c storagebackend.Config) (storage.Interface, DestroyFunc, error) { switch c.Type { case "etcd2": return nil, nil, fmt.Errorf("%v is no longer a supported storage backend", c.Type) case storagebackend.StorageTypeUnset, storagebackend.StorageTypeETCD3: return newETCD3Storage(c) default: return nil, nil, fmt.Errorf("unknown storage type: %s", c.Type) } } ```
#### 3.4 总结 Pod对象需要定义好NewStorage,其中在NewStorage中定义了Strategy,里面包含了创建,更新前的各种操作。这样的好处就是,Strorage这一层封装好了每个资源的差异性,下层的etcd来了数据,就只管增删改就行。 同时调用了CompleteWithOptions函数和etcd进行了打通。
### 4. 总结 k8s中完整的etcd框架是下图所示:这里先分析到Stroage.Interface。知道一个大体流程。后面有需要再深入。 ![etcd struct](../images/etcd struct.png) ### 5.参考链接: https://www.jianshu.com/p/daa4ff387a78 书籍:kubernetes源码解剖,郑东 ================================================ FILE: k8s/kube-apiserver/16. 创建更新删除资源时apiserver做了什么工作.md ================================================ * [1\. 简介](#1-简介) * [2\. 流程介绍](#2-流程介绍) * [3\. pod创建](#3-pod创建) * [3\.1 pod create 前端逻辑](#31-pod-create-前端逻辑) * [3\.2 pod创建\-后端逻辑](#32-pod创建-后端逻辑) * [3\.2\.1 BeforeCreate函数](#321-beforecreate函数) * [3\.2\.2 Create函数](#322-create函数) * [3\.3 总结](#33-总结) * [4\. Pod 删除](#4-pod-删除) * [4\.1 Delete](#41-delete) * [4\.2 BeforeDelete](#42-beforedelete) * [4\.3 updateForGracefulDeletionAndFinalizers](#43-updateforgracefuldeletionandfinalizers) * [4\.4 总结](#44-总结) * [5\.参考](#5参考) ### 1. 简介 目前只剩下一个请求在kube-apiserver webhook之后,存放etcd之前的操作没有分析了。这里以pod为例介绍一下。 同时也用于快速定位创建,删除,更新,get某个资源时,apiserver做了什么操作。
再次回顾之前的apiserver初始化逻辑。可以看看之前的文章,回顾一下。 在之前的分析中: InstallLegacyAPI函数的执行过程分为两步: **第一步:**通过legacyRESTStorageProvider.NewLegacyRESTStorage函数实例化APIGroupInfo,APIGroupInfo对象用于描述资源组信 息,该对象的VersionedResourcesStorageMap字段用于存储资源与资源存储对象的映射关系,其表现形式为map[string]map[string]rest.Storage (即<资源版本>/<资源>/<资源存储对象>), 例如Pod资源与资源存储对象的映射关系是v1/pods/PodStorage。使Core Groups/v1下的资源与资源存储对象相互映射,代码路径:pkg/registry/core/rest/storage_core.go ``` // storage就是将ulr和处理函数进行了绑定 restStorageMap := map[string]rest.Storage{ "pods": podStorage.Pod, "pods/attach": podStorage.Attach, "pods/status": podStorage.Status, "pods/log": podStorage.Log, "pods/exec": podStorage.Exec, "pods/portforward": podStorage.PortForward, "pods/proxy": podStorage.Proxy, "pods/binding": podStorage.Binding, "bindings": podStorage.LegacyBinding, "podTemplates": podTemplateStorage, "replicationControllers": controllerStorage.Controller, "replicationControllers/status": controllerStorage.Status, "services": serviceRest, "services/proxy": serviceRestProxy, "services/status": serviceStatusStorage, "endpoints": endpointsStorage, "nodes": nodeStorage.Node, "nodes/status": nodeStorage.Status, "nodes/proxy": nodeStorage.Proxy, "events": eventStorage, "limitRanges": limitRangeStorage, "resourceQuotas": resourceQuotaStorage, "resourceQuotas/status": resourceQuotaStatusStorage, "namespaces": namespaceStorage, "namespaces/status": namespaceStatusStorage, "namespaces/finalize": namespaceFinalizeStorage, "secrets": secretStorage, "serviceAccounts": serviceAccountStorage, "persistentVolumes": persistentVolumeStorage, "persistentVolumes/status": persistentVolumeStatusStorage, "persistentVolumeClaims": persistentVolumeClaimStorage, "persistentVolumeClaims/status": persistentVolumeClaimStatusStorage, "configMaps": configMapStorage, "componentStatuses": componentstatus.NewStorage(componentStatusStorage{c.StorageFactory}.serversToValidate), } if legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: "autoscaling", Version: "v1"}) { restStorageMap["replicationControllers/scale"] = controllerStorage.Scale } if legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: "policy", Version: "v1beta1"}) { restStorageMap["pods/eviction"] = podStorage.Eviction } if serviceAccountStorage.Token != nil { restStorageMap["serviceaccounts/token"] = serviceAccountStorage.Token } if utilfeature.DefaultFeatureGate.Enabled(features.EphemeralContainers) { restStorageMap["pods/ephemeralcontainers"] = podStorage.EphemeralContainers } ```
**第二步:** 创建完上面的路由之后,则开始进行路由的安装,执行`InstallLegacyAPIGroup`方法,主要调用链为`InstallLegacyAPIGroup-->installAPIResources-->InstallREST-->Install-->registerResourceHandlers`,最终核心的路由构造在`registerResourceHandlers`方法内。 ``` // Install handlers for API resources. func (a *APIInstaller) Install() ([]metav1.APIResource, *restful.WebService, []error) { var apiResources []metav1.APIResource var errors []error ws := a.newWebService() // Register the paths in a deterministic (sorted) order to get a deterministic swagger spec. paths := make([]string, len(a.group.Storage)) var i int = 0 for path := range a.group.Storage { paths[i] = path i++ } sort.Strings(paths) for _, path := range paths { apiResource, err := a.registerResourceHandlers(path, a.group.Storage[path], ws) if err != nil { errors = append(errors, fmt.Errorf("error in registering resource: %s, %v", path, err)) } if apiResource != nil { apiResources = append(apiResources, *apiResource) } } return apiResources, ws, errors } ``` install方法先创建了一个websevice。然后将所有的api 路径都存入一个数组:paths。对该数组排序(sort)。然后利用for range遍历数组的所有元素,调用registerResourceHandlers方法来对每个api路径注册,也就是和对应的storage以及Webservice绑定。 这里的storage指的是后端etcd的存储。storage变量是个map,Key是REST API的path,Value是rest.Storage接口,该接口就是一个通用的符合Restful要求的资源存储接口。
注意每个函数都会调用registerResourceHandlers registerResourceHandlers 函数很长。定义在:staging/src/k8s.io/apiserver/pkg/endpoints/installer.go 代码不贴出来了。具体逻辑为: (1) 首先对资源的后端存储storage(etcd的存储)进行验证,判断那些方法是storage所支持的。然后将所有支持的方法存入action数组中。比如判断是否支持,create, list, get, list, watch, patch等等动作。 ``` creater, isCreater := storage.(rest.Creater) namedCreater, isNamedCreater := storage.(rest.NamedCreater) lister, isLister := storage.(rest.Lister) getter, isGetter := storage.(rest.Getter) getterWithOptions, isGetterWithOptions := storage.(rest.GetterWithOptions) gracefulDeleter, isGracefulDeleter := storage.(rest.GracefulDeleter) collectionDeleter, isCollectionDeleter := storage.(rest.CollectionDeleter) updater, isUpdater := storage.(rest.Updater) patcher, isPatcher := storage.(rest.Patcher) watcher, isWatcher := storage.(rest.Watcher) connecter, isConnecter := storage.(rest.Connecter) storageMeta, isMetadata := storage.(rest.StorageMetadata) ``` (2)然后,遍历actions数组,在一个switch语句中,为所有元素定义路由。如贴出的case "GET"这一块,首先创建并包装一个handler对象,然后调用WebService的一系列方法,创建一个route对象,将handler绑定到这个route上。后面还有case "PUT"、case "DELETE"等一系列case,不一一贴出。最后,将route加入routes数组中。 ``` { case "GET": // Get a resource. ... case "LIST": // List all resources of a kind. ... } ``` ### 2. 流程介绍 上面是先注册了Storage,然后再实例化路由。这样每个资源的增删改查,就和路径对应上了。 然后根据registerResourceHandlers函数为每个资源的增删改查绑定 后端处理函数。 注意上面的case "GET", case "LIST"等都是通用的rest入口,最终会调用每个对象storage的处理函数。具体某个对象的storage处理逻辑如下: ![image-20220517170229750](../images/apiserver-14.png) 接下里以pod为例来说明 ### 3. pod创建 #### 3.1 pod create 前端逻辑 create 对应的是post方法,可以看到核心函数就是createHandler(staging/src/k8s.io/apiserver/pkg/endpoints/handlers/create.go)。函数逻辑如下: (1)如果是dryRun,并且不支持dryRun就退出 (2)经历decode,admission,validation以及encode的流程 (3)调用 r.Create 完成某个资源对象storage处理,这一步是到后端和etcd交互的处理了。之前1,2都是apiserver自己的逻辑处理。 ``` case "POST": // Create a resource. var handler restful.RouteFunction if isNamedCreater { handler = restfulCreateNamedResource(namedCreater, reqScope, admit) } else { handler = restfulCreateResource(creater, reqScope, admit) } handler = metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, handler) article := GetArticleForNoun(kind, " ") doc := "create" + article + kind if isSubresource { doc = "create " + subresource + " of" + article + kind } route := ws.POST(action.Path).To(handler). Doc(doc). Param(ws.QueryParameter("pretty", "If 'true', then the output is pretty printed.")). Operation("create"+namespaced+kind+strings.Title(subresource)+operationSuffix). Produces(append(storageMeta.ProducesMIMETypes(action.Verb), mediaTypes...)...). Returns(http.StatusOK, "OK", producedObject). // TODO: in some cases, the API may return a v1.Status instead of the versioned object // but currently go-restful can't handle multiple different objects being returned. Returns(http.StatusCreated, "Created", producedObject). Returns(http.StatusAccepted, "Accepted", producedObject). Reads(defaultVersionedObject). Writes(producedObject) if err := AddObjectParams(ws, route, versionedCreateOptions); err != nil { return nil, err } addParams(route, action.Params) routes = append(routes, route) func restfulCreateNamedResource(r rest.NamedCreater, scope handlers.RequestScope, admit admission.Interface) restful.RouteFunction { return func(req *restful.Request, res *restful.Response) { handlers.CreateNamedResource(r, &scope, admit)(res.ResponseWriter, req.Request) } } // CreateNamedResource returns a function that will handle a resource creation with name. func CreateNamedResource(r rest.NamedCreater, scope *RequestScope, admission admission.Interface) http.HandlerFunc { return createHandler(r, scope, admission, true) } // 核心函数createHandler func createHandler(r rest.NamedCreater, scope *RequestScope, admit admission.Interface, includeName bool) http.HandlerFunc { return func(w http.ResponseWriter, req *http.Request) { // For performance tracking purposes. trace := utiltrace.New("Create", utiltrace.Field{Key: "url", Value: req.URL.Path}, utiltrace.Field{Key: "user-agent", Value: &lazyTruncatedUserAgent{req}}, utiltrace.Field{Key: "client", Value: &lazyClientIP{req}}) defer trace.LogIfLong(500 * time.Millisecond) if isDryRun(req.URL) && !utilfeature.DefaultFeatureGate.Enabled(features.DryRun) { scope.err(errors.NewBadRequest("the dryRun alpha feature is disabled"), w, req) return } // TODO: we either want to remove timeout or document it (if we document, move timeout out of this function and declare it in api_installer) timeout := parseTimeout(req.URL.Query().Get("timeout")) namespace, name, err := scope.Namer.Name(req) if err != nil { if includeName { // name was required, return scope.err(err, w, req) return } // otherwise attempt to look up the namespace namespace, err = scope.Namer.Namespace(req) if err != nil { scope.err(err, w, req) return } } ctx, cancel := context.WithTimeout(req.Context(), timeout) defer cancel() ctx = request.WithNamespace(ctx, namespace) outputMediaType, _, err := negotiation.NegotiateOutputMediaType(req, scope.Serializer, scope) if err != nil { scope.err(err, w, req) return } gv := scope.Kind.GroupVersion() s, err := negotiation.NegotiateInputSerializer(req, false, scope.Serializer) if err != nil { scope.err(err, w, req) return } decoder := scope.Serializer.DecoderToVersion(s.Serializer, scope.HubGroupVersion) body, err := limitedReadBody(req, scope.MaxRequestBodyBytes) if err != nil { scope.err(err, w, req) return } options := &metav1.CreateOptions{} values := req.URL.Query() if err := metainternalversionscheme.ParameterCodec.DecodeParameters(values, scope.MetaGroupVersion, options); err != nil { err = errors.NewBadRequest(err.Error()) scope.err(err, w, req) return } if errs := validation.ValidateCreateOptions(options); len(errs) > 0 { err := errors.NewInvalid(schema.GroupKind{Group: metav1.GroupName, Kind: "CreateOptions"}, "", errs) scope.err(err, w, req) return } options.TypeMeta.SetGroupVersionKind(metav1.SchemeGroupVersion.WithKind("CreateOptions")) defaultGVK := scope.Kind original := r.New() trace.Step("About to convert to expected version") obj, gvk, err := decoder.Decode(body, &defaultGVK, original) if err != nil { err = transformDecodeError(scope.Typer, err, original, gvk, body) scope.err(err, w, req) return } if gvk.GroupVersion() != gv { err = errors.NewBadRequest(fmt.Sprintf("the API version in the data (%s) does not match the expected API version (%v)", gvk.GroupVersion().String(), gv.String())) scope.err(err, w, req) return } trace.Step("Conversion done") ae := request.AuditEventFrom(ctx) admit = admission.WithAudit(admit, ae) audit.LogRequestObject(ae, obj, scope.Resource, scope.Subresource, scope.Serializer) userInfo, _ := request.UserFrom(ctx) // On create, get name from new object if unset if len(name) == 0 { _, name, _ = scope.Namer.ObjectName(obj) } admissionAttributes := admission.NewAttributesRecord(obj, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Create, options, dryrun.IsDryRun(options.DryRun), userInfo) if mutatingAdmission, ok := admit.(admission.MutationInterface); ok && mutatingAdmission.Handles(admission.Create) { err = mutatingAdmission.Admit(ctx, admissionAttributes, scope) if err != nil { scope.err(err, w, req) return } } if scope.FieldManager != nil { liveObj, err := scope.Creater.New(scope.Kind) if err != nil { scope.err(fmt.Errorf("failed to create new object (Create for %v): %v", scope.Kind, err), w, req) return } obj, err = scope.FieldManager.Update(liveObj, obj, managerOrUserAgent(options.FieldManager, req.UserAgent())) if err != nil { scope.err(fmt.Errorf("failed to update object (Create for %v) managed fields: %v", scope.Kind, err), w, req) return } } trace.Step("About to store object in database") result, err := finishRequest(timeout, func() (runtime.Object, error) { return r.Create( ctx, name, obj, rest.AdmissionToValidateObjectFunc(admit, admissionAttributes, scope), options, ) }) if err != nil { scope.err(err, w, req) return } trace.Step("Object stored in database") code := http.StatusCreated status, ok := result.(*metav1.Status) if ok && err == nil && status.Code == 0 { status.Code = int32(code) } transformResponseObject(ctx, scope, trace, req, w, code, outputMediaType, result) } } ```
#### 3.2 pod创建-后端逻辑 **创建pod特有逻辑**: r.Create,从之前的调用链可以看出来。当资源为pod时,e.CreateStrategy=podStrategy。 Create 这里的逻辑是: (1)调用BeforeCreate做创建之前的工作,详见3.2.1 (2)得到对象的名字以及key,这个也是对象特有的 (3)调用Storage.Create开始创建对象 (4)创建对象后,如果这对象实现了AfterCreate, 再走AfterCreate逻辑,pod没有实现 (5)创建对象后,如果这对象实现了Decorator装饰, 再走AfterCreate逻辑,pod没有实现 ``` // Create inserts a new item according to the unique key from the object. func (e *Store) Create(ctx context.Context, obj runtime.Object, createValidation rest.ValidateObjectFunc, options *metav1.CreateOptions) (runtime.Object, error) { // 1。调用BeforeCreate做创建之前的工作,详见3.2.1 if err := rest.BeforeCreate(e.CreateStrategy, ctx, obj); err != nil { return nil, err } // at this point we have a fully formed object. It is time to call the validators that the apiserver // handling chain wants to enforce. if createValidation != nil { if err := createValidation(ctx, obj.DeepCopyObject()); err != nil { return nil, err } } // 2.得到对象的名字以及key,这个也是对象特有的 name, err := e.ObjectNameFunc(obj) if err != nil { return nil, err } key, err := e.KeyFunc(ctx, name) if err != nil { return nil, err } qualifiedResource := e.qualifiedResourceFromContext(ctx) ttl, err := e.calculateTTL(obj, 0, false) if err != nil { return nil, err } // 3. 调用Storage.Create开始创建对象 out := e.NewFunc() if err := e.Storage.Create(ctx, key, obj, out, ttl, dryrun.IsDryRun(options.DryRun)); err != nil { err = storeerr.InterpretCreateError(err, qualifiedResource, name) err = rest.CheckGeneratedNameError(e.CreateStrategy, err, obj) if !kubeerr.IsAlreadyExists(err) { return nil, err } if errGet := e.Storage.Get(ctx, key, "", out, false); errGet != nil { return nil, err } accessor, errGetAcc := meta.Accessor(out) if errGetAcc != nil { return nil, err } if accessor.GetDeletionTimestamp() != nil { msg := &err.(*kubeerr.StatusError).ErrStatus.Message *msg = fmt.Sprintf("object is being deleted: %s", *msg) } return nil, err } // 4.创建对象后,如果这对象实现了AfterCreate, 再走AfterCreate逻辑,pod没有实现 if e.AfterCreate != nil { if err := e.AfterCreate(out); err != nil { return nil, err } } // 5.创建对象后,如果这对象实现了Decorator装饰, 再走AfterCreate逻辑,pod没有实现 if e.Decorator != nil { if err := e.Decorator(out); err != nil { return nil, err } } return out, nil } ``` ##### 3.2.1 BeforeCreate函数 注意这里strategy是Pod 函数逻辑如下: (1)获取objectMeta, kind, namespaces等信息 (2)设置DeletionTimestamp,DeletionGracePeriodSeconds等所有对象通用的字段 (3)设置pod资源特有的字段,这里是podStrategy (4)做一下验证,以及Canonicalize,这个也是不同对象特有的 ``` // BeforeCreate ensures that common operations for all resources are performed on creation. It only returns // errors that can be converted to api.Status. It invokes PrepareForCreate, then GenerateName, then Validate. // It returns nil if the object should be created. func BeforeCreate(strategy RESTCreateStrategy, ctx context.Context, obj runtime.Object) error { // 1.获取objectMeta, kind, namespaces等信息 objectMeta, kind, kerr := objectMetaAndKind(strategy, obj) if kerr != nil { return kerr } if strategy.NamespaceScoped() { if !ValidNamespace(ctx, objectMeta) { return errors.NewBadRequest("the namespace of the provided object does not match the namespace sent on the request") } } else if len(objectMeta.GetNamespace()) > 0 { objectMeta.SetNamespace(metav1.NamespaceNone) } // 2. 设置DeletionTimestamp,DeletionGracePeriodSeconds等所有对象通用的字段 objectMeta.SetDeletionTimestamp(nil) objectMeta.SetDeletionGracePeriodSeconds(nil) // 3. 设置pod资源特有的字段,这里是podStrategy strategy.PrepareForCreate(ctx, obj) FillObjectMetaSystemFields(objectMeta) if len(objectMeta.GetGenerateName()) > 0 && len(objectMeta.GetName()) == 0 { objectMeta.SetName(strategy.GenerateName(objectMeta.GetGenerateName())) } // Ensure managedFields is not set unless the feature is enabled if !utilfeature.DefaultFeatureGate.Enabled(features.ServerSideApply) { objectMeta.SetManagedFields(nil) } // ClusterName is ignored and should not be saved if len(objectMeta.GetClusterName()) > 0 { objectMeta.SetClusterName("") } // 4.做一下验证,以及Canonicalize,这个也是不同对象特有的 if errs := strategy.Validate(ctx, obj); len(errs) > 0 { return errors.NewInvalid(kind.GroupKind(), objectMeta.GetName(), errs) } // Custom validation (including name validation) passed // Now run common validation on object meta // Do this *after* custom validation so that specific error messages are shown whenever possible if errs := genericvalidation.ValidateObjectMetaAccessor(objectMeta, strategy.NamespaceScoped(), path.ValidatePathSegmentName, field.NewPath("metadata")); len(errs) > 0 { return errors.NewInvalid(kind.GroupKind(), objectMeta.GetName(), errs) } strategy.Canonicalize(obj) return nil } ``` 以pod为例。podStrategy定义在: pkg/registry/core/pod/strategy.go, pod和其他资源对象一样实现了这样的接口。 ``` type RESTCreateStrategy interface { runtime.ObjectTyper // The name generator is used when the standard GenerateName field is set. // The NameGenerator will be invoked prior to validation. names.NameGenerator // NamespaceScoped returns true if the object must be within a namespace. NamespaceScoped() bool // PrepareForCreate is invoked on create before validation to normalize // the object. For example: remove fields that are not to be persisted, // sort order-insensitive list fields, etc. This should not remove fields // whose presence would be considered a validation error. // // Often implemented as a type check and an initailization or clearing of // status. Clear the status because status changes are internal. External // callers of an api (users) should not be setting an initial status on // newly created objects. PrepareForCreate(ctx context.Context, obj runtime.Object) // Validate returns an ErrorList with validation errors or nil. Validate // is invoked after default fields in the object have been filled in // before the object is persisted. This method should not mutate the // object. Validate(ctx context.Context, obj runtime.Object) field.ErrorList // Canonicalize allows an object to be mutated into a canonical form. This // ensures that code that operates on these objects can rely on the common // form for things like comparison. Canonicalize is invoked after // validation has succeeded but before the object has been persisted. // This method may mutate the object. Often implemented as a type check or // empty method. Canonicalize(obj runtime.Object) } ``` 这里就看看PrepareForCreate。可以看出来这里就是设置了pod.Status=Pending ``` // PrepareForCreate clears fields that are not allowed to be set by end users on creation. func (podStrategy) PrepareForCreate(ctx context.Context, obj runtime.Object) { pod := obj.(*api.Pod) pod.Status = api.PodStatus{ Phase: api.PodPending, QOSClass: qos.GetPodQOS(pod), } podutil.DropDisabledPodFields(pod, nil) } ``` ##### 3.2.2 Create函数 可以看出来Create是通用的,不要每个对象都实现,就是调用etcd3接口操作数据库了 staging/src/k8s.io/apiserver/pkg/registry/generic/registry/dryrun.go ``` func (s *DryRunnableStorage) Create(ctx context.Context, key string, obj, out runtime.Object, ttl uint64, dryRun bool) error { if dryRun { if err := s.Storage.Get(ctx, key, "", out, false); err == nil { return storage.NewKeyExistsError(key, 0) } s.copyInto(obj, out) return nil } return s.Storage.Create(ctx, key, obj, out, ttl) } s.Storage.Create 实现在k8s.io/apiserver/pkg/storage/etcd3/store.go ``` #### 3.3 总结 pod创建的逻辑如下: (1)经过apiserver通用的前端流程,就是判断是post接口,就走Create流程 (2)然后走通用逻辑,beforeCreate -> Create -> AfterCreate等等逻辑 比如: Create流程会先执行通用的部分,比如设置deletionStampTion等字段;然后再执行对象特有的,比如当一个pod对象创建时,需要设置pod.Status=Pending。 ### 4. Pod 删除 同样还是回到前端逻辑,这里 ``` case "DELETE": // Delete a resource. article := GetArticleForNoun(kind, " ") doc := "delete" + article + kind if isSubresource { doc = "delete " + subresource + " of" + article + kind } handler := metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, restfulDeleteResource(gracefulDeleter, isGracefulDeleter, reqScope, admit)) route := ws.DELETE(action.Path).To(handler). Doc(doc). Param(ws.QueryParameter("pretty", "If 'true', then the output is pretty printed.")). Operation("delete"+namespaced+kind+strings.Title(subresource)+operationSuffix). Produces(append(storageMeta.ProducesMIMETypes(action.Verb), mediaTypes...)...). Writes(versionedStatus). Returns(http.StatusOK, "OK", versionedStatus). Returns(http.StatusAccepted, "Accepted", versionedStatus) if isGracefulDeleter { route.Reads(versionedDeleterObject) route.ParameterNamed("body").Required(false) if err := AddObjectParams(ws, route, versionedDeleteOptions); err != nil { return nil, err } } addParams(route, action.Params) routes = append(routes, route) ```
前端逻辑和之前其实都一样,这里直接分析后端逻辑Delete ``` // DeleteResource returns a function that will handle a resource deletion // TODO admission here becomes solely validating admission func DeleteResource(r rest.GracefulDeleter, allowsOptions bool, scope *RequestScope, admit admission.Interface) http.HandlerFunc { return func(w http.ResponseWriter, req *http.Request) { ... trace.Step("About to delete object from database") wasDeleted := true userInfo, _ := request.UserFrom(ctx) staticAdmissionAttrs := admission.NewAttributesRecord(nil, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Delete, options, dryrun.IsDryRun(options.DryRun), userInfo) result, err := finishRequest(timeout, func() (runtime.Object, error) { obj, deleted, err := r.Delete(ctx, name, rest.AdmissionToValidateObjectDeleteFunc(admit, staticAdmissionAttrs, scope), options) wasDeleted = deleted return obj, err }) if err != nil { scope.err(err, w, req) return } trace.Step("Object deleted from database") transformResponseObject(ctx, scope, trace, req, w, status, outputMediaType, result) } } ``` #### 4.1 Delete 核心逻辑如下: (1)如果delete options指定了UID,ResourceVersion。需要进行对比确定,防止删错。可能会出现反复创建删除的时候删除错 (2)调用BeforeDelete,判断是否要优雅删除,是否正在优雅删除,BeforeDelete的核心逻辑见4.2,注意只有pod在这里会判断为优雅删除 (3)判断是否有finalizers (4)如果需要优雅删除,或者有finalizers,则执行updateForGracefulDeletionAndFinalizers函数。这个函数会返回当前对象是不是可以立马删除 (5)如果不可以立马删除,返回 (6)如果可以立马删除,删除etcd中的数据 ``` // Delete removes the item from storage. func (e *Store) Delete(ctx context.Context, name string, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) { key, err := e.KeyFunc(ctx, name) if err != nil { return nil, false, err } obj := e.NewFunc() qualifiedResource := e.qualifiedResourceFromContext(ctx) if err = e.Storage.Get(ctx, key, "", obj, false); err != nil { return nil, false, storeerr.InterpretDeleteError(err, qualifiedResource, name) } // support older consumers of delete by treating "nil" as delete immediately if options == nil { options = metav1.NewDeleteOptions(0) } // 1. 如果delete options指定了UID,ResourceVersion。需要进行对比确定,防止删错。可能会出现反复创建删除的时候删除错 var preconditions storage.Preconditions if options.Preconditions != nil { preconditions.UID = options.Preconditions.UID preconditions.ResourceVersion = options.Preconditions.ResourceVersion } // 2.开始BeforeDelete graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options) if err != nil { return nil, false, err } // this means finalizers cannot be updated via DeleteOptions if a deletion is already pending if pendingGraceful { out, err := e.finalizeDelete(ctx, obj, false) return out, false, err } // check if obj has pending finalizers accessor, err := meta.Accessor(obj) if err != nil { return nil, false, kubeerr.NewInternalError(err) } // 3.判断是否有finalizers pendingFinalizers := len(accessor.GetFinalizers()) != 0 var ignoreNotFound bool var deleteImmediately bool = true var lastExisting, out runtime.Object // Handle combinations of graceful deletion and finalization by issuing // the correct updates. shouldUpdateFinalizers, _ := deletionFinalizersForGarbageCollection(ctx, e, accessor, options) // TODO: remove the check, because we support no-op updates now. // 4. 如果需要优雅删除,或者有finalizers,则执行updateForGracefulDeletionAndFinalizers函数 if graceful || pendingFinalizers || shouldUpdateFinalizers { err, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, deleteValidation, obj) } // 5. 如果不能立马删除,就返回。所以第一次pod删除就会到这里 // !deleteImmediately covers all cases where err != nil. We keep both to be future-proof. if !deleteImmediately || err != nil { return out, false, err } // Going further in this function is not useful when we are // performing a dry-run request. Worse, it will actually // override "out" with the version of the object in database // that doesn't have the finalizer and deletiontimestamp set // (because the update above was dry-run too). If we already // have that version available, let's just return it now, // otherwise, we can call dry-run delete that will get us the // latest version of the object. if dryrun.IsDryRun(options.DryRun) && out != nil { return out, true, nil } // 第二次就到这里了,直接删除数据库数据了 // delete immediately, or no graceful deletion supported klog.V(6).Infof("going to delete %s from registry: ", name) out = e.NewFunc() if err := e.Storage.Delete(ctx, key, out, &preconditions, storage.ValidateObjectFunc(deleteValidation), dryrun.IsDryRun(options.DryRun)); err != nil { // Please refer to the place where we set ignoreNotFound for the reason // why we ignore the NotFound error . if storage.IsNotFound(err) && ignoreNotFound && lastExisting != nil { // The lastExisting object may not be the last state of the object // before its deletion, but it's the best approximation. out, err := e.finalizeDelete(ctx, lastExisting, true) return out, true, err } return nil, false, storeerr.InterpretDeleteError(err, qualifiedResource, name) } out, err = e.finalizeDelete(ctx, out, true) return out, true, err } ``` #### 4.2 BeforeDelete 函数逻辑如下: (1)进行DeleteOptions的校验,并且如果指定了uuid,也进行判断 (2)判断是否支持优雅删除,核心是是否实现了RESTGracefulDeleteStrategy接口。这个接口只有Pod实现,所以对应Pod而言是优雅删除的;如果不支持直接返回 (3)如果deletionTime不为空,表示正在优雅删除了 (4)设置deleteTime,和GracePeriodSecond ``` // BeforeDelete tests whether the object can be gracefully deleted. // If graceful is set, the object should be gracefully deleted. If gracefulPending // is set, the object has already been gracefully deleted (and the provided grace // period is longer than the time to deletion). An error is returned if the // condition cannot be checked or the gracePeriodSeconds is invalid. The options // argument may be updated with default values if graceful is true. Second place // where we set deletionTimestamp is pkg/registry/generic/registry/store.go. // This function is responsible for setting deletionTimestamp during gracefulDeletion, // other one for cascading deletions. func BeforeDelete(strategy RESTDeleteStrategy, ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) { objectMeta, gvk, kerr := objectMetaAndKind(strategy, obj) if kerr != nil { return false, false, kerr } // 1.进行DeleteOptions的校验,并且如果指定了uuid,也进行判断 if errs := validation.ValidateDeleteOptions(options); len(errs) > 0 { return false, false, errors.NewInvalid(schema.GroupKind{Group: metav1.GroupName, Kind: "DeleteOptions"}, "", errs) } // Checking the Preconditions here to fail early. They'll be enforced later on when we actually do the deletion, too. if options.Preconditions != nil { if options.Preconditions.UID != nil && *options.Preconditions.UID != objectMeta.GetUID() { return false, false, errors.NewConflict(schema.GroupResource{Group: gvk.Group, Resource: gvk.Kind}, objectMeta.GetName(), fmt.Errorf("the UID in the precondition (%s) does not match the UID in record (%s). The object might have been deleted and then recreated", *options.Preconditions.UID, objectMeta.GetUID())) } if options.Preconditions.ResourceVersion != nil && *options.Preconditions.ResourceVersion != objectMeta.GetResourceVersion() { return false, false, errors.NewConflict(schema.GroupResource{Group: gvk.Group, Resource: gvk.Kind}, objectMeta.GetName(), fmt.Errorf("the ResourceVersion in the precondition (%s) does not match the ResourceVersion in record (%s). The object might have been modified", *options.Preconditions.ResourceVersion, objectMeta.GetResourceVersion())) } } // 2. 判断是否支持优雅删除 gracefulStrategy, ok := strategy.(RESTGracefulDeleteStrategy) if !ok { // If we're not deleting gracefully there's no point in updating Generation, as we won't update // the obcject before deleting it. return false, false, nil } // 3.如果deletionTime不为空,所以正在优雅删除了 // if the object is already being deleted, no need to update generation. if objectMeta.GetDeletionTimestamp() != nil { // if we are already being deleted, we may only shorten the deletion grace period // this means the object was gracefully deleted previously but deletionGracePeriodSeconds was not set, // so we force deletion immediately // IMPORTANT: // The deletion operation happens in two phases. // 1. Update to set DeletionGracePeriodSeconds and DeletionTimestamp // 2. Delete the object from storage. // If the update succeeds, but the delete fails (network error, internal storage error, etc.), // a resource was previously left in a state that was non-recoverable. We // check if the existing stored resource has a grace period as 0 and if so // attempt to delete immediately in order to recover from this scenario. if objectMeta.GetDeletionGracePeriodSeconds() == nil || *objectMeta.GetDeletionGracePeriodSeconds() == 0 { return false, false, nil } // only a shorter grace period may be provided by a user if options.GracePeriodSeconds != nil { period := int64(*options.GracePeriodSeconds) if period >= *objectMeta.GetDeletionGracePeriodSeconds() { return false, true, nil } newDeletionTimestamp := metav1.NewTime( objectMeta.GetDeletionTimestamp().Add(-time.Second * time.Duration(*objectMeta.GetDeletionGracePeriodSeconds())). Add(time.Second * time.Duration(*options.GracePeriodSeconds))) objectMeta.SetDeletionTimestamp(&newDeletionTimestamp) objectMeta.SetDeletionGracePeriodSeconds(&period) return true, false, nil } // graceful deletion is pending, do nothing options.GracePeriodSeconds = objectMeta.GetDeletionGracePeriodSeconds() return false, true, nil } if !gracefulStrategy.CheckGracefulDelete(ctx, obj, options) { return false, false, nil } // 4. 设置deleteTime,和GracePeriodSecond now := metav1.NewTime(metav1.Now().Add(time.Second * time.Duration(*options.GracePeriodSeconds))) objectMeta.SetDeletionTimestamp(&now) objectMeta.SetDeletionGracePeriodSeconds(options.GracePeriodSeconds) // If it's the first graceful deletion we are going to set the DeletionTimestamp to non-nil. // Controllers of the object that's being deleted shouldn't take any nontrivial actions, hence its behavior changes. // Thus we need to bump object's Generation (if set). This handles generation bump during graceful deletion. // The bump for objects that don't support graceful deletion is handled in pkg/registry/generic/registry/store.go. if objectMeta.GetGeneration() > 0 { objectMeta.SetGeneration(objectMeta.GetGeneration() + 1) } return true, false, nil } 这是个接口,判断该资源是否可以优雅删除, 只有pod实现了这个接口。 type RESTGracefulDeleteStrategy interface { // CheckGracefulDelete should return true if the object can be gracefully deleted and set // any default values on the DeleteOptions. CheckGracefulDelete(ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) bool } ``` #### 4.3 updateForGracefulDeletionAndFinalizers 这里的一个核心就是,如果有finalizer的话,就调用markAsDeleting 函数,该函数也是设置deletionTimestamp和DeletionGracePeriodSeconds ``` // updateForGracefulDeletionAndFinalizers updates the given object for // graceful deletion and finalization by setting the deletion timestamp and // grace period seconds (graceful deletion) and updating the list of // finalizers (finalization); it returns: // // 1. an error // 2. a boolean indicating that the object was not found, but it should be // ignored // 3. a boolean indicating that the object's grace period is exhausted and it // should be deleted immediately // 4. a new output object with the state that was updated // 5. a copy of the last existing state of the object func (e *Store) updateForGracefulDeletionAndFinalizers(ctx context.Context, name, key string, options *metav1.DeleteOptions, preconditions storage.Preconditions, deleteValidation rest.ValidateObjectFunc, in runtime.Object) (err error, ignoreNotFound, deleteImmediately bool, out, lastExisting runtime.Object) { lastGraceful := int64(0) var pendingFinalizers bool out = e.NewFunc() err = e.Storage.GuaranteedUpdate( ctx, key, out, false, /* ignoreNotFound */ &preconditions, storage.SimpleUpdate(func(existing runtime.Object) (runtime.Object, error) { if err := deleteValidation(ctx, existing); err != nil { return nil, err } graceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, existing, options) if err != nil { return nil, err } if pendingGraceful { return nil, errAlreadyDeleting } // Add/remove the orphan finalizer as the options dictates. // Note that this occurs after checking pendingGraceufl, so // finalizers cannot be updated via DeleteOptions if deletion has // started. existingAccessor, err := meta.Accessor(existing) if err != nil { return nil, err } needsUpdate, newFinalizers := deletionFinalizersForGarbageCollection(ctx, e, existingAccessor, options) if needsUpdate { existingAccessor.SetFinalizers(newFinalizers) } pendingFinalizers = len(existingAccessor.GetFinalizers()) != 0 if !graceful { // set the DeleteGracePeriods to 0 if the object has pendingFinalizers but not supporting graceful deletion if pendingFinalizers { klog.V(6).Infof("update the DeletionTimestamp to \"now\" and GracePeriodSeconds to 0 for object %s, because it has pending finalizers", name) err = markAsDeleting(existing, time.Now()) if err != nil { return nil, err } return existing, nil } return nil, errDeleteNow } lastGraceful = *options.GracePeriodSeconds lastExisting = existing return existing, nil }), dryrun.IsDryRun(options.DryRun), ) markAsDeleting 函数也是设置deletionTimestamp和DeletionGracePeriodSeconds // markAsDeleting sets the obj's DeletionGracePeriodSeconds to 0, and sets the // DeletionTimestamp to "now" if there is no existing deletionTimestamp or if the existing // deletionTimestamp is further in future. Finalizers are watching for such updates and will // finalize the object if their IDs are present in the object's Finalizers list. func markAsDeleting(obj runtime.Object, now time.Time) (err error) { objectMeta, kerr := meta.Accessor(obj) if kerr != nil { return kerr } // This handles Generation bump for resources that don't support graceful // deletion. For resources that support graceful deletion is handle in // pkg/api/rest/delete.go if objectMeta.GetDeletionTimestamp() == nil && objectMeta.GetGeneration() > 0 { objectMeta.SetGeneration(objectMeta.GetGeneration() + 1) } existingDeletionTimestamp := objectMeta.GetDeletionTimestamp() if existingDeletionTimestamp == nil || existingDeletionTimestamp.After(now) { metaNow := metav1.NewTime(now) objectMeta.SetDeletionTimestamp(&metaNow) } var zero int64 = 0 objectMeta.SetDeletionGracePeriodSeconds(&zero) return nil } ```
#### 4.4 总结 (1) k8S的这套机制,只需要你自己写好对象的strategy就行,beforeCreate, afterCreate等等,不需要和数据库打交道。扩展性很强 (2) K8s 中对象删除基本流程如下: - 客户端提交删除请求到 API Server - - 可选传递 GracePeriodSeconds 参数 - API Server 做 Graceful Deletion 检查 - - 若对象实现了 RESTGracefulDeleteStrategy 接口,会调用对应的实现并返回是否需要进行 Graceful 删除 - API Server 检查 Finalizers 并结合是否需要进行 graceful 删除,来决定是否立即删除对象 - - 若对象需要进行 graceful 删除,更新 metadata.DeletionGracePeriodSecond 和 metadata.DeletionTimestamp 字段,不从存储中删除对象 - 若对象不需要进行 Graceful 删除时 - - metadata.Finalizers 为空,直接删除 - metadata.Finalizers 不为空,不删除,只更新 metadata.DeletionTimestamp 注: 当前 k8s 内置资源,只有 Pod 对象实现了 [RESTGracefulDeleteStrategy](https://link.zhihu.com/?target=https%3A//github.com/kubernetes/kubernetes/blob/v1.18.0/staging/src/k8s.io/apiserver/pkg/registry/rest/delete.go%23L55-L61) 接口。对于其他对象,都不会进入 Graceful 删除状态。
所以k8s中的删除资源时其实是2步,第一步是设置metadata.DeletionTimestamp字段。第二步是正在的删除。 pod是这个逻辑的原因是它实现了RESTGracefulDeleteStrategy接口。 其他资源比如deploy资源也是这个逻辑,是因为K8s删除的时候,会默认后天删除(前台删除,孤儿删除),实际会带finalizer。所以有finalizer实际上也实现了优雅删除。
当遇到对象删不掉的时候,方法: - 删除 finalizers ,让关联的逻辑不需要执行 - kubelet delete --force --grace-period 0 直接删除 到这里就结束吧,pod get, list, patch等对象基本都是这个思路 ### 5.参考 https://duyanghao.github.io/kubernetes-apiserver-overview/ https://blog.csdn.net/hahachenchen789/article/details/113880166 https://www.kubesre.com/archives/chuang-jian-yi-ge-pod-bei-hou-etcd-de-gu-shi https://zhuanlan.zhihu.com/p/161072336 ================================================ FILE: k8s/kube-apiserver/17-k8s之serviceaccount.md ================================================ Table of Contents ================= * [1. 什么是serviceaccount](#1-什么是serviceaccount) * [2、Service account与User account区别](#2service-account与user-account区别) * [3、默认Service Account](#3默认service-account) * [3.1 默认sa的权限测试](#31-默认sa的权限测试) * [3.2 自定义sa的权限测试](#32-自定义sa的权限测试) * [4. 如何通过client-go使用sa](#4-如何通过client-go使用sa) ### 1. 什么是serviceaccount k8s中提供了良好的多租户认证管理机制,如RBAC、ServiceAccount还有各种Policy等。 当用户访问集群(例如使用kubectl命令)时,apiserver 会将用户认证为一个特定的 User Account(目前通常是admin,除非系统管理员自定义了集群配置)。 Pod 容器中的进程也可以与 apiserver 联系。 当它们在联系 apiserver 的时候,它们会被认证为一个特定的 Service Account(例如default)。
**使用场景** Service Account它并不是给kubernetes集群的用户使用的,而是给pod里面的进程使用的,它为pod提供必要的身份认证。
Service Account包含3个主要内容,分别介绍如下: * NameSpace: 指定了Pod所在的命名空间 * CA: kube-apiserver组件的CA公钥证书,是Pod中的进程对kube-apiserver进程验证的证书 * Token:用作身份验证,通过kube-apiserver私钥签发经过Base64b编码的Bearer Token ### 2、Service account与User account区别 1. User account是为人设计的,而service account则是为Pod中的进程调用Kubernetes API或其他外部服务而设计的 2. User account是跨namespace的,而service account则是仅局限它所在的namespace; 3. 每个namespace都会自动创建一个default service account 4. Token controller检测service account的创建,并为它们创建secret 5. 开启ServiceAccount Admission Controller后: 5.1 每个Pod在创建后都会自动设置spec.serviceAccount为default(除非指定了其他ServiceAccout) ​ 5.2 验证Pod引用的service account已经存在,否则拒绝创建 ​ 5.3 如果Pod没有指定ImagePullSecrets,则把service account的ImagePullSecrets加到Pod中 ​ 5.4 每个container启动后都会挂载该service account的token和ca.crt到/var/run/secrets/kubernetes.io/serviceaccount/ ```bash # kubectl exec nginx-3137573019-md1u2 ls /run/secrets/kubernetes.io/serviceaccount ca.crt namespace token ``` **查看系统的config配置** 这里用到的token就是被授权过的SeviceAccount账户的token,集群利用token来使用ServiceAccount账户 ```text [root@master yaml]# cat /root/.kube/config ``` ### 3、默认Service Account 默认在 pod 中使用自动挂载的 service account 凭证来访问 API,如 Accessing the Cluster([https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod](https://link.zhihu.com/?target=https%3A//kubernetes.io/docs/tasks/access-application-cluster/access-cluster/%23accessing-the-api-from-a-pod)) 中所描述。 当创建 pod 的时候,如果没有指定一个 service account,系统会自动在与该pod 相同的 namespace 下为其指派一个default service account,并且使用默认的 Service Account 访问 API server。 例如: 获取刚创建的 pod 的原始 json 或 yaml 信息,将看到spec.serviceAccountName字段已经被设置为 default。 ``` root@k8s-master:~# kubectl get sa NAME SECRETS AGE default 1 2d4h root@k8s-master:~# kubectl get sa default -oyaml apiVersion: v1 kind: ServiceAccount metadata: creationTimestamp: "2021-10-23T09:04:02Z" name: default namespace: default resourceVersion: "231" selfLink: /api/v1/namespaces/default/serviceaccounts/default uid: 5953ce17-9e38-4768-9d61-e7066f838b0d secrets: - name: default-token-f8snr root@k8s-master:~# root@k8s-master:~# root@k8s-master:~# root@k8s-master:~# kubectl get secret default-token-f8snr -oyaml apiVersion: v1 data: ca.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUR2akNDQXFhZ0F3SUJBZ0lVZVNKWlB2SmZGangyOVBrU2NHdmw1eEFOQ2lZd0RRWUpLb1pJaHZjTkFRRUwKQlFBd1pURUxNQWtHQTFVRUJoTUNRMDR4RURBT0JnTlZCQWdUQjBKbGFXcHBibWN4RURBT0JnTlZCQWNUQjBKbAphV3BwYm1jeEREQUtCZ05WQkFvVEEyczRjekVQTUEwR0ExVUVDeE1HVTNsemRHVnRNUk13RVFZRFZRUURFd3ByCmRXSmxjbTVsZEdWek1CNFhEVEl4TVRBeU16QTRNakl3TUZvWERUSTJNVEF5TWpBNE1qSXdNRm93WlRFTE1Ba0cKQTFVRUJoTUNRMDR4RURBT0JnTlZCQWdUQjBKbGFXcHBibWN4RURBT0JnTlZCQWNUQjBKbGFXcHBibWN4RERBSwpCZ05WQkFvVEEyczRjekVQTUEwR0ExVUVDeE1HVTNsemRHVnRNUk13RVFZRFZRUURFd3ByZFdKbGNtNWxkR1Z6Ck1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBTUlJQkNnS0NBUUVBdDVPTVlLUG4xS3ZOY3FoaGxqdVQKei9pUDFiTGdWOUhFNGhZVmV0VDkralNTVTQzd20wWExqWlliT0oxZktDWkV5NU14ZUlXb1c2bFVhMDRLc2VZNAovSFdGM255VGVQVmx2citBbm9kNFZ2TWZxRXpBcmplcS85aElOcGxZdFFOMDBSanNpdHA3bDRRT1licEhTWUFNCnhXSmFPZG5lK2FNbmQrUkFaM1d0bGV1aXd5REZzVXI0NUhqeGJoeGR1YUNURUQwanNPYy9zbEQwRTFGZTRHOWoKOXpjK0xMb2ZTWHQ1N1B3Z1g5MVlwbnJUNmtTRUs0SGpMcjczMzRYTmRYbjBkektBc1A0RURzNkdibDEyZ1JiUQpuV3g2cHpSUmpkUXlua1Z0dkMzTXMrUVIrcUswb3RMMDVMTStPdy9VY2M4cXBFTUtWUVBRVFkyWGljLzZsa3IvCkV3SURBUUFCbzJZd1pEQU9CZ05WSFE4QkFmOEVCQU1DQVFZd0VnWURWUjBUQVFIL0JBZ3dCZ0VCL3dJQkFqQWQKQmdOVkhRNEVGZ1FVSjNiTDE3UGlVd0g5WDNhekp2VFVNbU1iUlgwd0h3WURWUjBqQkJnd0ZvQVVKM2JMMTdQaQpVd0g5WDNhekp2VFVNbU1iUlgwd0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFLSXpSSXpSMmp3UG0vU25LSXRBCjIyMUJFdnJTWEh4UE13VTJQbjgybmhQWjBaOFc0K0x3ZjBFcExlZ0xWaVgzMEJrTU5INkRkTkNUbEdrSnRSZW4KSHdMNVNnZkVnaTA0V0tXenVpT25jd2dnWkNOTXpyZGhGcFFqLzNOOWhqWUM0V050UXZlaWVmYjlZOGtpbUUvVAp6STh1MXpZTFRreG5FU3pHTE8yUGNtZXQ2TmtCb0NBTU1vc3R0ZC92RlN0b250TVk1OXBiMlpnejN1MXZuZkt5CmlpbzZVM1VtbWt2NGMzdnYwbzEwTVlMVElLR2ZiRVllSkROdjFhZ3NvSWlBQklNbEhGeUh1TUZIZmp5RExiamkKOHo0TTBmKzFkNXdqc2NHVFNsQng5anJXTzk3WFNFeU9BdDNkbkE0OU5sNUJjTDZXZWlhbGlQT0F4QWVPcUROZAp2elE9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K namespace: ZGVmYXVsdA== token: ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNklrTXhUV2hJVVdKWWJtNU9OemRoTW5GV1FYVnpURjlWTkdSbmIzcDZNVVUxUTBGTlVGOTFlVFJ2UW5jaWZRLmV5SnBjM01pT2lKcmRXSmxjbTVsZEdWekwzTmxjblpwWTJWaFkyTnZkVzUwSWl3aWEzVmlaWEp1WlhSbGN5NXBieTl6WlhKMmFXTmxZV05qYjNWdWRDOXVZVzFsYzNCaFkyVWlPaUprWldaaGRXeDBJaXdpYTNWaVpYSnVaWFJsY3k1cGJ5OXpaWEoyYVdObFlXTmpiM1Z1ZEM5elpXTnlaWFF1Ym1GdFpTSTZJbVJsWm1GMWJIUXRkRzlyWlc0dFpqaHpibklpTENKcmRXSmxjbTVsZEdWekxtbHZMM05sY25acFkyVmhZMk52ZFc1MEwzTmxjblpwWTJVdFlXTmpiM1Z1ZEM1dVlXMWxJam9pWkdWbVlYVnNkQ0lzSW10MVltVnlibVYwWlhNdWFXOHZjMlZ5ZG1salpXRmpZMjkxYm5RdmMyVnlkbWxqWlMxaFkyTnZkVzUwTG5WcFpDSTZJalU1TlROalpURTNMVGxsTXpndE5EYzJPQzA1WkRZeExXVTNNRFkyWmpnek9HSXdaQ0lzSW5OMVlpSTZJbk41YzNSbGJUcHpaWEoyYVdObFlXTmpiM1Z1ZERwa1pXWmhkV3gwT21SbFptRjFiSFFpZlEub09hNkZ6SDVhTEIzRnZYWW9ZTHNRWUNTOHl2ZTRXdWRBbGtjNjFwTVd0UEFBRy1URUJ5WjNvN3FzSU0yRmNkTW9VbXFCOEFoakx0QlZjeVhOMVFfd0dDNE9oLUdRQnpJZ3JPZTRDUm5QWkpGX2F0ZW15LXlsazI1aldJSG9VOWU1azAxMHExYjhMU0RJekVwSFd0UzZlZC01ZkQxdG5lSHdtU09LYTJtdTZ2QWVsUW9ydmFoeHU3UWxHSWFUcWRQaVk3ZWRyUFpKSUFGWUNMeFAtMklFV0ZRbFJMUkRxcVN0ckpBbTFDUFFoeFh4ZUgtSFJoTzhnQnB4bHV0VUdSOU5LNFdoMnRFYWIyaGV1YUZUQkp0dVIxeTlJbVZFQzFpaTFlT2NGeGJRRi1zWnRlZGEwWFBTbE1rZ1BHYmNUT3VPOEdvZHBZTzA5TnFZRW5WR29pQWtn kind: Secret metadata: annotations: kubernetes.io/service-account.name: default kubernetes.io/service-account.uid: 5953ce17-9e38-4768-9d61-e7066f838b0d creationTimestamp: "2021-10-23T09:04:02Z" name: default-token-f8snr namespace: default resourceVersion: "229" selfLink: /api/v1/namespaces/default/secrets/default-token-f8snr uid: 11cfe3f0-ad48-458f-8959-fcc3adccacd3 type: kubernetes.io/service-account-token ```
**默认的Sa作用:** 目前看起来就是给pod塞了一个sa,没有任何的权限绑定。 #### 3.1 默认sa的权限测试 (1)kubectl get role没看见有role和 default绑定 (2)进入一个pod后, 执行以下的命令发现这个sa没有权限 ``` / $ export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt / $ TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) / $ / $ curl -H "Authorization: Bearer $TOKEN" https://kubernetes curl: (6) Could not resolve host: kubernetes // 这个就是没有权限 / $ curl -H "Authorization: Bearer $TOKEN" https://192.168.0.4:6443/api/v1/namespaces/default/pods { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:serviceaccount:default:default\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403 }/ $ / $ // 这个就是没有权限 / $ curl -H "Authorization: Bearer $TOKEN" https://10.0.0.1:443/api/v1/namespaces/default/pods { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "pods is forbidden: User \"system:serviceaccount:default:default\" cannot list resource \"pods\" in API group \"\" in the namespace \"default\"", "reason": "Forbidden", "details": { "kind": "pods" }, "code": 403 }/ $ / $ / $ exit ``` #### 3.2 自定义sa的权限测试 (1)创建sa ``` root@k8s-master:~# kubectl create serviceaccount sa-example serviceaccount/sa-example created root@k8s-master:~# kubectl get sa sa-example -oyaml apiVersion: v1 kind: ServiceAccount metadata: creationTimestamp: "2021-10-25T13:51:23Z" name: sa-example namespace: default resourceVersion: "434232" selfLink: /api/v1/namespaces/default/serviceaccounts/sa-example uid: 42654626-8b42-4c5e-83de-fb836acfc934 secrets: - name: sa-example-token-lchv2 ``` (2) 创建role ``` kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: namespace: default # 命名空间 name: role-example rules: - apiGroups: [""] resources: ["pods"] # 可以访问pod verbs: ["get", "list"] # 可以执行GET、LIST操作 ``` (3) 创建rolebinding ``` kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: rolebinding-example namespace: default subjects: - kind: User name: user-example apiGroup: rbac.authorization.k8s.io - kind: ServiceAccount name: sa-example namespace: default roleRef: kind: Role name: role-example apiGroup: rbac.authorization.k8s.io ``` (4) 将pod设置自定义sa ``` root@k8s-master:~# cat pod.yaml apiVersion: v1 kind: Pod metadata: name: nginx spec: serviceAccountName: sa-example nodeName: k8s-node containers: - name: nginx image: curlimages/curl:7.75.0 command: - sleep - "3600" ``` (5) 执行上诉命令 ``` root@k8s-master:~# kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.0.0.1 443/TCP 2d5h ``` ``` root@k8s-master:~# kubectl exec -it nginx sh / $ export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt / $ TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) / $ curl -H "Authorization: Bearer $TOKEN" https://192.168.0.4:6443 { "kind": "Status", "apiVersion": "v1", "metadata": { }, "status": "Failure", "message": "forbidden: User \"system:serviceaccount:default:sa-example\" cannot get path \"/\"", "reason": "Forbidden", "details": { }, "code": 403 }/ $ //有get pod的权限 / $ curl -H "Authorization: Bearer $TOKEN" https://10.0.0.1:443/api/v1/namespace s/default/pods { "kind": "PodList", "apiVersion": "v1", "metadata": { "selfLink": "/api/v1/namespaces/default/pods", "resourceVersion": "435185" }, "items": [ { "metadata": { "name": "nginx", "namespace": "default", "selfLink": "/api/v1/namespaces/default/pods/nginx", "uid": "0ceadb16-588f-40ae-a8c1-4d3cfb34df20", "resourceVersion": "435049", "creationTimestamp": "2021-10-25T13:57:11Z", "annotations": { "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"metadata\":{\"annotations\":{},\"name\":\"nginx\",\"namespace\":\"default\"},\"spec\":{\"containers\":[{\"command\":[\"sleep\",\"3600\"],\"image\":\"curlimages/curl:7.75.0\",\"name\":\"nginx\"}],\"nodeName\":\"k8s-node\",\"serviceAccountName\":\"sa-example\"}}\n" } }, "spec": { "volumes": [ { "name": "sa-example-token-lchv2", "secret": { "secretName": "sa-example-token-lchv2", "defaultMode": 420 } } ], "containers": [ { "name": "nginx", "image": "curlimages/curl:7.75.0", "command": [ "sleep", "3600" ], "resources": { }, "volumeMounts": [ { "name": "sa-example-token-lchv2", "readOnly": true, "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount" } ], "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "imagePullPolicy": "IfNotPresent" } ], "restartPolicy": "Always", "terminationGracePeriodSeconds": 30, "dnsPolicy": "ClusterFirst", "serviceAccountName": "sa-example", "serviceAccount": "sa-example", "nodeName": "k8s-node", "securityContext": { }, "schedulerName": "default-scheduler", "tolerations": [ { "key": "node.kubernetes.io/not-ready", "operator": "Exists", "effect": "NoExecute", "tolerationSeconds": 300 }, { "key": "node.kubernetes.io/unreachable", "operator": "Exists", "effect": "NoExecute", "tolerationSeconds": 300 } ], "priority": 0, "enableServiceLinks": true }, "status": { "phase": "Running", "conditions": [ { "type": "Initialized", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-10-25T13:57:11Z" }, { "type": "Ready", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-10-25T13:57:13Z" }, { "type": "ContainersReady", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-10-25T13:57:13Z" }, { "type": "PodScheduled", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-10-25T13:57:11Z" } ], "hostIP": "192.168.0.5", "podIP": "10.244.1.7", "podIPs": [ { "ip": "10.244.1.7" } ], "startTime": "2021-10-25T13:57:11Z", "containerStatuses": [ { "name": "nginx", "state": { "running": { "startedAt": "2021-10-25T13:57:12Z" } }, "lastState": { }, "ready": true, "restartCount": 0, "image": "curlimages/curl:7.75.0", "imageID": "docker-pullable://curlimages/curl@sha256:28ec2dae8001949f657dbb36141508d65572f382dbd587f868289e2ceb0d47dd", "containerID": "docker://d6e4cc4acfa4b3093d3ee82286cf67da117f7f6ce23fd47254ee64a79d8ff29f", "started": true } ], "qosClass": "BestEffort" } } ] }/ $ //使用 apiserver的ip:端口也是可以的 / $ curl -H "Authorization: Bearer $TOKEN" https://192.168.0.4:6443/api/v1/names paces/default/pods { "kind": "PodList", "apiVersion": "v1", "metadata": { "selfLink": "/api/v1/namespaces/default/pods", "resourceVersion": "435286" }, "items": [ { "metadata": { "name": "nginx", "namespace": "default", "selfLink": "/api/v1/namespaces/default/pods/nginx", "uid": "0ceadb16-588f-40ae-a8c1-4d3cfb34df20", "resourceVersion": "435049", "creationTimestamp": "2021-10-25T13:57:11Z", "annotations": { "kubectl.kubernetes.io/last-applied-configuration": "{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"metadata\":{\"annotations\":{},\"name\":\"nginx\",\"namespace\":\"default\"},\"spec\":{\"containers\":[{\"command\":[\"sleep\",\"3600\"],\"image\":\"curlimages/curl:7.75.0\",\"name\":\"nginx\"}],\"nodeName\":\"k8s-node\",\"serviceAccountName\":\"sa-example\"}}\n" } }, "spec": { "volumes": [ { "name": "sa-example-token-lchv2", "secret": { "secretName": "sa-example-token-lchv2", "defaultMode": 420 } } ], "containers": [ { "name": "nginx", "image": "curlimages/curl:7.75.0", "command": [ "sleep", "3600" ], "resources": { }, "volumeMounts": [ { "name": "sa-example-token-lchv2", "readOnly": true, "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount" } ], "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "imagePullPolicy": "IfNotPresent" } ], "restartPolicy": "Always", "terminationGracePeriodSeconds": 30, "dnsPolicy": "ClusterFirst", "serviceAccountName": "sa-example", "serviceAccount": "sa-example", "nodeName": "k8s-node", "securityContext": { }, "schedulerName": "default-scheduler", "tolerations": [ { "key": "node.kubernetes.io/not-ready", "operator": "Exists", "effect": "NoExecute", "tolerationSeconds": 300 }, { "key": "node.kubernetes.io/unreachable", "operator": "Exists", "effect": "NoExecute", "tolerationSeconds": 300 } ], "priority": 0, "enableServiceLinks": true }, "status": { "phase": "Running", "conditions": [ { "type": "Initialized", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-10-25T13:57:11Z" }, { "type": "Ready", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-10-25T13:57:13Z" }, { "type": "ContainersReady", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-10-25T13:57:13Z" }, { "type": "PodScheduled", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2021-10-25T13:57:11Z" } ], "hostIP": "192.168.0.5", "podIP": "10.244.1.7", "podIPs": [ { "ip": "10.244.1.7" } ], "startTime": "2021-10-25T13:57:11Z", "containerStatuses": [ { "name": "nginx", "state": { "running": { "startedAt": "2021-10-25T13:57:12Z" } }, "lastState": { }, "ready": true, "restartCount": 0, "image": "curlimages/curl:7.75.0", "imageID": "docker-pullable://curlimages/curl@sha256:28ec2dae8001949f657dbb36141508d65572f382dbd587f868289e2ceb0d47dd", "containerID": "docker://d6e4cc4acfa4b3093d3ee82286cf67da117f7f6ce23fd47254ee64a79d8ff29f", "started": true } ], "qosClass": "BestEffort" } } ] }/ $ ``` ### 4. 如何通过client-go使用sa 直接调用client-go/rest的InClusterConfig ``` // creates the in-cluster config config, err := rest.InClusterConfig() if err != nil { panic(err.Error()) } // creates the clientset clientset, err := kubernetes.NewForConfig(config) if err != nil { panic(err.Error()) } ``` InClusterConfig的源码分析,这里定义了tokenFile和rootCAFile ``` // InClusterConfig returns a config object which uses the service account // kubernetes gives to pods. It's intended for clients that expect to be // running inside a pod running on kubernetes. It will return ErrNotInCluster // if called from a process not running in a kubernetes environment. func InClusterConfig() (*Config, error) { const ( tokenFile = "/var/run/secrets/kubernetes.io/serviceaccount/token" rootCAFile = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt" ) host, port := os.Getenv("KUBERNETES_SERVICE_HOST"), os.Getenv("KUBERNETES_SERVICE_PORT") if len(host) == 0 || len(port) == 0 { return nil, ErrNotInCluster } token, err := ioutil.ReadFile(tokenFile) if err != nil { return nil, err } tlsClientConfig := TLSClientConfig{} if _, err := certutil.NewPool(rootCAFile); err != nil { klog.Errorf("Expected to load root CA config from %s, but got err: %v", rootCAFile, err) } else { tlsClientConfig.CAFile = rootCAFile } return &Config{ // TODO: switch to using cluster DNS. Host: "https://" + net.JoinHostPort(host, port), TLSClientConfig: tlsClientConfig, BearerToken: string(token), BearerTokenFile: tokenFile, }, nil } ``` ================================================ FILE: k8s/kube-apiserver/18 event的定义.md ================================================ k8s集群中,controller-manage、kube-proxy、kube-scheduler、kubelet等组件都会产生大量的event。这些event对查看集群对象状态或者监控告警等等都非常有用。本章写一下自己对k8s中event的理解。 ### 1. event的定义 event定义在:k8s.io/api/core/v1/types.go中 ``` type Event struct { metav1.TypeMeta `json:",inline"` metav1.ObjectMeta `json:"metadata" protobuf:"bytes,1,opt,name=metadata"` InvolvedObject ObjectReference `json:"involvedObject" protobuf:"bytes,2,opt,name=involvedObject"` Reason string `json:"reason,omitempty" protobuf:"bytes,3,opt,name=reason"` Message string `json:"message,omitempty" protobuf:"bytes,4,opt,name=message"` Source EventSource `json:"source,omitempty" protobuf:"bytes,5,opt,name=source"` FirstTimestamp metav1.Time `json:"firstTimestamp,omitempty" protobuf:"bytes,6,opt,name=firstTimestamp"` LastTimestamp metav1.Time `json:"lastTimestamp,omitempty" protobuf:"bytes,7,opt,name=lastTimestamp"` Count int32 `json:"count,omitempty" protobuf:"varint,8,opt,name=count"` Type string `json:"type,omitempty" protobuf:"bytes,9,opt,name=type"` EventTime metav1.MicroTime `json:"eventTime,omitempty" protobuf:"bytes,10,opt,name=eventTime"` Series *EventSeries `json:"series,omitempty" protobuf:"bytes,11,opt,name=series"` Action string `json:"action,omitempty" protobuf:"bytes,12,opt,name=action"` Related *ObjectReference `json:"related,omitempty" protobuf:"bytes,13,opt,name=related"` ReportingController string `json:"reportingComponent" protobuf:"bytes,14,opt,name=reportingComponent"` ReportingInstance string `json:"reportingInstance" protobuf:"bytes,15,opt,name=reportingInstance"` ReportingInstance string `json:"reportingInstance" protobuf:"bytes,15,opt,name=reportingInstance"` } ``` Count,firstTimestamp和lasteTimestamp 表示事件重复了多少次 Message 详细的事件信息 Reason 简单的事件原因 Type 目前只支持:Normal和Warning俩种 Source 事件发出的来源 InvolvedObject 引用的另一个Kubernetes对象,例如Pod或者Deployment
### 2. kubectl自定义输出k8s事件 - (该方法适用于所有对象) 通常我们是通过kubectl 查看事件,如下: ``` root@k8s-master:~# kubectl get event LAST SEEN TYPE REASON OBJECT MESSAGE 40m Normal Pulled pod/zx-hpa-7b56cddd95-5j6r4 Container image "busybox:latest" already present on machine 40m Normal Created pod/zx-hpa-7b56cddd95-5j6r4 Created container busybox 40m Normal Started pod/zx-hpa-7b56cddd95-5j6r4 Started container busybox 40m Normal Pulled pod/zx-hpa-7b56cddd95-lthbz Container image "busybox:latest" already present on machine 40m Normal Created pod/zx-hpa-7b56cddd95-lthbz Created container busybox 40m Normal Started pod/zx-hpa-7b56cddd95-lthbz Started container busybox 29m Normal Pulled pod/zx-hpa-7b56cddd95-n9ft9 Container image "busybox:latest" already present on machine 29m Normal Created pod/zx-hpa-7b56cddd95-n9ft9 Created container busybox 29m Normal Started pod/zx-hpa-7b56cddd95-n9ft9 Started container busybox ``` 补充俩点注意: (1)event也有ns,所以kubectl get event 没有找到预期的事件,看看是否加上了 ns (2)自定义event的输出 默认的kubectl get event只输出了五列,有时并没有我们想看到的内容,这个时候可以利用kubectl 的强大输出功能,输出自己想看到的信息。 ``` 根据 kubectl 操作,支持以下输出格式: Output format Description -o custom-columns= 使用逗号分隔的自定义列列表打印表。 -o custom-columns-file= 使用 文件中的自定义列模板打印表。 -o json 输出 JSON 格式的 API 对象 -o jsonpath=