[
  {
    "path": ".gitignore",
    "content": "*/.DS_Store\n.DS_Store:\n**/.DS_Store\n.DS_Store?\n"
  },
  {
    "path": "README.md",
    "content": "\n\n# learning-k8s-source-code\n\n从源码角度出发，学习k8s的原理。\n\n目前打算以`kube-apiserver` `kube-controller-manager`  `kube-scheduler`  `kubelet` `proxy` 和 `kubectl`  这6个组件为主线进行源码级别的学习。\n\n同时还顺便记录一些平时用到和k8s相关的知识，例如etcd, docker, linux等相关知识。\n\n\n\n其他相关k8s组件的分析笔记发布在以下博客：\n\nhttps://www.jianshu.com/c/b097c5e7eb9b\n\nhttps://www.zhihu.com/column/c_1523054529113579520\n\nhttps://blog.csdn.net/zxyuliwuzhognzx11/category_11880534.html?spm=1001.2014.3001.5482\n"
  },
  {
    "path": "docker/0-docker章节介绍.md",
    "content": "容器技术是 云发展的一个重要基础，docker就是当前很火的一种容器技术。\n\n之前就知道docker利用了linux的cgruop, namespace + chroot + 联合文件系统实现的。\n\n本章力求从源码角度对docker进行分析, docker版本为：https://github.com/moby/moby/tree/v19.03.9\n\n章节安排：\n\n（1）了解linux namespaces, cgroup, choot，联合文件系统的原理\n\n（2）了解docker源码结构\n\n（3）以常见的docker run nginx ls命令为主线,  从源码入手了解该命令背后的详细过程\n"
  },
  {
    "path": "docker/1. linux namespaces 知识准备.md",
    "content": "* [1 namespace 简介](#1-namespace-简介)\r\n* [2\\. pid namespace](#2-pid-namespace)\r\n  * [2\\.1 如何查看一个进程的 pid namespace](#21-如何查看一个进程的-pid-namespace)\r\n  * [2\\.2 子进程不共享父进程的pid namespaces](#22-子进程不共享父进程的pid-namespaces)\r\n  * [2\\.3 pid namespace的原理](#23-pid-namespace的原理)\r\n  * [2\\.4 task\\_struct 结构图](#24-task_struct-结构图)\r\n* [3 总结](#3-总结)\r\n* [4\\.参考](#4参考)\r\n\r\n### 1 namespace 简介\r\n\r\n`namespace（命名空间）` 是Linux提供的一种内核级别环境隔离的方法，很多编程语言也有 namespace 这样的功能，例如C++，Java等，编程语言的 namespace 是为了解决项目中能够在不同的命名空间里使用相同的函数名或者类名。而Linux的 namespace 也是为了实现资源能够在不同的命名空间里有相同的名称，譬如在 `A命名空间` 有个pid为1的进程，而在 `B命名空间` 中也可以有一个pid为1的进程。\r\n\r\n有了 `namespace` 就可以实现基本的容器功能，著名的 `Docker` 也是使用了 namespace 来实现资源隔离的。\r\n\r\nLinux支持6种资源的 `namespace`，分别为（文档）：\r\n\r\n| Type               | Parameter     | Linux Version |\r\n| ------------------ | ------------- | ------------- |\r\n| Mount namespaces   | CLONE_NEWNS   | Linux 2.4.19  |\r\n| UTS namespaces     | CLONE_NEWUTS  | Linux 2.6.19  |\r\n| IPC namespaces     | CLONE_NEWIPC  | Linux 2.6.19  |\r\n| PID namespaces     | CLONE_NEWPID  | Linux 2.6.24  |\r\n| Network namespaces | CLONE_NEWNET  | Linux 2.6.24  |\r\n| User namespaces    | CLONE_NEWUSER | Linux 2.6.23  |\r\n\r\n<br>\r\n\r\n个人理解：namespace 就是对进程进行了内核资源的隔离（mount, uts, ipc, pid, network, user）这六种资源。\r\n\r\n\r\n接下来以 pid 这个来介绍 namespaces是如何起作用的。\r\n\r\n<br>\r\n\r\n### 2. pid namespace\r\n\r\n#### 2.1 如何查看一个进程的 pid namespace\r\n\r\n/proc/pid/ns  目录下目前可以看到pid namespace\r\n\r\n```\r\nps ajxf 查看到一个父进程，和子进程\r\n 4556  4574  4574  4574 ?           -1 Ss       0   0:00          \\_ nginx: master process nginx -g daemon off;\r\n 4574  4621  4574  4574 ?           -1 S      101   0:00              \\_ nginx: worker process\r\n 4574  4629  4574  4574 ?           -1 S      101   0:00              \\_ nginx: worker process\r\n\r\n\r\n// 父进程\r\nroot@k8s-master:/proc/170/ns# ls -l /proc/4574/ns\r\ntotal 0\r\nlrwxrwxrwx 1 root root 0 Dec  5 08:49 cgroup -> 'cgroup:[4026531835]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 08:49 ipc -> 'ipc:[4026532263]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 08:49 mnt -> 'mnt:[4026532331]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 08:49 net -> 'net:[4026532266]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 08:49 pid -> 'pid:[4026532333]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 08:49 pid_for_children -> 'pid:[4026532333]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 08:49 user -> 'user:[4026531837]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 08:49 uts -> 'uts:[4026532332]'\r\n\r\n// 子进程和父进程有一样的namespaces\r\nroot@k8s-master:/proc/170/ns# ls -l /proc/4621/ns\r\ntotal 0\r\nlrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec  5 08:49 cgroup -> 'cgroup:[4026531835]'\r\nlrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec  5 08:49 ipc -> 'ipc:[4026532263]'\r\nlrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec  5 08:49 mnt -> 'mnt:[4026532331]'\r\nlrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec  5 08:49 net -> 'net:[4026532266]'\r\nlrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec  5 08:49 pid -> 'pid:[4026532333]'\r\nlrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec  5 08:49 pid_for_children -> 'pid:[4026532333]'\r\nlrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec  5 08:49 user -> 'user:[4026531837]'\r\nlrwxrwxrwx 1 systemd-timesync systemd-journal 0 Dec  5 08:49 uts -> 'uts:[4026532332]\r\n```\r\n\r\n<br>\r\n\r\n#### 2.2 子进程不共享父进程的pid namespaces \r\n\r\n```\r\nroot@k8s-master:~# unshare --fork --pid --mount-proc sleep 100\r\n```\r\n\r\n<br>\r\n\r\n```\r\n   1   701   701   701 ?           -1 Ss       0   1:28 /usr/sbin/sshd -D\r\n  701  4462  4462  4462 ?           -1 Ss       0   0:00  \\_ sshd: root@pts/0,pts/1\r\n 4462  4497  4497  4497 pts/0     3994 Ss       0   0:00      \\_ -bash\r\n 4497  3106  3106  4497 pts/0     3994 S        0   0:00      |   \\_ bash\r\n 3106  3994  3994  4497 pts/0     3994 S+       0   0:00      |       \\_ unshare --fork --pid --mount-\r\n 3994  3995  3994  4497 pts/0     3994 S+       0   0:00      |           \\_ sleep 100\r\n```\r\n\r\n<br>\r\n\r\n```\r\n// 这个是 sleep 100的进程，所以他的子进程和它是公用pid的\r\nroot@k8s-master:~# ls -l /proc/3995/ns\r\ntotal 0\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:00 cgroup -> 'cgroup:[4026531835]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:00 ipc -> 'ipc:[4026531839]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:00 mnt -> 'mnt:[4026532334]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:00 net -> 'net:[4026531992]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:00 pid -> 'pid:[4026532335]'                 // 这里 pid 和pid_for_children是一样的\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:00 pid_for_children -> 'pid:[4026532335]'    \r\nlrwxrwxrwx 1 root root 0 Dec  5 10:00 user -> 'user:[4026531837]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:00 uts -> 'uts:[4026531838]'\r\nroot@k8s-master:~# \r\n\r\n// 这个是 unshare 的进程，因为使用了 --pid mount,所以和父进程pid namespaces是不一样的\r\nroot@k8s-master:~# ls -l /proc/3994/ns\r\ntotal 0\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:01 cgroup -> 'cgroup:[4026531835]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:01 ipc -> 'ipc:[4026531839]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:01 mnt -> 'mnt:[4026532334]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:01 net -> 'net:[4026531992]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:01 pid -> 'pid:[4026531836]'                  // 这里就是不一样的，因为这个进程是\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:01 pid_for_children -> 'pid:[4026532335]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:01 user -> 'user:[4026531837]'\r\nlrwxrwxrwx 1 root root 0 Dec  5 10:01 uts -> 'uts:[4026531838]'\r\n```\r\n\r\n#### 2.3 pid namespace的原理\r\n\r\n为了让每个进程都可以从属于某一个namespace，Linux内核为进程描述符添加了一个 `struct nsproxy` 的结构，如下：\r\n\r\n```\r\nstruct task_struct {\r\n    ...\r\n    /* namespaces */\r\n    struct nsproxy *nsproxy;\r\n    ...\r\n}\r\n\r\nstruct nsproxy {\r\n    atomic_t count;\r\n    struct uts_namespace  *uts_ns;\r\n    struct ipc_namespace  *ipc_ns;\r\n    struct mnt_namespace  *mnt_ns;\r\n    struct pid_namespace  *pid_ns;\r\n    struct user_namespace *user_ns;\r\n    struct net            *net_ns;\r\n};\r\n```\r\n\r\n从 `struct nsproxy` 结构的定义可以看出，Linux为每种不同类型的资源定义了不同的命名空间结构体进行管理。比如对于 `pid命名空间` 定义了 `struct pid_namespace` 结构来管理 。由于 namespace 涉及的资源种类比较多，所以本文主要以 `pid命名空间` 作为分析的对象。\r\n\r\n我们先来看看管理 `pid命名空间` 的 `struct pid_namespace` 结构的定义：\r\n\r\n```\r\nstruct pid_namespace {\r\n    struct kref kref;\r\n    struct pidmap pidmap[PIDMAP_ENTRIES];\r\n    int last_pid;\r\n    struct task_struct *child_reaper;\r\n    struct kmem_cache *pid_cachep;\r\n    unsigned int level;\r\n    struct pid_namespace *parent;\r\n#ifdef CONFIG_PROC_FS\r\n    struct vfsmount *proc_mnt;\r\n#endif\r\n};\r\n```\r\n\r\n因为 `struct pid_namespace` 结构主要用于为当前 `pid命名空间` 分配空闲的pid，所以定义比较简单：\r\n\r\n- `kref` 成员是一个引用计数器，用于记录引用这个结构的进程数\r\n- `pidmap` 成员用于快速找到可用pid的位图\r\n- `last_pid` 成员是记录最后一个可用的pid\r\n- `level` 成员记录当前 `pid命名空间` 所在的层次\r\n- `parent` 成员记录当前 `pid命名空间` 的父命名空间\r\n\r\n由于 `pid命名空间` 是分层的，也就是说新创建一个 `pid命名空间` 时会记录父级 `pid命名空间` 到 `parent` 字段中，所以随着 `pid命名空间` 的创建，在内核中会形成一颗 `pid命名空间` 的树，如下图（图片来源）：\r\n\r\n![image-20220226155912906](./image/ns-1.png)\r\n\r\n第0层的 `pid命名空间` 是 `init` 进程所在的命名空间。如果一个进程所在的 `pid命名空间` 为 `N`，那么其在 `0 ~ N 层pid命名空间` 都有一个唯一的pid号。也就是说 `高层pid命名空间` 的进程对 `低层pid命名空间` 的进程是可见的，但是 `低层pid命名空间`的进程对 `高层pid命名空间` 的进程是不可见的。\r\n\r\n由于在 `第N层pid命名空间` 的进程其在 `0 ~ N层pid命名空间` 都有一个唯一的pid号，所以在进程描述符中通过 `pids` 成员来记录其在每个层的pid号，代码如下：\r\n\r\n```\r\nstruct task_struct {\r\n    ...\r\n    struct pid_link pids[PIDTYPE_MAX];\r\n    ...\r\n}\r\n\r\nenum pid_type {\r\n    PIDTYPE_PID,\r\n    PIDTYPE_PGID,\r\n    PIDTYPE_SID,\r\n    PIDTYPE_MAX\r\n};\r\n\r\nstruct upid {\r\n    int nr;\r\n    struct pid_namespace *ns;\r\n    struct hlist_node pid_chain;\r\n};\r\n\r\nstruct pid {\r\n    atomic_t count;\r\n    struct hlist_head tasks[PIDTYPE_MAX];\r\n    struct rcu_head rcu;\r\n    unsigned int level;\r\n    struct upid numbers[1];\r\n};\r\n\r\nstruct pid_link {\r\n    struct hlist_node node;\r\n    struct pid *pid;\r\n};\r\n```\r\n\r\n这几个结构的关系如下图：\r\n\r\n![image-20220226160353792](./image/ns-2.png)\r\n\r\n我们主要关注 `struct pid` 这个结构，`struct pid` 有个类型为 `struct upid` 的成员 `numbers`，其定义为只有一个元素的数组，但是其实是一个动态的数据，它的元素个数与 `level` 的值一致，也就是说当 `level` 的值为5时，那么 `numbers` 成员就是一个拥有5个元素的数组。而每个元素记录了其在每层 `pid命名空间` 的pid号，而 `struct upid` 结构的 `nr` 成员就是用于记录进程在不同层级 `pid命名空间` 的pid号。\r\n\r\n我们通过代码来看看怎么为进程分配pid号的，在内核中是用过 `alloc_pid()` 函数分配pid号的，代码如下：\r\n\r\n```\r\nstruct pid *alloc_pid(struct pid_namespace *ns)\r\n{\r\n    struct pid *pid;\r\n    enum pid_type type;\r\n    int i, nr;\r\n    struct pid_namespace *tmp;\r\n    struct upid *upid;\r\n\r\n    pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);\r\n    if (!pid)\r\n        goto out;\r\n\r\n    tmp = ns;\r\n    for (i = ns->level; i >= 0; i--) {\r\n        nr = alloc_pidmap(tmp);    // 为当前进程所在的不同层级pid命名空间分配一个pid\r\n        if (nr < 0)\r\n            goto out_free;\r\n\r\n        pid->numbers[i].nr = nr;   // 对应i层namespace中的pid数字\r\n        pid->numbers[i].ns = tmp;  // 对应i层namespace的实体\r\n        tmp = tmp->parent;\r\n    }\r\n\r\n    get_pid_ns(ns);\r\n    pid->level = ns->level;\r\n    atomic_set(&pid->count, 1);\r\n    for (type = 0; type < PIDTYPE_MAX; ++type)\r\n        INIT_HLIST_HEAD(&pid->tasks[type]);\r\n\r\n    spin_lock_irq(&pidmap_lock);\r\n    for (i = ns->level; i >= 0; i--) {\r\n        upid = &pid->numbers[i];\r\n        // 把upid连接到全局pid中, 用于快速查找pid\r\n        hlist_add_head_rcu(&upid->pid_chain,\r\n                &pid_hash[pid_hashfn(upid->nr, upid->ns)]);\r\n    }\r\n    spin_unlock_irq(&pidmap_lock);\r\n\r\nout:\r\n    return pid;\r\n\r\n    ...\r\n}\r\n```\r\n\r\n上面的代码中，那个 `for (i = ns->level; i >= 0; i--)` 就是通过 `parent` 成员不断向上检索为不同层级的 `pid命名空间`分配一个唯一的pid号，并且保存到对应的 `nr` 字段中。另外，还会把进程所在各个层级的pid号添加到全局pid哈希表中，这样做是为了通过pid号快速找到进程。\r\n\r\n现在我们来看看怎么通过pid号快速找到一个进程，在内核中 `find_get_pid()` 函数用来通过pid号查找对应的 `struct pid`结构，代码如下（find_get_pid() -> find_vpid() -> find_pid_ns()）：\r\n\r\n```\r\nstruct pid *find_get_pid(pid_t nr)\r\n{\r\n    struct pid *pid;\r\n\r\n    rcu_read_lock();\r\n    pid = get_pid(find_vpid(nr));\r\n    rcu_read_unlock();\r\n\r\n    return pid;\r\n}\r\n\r\nstruct pid *find_vpid(int nr)\r\n{\r\n    return find_pid_ns(nr, current->nsproxy->pid_ns);\r\n}\r\n\r\nstruct pid *find_pid_ns(int nr, struct pid_namespace *ns)\r\n{\r\n    struct hlist_node *elem;\r\n    struct upid *pnr;\r\n\r\n    hlist_for_each_entry_rcu(pnr, elem,\r\n            &pid_hash[pid_hashfn(nr, ns)], pid_chain)\r\n        if (pnr->nr == nr && pnr->ns == ns)\r\n            return container_of(pnr, struct pid,\r\n                    numbers[ns->level]);\r\n\r\n    return NULL;\r\n}\r\n```\r\n\r\n通过pid号查找 `struct pid` 结构时，首先会把进程pid号和当前进程的 `pid命名空间` 传入到 `find_pid_ns()` 函数，而在 `find_pid_ns()` 函数中通过全局pid哈希表来快速查找对应的 `struct pid` 结构。获取到 `struct pid` 结构后就可以很容易地获取到进程对应的进程描述符，例如可以通过 `pid_task()` 函数来获取 `struct pid` 结构对应进程描述符，由于代码比较简单，这里就不再分析了。\r\n\r\n#### 2.4 task_struct 结构图\r\n\r\n```\r\nstruct task_struct \r\n{\r\n    /* \r\n    1. state: 进程执行时，它会根据具体情况改变状态。进程状态是进程调度和对换的依据。Linux中的进程主要有如下状态:\r\n        1) TASK_RUNNING: 可运行\r\n        处于这种状态的进程，只有两种状态:\r\n            1.1) 正在运行\r\n            正在运行的进程就是当前进程(由current所指向的进程)\r\n            1.2) 正准备运行\r\n            准备运行的进程只要得到CPU就可以立即投入运行，CPU是这些进程唯一等待的系统资源，系统中有一个运行队列(run_queue)，用来容纳所有处于可运行状态的进程，调度程序执行时，从中选择一个进程投入运行 \r\n        \r\n        2) TASK_INTERRUPTIBLE: 可中断的等待状态，是针对等待某事件或其他资源的睡眠进程设置的，在内核发送信号给该进程表明事件已经发生时，进程状态变为TASK_RUNNING，它只要调度器选中该进程即可恢复执行 \r\n        \r\n        3) TASK_UNINTERRUPTIBLE: 不可中断的等待状态\r\n        处于该状态的进程正在等待某个事件(event)或某个资源，它肯定位于系统中的某个等待队列(wait_queue)中，处于不可中断等待态的进程是因为硬件环境不能满足而等待，例如等待特定的系统资源，它任何情况下都不能被打断，只能用特定的方式来唤醒它，例如唤醒函数wake_up()等 \r\n　　　　　它们不能由外部信号唤醒，只能由内核亲自唤醒        \r\n\r\n        4) TASK_ZOMBIE: 僵死\r\n        进程虽然已经终止，但由于某种原因，父进程还没有执行wait()系统调用，终止进程的信息也还没有回收。顾名思义，处于该状态的进程就是死进程，这种进程实际上是系统中的垃圾，必须进行相应处理以释放其占用的资源。\r\n\r\n        5) TASK_STOPPED: 暂停\r\n        此时的进程暂时停止运行来接受某种特殊处理。通常当进程接收到SIGSTOP、SIGTSTP、SIGTTIN或 SIGTTOU信号后就处于这种状态。例如，正接受调试的进程就处于这种状态\r\n　　　　\r\n　　　　　6) TASK_TRACED\r\n　　　　　从本质上来说，这属于TASK_STOPPED状态，用于从停止的进程中，将当前被调试的进程与常规的进程区分开来\r\n　　　　　　\r\n　　　　　7) TASK_DEAD\r\n　　　　　父进程wait系统调用发出后，当子进程退出时，父进程负责回收子进程的全部资源，子进程进入TASK_DEAD状态\r\n\r\n        8) TASK_SWAPPING: 换入/换出\r\n    */\r\n    volatile long state;\r\n    \r\n    /*\r\n    2. stack\r\n    进程内核栈，进程通过alloc_thread_info函数分配它的内核栈，通过free_thread_info函数释放所分配的内核栈\r\n    */     \r\n    void *stack;\r\n    \r\n    /*\r\n    3. usage\r\n    进程描述符使用计数，被置为2时，表示进程描述符正在被使用而且其相应的进程处于活动状态\r\n    */\r\n    atomic_t usage;\r\n\r\n    /*\r\n    4. flags\r\n    flags是进程当前的状态标志(注意和运行状态区分)\r\n        1) #define PF_ALIGNWARN    0x00000001: 显示内存地址未对齐警告\r\n        2) #define PF_PTRACED    0x00000010: 标识是否是否调用了ptrace\r\n        3) #define PF_TRACESYS    0x00000020: 跟踪系统调用\r\n        4) #define PF_FORKNOEXEC 0x00000040: 已经完成fork，但还没有调用exec\r\n        5) #define PF_SUPERPRIV    0x00000100: 使用超级用户(root)权限\r\n        6) #define PF_DUMPCORE    0x00000200: dumped core  \r\n        7) #define PF_SIGNALED    0x00000400: 此进程由于其他进程发送相关信号而被杀死 \r\n        8) #define PF_STARTING    0x00000002: 当前进程正在被创建\r\n        9) #define PF_EXITING    0x00000004: 当前进程正在关闭\r\n        10) #define PF_USEDFPU    0x00100000: Process used the FPU this quantum(SMP only)  \r\n        #define PF_DTRACE    0x00200000: delayed trace (used on m68k)  \r\n    */\r\n    unsigned int flags;     \r\n\r\n    /*\r\n    5. ptrace\r\n    ptrace系统调用，成员ptrace被设置为0时表示不需要被跟踪，它的可能取值如下： \r\n    linux-2.6.38.8/include/linux/ptrace.h  \r\n        1) #define PT_PTRACED    0x00000001\r\n        2) #define PT_DTRACE    0x00000002: delayed trace (used on m68k, i386) \r\n        3) #define PT_TRACESYSGOOD    0x00000004\r\n        4) #define PT_PTRACE_CAP    0x00000008: ptracer can follow suid-exec \r\n        5) #define PT_TRACE_FORK    0x00000010\r\n        6) #define PT_TRACE_VFORK    0x00000020\r\n        7) #define PT_TRACE_CLONE    0x00000040\r\n        8) #define PT_TRACE_EXEC    0x00000080\r\n        9) #define PT_TRACE_VFORK_DONE    0x00000100\r\n        10) #define PT_TRACE_EXIT    0x00000200\r\n    */\r\n    unsigned int ptrace;\r\n    unsigned long ptrace_message;\r\n    siginfo_t *last_siginfo; \r\n\r\n    /*\r\n    6. lock_depth\r\n    用于表示获取大内核锁的次数，如果进程未获得过锁，则置为-1\r\n    */\r\n    int lock_depth;         \r\n\r\n    /*\r\n    7. oncpu\r\n    在SMP上帮助实现无加锁的进程切换(unlocked context switches)\r\n    */\r\n#ifdef CONFIG_SMP\r\n#ifdef __ARCH_WANT_UNLOCKED_CTXSW\r\n    int oncpu;\r\n#endif\r\n#endif\r\n\r\n    /*\r\n    8. 进程调度\r\n        1) prio: 调度器考虑的优先级保存在prio，由于在某些情况下内核需要暂时提高进程的优先级，因此需要第三个成员来表示(除了static_prio、normal_prio之外)，由于这些改变不是持久的，因此静态(static_prio)和普通(normal_prio)优先级不受影响\r\n        2) static_prio: 用于保存进程的\"静态优先级\"，静态优先级是进程\"启动\"时分配的优先级，它可以用nice、sched_setscheduler系统调用修改，否则在进程运行期间会一直保持恒定\r\n        3) normal_prio: 表示基于进程的\"静态优先级\"和\"调度策略\"计算出的优先级，因此，即使普通进程和实时进程具有相同的静态优先级(static_prio)，其普通优先级(normal_prio)也是不同的。进程分支时(fork)，新创建的子进程会集成普通优先级   \r\n    */\r\n    int prio, static_prio, normal_prio;\r\n    /*\r\n        4) rt_priority: 表示实时进程的优先级，需要明白的是，\"实时进程优先级\"和\"普通进程优先级\"有两个独立的范畴，实时进程即使是最低优先级也高于普通进程，最低的实时优先级为0，最高的优先级为99，值越大，表明优先级越高\r\n    */\r\n    unsigned int rt_priority;\r\n    /*\r\n        5) sched_class: 该进程所属的调度类，目前内核中有实现以下四种： \r\n            5.1) static const struct sched_class fair_sched_class;\r\n            5.2) static const struct sched_class rt_sched_class;\r\n            5.3) static const struct sched_class idle_sched_class;\r\n            5.4) static const struct sched_class stop_sched_class;        \r\n    */\r\n    const struct sched_class *sched_class;\r\n    /*\r\n        6) se: 用于普通进程的调用实体 \r\n　　调度器不限于调度进程，还可以处理更大的实体，这可以实现\"组调度\"，可用的CPU时间可以首先在一般的进程组(例如所有进程可以按所有者分组)之间分配，接下来分配的时间在组内再次分配\r\n　　这种一般性要求调度器不直接操作进程，而是处理\"可调度实体\"，一个实体有sched_entity的一个实例标识\r\n　　在最简单的情况下，调度在各个进程上执行，由于调度器设计为处理可调度的实体，在调度器看来各个进程也必须也像这样的实体，因此se在task_struct中内嵌了一个sched_entity实例，调度器可据此操作各个task_struct\r\n    */\r\n    struct sched_entity se;\r\n    /*\r\n        7) rt: 用于实时进程的调用实体 \r\n    */\r\n    struct sched_rt_entity rt;\r\n\r\n#ifdef CONFIG_PREEMPT_NOTIFIERS \r\n    /*\r\n    9. preempt_notifier\r\n    preempt_notifiers结构体链表 \r\n    */\r\n    struct hlist_head preempt_notifiers;\r\n#endif\r\n \r\n     /*\r\n     10. fpu_counter\r\n     FPU使用计数 \r\n     */\r\n    unsigned char fpu_counter;\r\n\r\n#ifdef CONFIG_BLK_DEV_IO_TRACE\r\n    /*\r\n    11. btrace_seq\r\n    blktrace是一个针对Linux内核中块设备I/O层的跟踪工具\r\n    */\r\n    unsigned int btrace_seq;\r\n#endif\r\n\r\n    /*\r\n    12. policy\r\n    policy表示进程的调度策略，目前主要有以下五种：\r\n        1) #define SCHED_NORMAL        0: 用于普通进程，它们通过完全公平调度器来处理\r\n        2) #define SCHED_FIFO        1: 先来先服务调度，由实时调度类处理\r\n        3) #define SCHED_RR            2: 时间片轮转调度，由实时调度类处理\r\n        4) #define SCHED_BATCH        3: 用于非交互、CPU使用密集的批处理进程，通过完全公平调度器来处理，调度决策对此类进程给与\"冷处理\"，它们绝不会抢占CFS调度器处理的另一个进程，因此不会干扰交互式进程，如果不打算用nice降低进程的静态优先级，同时又不希望该进程影响系统的交互性，最适合用该调度策略\r\n        5) #define SCHED_IDLE        5: 可用于次要的进程，其相对权重总是最小的，也通过完全公平调度器来处理。要注意的是，SCHED_IDLE不负责调度空闲进程，空闲进程由内核提供单独的机制来处理\r\n    只有root用户能通过sched_setscheduler()系统调用来改变调度策略 \r\n    */\r\n    unsigned int policy;\r\n\r\n    /*\r\n    13. cpus_allowed\r\n    cpus_allowed是一个位域，在多处理器系统上使用，用于控制进程可以在哪里处理器上运行\r\n    */\r\n    cpumask_t cpus_allowed;\r\n\r\n    /*\r\n    14. RCU同步原语 \r\n    */\r\n#ifdef CONFIG_TREE_PREEMPT_RCU\r\n    int rcu_read_lock_nesting;\r\n    char rcu_read_unlock_special;\r\n    struct rcu_node *rcu_blocked_node;\r\n    struct list_head rcu_node_entry;\r\n#endif /* #ifdef CONFIG_TREE_PREEMPT_RCU */\r\n\r\n#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)\r\n    /*\r\n    15. sched_info\r\n    用于调度器统计进程的运行信息\r\n    */\r\n    struct sched_info sched_info;\r\n#endif\r\n\r\n    /*\r\n    16. tasks\r\n    通过list_head将当前进程的task_struct串联进内核的进程列表中，构建；linux进程链表\r\n    */\r\n    struct list_head tasks;\r\n\r\n    /*\r\n    17. pushable_tasks\r\n    limit pushing to one attempt \r\n    */\r\n    struct plist_node pushable_tasks;\r\n\r\n    /*\r\n    18. 进程地址空间 \r\n        1) mm: 指向进程所拥有的内存描述符 \r\n        2) active_mm: active_mm指向进程运行时所使用的内存描述符\r\n    对于普通进程而言，这两个指针变量的值相同。但是，内核线程不拥有任何内存描述符，所以它们的mm成员总是为NULL。当内核线程得以运行时，它的active_mm成员被初始化为前一个运行进程的active_mm值\r\n    */\r\n    struct mm_struct *mm, *active_mm;\r\n\r\n    /*\r\n    19. exit_state\r\n    进程退出状态码\r\n    */\r\n    int exit_state;\r\n\r\n    /*\r\n    20. 判断标志\r\n        1) exit_code\r\n        exit_code用于设置进程的终止代号，这个值要么是_exit()或exit_group()系统调用参数(正常终止)，要么是由内核提供的一个错误代号(异常终止)\r\n        2) exit_signal\r\n        exit_signal被置为-1时表示是某个线程组中的一员。只有当线程组的最后一个成员终止时，才会产生一个信号，以通知线程组的领头进程的父进程\r\n    */\r\n    int exit_code, exit_signal; \r\n    /*\r\n        3) pdeath_signal\r\n        pdeath_signal用于判断父进程终止时发送信号\r\n    */\r\n    int pdeath_signal;   \r\n    /*\r\n        4)  personality用于处理不同的ABI，它的可能取值如下： \r\n            enum \r\n            {\r\n                PER_LINUX =        0x0000,\r\n                PER_LINUX_32BIT =    0x0000 | ADDR_LIMIT_32BIT,\r\n                PER_LINUX_FDPIC =    0x0000 | FDPIC_FUNCPTRS,\r\n                PER_SVR4 =        0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,\r\n                PER_SVR3 =        0x0002 | STICKY_TIMEOUTS | SHORT_INODE,\r\n                PER_SCOSVR3 =        0x0003 | STICKY_TIMEOUTS |\r\n                                 WHOLE_SECONDS | SHORT_INODE,\r\n                PER_OSR5 =        0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS,\r\n                PER_WYSEV386 =        0x0004 | STICKY_TIMEOUTS | SHORT_INODE,\r\n                PER_ISCR4 =        0x0005 | STICKY_TIMEOUTS,\r\n                PER_BSD =        0x0006,\r\n                PER_SUNOS =        0x0006 | STICKY_TIMEOUTS,\r\n                PER_XENIX =        0x0007 | STICKY_TIMEOUTS | SHORT_INODE,\r\n                PER_LINUX32 =        0x0008,\r\n                PER_LINUX32_3GB =    0x0008 | ADDR_LIMIT_3GB,\r\n                PER_IRIX32 =        0x0009 | STICKY_TIMEOUTS, \r\n                PER_IRIXN32 =        0x000a | STICKY_TIMEOUTS, \r\n                PER_IRIX64 =        0x000b | STICKY_TIMEOUTS, \r\n                PER_RISCOS =        0x000c,\r\n                PER_SOLARIS =        0x000d | STICKY_TIMEOUTS,\r\n                PER_UW7 =        0x000e | STICKY_TIMEOUTS | MMAP_PAGE_ZERO,\r\n                PER_OSF4 =        0x000f,              \r\n                PER_HPUX =        0x0010,\r\n                PER_MASK =        0x00ff,\r\n            };\r\n    */\r\n    unsigned int personality;\r\n    /*\r\n        5) did_exec\r\n        did_exec用于记录进程代码是否被execve()函数所执行\r\n    */\r\n    unsigned did_exec:1;\r\n    /*\r\n        6) in_execve\r\n        in_execve用于通知LSM是否被do_execve()函数所调用\r\n    */\r\n    unsigned in_execve:1;     \r\n    /*\r\n        7) in_iowait\r\n        in_iowait用于判断是否进行iowait计数\r\n    */\r\n    unsigned in_iowait:1;\r\n\r\n    /*\r\n        8) sched_reset_on_fork\r\n        sched_reset_on_fork用于判断是否恢复默认的优先级或调度策略\r\n    */\r\n    unsigned sched_reset_on_fork:1;\r\n\r\n    /*\r\n    21. 进程标识符(PID)\r\n    在CONFIG_BASE_SMALL配置为0的情况下，PID的取值范围是0到32767，即系统中的进程数最大为32768个\r\n    #define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000)  \r\n    在Linux系统中，一个线程组中的所有线程使用和该线程组的领头线程(该组中的第一个轻量级进程)相同的PID，并被存放在tgid成员中。只有线程组的领头线程的pid成员才会被设置为与tgid相同的值。注意，getpid()系统调用\r\n返回的是当前进程的tgid值而不是pid值。\r\n    */\r\n    pid_t pid;\r\n    pid_t tgid;\r\n\r\n#ifdef CONFIG_CC_STACKPROTECTOR \r\n    /*\r\n    22. stack_canary\r\n    防止内核堆栈溢出，在GCC编译内核时，需要加上-fstack-protector选项\r\n    */\r\n    unsigned long stack_canary;\r\n#endif\r\n \r\n     /*\r\n     23. 表示进程亲属关系的成员 \r\n         1) real_parent: 指向其父进程，如果创建它的父进程不再存在，则指向PID为1的init进程\r\n         2) parent: 指向其父进程，当它终止时，必须向它的父进程发送信号。它的值通常与real_parent相同 \r\n     */\r\n    struct task_struct *real_parent;  \r\n    struct task_struct *parent;   \r\n    /*\r\n        3) children: 表示链表的头部，链表中的所有元素都是它的子进程(子进程链表)\r\n        4) sibling: 用于把当前进程插入到兄弟链表中(连接到父进程的子进程链表(兄弟链表))\r\n        5) group_leader: 指向其所在进程组的领头进程\r\n    */\r\n    struct list_head children;     \r\n    struct list_head sibling;     \r\n    struct task_struct *group_leader;     \r\n     \r\n    struct list_head ptraced;\r\n    struct list_head ptrace_entry; \r\n    struct bts_context *bts;\r\n\r\n    /*\r\n    24. pids\r\n    PID散列表和链表  \r\n    */\r\n    struct pid_link pids[PIDTYPE_MAX];\r\n    /*\r\n    25. thread_group\r\n    线程组中所有进程的链表\r\n    */\r\n    struct list_head thread_group;\r\n\r\n    /*\r\n    26. do_fork函数 \r\n        1) vfork_done\r\n        在执行do_fork()时，如果给定特别标志，则vfork_done会指向一个特殊地址\r\n        2) set_child_tid、clear_child_tid\r\n        如果copy_process函数的clone_flags参数的值被置为CLONE_CHILD_SETTID或CLONE_CHILD_CLEARTID，则会把child_tidptr参数的值分别复制到set_child_tid和clear_child_tid成员。这些标志说明必须改变子\r\n进程用户态地址空间的child_tidptr所指向的变量的值。\r\n    */\r\n    struct completion *vfork_done;         \r\n    int __user *set_child_tid;         \r\n    int __user *clear_child_tid;         \r\n\r\n    /*\r\n    27. 记录进程的I/O计数(时间)\r\n        1) utime\r\n        用于记录进程在\"用户态\"下所经过的节拍数(定时器)\r\n        2) stime\r\n        用于记录进程在\"内核态\"下所经过的节拍数(定时器)\r\n        3) utimescaled\r\n        用于记录进程在\"用户态\"的运行时间，但它们以处理器的频率为刻度\r\n        4) stimescaled\r\n        用于记录进程在\"内核态\"的运行时间，但它们以处理器的频率为刻度\r\n    */\r\n    cputime_t utime, stime, utimescaled, stimescaled;\r\n    /*\r\n        5) gtime\r\n        以节拍计数的虚拟机运行时间(guest time)\r\n    */\r\n    cputime_t gtime;\r\n    /*\r\n        6) prev_utime、prev_stime是先前的运行时间\r\n    */\r\n    cputime_t prev_utime, prev_stime; \r\n    /*\r\n        7) nvcsw\r\n        自愿(voluntary)上下文切换计数\r\n        8) nivcsw\r\n        非自愿(involuntary)上下文切换计数\r\n    */\r\n    unsigned long nvcsw, nivcsw; \r\n    /*\r\n        9) start_time\r\n        进程创建时间\r\n        10) real_start_time\r\n        进程睡眠时间，还包含了进程睡眠时间，常用于/proc/pid/stat，\r\n    */\r\n    struct timespec start_time;          \r\n    struct timespec real_start_time;\r\n    /*\r\n        11) cputime_expires\r\n        用来统计进程或进程组被跟踪的处理器时间，其中的三个成员对应着cpu_timers[3]的三个链表\r\n    */\r\n    struct task_cputime cputime_expires;\r\n    struct list_head cpu_timers[3];\r\n    #ifdef CONFIG_DETECT_HUNG_TASK \r\n    /*\r\n        12) last_switch_count\r\n        nvcsw和nivcsw的总和\r\n    */\r\n    unsigned long last_switch_count;\r\n    #endif\r\n    struct task_io_accounting ioac;\r\n#if defined(CONFIG_TASK_XACCT)\r\n    u64 acct_rss_mem1;     \r\n    u64 acct_vm_mem1;     \r\n    cputime_t acct_timexpd;     \r\n#endif\r\n\r\n    /*\r\n    28. 缺页统计 \r\n    */     \r\n    unsigned long min_flt, maj_flt; \r\n\r\n    /*\r\n    29. 进程权能 \r\n    */\r\n    const struct cred *real_cred;     \r\n    const struct cred *cred;     \r\n    struct mutex cred_guard_mutex;     \r\n    struct cred *replacement_session_keyring;  \r\n\r\n    /*\r\n    30. comm[TASK_COMM_LEN]\r\n    相应的程序名 \r\n    */\r\n    char comm[TASK_COMM_LEN]; \r\n\r\n    /*\r\n    31. 文件 \r\n        1) fs\r\n        用来表示进程与文件系统的联系，包括当前目录和根目录\r\n        2) files\r\n        表示进程当前打开的文件\r\n    */\r\n    int link_count, total_link_count; \r\n    struct fs_struct *fs; \r\n    struct files_struct *files;\r\n\r\n#ifdef CONFIG_SYSVIPC \r\n    /*\r\n    32. sysvsem\r\n    进程通信(SYSVIPC)\r\n    */\r\n    struct sysv_sem sysvsem;\r\n#endif\r\n\r\n    /*\r\n    33. 处理器特有数据\r\n    */\r\n    struct thread_struct thread;  \r\n\r\n    /*\r\n    34. nsproxy\r\n    命名空间 \r\n    */\r\n    struct nsproxy *nsproxy; \r\n\r\n    /*\r\n    35. 信号处理 \r\n        1) signal: 指向进程的信号描述符\r\n        2) sighand: 指向进程的信号处理程序描述符\r\n    */\r\n    struct signal_struct *signal;\r\n    struct sighand_struct *sighand;\r\n    /*\r\n        3) blocked: 表示被阻塞信号的掩码\r\n        4) real_blocked: 表示临时掩码\r\n    */\r\n    sigset_t blocked, real_blocked;\r\n    sigset_t saved_sigmask;     \r\n    /*\r\n        5) pending: 存放私有挂起信号的数据结构\r\n    */\r\n    struct sigpending pending;\r\n    /*\r\n        6) sas_ss_sp: 信号处理程序备用堆栈的地址\r\n        7) sas_ss_size: 表示堆栈的大小\r\n    */\r\n    unsigned long sas_ss_sp;\r\n    size_t sas_ss_size;\r\n    /*\r\n        8) notifier\r\n        设备驱动程序常用notifier指向的函数来阻塞进程的某些信号\r\n        9) otifier_data\r\n        指的是notifier所指向的函数可能使用的数据。\r\n        10) otifier_mask\r\n        标识这些信号的位掩码\r\n    */\r\n    int (*notifier)(void *priv);\r\n    void *notifier_data;\r\n    sigset_t *notifier_mask;\r\n\r\n    /*\r\n    36. 进程审计 \r\n    */\r\n    struct audit_context *audit_context; \r\n#ifdef CONFIG_AUDITSYSCALL\r\n    uid_t loginuid;\r\n    unsigned int sessionid;\r\n#endif\r\n\r\n    /*\r\n    37. secure computing \r\n    */\r\n    seccomp_t seccomp;\r\n     \r\n     /*\r\n     38. 用于copy_process函数使用CLONE_PARENT标记时 \r\n     */\r\n       u32 parent_exec_id;\r\n       u32 self_exec_id;\r\n \r\n     /*\r\n     39. alloc_lock\r\n     用于保护资源分配或释放的自旋锁 \r\n     */\r\n    spinlock_t alloc_lock;\r\n\r\n    /*\r\n    40. 中断 \r\n    */\r\n#ifdef CONFIG_GENERIC_HARDIRQS \r\n    struct irqaction *irqaction;\r\n#endif\r\n#ifdef CONFIG_TRACE_IRQFLAGS\r\n    unsigned int irq_events;\r\n    int hardirqs_enabled;\r\n    unsigned long hardirq_enable_ip;\r\n    unsigned int hardirq_enable_event;\r\n    unsigned long hardirq_disable_ip;\r\n    unsigned int hardirq_disable_event;\r\n    int softirqs_enabled;\r\n    unsigned long softirq_disable_ip;\r\n    unsigned int softirq_disable_event;\r\n    unsigned long softirq_enable_ip;\r\n    unsigned int softirq_enable_event;\r\n    int hardirq_context;\r\n    int softirq_context;\r\n#endif\r\n     \r\n     /*\r\n     41. pi_lock\r\n     task_rq_lock函数所使用的锁 \r\n     */\r\n    spinlock_t pi_lock;\r\n\r\n#ifdef CONFIG_RT_MUTEXES \r\n    /*\r\n    42. 基于PI协议的等待互斥锁，其中PI指的是priority inheritance/9优先级继承)\r\n    */\r\n    struct plist_head pi_waiters; \r\n    struct rt_mutex_waiter *pi_blocked_on;\r\n#endif\r\n\r\n#ifdef CONFIG_DEBUG_MUTEXES \r\n    /*\r\n    43. blocked_on\r\n    死锁检测\r\n    */\r\n    struct mutex_waiter *blocked_on;\r\n#endif\r\n\r\n/*\r\n    44. lockdep，\r\n*/\r\n#ifdef CONFIG_LOCKDEP\r\n# define MAX_LOCK_DEPTH 48UL\r\n    u64 curr_chain_key;\r\n    int lockdep_depth;\r\n    unsigned int lockdep_recursion;\r\n    struct held_lock held_locks[MAX_LOCK_DEPTH];\r\n    gfp_t lockdep_reclaim_gfp;\r\n#endif\r\n \r\n     /*\r\n     45. journal_info\r\n     JFS文件系统\r\n     */\r\n    void *journal_info;\r\n     \r\n     /*\r\n     46. 块设备链表\r\n     */\r\n    struct bio *bio_list, **bio_tail; \r\n\r\n    /*\r\n    47. reclaim_state\r\n    内存回收\r\n    */\r\n    struct reclaim_state *reclaim_state;\r\n\r\n    /*\r\n    48. backing_dev_info\r\n    存放块设备I/O数据流量信息\r\n    */\r\n    struct backing_dev_info *backing_dev_info;\r\n\r\n    /*\r\n    49. io_context\r\n    I/O调度器所使用的信息 \r\n    */\r\n    struct io_context *io_context;\r\n\r\n    /*\r\n    50. CPUSET功能 \r\n    */\r\n#ifdef CONFIG_CPUSETS\r\n    nodemask_t mems_allowed;     \r\n    int cpuset_mem_spread_rotor;\r\n#endif\r\n\r\n    /*\r\n    51. Control Groups \r\n    */\r\n#ifdef CONFIG_CGROUPS \r\n    struct css_set *cgroups; \r\n    struct list_head cg_list;\r\n#endif\r\n\r\n    /*\r\n    52. robust_list\r\n    futex同步机制 \r\n    */\r\n#ifdef CONFIG_FUTEX\r\n    struct robust_list_head __user *robust_list;\r\n#ifdef CONFIG_COMPAT\r\n    struct compat_robust_list_head __user *compat_robust_list;\r\n#endif\r\n    struct list_head pi_state_list;\r\n    struct futex_pi_state *pi_state_cache;\r\n#endif \r\n#ifdef CONFIG_PERF_EVENTS\r\n    struct perf_event_context *perf_event_ctxp;\r\n    struct mutex perf_event_mutex;\r\n    struct list_head perf_event_list;\r\n#endif\r\n\r\n    /*\r\n    53. 非一致内存访问(NUMA  Non-Uniform Memory Access)\r\n    */\r\n#ifdef CONFIG_NUMA\r\n    struct mempolicy *mempolicy;    /* Protected by alloc_lock */\r\n    short il_next;\r\n#endif\r\n\r\n    /*\r\n    54. fs_excl\r\n    文件系统互斥资源\r\n    */\r\n    atomic_t fs_excl;\r\n\r\n    /*\r\n    55. rcu\r\n    RCU链表 \r\n    */     \r\n    struct rcu_head rcu;\r\n\r\n    /*\r\n    56. splice_pipe\r\n    管道\r\n    */\r\n    struct pipe_inode_info *splice_pipe;\r\n\r\n    /*\r\n    57. delays\r\n    延迟计数\r\n    */\r\n#ifdef    CONFIG_TASK_DELAY_ACCT\r\n    struct task_delay_info *delays;\r\n#endif\r\n\r\n    /*\r\n    58. make_it_fail\r\n    fault injection\r\n    */\r\n#ifdef CONFIG_FAULT_INJECTION\r\n    int make_it_fail;\r\n#endif\r\n\r\n    /*\r\n    59. dirties\r\n    FLoating proportions \r\n    */\r\n    struct prop_local_single dirties;\r\n\r\n    /*\r\n    60. Infrastructure for displayinglatency \r\n    */\r\n#ifdef CONFIG_LATENCYTOP\r\n    int latency_record_count;\r\n    struct latency_record latency_record[LT_SAVECOUNT];\r\n#endif\r\n     \r\n    /*\r\n    61. time slack values，常用于poll和select函数 \r\n    */\r\n    unsigned long timer_slack_ns;\r\n    unsigned long default_timer_slack_ns;\r\n\r\n    /*\r\n    62. scm_work_list\r\n    socket控制消息(control message)\r\n    */\r\n    struct list_head    *scm_work_list;\r\n\r\n    /*\r\n    63. ftrace跟踪器\r\n    */\r\n#ifdef CONFIG_FUNCTION_GRAPH_TRACER \r\n    int curr_ret_stack; \r\n    struct ftrace_ret_stack    *ret_stack; \r\n    unsigned long long ftrace_timestamp;  \r\n    atomic_t trace_overrun; \r\n    atomic_t tracing_graph_pause;\r\n#endif\r\n#ifdef CONFIG_TRACING \r\n    unsigned long trace; \r\n    unsigned long trace_recursion;\r\n#endif  \r\n};\r\n```\r\n\r\n### 3 总结\r\n\r\n（1）pid 就是一个编号，通过pid namespace的引入，让每个进程，可以存在多个ns命名空间。比如2.2中的sleep，其实就是存在两层的ns。\r\n\r\n第一层就是父进程所在的那层，父进程是直接fork和bash, 以及真正的1号进程的pid namespaces是一致的。\r\n\r\n第二层就是 自己新创建的这层。\r\n\r\n分配pid的时候，首先在第二层分配给sleep pid=1(这个没进去，看不见)\r\n\r\n然后再再父进程的ns，给sleep 分配的pid = 3994\r\n\r\n这样的话，就实现了进程隔离，因为子进程在第二层看到的 pid=1，所以它在第二层只能看到，自己产生的所有进程，从而达到了隔离。\r\n\r\n### 4.参考\r\n\r\n[容器原理之 - namespace](https://mp.weixin.qq.com/s/FnuOMbWAhLQoiCBA_NFYXA)\r\n\r\n[Linux-进程描述符 task_struct 详解](https://www.cnblogs.com/JohnABC/p/9084750.html)\r\n\r\n"
  },
  {
    "path": "docker/10. 如何下载并二进制编译docker源码.md",
    "content": "* [1\\. 如何下载docker源码](#1-如何下载docker源码)\n* [2\\. docker源码目录解析](#2-docker源码目录解析)\n* [3\\. 二进制编译docker源码](#3-二进制编译docker源码)\n  * [3\\.1 下载需要编译的源代码](#31-下载需要编译的源代码)\n  * [3\\.2 通过容器编译](#32-通过容器编译)\n\n### 1. 如何下载docker源码\n\n在下载docker源码的时候，发现有moby、docker-ce与docker-ee项目。\n\ndocker是一家公司，其中的一个产品就是docker。docker-ce是 免费版本。docker-ee的商用版本。目前docker-ee没有git repo。 docker-ce repo处于废弃状态。\n\ndocker将docker进行了开源，开源项目的名字是 moby。\n\n至于为什么这么做，可以参考以下的issue。\n\nhttps://www.zhihu.com/question/58805021\n\nhttps://github.com/moby/moby/pull/32691\n\n所以研究源码直接研究moby就可以了：\n\n### 2. docker源码目录解析\n\n```\n├── AUTHORS\n├── CHANGELOG.md\n├── CONTRIBUTING.md\n├── Dockerfile\n├── Dockerfile.aarch64\n├── Dockerfile.armhf\n├── Dockerfile.ppc64le\n├── Dockerfile.s390x\n├── Dockerfile.simple\n├── Dockerfile.solaris\n├── Dockerfile.windows\n├── LICENSE\n├── MAINTAINERS\n├── Makefile\n├── NOTICE\n├── README.md\n├── ROADMAP.md\n├── VENDORING.md\n├── VERSION\n\n├── api             api目录是docker cli或者第三方软件与docker daemon进行交互的api库，它是HTTP REST API. api/types:是被docker client和server共用的一些类型定义，比如多种对象，options, responses等。大部分是手工写的代码，也有部分是通过swagger自动生成的。\n├── builder        docker build dockerfile实现相关代码\n├── cli            Docker命令行接口,定义了docker支持的所有命令。例如docker stop等\n├── client         docker client端（发送http请求）。定义所有命令的client请求\n├── cmd            dockerd命令行实现，docker,dockerd的启动函数\n├── container      和容器相关的数据结构定义，比如容器状态，容器的io,容器的环境变量\n├── contrib        包括脚本，镜像和其它一些有用的工具，并不属于docker发布的一部分，正因为如此，它们可能会过时\n├── daemon         docker daemon实现\n├── distribution   docker镜像仓库相关功能代码，如docker push,docker pull\n├── dockerversion  docker镜像仓库相关功能代码，如docker push,docker pull\n├── docs           文档相关\n├── experimental   开启docker实验特性的相关文档说明\n├── hack           与编译相关的工具目录\n├── hooks          编译相关的钩子\n├── image          镜像存储相关操作代码\n├── integration-cli  集成测试相关命令行\n├── keys             和测试相关的key\n├── layer            镜像层相关操作代码\n├── libcontainerd    与containerd通信相关lib\n├── man              生成docker手册相关的代码\n├── migrate          用于转换老的镜像层次，主要是转v1\n├── oci              支持oci相关实现（容器运行时标准）\n├── opts             处理命令选项相关\n├── pkg              工具包。处理字符串，url,系统相关信号，锁相关工具\n├── plugin           docker插件处理相关实现\n├── poule.yml        \n├── profiles         linux下安全相关处理,apparmor和seccomp.\n├── project          文档相关\n├── reference        镜像仓库reference管理\n├── registry         镜像仓库相关代码\n├── restartmanager   容器重启策略实现\n├── runconfig        容器运行相关配置操作\n├── vendor           go语言的目录，依赖第三方库目录\n├── vendor.conf\n└── volume           docker volume相关的代码实现\n```\n\n### 3. 二进制编译docker源码-17.05.0版本\n\n直接看源码肯定会在一些地方卡住，所以最好的办法就是编译源码，通过打日志/调试的方式来确定具体实现细节。\n\n\n#### 3.1 下载需要编译的源代码\n\n这里我是下载的 https://github.com/moby/moby/blob/v17.05.0-ce\n\n```\n# git clone https://github.com/moby/moby.git -b v17.05.0-ce\n```\n\n然后修改文件项目为： `/home/zoux/data/golang/src/github.com/docker/docker`\n\n#### 3.2 通过容器编译\n\ndocker开发环境本质上是创建一个docker镜像，镜像里包含了docker的所有开发运行环境，本地代码通过挂载的方式放到容器中运行。\n\ndockercore/docker就是官方提供的编译镜像。\n\n```\ndocker run --rm -it --privileged -v /home/zoux/data/golang/src/github.com/docker/docker:/go/src/github.com/docker/docker   dockercore/docker bash\n\n\n## 进去之后可以直接运行 该命令进行编译\nroot@ab1bf697b6a6:/go/src/github.com/docker/docker# ./hack/make.sh binary\n\nbundles/17.05.0-ce already exists. Removing.\n\n---> Making bundle: binary (in bundles/17.05.0-ce/binary)\nBuilding: bundles/17.05.0-ce/binary-client/docker-17.05.0-ce\nCreated binary: bundles/17.05.0-ce/binary-client/docker-17.05.0-ce\nBuilding: bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce\nCreated binary: bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce\nCopying nested executables into bundles/17.05.0-ce/binary-daemon\n\n\n## 还可以自己设置tag\nroot@ab1bf697b6a6:/go/src/github.com/docker/docker# export DOCKER_GITCOMMIT=v17.05-zx\nroot@ab1bf697b6a6:/go/src/github.com/docker/docker# ./hack/make.sh binary\n\nbundles/17.05.0-ce already exists. Removing.\n\n---> Making bundle: binary (in bundles/17.05.0-ce/binary)\nBuilding: bundles/17.05.0-ce/binary-client/docker-17.05.0-ce\nCreated binary: bundles/17.05.0-ce/binary-client/docker-17.05.0-ce\nBuilding: bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce\nCreated binary: bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce\nCopying nested executables into bundles/17.05.0-ce/binary-daemon\n\nroot@ab1bf697b6a6:/go/src/github.com/docker/docker# ./bundles/17.05.0-ce/binary-daemon/dockerd --version\nDocker version 17.05.0-ce, build v17.05-zx\n\n\n## 下载到本地, 一定要是dockerd-17.05.0-ce，而不是dockerd, dockerd只是一个链接文件\ndocker cp ab1bf697b6a6:/go/src/github.com/docker/docker/bundles/17.05.0-ce/binary-daemon/dockerd-17.05.0-ce /home/zoux/dockerd\n\n\nroot@ab1bf697b6a6:/go/src/github.com/docker/docker/bundles/17.05.0-ce/binary-daemon# ls -l\ntotal 68704\n-rwxr-xr-x 1 root root  8997448 Feb 23 09:12 docker-containerd\n-rwxr-xr-x 1 root root  8448168 Feb 23 09:12 docker-containerd-ctr\n-rw-r--r-- 1 root root       56 Feb 23 09:12 docker-containerd-ctr.md5\n-rw-r--r-- 1 root root       88 Feb 23 09:12 docker-containerd-ctr.sha256\n-rwxr-xr-x 1 root root  3047240 Feb 23 09:12 docker-containerd-shim\n-rw-r--r-- 1 root root       57 Feb 23 09:12 docker-containerd-shim.md5\n-rw-r--r-- 1 root root       89 Feb 23 09:12 docker-containerd-shim.sha256\n-rw-r--r-- 1 root root       52 Feb 23 09:12 docker-containerd.md5\n-rw-r--r-- 1 root root       84 Feb 23 09:12 docker-containerd.sha256\n-rwxr-xr-x 1 root root   772400 Feb 23 09:12 docker-init\n-rw-r--r-- 1 root root       46 Feb 23 09:12 docker-init.md5\n-rw-r--r-- 1 root root       78 Feb 23 09:12 docker-init.sha256\n-rwxr-xr-x 1 root root  2530685 Feb 23 09:12 docker-proxy\n-rw-r--r-- 1 root root       47 Feb 23 09:12 docker-proxy.md5\n-rw-r--r-- 1 root root       79 Feb 23 09:12 docker-proxy.sha256\n-rwxr-xr-x 1 root root  7096504 Feb 23 09:12 docker-runc\n-rw-r--r-- 1 root root       46 Feb 23 09:12 docker-runc.md5\n-rw-r--r-- 1 root root       78 Feb 23 09:12 docker-runc.sha256\nlrwxrwxrwx 1 root root       18 Feb 23 09:12 dockerd -> dockerd-17.05.0-ce\n-rwxr-xr-x 1 root root 39392304 Feb 23 09:12 dockerd-17.05.0-ce\n-rw-r--r-- 1 root root       53 Feb 23 09:12 dockerd-17.05.0-ce.md5\n-rw-r--r-- 1 root root       85 Feb 23 09:12 dockerd-17.05.0-ce.sha256\n```\n\n### 4. 二进制编译docker源码-19.03.9版本\n\n这个版本和17.5版本的不同在于，docker和dockerd分离了。\n\n在docker v17.06 之后，docker cli 和dockerd分离了, 单独拆成了https://github.com/docker/cli\n\n#### 4.1 docker编译\n\n将该项目下载到  $GOPATH/src/github.com/docker 目录。然后有go环境，直接 `make  binary`就可以编译docker源码。\n\n```\nroot:/home/zoux/data/golang/src/github.com/docker/cli# source /home/zouxiang/config  // 设置go环境\nroot:/home/zoux/data/golang/src/github.com/docker/cli# make binary                   // 编译\n\n\nWARNING: you are not in a container.\nUse \"make -f docker.Makefile binary\" or set\nDISABLE_WARN_OUTSIDE_CONTAINER=1 to disable this warning.\n\nPress Ctrl+C now to abort.\n\nWARNING: binary creates a Linux executable. Use cross for macOS or Windows.\n./scripts/build/binary\nBuilding statically linked build/docker-linux-amd64\n```\n\n#### 4.2 dockerd编译\n\n同样通过二进制编译。将该项目下载到  $GOPATH/src/github.com/docker 目录。然后有go环境，直接 `./hack/make.sh binary`就可以编译docker源码。\n\n```\nroot:/home/zoux/data/golang/src/github.com/docker/docker# ./hack/make.sh binary\n\n#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n# GITCOMMIT = 811a247d06-unsupported\n# The version you are building is listed as unsupported because\n# there are some files in the git repository that are in an uncommitted state.\n# Commit these changes, or add to .gitignore to remove the -unsupported from the version.\n# Here is the current list:\n M cmd/dockerd/daemon.go\n#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nRemoving bundles/\n\n---> Making bundle: binary (in bundles/binary)\nBuilding: bundles/binary-daemon/dockerd-dev\nGOOS=\"linux\" GOARCH=\"amd64\" GOARM=\"\"\nCreated binary: bundles/binary-daemon/dockerd-dev\n```\n\n该过程可能会遇到报错。比如：\n\nNo package 'devmapper' found\n\nmake binary causes fatal error: btrfs/ioctl.h: No such file or directory\n\n<br>\n\n这是一些基础的包没装好。apt-get 或者yum安装就好了。\n\n```\napt-get install -y libdevmapper-dev\n\napt-get install -y install btrfs-progs\napt-get install -y btrfs-progs-dev\n```\n\n"
  },
  {
    "path": "docker/11. dockercli 源码分析-docker run为例.md",
    "content": "* [0\\. 章节目的](#0-章节目的)\n* [1\\. docker run 客户端处理流程](#1-docker-run-客户端处理流程)\n  * [1\\.1 docker 函数入口](#11-docker-函数入口)\n* [2\\. 初始化docker cli客户端](#2-初始化docker-cli客户端)\n* [3\\. 实例化newDockerCommand对象](#3-实例化newdockercommand对象)\n  * [3\\.1 newDockerCommand](#31-newdockercommand)\n  * [3\\.2\\. NewRunCommand](#32-newruncommand)\n  * [3\\.3 runContainer](#33-runcontainer)\n  * [3\\.4 ContainerCreate &amp; ContainerStart](#34-containercreate--containerstart)\n  * [3\\.5 总结](#35-总结)\n\n\n\n### 0. 章节目的\n\n从本章节开始以 docker run niginx ls为例。从源码角度弄清楚docker run nginx ls具体过程。\n\n本章节的目的就是弄清楚该命令运行时。 docker cli做了什么工作。\n\n```\nroot# docker run nginx ls\nbin\nboot\ndev\ndocker-entrypoint.d\ndocker-entrypoint.sh\netc\nhome\nlib\nlib64\nmedia\nmnt\nopt\nproc\nroot\nrun\nsbin\nsrv\nsys\ntmp\nusr\nvar\n```\n\n<br>\n\n顺便补充一下：\n\n在docker v17.06 之前，docker cli（就是我们经常使用的docker） 和dockerd 源码是一起的。都在：https://github.com/moby/moby项目\n\n并且都在cmd目录。\n\ncmd/docker： 是docker cli的主函数目录\n\ncmd/dockerd： 是dockerd的主函数目录\n\n<br>在docker v17.06 之后，docker cli 和dockerd分离了, 单独拆成了https://github.com/docker/cli\n\n所以，本节基于https://github.com/docker/cli/tree/v19.03.9 进行研究。\n\n将该项目下载到  $GOPATH/src/github.com/docker 目录。然后有go环境，直接 `make  binary`就可以编译源码。\n\n<br>\n\n### 1. docker run 客户端处理流程\n\n#### 1.1 docker 函数入口\n\nmain函数主要就是定义了newDockerCommand, dockerd的mian函数在cmd/dockerd/docker.go\n\n```\nfunc runDocker(dockerCli *command.DockerCli) error {\n\ttcmd := newDockerCommand(dockerCli)\n\n\tcmd, args, err := tcmd.HandleGlobalFlags()\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tif err := tcmd.Initialize(); err != nil {\n\t\treturn err\n\t}\n\n\targs, os.Args, err = processAliases(dockerCli, cmd, args, os.Args)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tif len(args) > 0 {\n\t\tif _, _, err := cmd.Find(args); err != nil {\n\t\t\terr := tryPluginRun(dockerCli, cmd, args[0])\n\t\t\tif !pluginmanager.IsNotFound(err) {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\t// For plugin not found we fall through to\n\t\t\t// cmd.Execute() which deals with reporting\n\t\t\t// \"command not found\" in a consistent way.\n\t\t}\n\t}\n\n\t// We've parsed global args already, so reset args to those\n\t// which remain.\n\tcmd.SetArgs(args)\n\treturn cmd.Execute()\n}\n```\n\n主要干了两件事：\n\n（1）实例化newDockerCommand对象\n\n（2）初始化了docker cli客户端\n\n先看看初始化客户端做了什么。\n\n### 2. 初始化docker cli客户端\n\nInitialize函数进行了cli客户端的初始化。\n\ndocker是cs结构的框架，但是client server基本都是在同一台机器上。所以docker使用了unix socket进行进程的通信。这样的好处就是快，比tcp快1/7。\n\ndockerd运行起来后，会创建一个socket，默认是 /var/run/docker.sock。基于这个sock文件就可以构造一个客户端，用于交互。\n\ndockerd运行起来后，会在 /var/run 目录增加两个文件。docker.pid （进程编号）, docker.sock。\n\n可参考：[golang中基于http 和unix socket的通信代码实现（服务端基于gin框架）](https://blog.csdn.net/qq_33399567/article/details/107691339)\n\n<br>\n\n```\n// Initialize the dockerCli runs initialization that must happen after command\n// line flags are parsed.\nfunc (cli *DockerCli) Initialize(opts *cliflags.ClientOptions, ops ...InitializeOpt) error {\n\tvar err error\n\n\tfor _, o := range ops {\n\t\tif err := o(cli); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\tcliflags.SetLogLevel(opts.Common.LogLevel)\n\n\tif opts.ConfigDir != \"\" {\n\t\tcliconfig.SetDir(opts.ConfigDir)\n\t\tlogrus.Errorf(\"zoux Initialize opts.ConfigDir is: %v\", opts.ConfigDir)\n\t}\n\n\tif opts.Common.Debug {\n\t\tdebug.Enable()\n\t}\n\n\tcli.loadConfigFile()\n\n\tbaseContextStore := store.New(cliconfig.ContextStoreDir(), cli.contextStoreConfig)\n\tlogrus.Errorf(\"zoux Initialize baseContextStore is: %v\", baseContextStore)\n\tcli.contextStore = &ContextStoreWithDefault{\n\t\tStore: baseContextStore,\n\t\tResolver: func() (*DefaultContext, error) {\n\t\t\treturn ResolveDefaultContext(opts.Common, cli.ConfigFile(), cli.contextStoreConfig, cli.Err())\n\t\t},\n\t}\n\tcli.currentContext, err = resolveContextName(opts.Common, cli.configFile, cli.contextStore)\n\tif err != nil {\n\t\treturn err\n\t}\n\tcli.dockerEndpoint, err = resolveDockerEndpoint(cli.contextStore, cli.currentContext)\n\tif err != nil {\n\t\treturn errors.Wrap(err, \"unable to resolve docker endpoint\")\n\t}\n\tlogrus.Errorf(\"zoux Initialize dockerEndpoint TLSData is %v: host is %v\", cli.dockerEndpoint.TLSData, cli.dockerEndpoint.Host)\n\n\tif cli.client == nil {\n\t\tcli.client, err = newAPIClientFromEndpoint(cli.dockerEndpoint, cli.configFile)\n\t\tif tlsconfig.IsErrEncryptedKey(err) {\n\t\t\tpassRetriever := passphrase.PromptRetrieverWithInOut(cli.In(), cli.Out(), nil)\n\t\t\tnewClient := func(password string) (client.APIClient, error) {\n\t\t\t\tcli.dockerEndpoint.TLSPassword = password\n\t\t\t\treturn newAPIClientFromEndpoint(cli.dockerEndpoint, cli.configFile)\n\t\t\t}\n\t\t\tcli.client, err = getClientWithPassword(passRetriever, newClient)\n\t\t}\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\tlogrus.Errorf(\"zoux Initialize cli.client is %v\", cli.client)\n\n\treturn nil\n}\n```\n\n在上面的核心函数增加了部分日志，可以看出来。\n\ndocker cli的构建核心就是，利用var/run/docker.sock 文件创建了go的客户端。\n\n```\nroot@k8s-node:~# docker run nginx ls\nERRO[0000] zoux initialize opts.configDir is /root/.docker\nERRO[0000] zoux initialize baseContextStore is &{0xc0002f4e80 0xc00005a3a0}\nERRO[0000] zoux initialize dockerEndpooint TLSData is <nil>, host is unix:///var/run/docker.sock\nERRO[0000] zoux initizlize cli.client is &{http unix:///var/run/docker.sock unix /var/run/docker.sock  0xc000368720 1.40 map[User-Agent:Docker-Client/unknown-version (linux)] false false false}\nbin\nboot\ndev\ndocker-entrypoint.d\ndocker-entrypoint.sh\netc\nhome\nlib\nlib64\nmedia\nmnt\nopt\nproc\nroot\nrun\nsbin\nsrv\nsys\ntmp\nusr\nvar\n```\n\n### 3. 实例化newDockerCommand对象\n\n#### 3.1 newDockerCommand\n\n```\nfunc newDockerCommand(dockerCli *command.DockerCli) *cli.TopLevelCommand {\n\tvar (\n\t\topts    *cliflags.ClientOptions\n\t\tflags   *pflag.FlagSet\n\t\thelpCmd *cobra.Command\n\t)\n\n\tcmd := &cobra.Command{\n\t\tUse:              \"docker [OPTIONS] COMMAND [ARG...]\",\n\t\tShort:            \"A self-sufficient runtime for containers\",\n\t\tSilenceUsage:     true,\n\t\tSilenceErrors:    true,\n\t\tTraverseChildren: true,\n\t\tRunE: func(cmd *cobra.Command, args []string) error {\n\t\t\tif len(args) == 0 {\n\t\t\t\treturn command.ShowHelp(dockerCli.Err())(cmd, args)\n\t\t\t}\n\t\t\treturn fmt.Errorf(\"docker: '%s' is not a docker command.\\nSee 'docker --help'\", args[0])\n\n\t\t},\n\t\tPersistentPreRunE: func(cmd *cobra.Command, args []string) error {\n\t\t\treturn isSupported(cmd, dockerCli)\n\t\t},\n\t\tVersion:               fmt.Sprintf(\"%s, build %s\", version.Version, version.GitCommit),\n\t\tDisableFlagsInUseLine: true,\n\t}\n\topts, flags, helpCmd = cli.SetupRootCommand(cmd)\n\tflags.BoolP(\"version\", \"v\", false, \"Print version information and quit\")\n\n\tsetFlagErrorFunc(dockerCli, cmd)\n\n\tsetupHelpCommand(dockerCli, cmd, helpCmd)\n\tsetHelpFunc(dockerCli, cmd)\n\n\tcmd.SetOutput(dockerCli.Out())\n\tcommands.AddCommands(cmd, dockerCli)\n\n\tcli.DisableFlagsInUseLine(cmd)\n\tsetValidateArgs(dockerCli, cmd)\n\n\t// flags must be the top-level command flags, not cmd.Flags()\n\treturn cli.NewTopLevelCommand(cmd, dockerCli, opts, flags)\n}\n```\n\nnewDockerCommand函数的核心就是：\n\n（1）RunE\n\n（2）PersistentPreRunE\n\n（3）commands.AddCommands(cmd, dockerCli)\n\n<br>\n\n**RunE**就是打印help函数，这和实操是一样的。输入docker，后面什么都不带就是打印help。因为docker 本身是不能运行的，后面必须跟子命令。\n\n**PersistentPreRunE**就是判断docker 输入的flags是否支持。\n\n```\nfunc areFlagsSupported(cmd *cobra.Command, details versionDetails) error {\n\terrs := []string{}\n\n\tcmd.Flags().VisitAll(func(f *pflag.Flag) {\n\t\tif !f.Changed {\n\t\t\treturn\n\t\t}\n\t\tif !isVersionSupported(f, details.Client().ClientVersion()) {\n\t\t\terrs = append(errs, fmt.Sprintf(`\"--%s\" requires API version %s, but the Docker daemon API version is %s`, f.Name, getFlagAnnotation(f, \"version\"), details.Client().ClientVersion()))\n\t\t\treturn\n\t\t}\n\t\tif !isOSTypeSupported(f, details.ServerInfo().OSType) {\n\t\t\terrs = append(errs, fmt.Sprintf(\n\t\t\t\t`\"--%s\" is only supported on a Docker daemon running on %s, but the Docker daemon is running on %s`,\n\t\t\t\tf.Name,\n\t\t\t\tgetFlagAnnotation(f, \"ostype\"), details.ServerInfo().OSType),\n\t\t\t)\n\t\t\treturn\n\t\t}\n\t\tif _, ok := f.Annotations[\"experimental\"]; ok && !details.ServerInfo().HasExperimental {\n\t\t\terrs = append(errs, fmt.Sprintf(`\"--%s\" is only supported on a Docker daemon with experimental features enabled`, f.Name))\n\t\t}\n\t\tif _, ok := f.Annotations[\"experimentalCLI\"]; ok && !details.ClientInfo().HasExperimental {\n\t\t\terrs = append(errs, fmt.Sprintf(`\"--%s\" is only supported on a Docker cli with experimental cli features enabled`, f.Name))\n\t\t}\n\t\t// buildkit-specific flags are noop when buildkit is not enabled, so we do not add an error in that case\n\t})\n\tif len(errs) > 0 {\n\t\treturn errors.New(strings.Join(errs, \"\\n\"))\n\t}\n\treturn nil\n}\n```\n\n**commands.AddCommands:**  就是增加子命令。\n\n这里我们主要关键 NewContainerCommand。 而docker run就是对应了NewRunCommand子命令。\n\n```\n// AddCommands adds all the commands from cli/command to the root command\nfunc AddCommands(cmd *cobra.Command, dockerCli command.Cli) {\n\tcmd.AddCommand(\n\t\t// checkpoint\n\t\tcheckpoint.NewCheckpointCommand(dockerCli),\n\n\t\t// config\n\t\tconfig.NewConfigCommand(dockerCli),\n\n\t\t// container\n\t\tcontainer.NewContainerCommand(dockerCli),\n\t\tcontainer.NewRunCommand(dockerCli),\n\n\t\t// image\n\t\timage.NewImageCommand(dockerCli),\n\t\timage.NewBuildCommand(dockerCli),\n\n\t\t// builder\n\t\tbuilder.NewBuilderCommand(dockerCli),\n\n\t\t// manifest\n\t\tmanifest.NewManifestCommand(dockerCli),\n\n\t\t// network\n\t\tnetwork.NewNetworkCommand(dockerCli),\n\n\t\t// node\n\t\tnode.NewNodeCommand(dockerCli),\n\n\t\t// plugin\n\t\tplugin.NewPluginCommand(dockerCli),\n\n\t\t// registry\n\t\tregistry.NewLoginCommand(dockerCli),\n\t\tregistry.NewLogoutCommand(dockerCli),\n\t\tregistry.NewSearchCommand(dockerCli),\n\n\t\t// secret\n\t\tsecret.NewSecretCommand(dockerCli),\n\n\t\t// service\n\t\tservice.NewServiceCommand(dockerCli),\n\n\t\t// system\n\t\tsystem.NewSystemCommand(dockerCli),\n\t\tsystem.NewVersionCommand(dockerCli),\n\n\t\t// stack\n\t\tstack.NewStackCommand(dockerCli),\n\n\t\t// swarm\n\t\tswarm.NewSwarmCommand(dockerCli),\n\n\t\t// trust\n\t\ttrust.NewTrustCommand(dockerCli),\n\n\t\t// volume\n\t\tvolume.NewVolumeCommand(dockerCli),\n\n\t\t// context\n\t\tcontext.NewContextCommand(dockerCli),\n\n\t\t// legacy commands may be hidden\n\t\thide(stack.NewTopLevelDeployCommand(dockerCli)),\n\t\thide(system.NewEventsCommand(dockerCli)),\n\t\thide(system.NewInfoCommand(dockerCli)),\n\t\thide(system.NewInspectCommand(dockerCli)),\n\t\thide(container.NewAttachCommand(dockerCli)),\n\t\thide(container.NewCommitCommand(dockerCli)),\n\t\thide(container.NewCopyCommand(dockerCli)),\n\t\thide(container.NewCreateCommand(dockerCli)),\n\t\thide(container.NewDiffCommand(dockerCli)),\n\t\thide(container.NewExecCommand(dockerCli)),\n\t\thide(container.NewExportCommand(dockerCli)),\n\t\thide(container.NewKillCommand(dockerCli)),\n\t\thide(container.NewLogsCommand(dockerCli)),\n\t\thide(container.NewPauseCommand(dockerCli)),\n\t\thide(container.NewPortCommand(dockerCli)),\n\t\thide(container.NewPsCommand(dockerCli)),\n\t\thide(container.NewRenameCommand(dockerCli)),\n\t\thide(container.NewRestartCommand(dockerCli)),\n\t\thide(container.NewRmCommand(dockerCli)),\n\t\thide(container.NewStartCommand(dockerCli)),\n\t\thide(container.NewStatsCommand(dockerCli)),\n\t\thide(container.NewStopCommand(dockerCli)),\n\t\thide(container.NewTopCommand(dockerCli)),\n\t\thide(container.NewUnpauseCommand(dockerCli)),\n\t\thide(container.NewUpdateCommand(dockerCli)),\n\t\thide(container.NewWaitCommand(dockerCli)),\n\t\thide(image.NewHistoryCommand(dockerCli)),\n\t\thide(image.NewImagesCommand(dockerCli)),\n\t\thide(image.NewImportCommand(dockerCli)),\n\t\thide(image.NewLoadCommand(dockerCli)),\n\t\thide(image.NewPullCommand(dockerCli)),\n\t\thide(image.NewPushCommand(dockerCli)),\n\t\thide(image.NewRemoveCommand(dockerCli)),\n\t\thide(image.NewSaveCommand(dockerCli)),\n\t\thide(image.NewTagCommand(dockerCli)),\n\t)\n\tif runtime.GOOS == \"linux\" {\n\t\t// engine\n\t\tcmd.AddCommand(engine.NewEngineCommand(dockerCli))\n\t}\n}\n```\n\n#### 3.2. NewRunCommand\n\n```\n// NewRunCommand create a new `docker run` command\nfunc NewRunCommand(dockerCli command.Cli) *cobra.Command {\n\tvar opts runOptions\n\tvar copts *containerOptions\n\n\tcmd := &cobra.Command{\n\t\tUse:   \"run [OPTIONS] IMAGE [COMMAND] [ARG...]\",\n\t\tShort: \"Run a command in a new container\",\n\t\tArgs:  cli.RequiresMinArgs(1),\n\t\tRunE: func(cmd *cobra.Command, args []string) error {\n\t\t\tcopts.Image = args[0]\n\t\t\tif len(args) > 1 {\n\t\t\t\tcopts.Args = args[1:]\n\t\t\t}\n\t\t\treturn runRun(dockerCli, cmd.Flags(), &opts, copts)\n\t\t},\n\t}\n\n\tflags := cmd.Flags()\n\tflags.SetInterspersed(false)\n\n\t// These are flags not stored in Config/HostConfig\n\tflags.BoolVarP(&opts.detach, \"detach\", \"d\", false, \"Run container in background and print container ID\")\n\tflags.BoolVar(&opts.sigProxy, \"sig-proxy\", true, \"Proxy received signals to the process\")\n\tflags.StringVar(&opts.name, \"name\", \"\", \"Assign a name to the container\")\n\tflags.StringVar(&opts.detachKeys, \"detach-keys\", \"\", \"Override the key sequence for detaching a container\")\n\n\t// Add an explicit help that doesn't have a `-h` to prevent the conflict\n\t// with hostname\n\tflags.Bool(\"help\", false, \"Print usage\")\n\n\tcommand.AddPlatformFlag(flags, &opts.platform)\n\tcommand.AddTrustVerificationFlags(flags, &opts.untrusted, dockerCli.ContentTrustEnabled())\n\tcopts = addFlags(flags)\n\treturn cmd\n}\n```\n\n<br>\n\n和其他命令已有，设置了一堆flags, 还有校验，比如image 必须要有，制定了多个只允许第一个\n\n核心就是runRun函数核心就是runContainer\n\n```\nfunc runRun(dockerCli command.Cli, flags *pflag.FlagSet, ropts *runOptions, copts *containerOptions) error {\n\tproxyConfig := dockerCli.ConfigFile().ParseProxyConfig(dockerCli.Client().DaemonHost(), opts.ConvertKVStringsToMapWithNil(copts.env.GetAll()))\n\tnewEnv := []string{}\n\tfor k, v := range proxyConfig {\n\t\tif v == nil {\n\t\t\tnewEnv = append(newEnv, k)\n\t\t} else {\n\t\t\tnewEnv = append(newEnv, fmt.Sprintf(\"%s=%s\", k, *v))\n\t\t}\n\t}\n\tcopts.env = *opts.NewListOptsRef(&newEnv, nil)\n\tcontainerConfig, err := parse(flags, copts, dockerCli.ServerInfo().OSType)\n\t// just in case the parse does not exit\n\tif err != nil {\n\t\treportError(dockerCli.Err(), \"run\", err.Error(), true)\n\t\treturn cli.StatusError{StatusCode: 125}\n\t}\n\tif err = validateAPIVersion(containerConfig, dockerCli.Client().ClientVersion()); err != nil {\n\t\treportError(dockerCli.Err(), \"run\", err.Error(), true)\n\t\treturn cli.StatusError{StatusCode: 125}\n\t}\n\treturn runContainer(dockerCli, ropts, copts, containerConfig)\n}\n```\n\n<br>\n\n#### 3.3 runContainer\n\n从下面的函数逻辑可以看出来，run container分为两个过程：creatContainer, StartContainer。\n\n在`ContainerCreate()`和`ContainerStart()`中分别向daemon发送了create和start命令。下一步，就需要到docker daemon中分析daemon对create和start的处理。\n\n```\ncreateResponse, err := createContainer(ctx, dockerCli, containerConfig, opts.name)\nif err := client.ContainerStart(ctx, createResponse.ID, types.ContainerStartOptions{}); err != nil\n```\n\n<br>\n\n```\n// nolint: gocyclo\nfunc runContainer(dockerCli command.Cli, opts *runOptions, copts *containerOptions, containerConfig *containerConfig) error {\n\tconfig := containerConfig.Config\n\thostConfig := containerConfig.HostConfig\n\tstdout, stderr := dockerCli.Out(), dockerCli.Err()\n\tclient := dockerCli.Client()\n\n\tconfig.ArgsEscaped = false\n  \n  // 1.更加配置初始化是否attach，运行的os等\n\tif !opts.detach {\n\t\tif err := dockerCli.In().CheckTty(config.AttachStdin, config.Tty); err != nil {\n\t\t\treturn err\n\t\t}\n\t} else {\n\t\tif copts.attach.Len() != 0 {\n\t\t\treturn errors.New(\"Conflicting options: -a and -d\")\n\t\t}\n\n\t\tconfig.AttachStdin = false\n\t\tconfig.AttachStdout = false\n\t\tconfig.AttachStderr = false\n\t\tconfig.StdinOnce = false\n\t}\n\n\t// Telling the Windows daemon the initial size of the tty during start makes\n\t// a far better user experience rather than relying on subsequent resizes\n\t// to cause things to catch up.\n\tif runtime.GOOS == \"windows\" {\n\t\thostConfig.ConsoleSize[0], hostConfig.ConsoleSize[1] = dockerCli.Out().GetTtySize()\n\t}\n\n\tctx, cancelFun := context.WithCancel(context.Background())\n\tdefer cancelFun()\n\n  // 1.调用createContainer创建container\n\tcreateResponse, err := createContainer(ctx, dockerCli, containerConfig, &opts.createOptions)\n\tif err != nil {\n\t\treportError(stderr, \"run\", err.Error(), true)\n\t\treturn runStartContainerErr(err)\n\t}\n\tif opts.sigProxy {\n\t\tsigc := ForwardAllSignals(ctx, dockerCli, createResponse.ID)\n\t\tdefer signal.StopCatch(sigc)\n\t}\n\n\tvar (\n\t\twaitDisplayID chan struct{}\n\t\terrCh         chan error\n\t)\n\tif !config.AttachStdout && !config.AttachStderr {\n\t\t// Make this asynchronous to allow the client to write to stdin before having to read the ID\n\t\twaitDisplayID = make(chan struct{})\n\t\tgo func() {\n\t\t\tdefer close(waitDisplayID)\n\t\t\tfmt.Fprintln(stdout, createResponse.ID)\n\t\t}()\n\t}\n\tattach := config.AttachStdin || config.AttachStdout || config.AttachStderr\n\tif attach {\n\t\tif opts.detachKeys != \"\" {\n\t\t\tdockerCli.ConfigFile().DetachKeys = opts.detachKeys\n\t\t}\n\n\t\tclose, err := attachContainer(ctx, dockerCli, &errCh, config, createResponse.ID)\n\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tdefer close()\n\t}\n\n\tstatusChan := waitExitOrRemoved(ctx, dockerCli, createResponse.ID, copts.autoRemove)\n\n\t//start the container\n\t// 3.调用ContainerStart，运行容器\n\tif err := client.ContainerStart(ctx, createResponse.ID, types.ContainerStartOptions{}); err != nil {\n\t\t// If we have hijackedIOStreamer, we should notify\n\t\t// hijackedIOStreamer we are going to exit and wait\n\t\t// to avoid the terminal are not restored.\n\t\tif attach {\n\t\t\tcancelFun()\n\t\t\t<-errCh\n\t\t}\n\n\t\treportError(stderr, \"run\", err.Error(), false)\n\t\tif copts.autoRemove {\n\t\t\t// wait container to be removed\n\t\t\t<-statusChan\n\t\t}\n\t\treturn runStartContainerErr(err)\n\t}\n\n\tif (config.AttachStdin || config.AttachStdout || config.AttachStderr) && config.Tty && dockerCli.Out().IsTerminal() {\n\t\tif err := MonitorTtySize(ctx, dockerCli, createResponse.ID, false); err != nil {\n\t\t\tfmt.Fprintln(stderr, \"Error monitoring TTY size:\", err)\n\t\t}\n\t}\n\n\tif errCh != nil {\n\t\tif err := <-errCh; err != nil {\n\t\t\tif _, ok := err.(term.EscapeError); ok {\n\t\t\t\t// The user entered the detach escape sequence.\n\t\t\t\treturn nil\n\t\t\t}\n\n\t\t\tlogrus.Debugf(\"Error hijack: %s\", err)\n\t\t\treturn err\n\t\t}\n\t}\n\n\t// Detached mode: wait for the id to be displayed and return.\n\tif !config.AttachStdout && !config.AttachStderr {\n\t\t// Detached mode\n\t\t<-waitDisplayID\n\t\treturn nil\n\t}\n\n\tstatus := <-statusChan\n\tif status != 0 {\n\t\treturn cli.StatusError{StatusCode: status}\n\t}\n\treturn nil\n}\n```\n\n#### 3.4 ContainerCreate & ContainerStart\n\n从下面的代码很容易看出来。ContainerCreate核心逻辑如下：\n\n（1）通过配置获取镜像tag等信息\n\n（2）调用dockercli客户端，创建container。\n\n（3）如果创建失败，并且是因为image的问题，并且 --pull=always或者missing，就先pull image，然后再次创建\n\n```\n--pull\tmissing\tPull image before running (\"always\"|\"missing\"|\"never\")\n```\n\ndocker run参数详见： https://docs.docker.com/engine/reference/commandline/run/\n\n<br>\n\n```\nfunc createContainer(ctx context.Context, dockerCli command.Cli, containerConfig *containerConfig, opts *createOptions) (*container.ContainerCreateCreatedBody, error) {\n\tconfig := containerConfig.Config\n\thostConfig := containerConfig.HostConfig\n\tnetworkingConfig := containerConfig.NetworkingConfig\n\tstderr := dockerCli.Err()\n\n\twarnOnOomKillDisable(*hostConfig, stderr)\n\twarnOnLocalhostDNS(*hostConfig, stderr)\n\n\tvar (\n\t\ttrustedRef reference.Canonical\n\t\tnamedRef   reference.Named\n\t)\n\n\tcontainerIDFile, err := newCIDFile(hostConfig.ContainerIDFile)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tdefer containerIDFile.Close()\n\n\tref, err := reference.ParseAnyReference(config.Image)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tif named, ok := ref.(reference.Named); ok {\n\t\tnamedRef = reference.TagNameOnly(named)\n\n\t\tif taggedRef, ok := namedRef.(reference.NamedTagged); ok && !opts.untrusted {\n\t\t\tvar err error\n\t\t\ttrustedRef, err = image.TrustedReference(ctx, dockerCli, taggedRef, nil)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tconfig.Image = reference.FamiliarString(trustedRef)\n\t\t}\n\t}\n\n\t//create the container\n\tresponse, err := dockerCli.Client().ContainerCreate(ctx, config, hostConfig, networkingConfig, opts.name)\n\n\t//if image not found try to pull it\n\tif err != nil {\n\t\tif apiclient.IsErrNotFound(err) && namedRef != nil {\n\t\t\tfmt.Fprintf(stderr, \"Unable to find image '%s' locally\\n\", reference.FamiliarString(namedRef))\n\n\t\t\t// we don't want to write to stdout anything apart from container.ID\n\t\t\tif err := pullImage(ctx, dockerCli, config.Image, opts.platform, stderr); err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tif taggedRef, ok := namedRef.(reference.NamedTagged); ok && trustedRef != nil {\n\t\t\t\tif err := image.TagTrusted(ctx, dockerCli, trustedRef, taggedRef); err != nil {\n\t\t\t\t\treturn nil, err\n\t\t\t\t}\n\t\t\t}\n\t\t\t// Retry\n\t\t\tvar retryErr error\n\t\t\tresponse, retryErr = dockerCli.Client().ContainerCreate(ctx, config, hostConfig, networkingConfig, opts.name)\n\t\t\tif retryErr != nil {\n\t\t\t\treturn nil, retryErr\n\t\t\t}\n\t\t} else {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\n\tfor _, warning := range response.Warnings {\n\t\tfmt.Fprintf(stderr, \"WARNING: %s\\n\", warning)\n\t}\n\terr = containerIDFile.Write(response.ID)\n\treturn &response, err\n}\n\n```\n\nContainerCreate, ContainerStart 直接就是Post /containers/create 或者/start 请求创建, 运行。\n\n```\n// ContainerCreate creates a new container based in the given configuration.\n// It can be associated with a name, but it's not mandatory.\nfunc (cli *Client) ContainerCreate(ctx context.Context, config *container.Config, hostConfig *container.HostConfig, networkingConfig *network.NetworkingConfig, containerName string) (container.ContainerCreateCreatedBody, error) {\n\tvar response container.ContainerCreateCreatedBody\n\n\tif err := cli.NewVersionError(\"1.25\", \"stop timeout\"); config != nil && config.StopTimeout != nil && err != nil {\n\t\treturn response, err\n\t}\n\n\t// When using API 1.24 and under, the client is responsible for removing the container\n\tif hostConfig != nil && versions.LessThan(cli.ClientVersion(), \"1.25\") {\n\t\thostConfig.AutoRemove = false\n\t}\n\n\tquery := url.Values{}\n\tif containerName != \"\" {\n\t\tquery.Set(\"name\", containerName)\n\t}\n\n\tbody := configWrapper{\n\t\tConfig:           config,\n\t\tHostConfig:       hostConfig,\n\t\tNetworkingConfig: networkingConfig,\n\t}\n\n\tserverResp, err := cli.post(ctx, \"/containers/create\", query, body, nil)\n\tdefer ensureReaderClosed(serverResp)\n\tif err != nil {\n\t\treturn response, err\n\t}\n\n\terr = json.NewDecoder(serverResp.body).Decode(&response)\n\treturn response, err\n}\n\n\n// ContainerStart sends a request to the docker daemon to start a container.\nfunc (cli *Client) ContainerStart(ctx context.Context, containerID string, options types.ContainerStartOptions) error {\n\tquery := url.Values{}\n\tif len(options.CheckpointID) != 0 {\n\t\tquery.Set(\"checkpoint\", options.CheckpointID)\n\t}\n\tif len(options.CheckpointDir) != 0 {\n\t\tquery.Set(\"checkpoint-dir\", options.CheckpointDir)\n\t}\n\n\tresp, err := cli.post(ctx, \"/containers/\"+containerID+\"/start\", query, nil, nil)\n\tensureReaderClosed(resp)\n\treturn err\n}\n```\n\n#### 3.5 总结\n\n可以看出来 ContainerCreate和ContainerStart处理非常简单，就是\n\n（1）利用var/run/docker.sock 文件创建了http的客户端\n\n（2）调用cli 客户端发送post请求，创建和启动容器\n"
  },
  {
    "path": "docker/12. dockerd源码分析-docker run为例.md",
    "content": "* [0\\. 章节目的](#0-章节目的)\n* [1\\. docker run服务器端处理流程](#1-docker-run服务器端处理流程)\n  * [1\\.1 dockerd 函数入口](#11-dockerd-函数入口)\n  * [1\\.2 runDaemon](#12-rundaemon)\n  * [1\\.3 daemonCli\\.start](#13-daemonclistart)\n  * [1\\.4 NewDaemon](#14-newdaemon)\n  * [1\\.5  dockerd的路由设置 containers](#15--dockerd的路由设置-containers)\n* [2\\. docker create container详细流程分析](#2-docker-create-container详细流程分析)\n  * [2\\.1 postContainersCreate](#21-postcontainerscreate)\n  * [2\\.2 containerCreate](#22-containercreate)\n  * [2\\.3 daemon\\.create](#23-daemoncreate)\n  * [2\\.4  newContainer](#24--newcontainer)\n  * [2\\.5 实验](#25-实验)\n    * [2\\.5\\.1 实验1\\-观察目录变化](#251-实验1-观察目录变化)\n    * [2\\.5\\.2 实验2\\-查看配置](#252-实验2-查看配置)\n  * [2\\.6 总结](#26-总结)\n* [3\\. Docker start container详细流程分析](#3-docker-start-container详细流程分析)\n  * [3\\.1 postContainerExecStart](#31-postcontainerexecstart)\n  * [3\\.2 ContainerStart](#32-containerstart)\n  * [3\\.3 containerStart](#33-containerstart)\n* [4\\. docker start 创建的详细过程](#4-docker-start-创建的详细过程)\n  * [4\\.1 containerd的初始化](#41-containerd的初始化)\n  * [4\\.2 容器的网络设置](#42-容器的网络设置)\n  * [4\\.3 容器的spec设置\\-createSpec函数](#43-容器的spec设置-createspec函数)\n  * [4\\.4 containerd创建容器的详细流程](#44-containerd创建容器的详细流程)\n* [5\\. 总结](#5-总结)\n\n### 0. 章节目的\n\n以 docker run niginx ls为例。从源码角度弄清楚dockerd具体的执行过程。\n\n源码版本：https://github.com/moby/moby/tree/v19.03.9-ce\n\n从上一篇分析中，docker run 其实是分为了container create, container start这两个步骤。\n\n### 1. docker run服务器端处理流程\n\n还是先从docker的main函数可以入手。在安装docker之后。查看docker的配置，发现docker运行没有带任何参数。\n\n```\nroot@k8s-node:~# ps -ef | grep docker\nroot      6164  5604  0 21:11 pts/1    00:00:00 grep docker\nroot     12493     1  0 17:40 ?        00:01:04 /usr/bin/dockerd\n\nroot@k8s-node:~# cat /usr/lib/systemd/system/docker.service\n[Unit]\nDescription=Docker Application Container Engine\nDocumentation=https://docs.docker.com\nAfter=network-online.target firewalld.service\nWants=network-online.target\n\n[Service]\nType=notify\nExecStart=/usr/bin/dockerd\nExecReload=/bin/kill -s HUP\nLimitNOFILE=infinity\nLimitNPROC=infinity\nLimitCORE=infinity\nTimeoutStartSec=0\nDelegate=yes\nKillMode=process\nRestart=on-failure\nStartLimitBurst=3\nStartLimitInterval=60s\n\n[Install]\nWantedBy=multi-user.target\n```\n\n<br>\n\n#### 1.1 dockerd 函数入口\n\ndockerd main函数在cmd/dockerd/docker.go。还是熟悉的cobra框架，所以直接从newDaemonCommand入手。\n\nnewDaemonOptions主要是调用了runDaemon命令，从上面分析看，这里默认dockerd启动没有flags。在之前配置镜像源的时候，经常在 `/etc/docker/daemon.json` 目录下进行如下配置。这个其实是docker的默认配置目录。\n\n```\nroot@k8s-node:/etc/docker# cat daemon.json\n{\n  \"registry-mirrors\": [\"https://b9pmyelo.mirror.aliyuncs.com\"]\n}\n```\n\n<br>\n\n```\nfunc newDaemonCommand() (*cobra.Command, error) {\n\topts := newDaemonOptions(config.New())\n\n\tcmd := &cobra.Command{\n\t\tUse:           \"dockerd [OPTIONS]\",\n\t\tShort:         \"A self-sufficient runtime for containers.\",\n\t\tSilenceUsage:  true,\n\t\tSilenceErrors: true,\n\t\tArgs:          cli.NoArgs,\n\t\tRunE: func(cmd *cobra.Command, args []string) error {\n\t\t\topts.flags = cmd.Flags()\n\t\t\treturn runDaemon(opts)\n\t\t},\n\t\tDisableFlagsInUseLine: true,\n\t\tVersion:               fmt.Sprintf(\"%s, build %s\", dockerversion.Version, dockerversion.GitCommit),\n\t}\n\tcli.SetupRootCommand(cmd)\n\n\tflags := cmd.Flags()\n\tflags.BoolP(\"version\", \"v\", false, \"Print version information and quit\")\n\t\n\t// 读取默认的配置文件。默认是 /etc/docker/daemon.json\n\tdefaultDaemonConfigFile, err := getDefaultDaemonConfigFile()\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tflags.StringVar(&opts.configFile, \"config-file\", defaultDaemonConfigFile, \"Daemon configuration file\")\n\topts.InstallFlags(flags)\n\tif err := installConfigFlags(opts.daemonConfig, flags); err != nil {\n\t\treturn nil, err\n\t}\n\tinstallServiceFlags(flags)\n\n\treturn cmd, nil\n}\n```\n\n<br>\n\n#### 1.2 runDaemon\n\n```\nfunc runDaemon(opts *daemonOptions) error {\n\tdaemonCli := NewDaemonCli()\n\treturn daemonCli.start(opts)\n}\n```\n\n这里主要是  runDaemon -> daemonCli.start。\n\n#### 1.3 daemonCli.start\n\nstart函数的核心逻辑如下：\n\n1. 设置默认的配置，以及从命令行、文件读取配置。从打印出来的日志来看，确实没什么启动参数。基本都是默认值，比如\n\n   默认的docker目录是/var/lib/docker, 默人的sock是const DefaultDockerHost = \"unix:///var/run/docker.sock\"\n\n2. 检查一些配置，比如是否debug模式，是否开启实验模式，是否以root运行等等\n\n3. 创建docker-root目录文件，默认在 /var/lib/docker目录下\n\n4. 创建docker.pid文件\n\n5. 创建sever config\n\n6. 根据config，创建一个sever\n\n7. daemon程序可以根据选项监控多个地址，loadListeners遍历这些地址，也监听了多个地址。\n\n8. initcontainerD 初始化容器运行时, initContainerD会调用supervisor.Start然后调用 startContainerd，启动containerd。会在/var/run/docker/containerd目录下，pid和sock文件。\n\n9. 初始化pluginStore,实际就是生成一个map用来保存有哪些plugins\n\n10. 初始化Middlewares, http的中间件，这些中间件主要进行版本兼容性检查、添加CORS跨站点请求相关响应头、对请求进行认证。\n\n11. 实例化Daemon对象，做好 sever端的一切准备，包括检查网络以及其他环境\n\n12. 实例化metric server\n\n13. docker可能以集群方式运行，开启\n\n14. 运行 swarm containers \n\n15. 配置路由，包括contianer,image, driver等等\n\n16. 初始化路由,接下里会分析\n\n1. 开启服务器，以及通知就绪等等\n\n```\nfunc (cli *DaemonCli) start(opts *daemonOptions) (err error) {\n\tstopc := make(chan bool)\n\tdefer close(stopc)\n\n\t// warn from uuid package when running the daemon\n\tuuid.Loggerf = logrus.Warnf\n\n  // 1.设置默认的配置，以及从命令行、文件读取配置。从打印出来的日志来看，确实没什么启动参数。例如指定了\n  // root is /var/lib/docker, conf.TrustKeyPath is /etc/docker/key.json\n\topts.SetDefaultOptions(opts.flags)\n\t// 增加日志打印输出。用于理解源码\n\tlogrus.Infof(\"zoux start flags.configFile is %v, damonConfig is %v, flags is %v, debug is %v, hosts is %v\", opts.configFile, opts.daemonConfig,opts.Debug, opts.Hosts)\n\n\tif cli.Config, err = loadDaemonCliConfig(opts); err != nil {\n\t\treturn err\n\t}\n\n\tif err := configureDaemonLogs(cli.Config); err != nil {\n\t\treturn err\n\t}\n\n\tlogrus.Info(\"Starting up\")\n\n\tcli.configFile = &opts.configFile\n\tcli.flags = opts.flags\n  \n  // 2.检查一些配置，比如是否debug模式，是否开启实验模式，是否以root运行等等\n\tif cli.Config.Debug {\n\t\tdebug.Enable()\n\t}\n  \n\tif cli.Config.Experimental {\n\t\tlogrus.Warn(\"Running experimental build\")\n\t\tif cli.Config.IsRootless() {\n\t\t\tlogrus.Warn(\"Running in rootless mode. Cgroups, AppArmor, and CRIU are disabled.\")\n\t\t}\n\t\tif rootless.RunningWithRootlessKit() {\n\t\t\tlogrus.Info(\"Running with RootlessKit integration\")\n\t\t\tif !cli.Config.IsRootless() {\n\t\t\t\treturn fmt.Errorf(\"rootless mode needs to be enabled for running with RootlessKit\")\n\t\t\t}\n\t\t}\n\t} else {\n\t\tif cli.Config.IsRootless() {\n\t\t\treturn fmt.Errorf(\"rootless mode is supported only when running in experimental mode\")\n\t\t}\n\t}\n\t// return human-friendly error before creating files\n\tif runtime.GOOS == \"linux\" && os.Geteuid() != 0 {\n\t\treturn fmt.Errorf(\"dockerd needs to be started with root. To see how to run dockerd in rootless mode with unprivileged user, see the documentation\")\n\t}\n\n\tsystem.InitLCOW(cli.Config.Experimental)\n\n\tif err := setDefaultUmask(); err != nil {\n\t\treturn err\n\t}\n   \n  // 3. 创建docker-root目录文件，默认在 /var/lib/docker目录下\n\t// Create the daemon root before we create ANY other files (PID, or migrate keys)\n\t// to ensure the appropriate ACL is set (particularly relevant on Windows)\n\tif err := daemon.CreateDaemonRoot(cli.Config); err != nil {\n\t\treturn err\n\t}\n\n\tif err := system.MkdirAll(cli.Config.ExecRoot, 0700, \"\"); err != nil {\n\t\treturn err\n\t}\n\n\tpotentiallyUnderRuntimeDir := []string{cli.Config.ExecRoot}\n\n  // 4.创建docker.pid文件\n\tif cli.Pidfile != \"\" {\n\t\tpf, err := pidfile.New(cli.Pidfile)\n\t\tif err != nil {\n\t\t\treturn errors.Wrap(err, \"failed to start daemon\")\n\t\t}\n\t\tpotentiallyUnderRuntimeDir = append(potentiallyUnderRuntimeDir, cli.Pidfile)\n\t\tdefer func() {\n\t\t\tif err := pf.Remove(); err != nil {\n\t\t\t\tlogrus.Error(err)\n\t\t\t}\n\t\t}()\n\t}\n\n\tif cli.Config.IsRootless() {\n\t\t// Set sticky bit if XDG_RUNTIME_DIR is set && the file is actually under XDG_RUNTIME_DIR\n\t\tif _, err := homedir.StickRuntimeDirContents(potentiallyUnderRuntimeDir); err != nil {\n\t\t\t// StickRuntimeDirContents returns nil error if XDG_RUNTIME_DIR is just unset\n\t\t\tlogrus.WithError(err).Warn(\"cannot set sticky bit on files under XDG_RUNTIME_DIR\")\n\t\t}\n\t}\n  \n  // 5.创建sever config\n\tserverConfig, err := newAPIServerConfig(cli)\n\tif err != nil {\n\t\treturn errors.Wrap(err, \"failed to create API server\")\n\t}\n\t// 6.根据config，创建一个sever\n\tcli.api = apiserver.New(serverConfig)\n  \n  // 7.daemon程序可以根据选项监控多个地址，loadListeners遍历这些地址，也监听了多个地址。\n\thosts, err := loadListeners(cli, serverConfig)\n\tif err != nil {\n\t\treturn errors.Wrap(err, \"failed to load listeners\")\n\t}\n\n\tctx, cancel := context.WithCancel(context.Background())\n\t// 8.initcontainerD 初始化容器运行时, initContainerD会调用supervisor.Start然后调用 startContainerd，启动containerd。会在/var/run/docker/containerd目录下，pid和sock文件。\n\twaitForContainerDShutdown, err := cli.initContainerD(ctx)\n\tif waitForContainerDShutdown != nil {\n\t\tdefer waitForContainerDShutdown(10 * time.Second)\n\t}\n\tif err != nil {\n\t\tcancel()\n\t\treturn err\n\t}\n\tdefer cancel()\n\n\tsignal.Trap(func() {\n\t\tcli.stop()\n\t\t<-stopc // wait for daemonCli.start() to return\n\t}, logrus.StandardLogger())\n\n\t// Notify that the API is active, but before daemon is set up.\n\tpreNotifySystem()\n  \n  // 9.初始化pluginStore,实际就是生成一个map用来保存有哪些plugins\n\tpluginStore := plugin.NewStore()\n  \n  // 10.初始化Middlewares, http的中间件，这些中间件主要进行版本兼容性检查、添加CORS跨站点请求相关响应头、对请求进行认证。\n\tif err := cli.initMiddlewares(cli.api, serverConfig, pluginStore); err != nil {\n\t\tlogrus.Fatalf(\"Error creating middlewares: %v\", err)\n\t}\n  \n  // 11.实例化Daemon对象，做好 sever端的一切准备，包括检查网络以及其他环境\n\td, err := daemon.NewDaemon(ctx, cli.Config, pluginStore)\n\tif err != nil {\n\t\treturn errors.Wrap(err, \"failed to start daemon\")\n\t}\n\n\td.StoreHosts(hosts)\n\n\t// validate after NewDaemon has restored enabled plugins. Don't change order.\n\tif err := validateAuthzPlugins(cli.Config.AuthorizationPlugins, pluginStore); err != nil {\n\t\treturn errors.Wrap(err, \"failed to validate authorization plugin\")\n\t}\n\n\tcli.d = d\n  \n  // 12. 实例化metric server\n\tif err := cli.startMetricsServer(cli.Config.MetricsAddress); err != nil {\n\t\treturn err\n\t}\n  \n  // 13.docker可能以集群方式运行，开启\n\tc, err := createAndStartCluster(cli, d)\n\tif err != nil {\n\t\tlogrus.Fatalf(\"Error starting cluster component: %v\", err)\n\t}\n\n\t// Restart all autostart containers which has a swarm endpoint\n\t// and is not yet running now that we have successfully\n\t// initialized the cluster.\n\t// 14.运行 swarm containers \n\td.RestartSwarmContainers()\n\n\tlogrus.Info(\"Daemon has completed initialization\")\n  \n  // 15.配置路由，包括contianer,image, driver等等\n\trouterOptions, err := newRouterOptions(cli.Config, d)\n\tif err != nil {\n\t\treturn err\n\t}\n\trouterOptions.api = cli.api\n\trouterOptions.cluster = c\n  \n  // 16.初始化路由,接下里会分析\n\tinitRouter(routerOptions)\n\n\n  // 17. 开启服务器，以及通知就绪等等\n\tgo d.ProcessClusterNotifications(ctx, c.GetWatchStream())\n\n\tcli.setupConfigReloadTrap()\n\n\t// The serve API routine never exits unless an error occurs\n\t// We need to start it as a goroutine and wait on it so\n\t// daemon doesn't exit\n\tserveAPIWait := make(chan error)\n\tgo cli.api.Wait(serveAPIWait)\n   \n  \n\t// after the daemon is done setting up we can notify systemd api\n\tnotifySystem()\n\n\t// Daemon is fully initialized and handling API traffic\n\t// Wait for serve API to complete\n\terrAPI := <-serveAPIWait\n\tc.Cleanup()\n\n\tshutdownDaemon(d)\n\n\t// Stop notification processing and any background processes\n\tcancel()\n\n\tif errAPI != nil {\n\t\treturn errors.Wrap(errAPI, \"shutting down due to ServeAPI error\")\n\t}\n\n\tlogrus.Info(\"Daemon shutdown complete\")\n\treturn nil\n}\n\n日志输出结果：\nlogrus.Infof(\"zoux start flags.configFile is %v, damonConfig is %v, flags is %v, debug is %v, hosts is %v\", opts.configFile, opts.daemonConfig,opts.Debug, opts.Hosts)\n\nFeb 28 16:56:58 k8s-node dockerd[28021]: time=\"2022-02-28T16:56:58.742186824+08:00\" level=info msg=\"zoux start flags.configFile is /etc/docker/daemon.json, damonConfig is &{{<nil> [] true map[] false []  [] [] 0 0 /var/run/docker.pid false  /var/lib/docker /var/run/docker docker   false  map[]  0xc0005f3bd0 0xc0005f3bd8 15 false []  false false {  }  0 0  {[] [] []} {json-file map[]} {{ } {0.0.0.0  <nil> <nil> true} false true true true true  } {{[]} 1500} {[] [] []} {0 0} map[] false []  false map[] {{false [] } {<nil> <nil>}} moby plugins.moby} {map[] runc }  false  map[] 0 0 -500 false   67108864 false private  false}, flags is false, debug is [], hosts is %!v(MISSING)\"\n```\n\n**initContainerD**会调用supervisor.Start然后调用 startContainerd，启动containerd。会在/var/run/docker/containerd目录下，pid和sock文件。\n\n```\nroot@k8s-node:/var/run/docker/containerd# ls\n0ea51049a3dde9b6ca6940f563b920997cc7ff05425bfe5174f2fbced72a9feb  7f1343294ac385c400b076a0d0c62979909cede65e90b2a0d8615ddba36c19cd  containerd-debug.sock  containerd.toml  daemon\nroot@k8s-node:/var/run/docker/containerd#\nroot@k8s-node:/var/run/docker/containerd#\nroot@k8s-node:/var/run/docker/containerd# systemctl start docker.service\nroot@k8s-node:/var/run/docker/containerd#\nroot@k8s-node:/var/run/docker/containerd# ls\n492a9c1152120b8eafd70c476a04aa7d73b8ec359fbf01c55e55b70912872dfe  702cc9e5c234374195375cdc05bf34eb0484221f9da3e4288d2c37154f2325bd  containerd-debug.sock  containerd.pid  containerd.sock  containerd.toml  daemon\nroot@k8s-node:/var/run/docker/containerd# ls\n```\n\n<br>\n\n```\nfunc (cli *DaemonCli) initContainerD(ctx context.Context) (func(time.Duration) error, error) {\n\tvar waitForShutdown func(time.Duration) error\n\tif cli.Config.ContainerdAddr == \"\" {\n\t\tsystemContainerdAddr, ok, err := systemContainerdRunning(honorXDG)\n\t\tif err != nil {\n\t\t\treturn nil, errors.Wrap(err, \"could not determine whether the system containerd is running\")\n\t\t}\n\t\tif !ok {\n\t\t\tlogrus.Debug(\"Containerd not running, starting daemon managed containerd\")\n\t\t\topts, err := cli.getContainerdDaemonOpts()\n\t\t\tif err != nil {\n\t\t\t\treturn nil, errors.Wrap(err, \"failed to generate containerd options\")\n\t\t\t}\n\n\t\t\tr, err := supervisor.Start(ctx, filepath.Join(cli.Config.Root, \"containerd\"), filepath.Join(cli.Config.ExecRoot, \"containerd\"), opts...)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, errors.Wrap(err, \"failed to start containerd\")\n\t\t\t}\n\t\t\tlogrus.Debug(\"Started daemon managed containerd\")\n\t\t\tcli.Config.ContainerdAddr = r.Address()\n\n\t\t\t// Try to wait for containerd to shutdown\n\t\t\twaitForShutdown = r.WaitTimeout\n\t\t} else {\n\t\t\tcli.Config.ContainerdAddr = systemContainerdAddr\n\t\t}\n\t}\n\n\treturn waitForShutdown, nil\n}\n```\n\n#### 1.4 NewDaemon\n\nNewDaemon核心就是为了接下来开启 服务端路由做准备。包括\n\n（1）环境的检测调整\n\n（2）用户空间重映射特性\n\n（3）对存储目录进行必要的权限调整、对daemon进程的`oom_score_adj`参数进行必要的调整（减小daemon进程被OS杀掉的可能性）、创建临时目录。\n\n（4）调整进程的最大线程数限制 * 安装AppArmor相关的配置\n\n（5）创建初始化了一堆与镜像存储相关的目录及Store，有以下几个： \n\n `/var/lib/docker/containers` 这个目录是用来记录的是容器相关的信息，每运行一个容器，就在这个目录下面生成一个容器Id对应的子目录\n\n`/var/lib/docker/image/${graphDriverName}/layerdb` 这个目录是用来记录layer元数据的\n\n `/var/lib/docker/image/${graphDriverName}/imagedb` 这个目录是用来记录镜像元数据的 \n\n`/var/lib/docker/image/${graphDriverName}/distribution` 这个目录用来记录layer元数据与镜像元数据之间的关联关系 \n\n`/var/lib/docker/image/${graphDriverName}/repositories.json` 这个目录是用来记录镜像仓库元数据的 \n\n `/var/lib/docker/trust` 这个目录用来放一些证书文件 * `/var/lib/docker/volumes` 这个目录是用来记录卷元数据的\n\n（6）如果配置了在集群中向外发布的访问地址，则需要初始化集群节点的服务发现Agent。一般来说就是定时向KV库报告自身的状态及公布访问地址\n\n（7）再然后就是给Daemon对象的一系列属性赋上值。\n\n（8）确保插件系统初始化完毕，然后根据`/var/lib/docker/containers`目录里容器目录还原部分容器、初始化容器依赖的网络环境，初始化容器之间的link关系等。\n\n具体不一样对应了，看代码和注释就知道了。代码位置在：daemon/daemon.go\n\n#### 1.5  dockerd的路由设置 containers\n\n在2.2中，initRouter就是负责路由规则。可以看出来包括image, contianer, plugins等等。这里我们只关注container路由。\n\n```\nfunc initRouter(opts routerOptions) {\n\t。。。\n\trouters := []router.Router{\n\t\t// we need to add the checkpoint router before the container router or the DELETE gets masked\n\t\tcheckpointrouter.NewRouter(opts.daemon, decoder),\n\t\tcontainer.NewRouter(opts.daemon, decoder, opts.daemon.RawSysInfo().CgroupUnified),\n\t\timage.NewRouter(opts.daemon.ImageService()),\n\t\tsystemrouter.NewRouter(opts.daemon, opts.cluster, opts.buildkit, opts.features),\n\t\tvolume.NewRouter(opts.daemon.VolumesService()),\n\t\tbuild.NewRouter(opts.buildBackend, opts.daemon, opts.features),\n\t\tsessionrouter.NewRouter(opts.sessionManager),\n\t\tswarmrouter.NewRouter(opts.cluster),\n\t\tpluginrouter.NewRouter(opts.daemon.PluginManager()),\n\t\tdistributionrouter.NewRouter(opts.daemon.ImageService()),\n\t}\n\n\t。。。\n\topts.api.InitRouter(routers...)\n}\n```\n\n上述所有的路由实现都对应在 api/server/router目录。\n\n可以看出来：\n\ncontainer create: 对应了 r.postContainersCreate 这个实现函数 \n\ncontainer start: 对应了 r.postContainerExecStart 这个实现函数\n\n```\napi/server/router/container/container.go\n\n// NewRouter initializes a new container router\nfunc NewRouter(b Backend, decoder httputils.ContainerDecoder, cgroup2 bool) router.Router {\n\tr := &containerRouter{\n\t\tbackend: b,\n\t\tdecoder: decoder,\n\t\tcgroup2: cgroup2,\n\t}\n\tr.initRoutes()\n\treturn r\n}\n\n// Routes returns the available routes to the container controller\nfunc (r *containerRouter) Routes() []router.Route {\n\treturn r.routes\n}\n\n// initRoutes initializes the routes in container router\nfunc (r *containerRouter) initRoutes() {\n\tr.routes = []router.Route{\n\t\t// HEAD\n\t\trouter.NewHeadRoute(\"/containers/{name:.*}/archive\", r.headContainersArchive),\n\t\t// GET\n\t\trouter.NewGetRoute(\"/containers/json\", r.getContainersJSON),\n\t\trouter.NewGetRoute(\"/containers/{name:.*}/export\", r.getContainersExport),\n\t\trouter.NewGetRoute(\"/containers/{name:.*}/changes\", r.getContainersChanges),\n\t\trouter.NewGetRoute(\"/containers/{name:.*}/json\", r.getContainersByName),\n\t\trouter.NewGetRoute(\"/containers/{name:.*}/top\", r.getContainersTop),\n\t\trouter.NewGetRoute(\"/containers/{name:.*}/logs\", r.getContainersLogs),\n\t\trouter.NewGetRoute(\"/containers/{name:.*}/stats\", r.getContainersStats),\n\t\trouter.NewGetRoute(\"/containers/{name:.*}/attach/ws\", r.wsContainersAttach),\n\t\trouter.NewGetRoute(\"/exec/{id:.*}/json\", r.getExecByID),\n\t\trouter.NewGetRoute(\"/containers/{name:.*}/archive\", r.getContainersArchive),\n\t\t// POST\n\t\t//r.postContainersCreate 这个是 container create的实现函数\n\t\trouter.NewPostRoute(\"/containers/create\", r.postContainersCreate),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/kill\", r.postContainersKill),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/pause\", r.postContainersPause),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/unpause\", r.postContainersUnpause),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/restart\", r.postContainersRestart),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/start\", r.postContainersStart),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/stop\", r.postContainersStop),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/wait\", r.postContainersWait),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/resize\", r.postContainersResize),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/attach\", r.postContainersAttach),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/copy\", r.postContainersCopy), // Deprecated since 1.8, Errors out since 1.12\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/exec\", r.postContainerExecCreate),\n\t\trouter.NewPostRoute(\"/exec/{name:.*}/start\", r.postContainerExecStart),\n\t\trouter.NewPostRoute(\"/exec/{name:.*}/resize\", r.postContainerExecResize),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/rename\", r.postContainerRename),\n\t\trouter.NewPostRoute(\"/containers/{name:.*}/update\", r.postContainerUpdate),\n\t\trouter.NewPostRoute(\"/containers/prune\", r.postContainersPrune),\n\t\trouter.NewPostRoute(\"/commit\", r.postCommit),\n\t\t// PUT\n\t\trouter.NewPutRoute(\"/containers/{name:.*}/archive\", r.putContainersArchive),\n\t\t// DELETE\n\t\trouter.NewDeleteRoute(\"/containers/{name:.*}\", r.deleteContainers),\n\t}\n}\n```\n\n### 2. docker create container详细流程分析\n\ndocker create container在后端调用的是postContainersCreate，首先从源码角度分析详细流程\n\n#### 2.1 postContainersCreate \n\npostContainersCreate 函数逻辑如下：\n\n1. 对request进行校验\n2. 从表单获取contaienr name\n3. 获取容器hostConfig， 网络config等配置\n4. 传入配置信息，调用ContainerCreate进一步创建容器\n\n看起来核心是backend.ContainerCreate 函数\n\n```\nfunc (s *containerRouter) postContainersCreate(ctx context.Context, w http.ResponseWriter, r *http.Request, vars map[string]string) error {\n   // 1.对request进行校验\n\tif err := httputils.ParseForm(r); err != nil {\n\t\treturn err\n\t}\n\tif err := httputils.CheckForJSON(r); err != nil {\n\t\treturn err\n\t}\n  \n  // 2.从表单获取contaienr name\n\tname := r.Form.Get(\"name\")\n \n // 3.获取容器hostConfig， 网络config等配置\n\tconfig, hostConfig, networkingConfig, err := s.decoder.DecodeConfig(r.Body)\n\tif err != nil {\n\t\treturn err\n\t}\n\tversion := httputils.VersionFromContext(ctx)\n\tadjustCPUShares := versions.LessThan(version, \"1.19\")\n   \n\t// When using API 1.24 and under, the client is responsible for removing the container\n\tif hostConfig != nil && versions.LessThan(version, \"1.25\") {\n\t\thostConfig.AutoRemove = false\n\t}\n\n\tif hostConfig != nil && versions.LessThan(version, \"1.40\") {\n\t\t// Ignore BindOptions.NonRecursive because it was added in API 1.40.\n\t\tfor _, m := range hostConfig.Mounts {\n\t\t\tif bo := m.BindOptions; bo != nil {\n\t\t\t\tbo.NonRecursive = false\n\t\t\t}\n\t\t}\n\t\t// Ignore KernelMemoryTCP because it was added in API 1.40.\n\t\thostConfig.KernelMemoryTCP = 0\n\n\t\t// Ignore Capabilities because it was added in API 1.40.\n\t\thostConfig.Capabilities = nil\n\n\t\t// Older clients (API < 1.40) expects the default to be shareable, make them happy\n\t\tif hostConfig.IpcMode.IsEmpty() {\n\t\t\thostConfig.IpcMode = container.IpcMode(\"shareable\")\n\t\t}\n\t}\n\n\tif hostConfig != nil && hostConfig.PidsLimit != nil && *hostConfig.PidsLimit <= 0 {\n\t\t// Don't set a limit if either no limit was specified, or \"unlimited\" was\n\t\t// explicitly set.\n\t\t// Both `0` and `-1` are accepted as \"unlimited\", and historically any\n\t\t// negative value was accepted, so treat those as \"unlimited\" as well.\n\t\thostConfig.PidsLimit = nil\n\t}\n  \n  // 4.传入配置信息，调用ContainerCreate进一步创建容器\n\tccr, err := s.backend.ContainerCreate(types.ContainerCreateConfig{\n\t\tName:             name,\n\t\tConfig:           config,\n\t\tHostConfig:       hostConfig,\n\t\tNetworkingConfig: networkingConfig,\n\t\tAdjustCPUShares:  adjustCPUShares,\n\t})\n\tif err != nil {\n\t\treturn err\n\t}\n\n\treturn httputils.WriteJSON(w, http.StatusCreated, ccr)\n}\n```\n\nbackend.ContainerCreate最终调用的是 daemon.ContainerCreate\n\n```\ndaemon/create.go\n\n// ContainerCreate creates a regular container\nfunc (daemon *Daemon) ContainerCreate(params types.ContainerCreateConfig) (containertypes.ContainerCreateCreatedBody, error) {\n   return daemon.containerCreate(createOpts{\n      params:                  params,\n      managed:                 false,\n      ignoreImagesArgsEscaped: false})\n}\n```\n\n<br>\n\n#### 2.2 containerCreate\n\ncontainerCreate的核心逻辑如下：\n\n1. 一开始纪录时间，估计是统计耗时用的，接下来看返回条件就知道，是做一系列的验证\n2. 如果指定了镜像，就调用imageService.GetImage获取 image对象。这里只是为了获取镜像信息，如果没有镜像并没有拉取。原因是客户端docker会拉去镜像再重试\n3. 修改hostconfig的不正常值，例如CPUShares、Memory\n4. 继续调用daemon.create创建容器\n5. 纪录已经创建容器的时间\n\n```\nfunc (daemon *Daemon) containerCreate(opts createOpts) (containertypes.ContainerCreateCreatedBody, error) {\n\tstart := time.Now()\n\tif opts.params.Config == nil {\n\t\treturn containertypes.ContainerCreateCreatedBody{}, errdefs.InvalidParameter(errors.New(\"Config cannot be empty in order to create a container\"))\n\t}\n\n\tos := runtime.GOOS\n\tif opts.params.Config.Image != \"\" {\n\t\timg, err := daemon.imageService.GetImage(opts.params.Config.Image)\n\t\tif err == nil {\n\t\t\tos = img.OS\n\t\t}\n\t} else {\n\t\t// This mean scratch. On Windows, we can safely assume that this is a linux\n\t\t// container. On other platforms, it's the host OS (which it already is)\n\t\tif runtime.GOOS == \"windows\" && system.LCOWSupported() {\n\t\t\tos = \"linux\"\n\t\t}\n\t}\n\n\twarnings, err := daemon.verifyContainerSettings(os, opts.params.HostConfig, opts.params.Config, false)\n\tif err != nil {\n\t\treturn containertypes.ContainerCreateCreatedBody{Warnings: warnings}, errdefs.InvalidParameter(err)\n\t}\n\n\terr = verifyNetworkingConfig(opts.params.NetworkingConfig)\n\tif err != nil {\n\t\treturn containertypes.ContainerCreateCreatedBody{Warnings: warnings}, errdefs.InvalidParameter(err)\n\t}\n\n\tif opts.params.HostConfig == nil {\n\t\topts.params.HostConfig = &containertypes.HostConfig{}\n\t}\n\terr = daemon.adaptContainerSettings(opts.params.HostConfig, opts.params.AdjustCPUShares)\n\tif err != nil {\n\t\treturn containertypes.ContainerCreateCreatedBody{Warnings: warnings}, errdefs.InvalidParameter(err)\n\t}\n\n\tcontainer, err := daemon.create(opts)\n\tif err != nil {\n\t\treturn containertypes.ContainerCreateCreatedBody{Warnings: warnings}, err\n\t}\n\tcontainerActions.WithValues(\"create\").UpdateSince(start)\n\n\tif warnings == nil {\n\t\twarnings = make([]string, 0) // Create an empty slice to avoid https://github.com/moby/moby/issues/38222\n\t}\n\n\treturn containertypes.ContainerCreateCreatedBody{ID: container.ID, Warnings: warnings}, nil\n}\n```\n\n<br>\n\n#### 2.3 daemon.create\n\ncreate主要逻辑如下：\n\n1. 定义一些全局变量\n2. 看起來还是只是getImages 没有pull\n3. 根据镜像信息，再一次校验信息是否有误\n4. 调用daemon.newContainer创建容器\n5.  判断是否设置容器特权。 noNewPrivileges：设置为true后可以防止进程获取额外的权限(如使得suid和文件capabilities失效)，该标记位在内核4.10版本之后可以在/proc/$pid/status中查看NoNewPrivs的设置值。更多参见 https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt\n6. 为容器设置 可读性 layer层\n7. 以 root uid gid的属性创建目录，在/var/lib/docker/containers目录下创建容器文件，并在容器文件下创建checkpoints目录\n8. 根据特定的OS创建容器，比如默认路径已经创建volume（这些特性和os有关）\n9. 设置网络\n10. 更新网络\n\n```\n// Create creates a new container from the given configuration with a given name.\nfunc (daemon *Daemon) create(opts createOpts) (retC *container.Container, retErr error) {\n\t // 1. 定义一些全局变量\n\tvar (\n\t\tcontainer *container.Container\n\t\timg       *image.Image\n\t\timgID     image.ID\n\t\terr       error\n\t)\n\n\tos := runtime.GOOS\n\t// 2. getImages 获取镜像信息\n\tif opts.params.Config.Image != \"\" {\n\t\timg, err = daemon.imageService.GetImage(opts.params.Config.Image)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tif img.OS != \"\" {\n\t\t\tos = img.OS\n\t\t} else {\n\t\t\t// default to the host OS except on Windows with LCOW\n\t\t\tif runtime.GOOS == \"windows\" && system.LCOWSupported() {\n\t\t\t\tos = \"linux\"\n\t\t\t}\n\t\t}\n\t\timgID = img.ID()\n\n\t\tif runtime.GOOS == \"windows\" && img.OS == \"linux\" && !system.LCOWSupported() {\n\t\t\treturn nil, errors.New(\"operating system on which parent image was created is not Windows\")\n\t\t}\n\t} else {\n\t\tif runtime.GOOS == \"windows\" {\n\t\t\tos = \"linux\" // 'scratch' case.\n\t\t}\n\t}\n\n\t// On WCOW, if are not being invoked by the builder to create this container (where\n\t// ignoreImagesArgEscaped will be true) - if the image already has its arguments escaped,\n\t// ensure that this is replicated across to the created container to avoid double-escaping\n\t// of the arguments/command line when the runtime attempts to run the container.\n\tif os == \"windows\" && !opts.ignoreImagesArgsEscaped && img != nil && img.RunConfig().ArgsEscaped {\n\t\topts.params.Config.ArgsEscaped = true\n\t}\n  \n  // 3.根据镜像信息，再一次校验信息是否有误\n\tif err := daemon.mergeAndVerifyConfig(opts.params.Config, img); err != nil {\n\t\treturn nil, errdefs.InvalidParameter(err)\n\t}\n\n\tif err := daemon.mergeAndVerifyLogConfig(&opts.params.HostConfig.LogConfig); err != nil {\n\t\treturn nil, errdefs.InvalidParameter(err)\n\t}\n  \n  // 4.调用daemon.newContainer创建容器\n\tif container, err = daemon.newContainer(opts.params.Name, os, opts.params.Config, opts.params.HostConfig, imgID, opts.managed); err != nil {\n\t\treturn nil, err\n\t}\n\tdefer func() {\n\t\tif retErr != nil {\n\t\t\tif err := daemon.cleanupContainer(container, true, true); err != nil {\n\t\t\t\tlogrus.Errorf(\"failed to cleanup container on create error: %v\", err)\n\t\t\t}\n\t\t}\n\t}()\n  \n  // 5. 判断是否设置容器特权。 noNewPrivileges：设置为true后可以防止进程获取额外的权限(如使得suid和文件capabilities失效)，该标记位在内核4.10版本\n  // 之后可以在/proc/$pid/status中查看NoNewPrivs的设置值。更多参见 https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt\n\tif err := daemon.setSecurityOptions(container, opts.params.HostConfig); err != nil {\n\t\treturn nil, err\n\t}\n\n\tcontainer.HostConfig.StorageOpt = opts.params.HostConfig.StorageOpt\n\n\t// Fixes: https://github.com/moby/moby/issues/34074 and\n\t// https://github.com/docker/for-win/issues/999.\n\t// Merge the daemon's storage options if they aren't already present. We only\n\t// do this on Windows as there's no effective sandbox size limit other than\n\t// physical on Linux.\n\tif runtime.GOOS == \"windows\" {\n\t\tif container.HostConfig.StorageOpt == nil {\n\t\t\tcontainer.HostConfig.StorageOpt = make(map[string]string)\n\t\t}\n\t\tfor _, v := range daemon.configStore.GraphOptions {\n\t\t\topt := strings.SplitN(v, \"=\", 2)\n\t\t\tif _, ok := container.HostConfig.StorageOpt[opt[0]]; !ok {\n\t\t\t\tcontainer.HostConfig.StorageOpt[opt[0]] = opt[1]\n\t\t\t}\n\t\t}\n\t}\n   \n  // 6. 为容器设置 可读性 layer层\n\t// Set RWLayer for container after mount labels have been set\n\trwLayer, err := daemon.imageService.CreateLayer(container, setupInitLayer(daemon.idMapping))\n\tif err != nil {\n\t\treturn nil, errdefs.System(err)\n\t}\n\tcontainer.RWLayer = rwLayer\n\n\trootIDs := daemon.idMapping.RootPair()\n  \n  // 7. 以 root uid gid的属性创建目录，在/var/lib/docker/containers目录下创建容器文件，并在容器文件下创建checkpoints目录\n\tif err := idtools.MkdirAndChown(container.Root, 0700, rootIDs); err != nil {\n\t\treturn nil, err\n\t}\n\tif err := idtools.MkdirAndChown(container.CheckpointDir(), 0700, rootIDs); err != nil {\n\t\treturn nil, err\n\t}\n\n\tif err := daemon.setHostConfig(container, opts.params.HostConfig); err != nil {\n\t\treturn nil, err\n\t}\n  \n  // 8. 根据特定的OS创建容器，比如默认路径已经创建volume（这些特性和os有关）\n\tif err := daemon.createContainerOSSpecificSettings(container, opts.params.Config, opts.params.HostConfig); err != nil {\n\t\treturn nil, err\n\t}\n\n   // 9.设置网络\n\tvar endpointsConfigs map[string]*networktypes.EndpointSettings\n\tif opts.params.NetworkingConfig != nil {\n\t\tendpointsConfigs = opts.params.NetworkingConfig.EndpointsConfig\n\t}\n\t// Make sure NetworkMode has an acceptable value. We do this to ensure\n\t// backwards API compatibility.\n\trunconfig.SetDefaultNetModeIfBlank(container.HostConfig)\n  \n   // 10.更新网络\n\tdaemon.updateContainerNetworkSettings(container, endpointsConfigs)\n\tif err := daemon.Register(container); err != nil {\n\t\treturn nil, err\n\t}\n\tstateCtr.set(container.ID, \"stopped\")\n\tdaemon.LogContainerEvent(container, \"create\")\n\treturn container, nil\n}\n```\n\n<br>\n\n接下来继续看看第四步，daemon.newContainer做了什么\n\n#### 2.4  newContainer\n\n可以看出来new container只是创建容器这个对象。具体就是给对象赋值。而创建目录啥的在createContainerOSSpecificSettings做了\n\n```\nfunc (daemon *Daemon) newContainer(name string, operatingSystem string, config *containertypes.Config, hostConfig *containertypes.HostConfig, imgID image.ID, managed bool) (*container.Container, error) {\n\tvar (\n\t\tid             string\n\t\terr            error\n\t\tnoExplicitName = name == \"\"\n\t)\n\tid, name, err = daemon.generateIDAndName(name)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tif hostConfig.NetworkMode.IsHost() {\n\t\tif config.Hostname == \"\" {\n\t\t\tconfig.Hostname, err = os.Hostname()\n\t\t\tif err != nil {\n\t\t\t\treturn nil, errdefs.System(err)\n\t\t\t}\n\t\t}\n\t} else {\n\t\tdaemon.generateHostname(id, config)\n\t}\n\tentrypoint, args := daemon.getEntrypointAndArgs(config.Entrypoint, config.Cmd)\n\n\tbase := daemon.newBaseContainer(id)\n\tbase.Created = time.Now().UTC()\n\tbase.Managed = managed\n\tbase.Path = entrypoint\n\tbase.Args = args //FIXME: de-duplicate from config\n\tbase.Config = config\n\tbase.HostConfig = &containertypes.HostConfig{}\n\tbase.ImageID = imgID\n\tbase.NetworkSettings = &network.Settings{IsAnonymousEndpoint: noExplicitName}\n\tbase.Name = name\n\tbase.Driver = daemon.imageService.GraphDriverForOS(operatingSystem)\n\tbase.OS = operatingSystem\n\treturn base, err\n}\n```\n\n#### 2.5 实验\n\n##### 2.5.1 实验1-观察目录变化\n\n在执行 `docker container create --name nginx` 命令的过程中，时刻观察/var/lib/docker的变化，发现在create的阶段镜像文件以及挂载都准备好了。\n\n如果nginx镜像不存在，可以看到下载进行的整个过程。\n\n```\n03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob163641926 CREATE\n03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob307902189 CREATE\n03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob256086888 CREATE\n03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob630460839 CREATE\n03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob086739162 CREATE\n03/03/22 15:08 /var/lib/docker/tmp/ GetImageBlob105444465 CREATE\n```\n\n<br>\n\n```\nroot@k8s-node:~# inotifywait -mrq --timefmt '%d/%m/%y %H:%M' --format '%T %w %f %e' -e modify,delete,create,attrib /var/lib/docker\n\n03/03/22 11:44 /var/lib/docker/overlay2/ 4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/l/ GRIVPJLK7YAT3OXDTS4V2QFCUA CREATE\n03/03/22 11:44 /var/lib/docker/overlay2/7a25fdc447cb19682434e15e2a721250a869eb3a75aa8d439bbd985e736f8ef4/ committed MODIFY\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ merged CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/ work CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/ work ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/ work ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/  ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ diff ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/  ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3fed CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3fed ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ diff ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/  ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/ .dockerenv CREATE\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/ .dockerenv ATTRIB\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3fee CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3fee ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ diff ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/  ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3ff0 CREATE\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/work/work/ #3ff2 CREATE\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/dev/ shm CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/dev/ shm ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/dev/ console CREATE\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/diff/dev/ console ATTRIB\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ merged DELETE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/ 4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/l/ PV2PZDA4VGO3PPNCMHCCT4YDVN CREATE\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ link CREATE\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ linkMODIFY\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ work CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb-init/ committed CREATE\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ lower CREATE\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ lower MODIFY\n03/03/22 11:44 /var/lib/docker/image/overlay2/layerdb/mounts/ 15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52 CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/containers/ 15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52 CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json389982635 ATTRIB\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json466278670 CREATE\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json466278670 MODIFY\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json070462229 CREATE\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json070462229 MODIFY\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json070462229 ATTRIB\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json466278670 ATTRIB\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ merged CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/ work CREATE,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/ work ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/ work ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/ work ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/work/work/  ATTRIB,ISDIR\n03/03/22 11:44 /var/lib/docker/overlay2/4f6ff7566e421e62c68b599f85585f08ea662bfa44eb0eeb7e9da0e2858745cb/ merged DELETE,ISDIR\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json231746416 CREATE\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json231746416 MODIFY\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json732808207 CREATE\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-hostconfig.json732808207 ATTRIB\n03/03/22 11:44 /var/lib/docker/containers/15b8d0b7c46c37cf62a2c388aba016226e4ec327a5729991f6a1ae9e81b89e52/ .tmp-config.v2.json231746416 ATTRIB\n```\n\n##### 2.5.2 实验2-查看配置\n\n实际上docker create container 是制定了所有配置。包括运行命令。从inspect 就可以看出来。\n\n```\n\"Config\": {\n            \"Hostname\": \"687c38e427a4\",\n            \"Domainname\": \"\",\n            \"User\": \"\",\n            \"AttachStdin\": false,\n            \"AttachStdout\": true,\n            \"AttachStderr\": true,\n            \"ExposedPorts\": {\n                \"80/tcp\": {}\n            },\n            \"Tty\": false,\n            \"OpenStdin\": false,\n            \"StdinOnce\": false,\n            \"Env\": [\n                \"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\",\n                \"NGINX_VERSION=1.21.5\",\n                \"NJS_VERSION=0.7.1\",\n                \"PKG_RELEASE=1~bullseye\"\n            ],\n            \"Cmd\": [\n                \"ls\"\n            ],\n            \"Image\": \"nginx\",\n            \"Volumes\": null,\n            \"WorkingDir\": \"\",\n            \"Entrypoint\": [\n                \"/docker-entrypoint.sh\"\n            ],\n            \"OnBuild\": null,\n            \"Labels\": {\n                \"maintainer\": \"NGINX Docker Maintainers <docker-maint@nginx.com>\"\n            },\n            \"StopSignal\": \"SIGQUIT\"\n        },\n```\n\n\n\n#### 2.6 总结\n\ndocker create只是根据docker的配置（包括使用什么存储系统，root目录等），完成了所有的初始化。\n\n主要是利用镜像层已有的数据。初始化container的所有数据。 \n\n主要是初始化这个目录：/var/lib/docker/containers/contaienrId\n\n### 3. Docker start container详细流程分析\n\n从上面的分析可以得出。docker create 就已经将所有的准备工作做好了，包括运行的参数。接下来看看docker start做了什么。\n\n#### 3.1 postContainerExecStart\n\n和create一样，这里主要是调用了postContainerExecStart进行start\n\n```\nfunc (s *containerRouter) postContainersStart(ctx context.Context, w http.ResponseWriter, r *http.Request, vars map[string]string) error {\n\t// If contentLength is -1, we can assumed chunked encoding\n\t// or more technically that the length is unknown\n\t// https://golang.org/src/pkg/net/http/request.go#L139\n\t// net/http otherwise seems to swallow any headers related to chunked encoding\n\t// including r.TransferEncoding\n\t// allow a nil body for backwards compatibility\n\n\tversion := httputils.VersionFromContext(ctx)\n\tvar hostConfig *container.HostConfig\n\t// A non-nil json object is at least 7 characters.\n\tif r.ContentLength > 7 || r.ContentLength == -1 {\n\t\tif versions.GreaterThanOrEqualTo(version, \"1.24\") {\n\t\t\treturn bodyOnStartError{}\n\t\t}\n\n\t\tif err := httputils.CheckForJSON(r); err != nil {\n\t\t\treturn err\n\t\t}\n\n\t\tc, err := s.decoder.DecodeHostConfig(r.Body)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\thostConfig = c\n\t}\n\n\tif err := httputils.ParseForm(r); err != nil {\n\t\treturn err\n\t}\n\n\tcheckpoint := r.Form.Get(\"checkpoint\")\n\tcheckpointDir := r.Form.Get(\"checkpoint-dir\")\n\tif err := s.backend.ContainerStart(vars[\"name\"], hostConfig, checkpoint, checkpointDir); err != nil {\n\t\treturn err\n\t}\n\n\tw.WriteHeader(http.StatusNoContent)\n\treturn nil\n}\n```\n\n<br>\n\n#### 3.2 ContainerStart\n\nContainerStart主要逻辑如下：\n\n（1）根据容器name, 判断容器状态，比如paused状态的容器不能start等等。\n\n（2）判断hostconfig信息等，hostconfig必须在create的时候指定，start只管启动\n\n（3）调用containerStart进行start。核心是这个函数\n\n```\n// ContainerStart starts a container.\nfunc (daemon *Daemon) ContainerStart(name string, hostConfig *containertypes.HostConfig, checkpoint string, checkpointDir string) error {\n\tif checkpoint != \"\" && !daemon.HasExperimental() {\n\t\treturn errdefs.InvalidParameter(errors.New(\"checkpoint is only supported in experimental mode\"))\n\t}\n\n\tcontainer, err := daemon.GetContainer(name)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tvalidateState := func() error {\n\t\tcontainer.Lock()\n\t\tdefer container.Unlock()\n\n\t\tif container.Paused {\n\t\t\treturn errdefs.Conflict(errors.New(\"cannot start a paused container, try unpause instead\"))\n\t\t}\n\n\t\tif container.Running {\n\t\t\treturn containerNotModifiedError{running: true}\n\t\t}\n\n\t\tif container.RemovalInProgress || container.Dead {\n\t\t\treturn errdefs.Conflict(errors.New(\"container is marked for removal and cannot be started\"))\n\t\t}\n\t\treturn nil\n\t}\n\n\tif err := validateState(); err != nil {\n\t\treturn err\n\t}\n\n\t// Windows does not have the backwards compatibility issue here.\n\tif runtime.GOOS != \"windows\" {\n\t\t// This is kept for backward compatibility - hostconfig should be passed when\n\t\t// creating a container, not during start.\n\t\tif hostConfig != nil {\n\t\t\tlogrus.Warn(\"DEPRECATED: Setting host configuration options when the container starts is deprecated and has been removed in Docker 1.12\")\n\t\t\toldNetworkMode := container.HostConfig.NetworkMode\n\t\t\tif err := daemon.setSecurityOptions(container, hostConfig); err != nil {\n\t\t\t\treturn errdefs.InvalidParameter(err)\n\t\t\t}\n\t\t\tif err := daemon.mergeAndVerifyLogConfig(&hostConfig.LogConfig); err != nil {\n\t\t\t\treturn errdefs.InvalidParameter(err)\n\t\t\t}\n\t\t\tif err := daemon.setHostConfig(container, hostConfig); err != nil {\n\t\t\t\treturn errdefs.InvalidParameter(err)\n\t\t\t}\n\t\t\tnewNetworkMode := container.HostConfig.NetworkMode\n\t\t\tif string(oldNetworkMode) != string(newNetworkMode) {\n\t\t\t\t// if user has change the network mode on starting, clean up the\n\t\t\t\t// old networks. It is a deprecated feature and has been removed in Docker 1.12\n\t\t\t\tcontainer.NetworkSettings.Networks = nil\n\t\t\t\tif err := container.CheckpointTo(daemon.containersReplica); err != nil {\n\t\t\t\t\treturn errdefs.System(err)\n\t\t\t\t}\n\t\t\t}\n\t\t\tcontainer.InitDNSHostConfig()\n\t\t}\n\t} else {\n\t\tif hostConfig != nil {\n\t\t\treturn errdefs.InvalidParameter(errors.New(\"Supplying a hostconfig on start is not supported. It should be supplied on create\"))\n\t\t}\n\t}\n\n\t// check if hostConfig is in line with the current system settings.\n\t// It may happen cgroups are umounted or the like.\n\tif _, err = daemon.verifyContainerSettings(container.OS, container.HostConfig, nil, false); err != nil {\n\t\treturn errdefs.InvalidParameter(err)\n\t}\n\t// Adapt for old containers in case we have updates in this function and\n\t// old containers never have chance to call the new function in create stage.\n\tif hostConfig != nil {\n\t\tif err := daemon.adaptContainerSettings(container.HostConfig, false); err != nil {\n\t\t\treturn errdefs.InvalidParameter(err)\n\t\t}\n\t}\n\treturn daemon.containerStart(container, checkpoint, checkpointDir, true)\n}\n```\n\n<br>\n\n#### 3.3 containerStart\n\n核心逻辑如下：\n\n（1）判断容器状态，是否已经running或者dead\n\n（2）通过defer函数进行收尾，然后start过程出现了错误，调用daemon.Cleanup，ContainerRm进行清理工作\n\n（3）挂载目录。docker start过程也会很多目录的创建，mount\n\n（4）设置容器的网络模式,默认模式bridge：同一个host主机上容器的通信通过Linux bridge进行。与宿主机外部网络的通信需要通过宿主机端      口进行NAT\n\n（5）创建/proc /dev等spec文件，对容器所特有的属性都进行设置，例如：资源限制，命名空间，安全模式等等配置信息\n\n（6）初始化libContainerd的 createOptions，到这里就是调用containerd了\n\n（7）通过containerd创建容器\n\n（8）通过containerd启动容器\n\n（9）设置状态，已经running等等\n\n```\n// containerStart prepares the container to run by setting up everything the\n// container needs, such as storage and networking, as well as links\n// between containers. The container is left waiting for a signal to\n// begin running.\nfunc (daemon *Daemon) containerStart(container *container.Container, checkpoint string, checkpointDir string, resetRestartManager bool) (err error) {\n\tstart := time.Now()\n\tcontainer.Lock()\n\tdefer container.Unlock()\n  \n  // 1.判断容器状态，是否已经running或者dead\n\tif resetRestartManager && container.Running { // skip this check if already in restarting step and resetRestartManager==false\n\t\treturn nil\n\t}\n\n\tif container.RemovalInProgress || container.Dead {\n\t\treturn errdefs.Conflict(errors.New(\"container is marked for removal and cannot be started\"))\n\t}\n  \n\tif checkpointDir != \"\" {\n\t\t// TODO(mlaventure): how would we support that?\n\t\treturn errdefs.Forbidden(errors.New(\"custom checkpointdir is not supported\"))\n\t}\n   \n  // 2.通过defer函数进行收尾，然后start过程出现了错误，调用daemon.Cleanup，ContainerRm进行清理工作\n\t// if we encounter an error during start we need to ensure that any other\n\t// setup has been cleaned up properly\n\tdefer func() {\n\t\tif err != nil {\n\t\t\tcontainer.SetError(err)\n\t\t\t// if no one else has set it, make sure we don't leave it at zero\n\t\t\tif container.ExitCode() == 0 {\n\t\t\t\tcontainer.SetExitCode(128)\n\t\t\t}\n\t\t\tif err := container.CheckpointTo(daemon.containersReplica); err != nil {\n\t\t\t\tlogrus.Errorf(\"%s: failed saving state on start failure: %v\", container.ID, err)\n\t\t\t}\n\t\t\tcontainer.Reset(false)\n\n\t\t\tdaemon.Cleanup(container)\n\t\t\t// if containers AutoRemove flag is set, remove it after clean up\n\t\t\tif container.HostConfig.AutoRemove {\n\t\t\t\tcontainer.Unlock()\n\t\t\t\tif err := daemon.ContainerRm(container.ID, &types.ContainerRmConfig{ForceRemove: true, RemoveVolume: true}); err != nil {\n\t\t\t\t\tlogrus.Errorf(\"can't remove container %s: %v\", container.ID, err)\n\t\t\t\t}\n\t\t\t\tcontainer.Lock()\n\t\t\t}\n\t\t}\n\t}()\n  \n  // 3.挂载目录。docker start过程也会很多目录的创建，mount\n\tif err := daemon.conditionalMountOnStart(container); err != nil {\n\t\treturn err\n\t}\n  \n  // 4.设置容器的网络模式,默认模式bridge：同一个host主机上容器的通信通过Linux bridge进行。与宿主机外部网络的通信需要通过宿主机端      口进行NAT\n\tif err := daemon.initializeNetworking(container); err != nil {\n\t\treturn err\n\t}\n  \n  // 5. 创建/proc /dev等spec文件，对容器所特有的属性都进行设置，例如：资源限制，命名空间，安全模式等等配置信息\n\tspec, err := daemon.createSpec(container)\n\tif err != nil {\n\t\treturn errdefs.System(err)\n\t}\n\n\tif resetRestartManager {\n\t\tcontainer.ResetRestartManager(true)\n\t\tcontainer.HasBeenManuallyStopped = false\n\t}\n\n\tif err := daemon.saveApparmorConfig(container); err != nil {\n\t\treturn err\n\t}\n\n\tif checkpoint != \"\" {\n\t\tcheckpointDir, err = getCheckpointDir(checkpointDir, checkpoint, container.Name, container.ID, container.CheckpointDir(), false)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n  \n  // 6.初始化libContainerd的 createOptions，到这里就是调用containerd了\n\tcreateOptions, err := daemon.getLibcontainerdCreateOptions(container)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tctx := context.TODO()\n  \n  // 7. 通过containerd创建容器\n\terr = daemon.containerd.Create(ctx, container.ID, spec, createOptions)\n\tif err != nil {\n\t\tif errdefs.IsConflict(err) {\n\t\t\tlogrus.WithError(err).WithField(\"container\", container.ID).Error(\"Container not cleaned up from containerd from previous run\")\n\t\t\t// best effort to clean up old container object\n\t\t\tdaemon.containerd.DeleteTask(ctx, container.ID)\n\t\t\tif err := daemon.containerd.Delete(ctx, container.ID); err != nil && !errdefs.IsNotFound(err) {\n\t\t\t\tlogrus.WithError(err).WithField(\"container\", container.ID).Error(\"Error cleaning up stale containerd container object\")\n\t\t\t}\n\t\t\terr = daemon.containerd.Create(ctx, container.ID, spec, createOptions)\n\t\t}\n\t\tif err != nil {\n\t\t\treturn translateContainerdStartErr(container.Path, container.SetExitCode, err)\n\t\t}\n\t}\n  \n  // 8. 通过containerd启动容器\n\t// TODO(mlaventure): we need to specify checkpoint options here\n\tpid, err := daemon.containerd.Start(context.Background(), container.ID, checkpointDir,\n\t\tcontainer.StreamConfig.Stdin() != nil || container.Config.Tty,\n\t\tcontainer.InitializeStdio)\n\tif err != nil {\n\t\tif err := daemon.containerd.Delete(context.Background(), container.ID); err != nil {\n\t\t\tlogrus.WithError(err).WithField(\"container\", container.ID).\n\t\t\t\tError(\"failed to delete failed start container\")\n\t\t}\n\t\treturn translateContainerdStartErr(container.Path, container.SetExitCode, err)\n\t}\n \n  // 9.设置状态，已经running等等\n\tcontainer.SetRunning(pid, true)\n\tcontainer.HasBeenStartedBefore = true\n\tdaemon.setStateCounter(container)\n\n\tdaemon.initHealthMonitor(container)\n\n\tif err := container.CheckpointTo(daemon.containersReplica); err != nil {\n\t\tlogrus.WithError(err).WithField(\"container\", container.ID).\n\t\t\tErrorf(\"failed to store container\")\n\t}\n\n\tdaemon.LogContainerEvent(container, \"start\")\n\tcontainerActions.WithValues(\"start\").UpdateSince(start)\n\n\treturn nil\n}\n```\n\n<br>\n\n**docker start nginx 过程的目录变化**\n\n```\nroot@k8s-node:~# inotifywait -mrq --timefmt '%d/%m/%y %H:%M' --format '%T %w %f %e' -e modify,delete,create,attrib /var/lib/docker\n\n\n03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/ merged CREATE,ISDIR\n03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/work/ work DELETE,ISDIR\n03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/work/ work CREATE,ISDIR\n03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/work/ work ATTRIB,ISDIR\n03/03/22 17:06 /var/lib/docker/overlay2/105b22191a32cf89aa1ffb96ee4a1a55032ed251a2877f5ea480c4d4921c5244/work/ work ATTRIB,ISDIR\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/network/files/ local-kv.db MODIFY\n03/03/22 17:06 /var/lib/docker/containers/8f7318baf651f7a9539e1d41486151946b36af202f0b0ef4257954911fb77edc/ hosts MODIFY\n03/03/22 17:06 /var/lib/docker/containers/8f7318baf651f7a9539e1d41486151946b36af202f0b0ef4257954911fb77edc/ hosts MODIFY\n03/03/22 17:06 /var/lib/docker/containers/8f7318baf651f7a9539e1d41486151946b36af202f0b0ef4257954911fb77edc/ resolv.conf MODIFY\n03/03/22 17:06 /var/lib/docker/containers/8f7318baf651f7a9539e1d41486151946b36af2\n```\n\n<br>\n\n### 4. docker start 创建的详细过程\n\n上面已经知道了docker start的大致流程。接下里才是重点，就是containerd是如何创建容器的，以及runc是啥时候调用的等等。\n\n这一节就是详细弄清楚整个过程，可能会拆分章节。\n\n#### 4.1 containerd的初始化\n\n在dockerd启动的时候，通过initContainerD函数启动了containerd\n\n#### 4.2 容器的网络设置\n\n待补充，需要补充其他知识，可能会再开一章节\n\n#### 4.3 容器的spec设置-createSpec函数\n\nLinux 内核提供了一种通过`/proc`文件系统，在运行时访问内核内部数据结构、改变内核设置的机制。 proc文件系统是一个伪文件系统，它只存在内存当中，而不占用外存空间。 它以文件系统的方式为访问系统内核数据的操作提供接口。\n\n```\nunc (daemon *Daemon) createSpec(c *container.Container) (retSpec *specs.Spec, err error) {\n\tvar (\n\t\topts []coci.SpecOpts\n\t\ts    = oci.DefaultSpec()\n\t)\n\topts = append(opts,\n\t\tWithCommonOptions(daemon, c),\n\t\tWithCgroups(daemon, c),\n\t\tWithResources(c),\n\t\tWithSysctls(c),\n\t\tWithDevices(daemon, c),\n\t\tWithUser(c),\n\t\tWithRlimits(daemon, c),\n\t\tWithNamespaces(daemon, c),\n\t\tWithCapabilities(c),\n\t\tWithSeccomp(daemon, c),\n\t\tWithMounts(daemon, c),\n\t\tWithLibnetwork(daemon, c),\n\t\tWithApparmor(c),\n\t\tWithSelinux(c),\n\t\tWithOOMScore(&c.HostConfig.OomScoreAdj),\n\t)\n\tif c.NoNewPrivileges {\n\t\topts = append(opts, coci.WithNoNewPrivileges)\n\t}\n\n\t// Set the masked and readonly paths with regard to the host config options if they are set.\n\tif c.HostConfig.MaskedPaths != nil {\n\t\topts = append(opts, coci.WithMaskedPaths(c.HostConfig.MaskedPaths))\n\t}\n\tif c.HostConfig.ReadonlyPaths != nil {\n\t\topts = append(opts, coci.WithReadonlyPaths(c.HostConfig.ReadonlyPaths))\n\t}\n\tif daemon.configStore.Rootless {\n\t\topts = append(opts, WithRootless)\n\t}\n\treturn &s, coci.ApplyOpts(context.Background(), nil, &containers.Container{\n\t\tID: c.ID,\n\t}, &s, opts...)\n}\n```\n\n#### 4.4 containerd创建容器的详细流程\n\n待补充\n\n### 5. 总结\n\n（1）docker run nginx ls 其实是分成了两个步骤。`docker create contianer nginx ls`  和 `docker start nginx`\n\n（2）docker create 做了前期的准备工作，包括下载镜像，准备所有的文件和目录\n\n（3）docker start核心是调用containerd进行start，启动进程等等。这个过程涉及网络以及其他底层的知识。目前先了解到这里，还有很多细节比如第四章节还待补充。这个等补充一波知识后，再更新。"
  },
  {
    "path": "docker/2. linux cgroup 知识准备.md",
    "content": "* [0\\. 说明](#0-说明)\n* [1\\. cgroup简介](#1-cgroup简介)\n* [2\\. CGroup 使用](#2-cgroup-使用)\n* [3\\. CGroup 基本概念](#3-cgroup-基本概念)\n* [4\\. CGroup 操作规则](#4-cgroup-操作规则)\n* [5\\. CGroup的原理实现](#5-cgroup的原理实现)\n  * [5\\.1 cgroup 结构体](#51-cgroup-结构体)\n  * [5\\.2 CGroup 的挂载](#52-cgroup-的挂载)\n  * [5\\.3 向 CGroup 添加要进行资源控制的进程](#53-向-cgroup-添加要进行资源控制的进程)\n  * [5\\.4 限制 CGroup 的资源使用](#54-限制-cgroup-的资源使用)\n  * [5\\.5 限制进程使用资源](#55-限制进程使用资源)\n* [6\\.参考资料](#6参考资料)\n\n### 0. 说明\n\n本文章转载微信公众的一篇文章。地址如下：https://mp.weixin.qq.com/s/n796FnrKsfLLxcvV4-dAlg\n\n该笔记绝大部分来源于上诉公众号，用于自己对cgroup的理解，当做笔记记录。\n\n### 1. cgroup简介\n\n`CGroup` 全称 `Control Group` 中文意思为 `控制组`，用于控制（限制）进程对系统各种资源的使用，比如 `CPU`、`内存`、`网络` 和 `磁盘I/O` 等资源的限制，著名的容器引擎 `Docker` 就是使用 `CGroup` 来对容器进行资源限制。\n\n### 2. CGroup 使用\n\n本文主要以 `内存子系统（memory subsystem）` 作为例子来阐述 `CGroup` 的原理，所以这里先介绍怎么通过 `内存子系统` 来限制进程对内存的使用。\n\n> `子系统` 是 `CGroup` 用于控制某种资源（如内存或者CPU等）使用的逻辑或者算法\n>\n> 在系统的开机阶段，systemd会把支持的子系统挂载到默认的 `/sys/fs/cgroup` 目录下面。\n\n`CGroup` 使用了 `虚拟文件系统` 来进行管理限制的资源信息和被限制的进程列表等，例如要创建一个限制内存使用的 `CGroup` 可以使用下面命令：\n\n```\n$ mount -t cgroup -o memory memory /sys/fs/cgroup/memory\n```\n\n上面的命令用于创建内存子系统的根 `CGroup`，如果系统已经存在可以跳过。然后我们使用下面命令在这个目录下面创建一个新的目录 `test`，\n\n```\n$ mkdir /sys/fs/cgroup/memory/test\n```\n\n这样就在内存子系统的根 `CGroup` 下创建了一个子 `CGroup`，我们可以通过 `ls` 目录来查看这个目录下有哪些文件：\n\n```\n$ ls -l /sys/fs/cgroup/memory/test\ncgroup.clone_childrenmemory.kmem.max_usage_in_bytesmemory.limit_in_bytesmemory.numa_statmemory.use_hierarchy\ncgroup.event_controlmemory.kmem.slabinfomemory.max_usage_in_bytesmemory.oom_controlnotify_on_release\ncgroup.procsmemory.kmem.tcp.failcntmemory.memsw.failcntmemory.pressure_leveltasks\nmemory.failcntmemory.kmem.tcp.limit_in_bytesmemory.memsw.limit_in_bytesmemory.soft_limit_in_bytes\nmemory.force_emptymemory.kmem.tcp.max_usage_in_bytesmemory.memsw.max_usage_in_bytesmemory.stat\nmemory.kmem.failcntmemory.kmem.tcp.usage_in_bytesmemory.memsw.usage_in_bytesmemory.swappiness\nmemory.kmem.limit_in_bytesmemory.kmem.usage_in_bytesmemory.move_charge_at_immigratememory.usage_in_bytes\n```\n\n可以看到在目录下有很多文件，每个文件都是 `CGroup` 用于控制进程组的资源使用。我们可以向 `memory.limit_in_bytes` 文件写入限制进程（进程组）使用的内存大小，单位为字节(bytes)。例如可以使用以下命令写入限制使用的内存大小为 `1MB`：\n\n```\n$ echo 1048576 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes\n```\n\n然后我们可以通过以下命令把要限制的进程加入到 `CGroup` 中：\n\n```\n$ echo task_pid > /sys/fs/cgroup/memory/test/tasks\n```\n\n上面的 `task_pid` 为进程的 `PID`，把进程PID添加到 `tasks` 文件后，进程对内存的使用就受到此 `CGroup` 的限制。\n\n### 3. CGroup 基本概念\n\n在介绍 `CGroup` 原理前，先介绍一下 `CGroup` 几个相关的概念，因为要理解 `CGroup` 就必须要理解他们：\n\n- `任务（task）`。任务指的是系统的一个进程，如上面介绍的 `tasks` 文件中的进程；\n- `控制组（control group）`。控制组就是受相同资源限制的一组进程。`CGroup` 中的资源控制都是以控制组为单位实现。一个进程可以加入到某个控制组，也从一个进程组迁移到另一个控制组。一个进程组的进程可以使用 `CGroup` 以控制组为单位分配的资源，同时受到 `CGroup` 以控制组为单位设定的限制；\n- `层级（hierarchy）`。由于控制组是以目录形式存在的，所以控制组可以组织成层级的形式，即一棵控制组组成的树。控制组树上的子节点控制组是父节点控制组的孩子，继承父控制组的特定的属性；\n- `子系统（subsystem）`。一个子系统就是一个资源控制器，比如 `CPU子系统` 就是控制 CPU 时间分配的一个控制器。子系统必须附加（attach）到一个层级上才能起作用，一个子系统附加到某个层级以后，这个层级上的所有控制组都受到这个子系统的控制。\n\n他们之间的关系如下图：\n\n![image-20220226162724297](./image/cgroup-1.png)\n\n\n\n我们可以把 `层级` 中的一个目录当成是一个 `CGroup`，那么目录里面的文件就是这个 `CGroup` 用于控制进程组使用各种资源的信息（比如 `tasks` 文件用于保存这个 `CGroup` 控制的进程组所有的进程PID，而 `memory.limit_in_bytes` 文件用于描述这个 `CGroup` 能够使用的内存字节数）。\n\n而附加在 `层级` 上的 `子系统` 表示这个 `层级` 中的 `CGroup` 可以控制哪些资源，每当向 `层级` 附加 `子系统` 时，`层级` 中的所有 `CGroup` 都会产生很多与 `子系统` 资源控制相关的文件。\n\n### 4. CGroup 操作规则\n\n使用 `CGroup` 时，必须按照 `CGroup` 一些操作规则来进行操作，否则会出错。下面介绍一下关于 `CGroup` 的一些操作规则：\n\n1. 一个 `层级` 可以附加多个 `子系统`，如下图：\n\n![image-20220226162836054](./image/cgroup-2.png)2. 一个已经被挂载的 `子系统` 只能被再次挂载在一个空的 `层级` 上，不能挂载到已经挂载了其他 `子系统` 的 `层级`，如下图：\n\n![image-20220226163153346](/Users/game-netease/k8sLearnNote/learning-k8s-source-code/docker/image/cgroup-3.png)\n\n3. 每个 `任务` 只能在同一个 `层级` 的唯一一个 `CGroup` 里，并且可以在多个不同层级的 `CGroup` 中，如下图：\n\n![image-20220226163311087](./image/cgroup-4.png)\n\n4. 子进程在被 `fork` 出时自动继承父进程所在 `CGroup`，但是 `fork` 之后就可以按需调整到其他 `CGroup`，如下图：\n\n   ![image-20220226163414589](/Users/game-netease/k8sLearnNote/learning-k8s-source-code/docker/image/cgroup-5.png)\n\n### 5. CGroup的原理实现\n\n#### 5.1 `cgroup` 结构体\n\n前面介绍过，`cgroup` 是用来控制进程组对各种资源的使用，而在内核中，`cgroup` 是通过 `cgroup` 结构体来描述的，我们来看看其定义：\n\n```\nstruct cgroup {\n    unsigned long flags;        /* \"unsigned long\" so bitops work */\n    atomic_t count;\n    struct list_head sibling;   /* my parent's children */\n    struct list_head children;  /* my children */\n    struct cgroup *parent;      /* my parent */\n    struct dentry *dentry;      /* cgroup fs entry */\n    struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];\n    struct cgroupfs_root *root;\n    struct cgroup *top_cgroup;\n    struct list_head css_sets;\n    struct list_head release_list;\n};\n```\n\n下面我们来介绍一下 `cgroup` 结构体各个字段的用途：\n\n1. `flags`: 用于标识当前 `cgroup` 的状态。\n2. `count`: 引用计数器，表示有多少个进程在使用这个 `cgroup`。\n3. `sibling、children、parent`: 由于 `cgroup` 是通过 `层级` 来进行管理的，这三个字段就把同一个 `层级` 的所有 `cgroup` 连接成一棵树。`parent` 指向当前 `cgroup` 的父节点，`sibling` 连接着所有兄弟节点，而 `children` 连接着当前 `cgroup` 的所有子节点。\n4. `dentry`: 由于 `cgroup` 是通过 `虚拟文件系统` 来进行管理的，在介绍 `cgroup` 使用时说过，可以把 `cgroup` 当成是 `层级` 中的一个目录，所以 `dentry` 字段就是用来描述这个目录的。\n5. `subsys`: 前面说过，`子系统` 能够附加到 `层级`，而附加到 `层级` 的 `子系统` 都有其限制进程组使用资源的算法和统计数据。所以 `subsys` 字段就是提供给各个 `子系统` 存放其限制进程组使用资源的统计数据。我们可以看到 `subsys` 字段是一个数组，而数组中的每一个元素都代表了一个 `子系统` 相关的统计数据。从实现来看，`cgroup` 只是把多个进程组织成控制进程组，而真正限制资源使用的是各个 `子系统`。\n6. `root`: 用于保存 `层级` 的一些数据，比如：`层级` 的根节点，附加到 `层级` 的 `子系统` 列表（因为一个 `层级` 可以附加多个 `子系统`），还有这个 `层级` 有多少个 `cgroup` 节点等。\n7. `top_cgroup`: `层级` 的根节点（根cgroup）。\n\n我们通过下面图片来描述 `层级` 中各个 `cgroup` 组成的树状关系：\n\n![图片](./image/cgroup-6.png)\n\n`cgroup_subsys_state` 结构体\n\n每个 `子系统` 都有属于自己的资源控制统计信息结构，而且每个 `cgroup` 都绑定一个这样的结构，这种资源控制统计信息结构就是通过 `cgroup_subsys_state` 结构体实现的，其定义如下：\n\n```\nstruct cgroup_subsys_state {\n    struct cgroup *cgroup;\n    atomic_t refcnt;\n    unsigned long flags;\n};\n```\n\n下面介绍一下 `cgroup_subsys_state` 结构各个字段的作用：\n\n1. `cgroup`: 指向了这个资源控制统计信息所属的 `cgroup`。\n2. `refcnt`: 引用计数器。\n3. `flags`: 标志位，如果这个资源控制统计信息所属的 `cgroup` 是 `层级` 的根节点，那么就会将这个标志位设置为 `CSS_ROOT` 表示属于根节点。\n\n从 `cgroup_subsys_state` 结构的定义看不到各个 `子系统` 相关的资源控制统计信息，这是因为 `cgroup_subsys_state` 结构并不是真实的资源控制统计信息结构，比如 `内存子系统` 真正的资源控制统计信息结构是 `mem_cgroup`，那么怎样通过这个 `cgroup_subsys_state` 结构去找到对应的 `mem_cgroup` 结构呢？我们来看看 `mem_cgroup` 结构的定义：\n\n```\nstruct mem_cgroup {\n    struct cgroup_subsys_state css; // 注意这里\n    struct res_counter res;\n    struct mem_cgroup_lru_info info;\n    int prev_priority;\n    struct mem_cgroup_stat stat;\n};\n```\n\n从 `mem_cgroup` 结构的定义可以发现，`mem_cgroup` 结构的第一个字段就是一个 `cgroup_subsys_state` 结构。下面的图片展示了他们之间的关系：\n\n![图片](./image/cgroup-7.png)\n\n从上图可以看出，`mem_cgroup` 结构包含了 `cgroup_subsys_state` 结构，`内存子系统` 对外暴露出 `mem_cgroup` 结构的 `cgroup_subsys_state` 部分（即返回 `cgroup_subsys_state` 结构的指针），而其余部分由 `内存子系统` 自己维护和使用。\n\n由于 `cgroup_subsys_state` 部分在 `mem_cgroup` 结构的首部，所以要将 `cgroup_subsys_state` 结构转换成 `mem_cgroup` 结构，只需要通过指针类型转换即可。如下代码：\n\n`cgroup` 结构与 `cgroup_subsys_state` 结构之间的关系如下图：\n\n![图片](./image/cgroup-8.png)\n\n`css_set` 结构体\n\n由于一个进程可以同时添加到不同的 `cgroup` 中（前提是这些 `cgroup` 属于不同的 `层级`）进行资源控制，而这些 `cgroup` 附加了不同的资源控制 `子系统`。所以需要使用一个结构把这些 `子系统` 的资源控制统计信息收集起来，方便进程通过 `子系统ID` 快速查找到对应的 `子系统` 资源控制统计信息，而 `css_set` 结构体就是用来做这件事情。`css_set` 结构体定义如下：\n\n```\nstruct css_set {\n    struct kref ref;\n    struct list_head list;\n    struct list_head tasks;\n    struct list_head cg_links;\n    struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];\n};\n```\n\n下面介绍一下 `css_set` 结构体各个字段的作用：\n\n1. `ref`: 引用计数器，用于计算有多少个进程在使用此 `css_set`。\n2. `list`: 用于连接所有 `css_set`。\n3. `tasks`: 由于可能存在多个进程同时受到相同的 `cgroup` 控制，所以用此字段把所有使用此 `css_set` 的进程连接起来。\n4. `subsys`: 用于收集各种 `子系统` 的统计信息结构。\n\n进程描述符 `task_struct` 有两个字段与此相关，如下：\n\n```\nstruct task_struct {\n    ...\n    struct css_set *cgroups;\n    struct list_head cg_list;\n    ...\n}\n```\n\n可以看出，`task_struct` 结构的 `cgroups` 字段就是指向 `css_set` 结构的指针，而 `cg_list` 字段用于连接所有使用此 `css_set` 结构的进程列表。\n\n`task_struct` 结构与 `css_set` 结构的关系如下图：\n\n![图片](./image/cgroup-9.png)\n\n`cgroup_subsys` 结构\n\n`CGroup` 通过 `cgroup_subsys` 结构操作各个 `子系统`，每个 `子系统` 都要实现一个这样的结构，其定义如下：\n\n```\nstruct cgroup_subsys {\n    struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,\n                          struct cgroup *cgrp);\n    void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);\n    void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);\n    int (*can_attach)(struct cgroup_subsys *ss,\n              struct cgroup *cgrp, struct task_struct *tsk);\n    void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,\n            struct cgroup *old_cgrp, struct task_struct *tsk);\n    void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);\n    void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);\n    int (*populate)(struct cgroup_subsys *ss,\n            struct cgroup *cgrp);\n    void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);\n    void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);\n\n    int subsys_id;\n    int active;\n    int disabled;\n    int early_init;\n    const char *name;\n    struct cgroupfs_root *root;\n    struct list_head sibling;\n    void *private;\n};\n```\n\n`cgroup_subsys` 结构包含了很多函数指针，通过这些函数指针，`CGroup` 可以对 `子系统` 进行一些操作。比如向 `CGroup` 的 `tasks` 文件添加要控制的进程PID时，就会调用 `cgroup_subsys` 结构的 `attach()` 函数。当在 `层级` 中创建新目录时，就会调用 `create()` 函数创建一个 `子系统` 的资源控制统计信息对象 `cgroup_subsys_state`，并且调用 `populate()` 函数创建 `子系统` 相关的资源控制信息文件。\n\n除了函数指针外，`cgroup_subsys` 结构还包含了很多字段，下面说明一下各个字段的作用：\n\n1. `subsys_id`: 表示了子系统的ID。\n2. `active`: 表示子系统是否被激活。\n3. `disabled`: 子系统是否被禁止。\n4. `name`: 子系统名称。\n5. `root`: 被附加到的层级挂载点。\n6. `sibling`: 用于连接被附加到同一个层级的所有子系统。\n7. `private`: 私有数据。\n\n`内存子系统` 定义了一个名为 `mem_cgroup_subsys` 的 `cgroup_subsys` 结构，如下：\n\n```\nstruct cgroup_subsys mem_cgroup_subsys = {\n    .name = \"memory\",\n    .subsys_id = mem_cgroup_subsys_id,\n    .create = mem_cgroup_create,\n    .pre_destroy = mem_cgroup_pre_destroy,\n    .destroy = mem_cgroup_destroy,\n    .populate = mem_cgroup_populate,\n    .attach = mem_cgroup_move_task,\n    .early_init = 0,\n};\n```\n\n另外 Linux 内核还定义了一个 `cgroup_subsys` 结构的数组 `subsys`，用于保存所有 `子系统` 的 `cgroup_subsys` 结构，如下：\n\n```\nstatic struct cgroup_subsys *subsys[] = {\n    cpuset_subsys,\n    debug_subsys,\n    ns_subsys,\n    cpu_cgroup_subsys,\n    cpuacct_subsys,\n    mem_cgroup_subsys\n};\n```\n\n#### 5.2 `CGroup` 的挂载\n\n前面介绍了 `CGroup` 相关的几个结构体，接下来我们分析一下 `CGroup` 的实现。\n\n要使用 `CGroup` 功能首先必须先进行挂载操作，比如使用下面命令挂载一个 `CGroup`：\n\n```\n$ mount -t cgroup -o memory memory /sys/fs/cgroup/memory\n```\n\n在上面的命令中，`-t` 参数指定了要挂载的文件系统类型为 `cgroup`，而 `-o` 参数表示要附加到此 `层级` 的子系统，上面表示附加了 `内存子系统`，当然可以附加多个 `子系统`。而紧随 `-o` 参数后的 `memory` 指定了此 `CGroup` 的名字，最后一个参数表示要挂载的目录路径。\n\n挂载过程最终会调用内核函数 `cgroup_get_sb()` 完成，由于 `cgroup_get_sb()` 函数比较长，所以我们只分析重要部分：\n\n```\nstatic int cgroup_get_sb(struct file_system_type *fs_type,\n     int flags, const char *unused_dev_name,\n     void *data, struct vfsmount *mnt)\n{\n    ...\n    struct cgroupfs_root *root;\n    ...\n    root = kzalloc(sizeof(*root), GFP_KERNEL);\n    ...\n    ret = rebind_subsystems(root, root->subsys_bits);\n    ...\n\n    struct cgroup *cgrp = &root->top_cgroup;\n\n    cgroup_populate_dir(cgrp);\n    ...\n}\n```\n\n`cgroup_get_sb()` 函数会调用 `kzalloc()` 函数创建一个 `cgroupfs_root` 结构。`cgroupfs_root` 结构主要用于描述这个挂载点的信息，其定义如下：\n\n```\nstruct cgroupfs_root {\n    struct super_block *sb;\n    unsigned long subsys_bits;\n    unsigned long actual_subsys_bits;\n    struct list_head subsys_list;\n    struct cgroup top_cgroup;\n    int number_of_cgroups;\n    struct list_head root_list;\n    unsigned long flags;\n    char release_agent_path[PATH_MAX];\n};\n```\n\n下面介绍一下 `cgroupfs_root` 结构的各个字段含义：\n\n1. `sb`: 挂载的文件系统超级块。\n2. `subsys_bits/actual_subsys_bits`: 附加到此层级的子系统标志。\n3. `subsys_list`: 附加到此层级的子系统(cgroup_subsys)列表。\n4. `top_cgroup`: 此层级的根cgroup。\n5. `number_of_cgroups`: 层级中有多少个cgroup。\n6. `root_list`: 连接系统中所有的cgroupfs_root。\n7. `flags`: 标志位。\n\n其中最重要的是 `subsys_list` 和 `top_cgroup` 字段，`subsys_list` 表示了附加到此 `层级` 的所有 `子系统`，而 `top_cgroup` 表示此 `层级` 的根 `cgroup`。\n\n接着调用 `rebind_subsystems()` 函数把挂载时指定要附加的 `子系统` 添加到 `cgroupfs_root` 结构的 `subsys_list` 链表中，并且为根 `cgroup` 的 `subsys` 字段设置各个 `子系统` 的资源控制统计信息对象，最后调用 `cgroup_populate_dir()` 函数向挂载目录创建 `cgroup` 的管理文件（如 `tasks` 文件）和各个 `子系统` 的管理文件（如 `memory.limit_in_bytes` 文件）。\n\n#### 5.3 向 `CGroup` 添加要进行资源控制的进程\n\n通过向 `CGroup` 的 `tasks` 文件写入要进行资源控制的进程PID，即可以对进程进行资源控制。例如下面命令：\n\n```\n$ echo 123012 > /sys/fs/cgroup/memory/test/tasks\n```\n\n向 `tasks` 文件写入进程PID是通过 `attach_task_by_pid()` 函数实现的，代码如下：\n\n```\nstatic int attach_task_by_pid(struct cgroup *cgrp, char *pidbuf)\n{\n    pid_t pid;\n    struct task_struct *tsk;\n    int ret;\n\n    if (sscanf(pidbuf, \"%d\", &pid) != 1) // 读取进程pid\n        return -EIO;\n\n    if (pid) { // 如果有指定进程pid\n        ...\n        tsk = find_task_by_vpid(pid); // 通过pid查找对应进程的进程描述符\n        if (!tsk || tsk->flags & PF_EXITING) {\n            rcu_read_unlock();\n            return -ESRCH;\n        }\n        ...\n    } else {\n        tsk = current; // 如果没有指定进程pid, 就使用当前进程\n        ...\n    }\n\n    ret = cgroup_attach_task(cgrp, tsk); // 调用 cgroup_attach_task() 把进程添加到cgroup中\n    ...\n    return ret;\n}\n```\n\n`attach_task_by_pid()` 函数首先会判断是否指定了进程pid，如果指定了就通过进程pid查找到进程描述符，如果没指定就使用当前进程，然后通过调用 `cgroup_attach_task()` 函数把进程添加到 `cgroup` 中。\n\n我们接着看看 `cgroup_attach_task()` 函数的实现：\n\n```\nint cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)\n{\n    int retval = 0;\n    struct cgroup_subsys *ss;\n    struct cgroup *oldcgrp;\n    struct css_set *cg = tsk->cgroups;\n    struct css_set *newcg;\n    struct cgroupfs_root *root = cgrp->root;\n\n    ...\n    newcg = find_css_set(cg, cgrp); // 根据新的cgroup查找css_set对象\n    ...\n    rcu_assign_pointer(tsk->cgroups, newcg); // 把进程的cgroups字段设置为新的css_set对象\n    ...\n    // 把进程添加到css_set对象的tasks列表中\n    write_lock(&css_set_lock);\n    if (!list_empty(&tsk->cg_list)) {\n        list_del(&tsk->cg_list);\n        list_add(&tsk->cg_list, &newcg->tasks);\n    }\n    write_unlock(&css_set_lock);\n\n    // 调用各个子系统的attach函数\n    for_each_subsys(root, ss) {\n        if (ss->attach)\n            ss->attach(ss, cgrp, oldcgrp, tsk);\n    }\n    ...\n    return 0;\n}\n```\n\n`cgroup_attach_task()` 函数首先会调用 `find_css_set()` 函数查找或者创建一个 `css_set` 对象。前面说过 `css_set` 对象用于收集不同 `cgroup` 上附加的 `子系统` 资源统计信息对象。\n\n因为一个进程能够被加入到不同的 `cgroup` 进行资源控制，所以 `find_css_set()` 函数就是收集进程所在的所有 `cgroup` 上附加的 `子系统` 资源统计信息对象，并返回一个 `css_set` 对象。接着把进程描述符的 `cgroups` 字段设置为这个 `css_set` 对象，并且把进程添加到这个 `css_set` 对象的 `tasks` 链表中。\n\n最后，`cgroup_attach_task()` 函数会调用附加在 `层级` 上的所有 `子系统` 的 `attach()` 函数对新增进程进行一些其他的操作（这些操作由各自 `子系统` 去实现）。\n\n#### 5.4 限制 `CGroup` 的资源使用\n\n本文主要是使用 `内存子系统` 作为例子，所以这里分析内存限制的原理。\n\n可以向 `cgroup` 的 `memory.limit_in_bytes` 文件写入要限制使用的内存大小（单位为字节），如下面命令限制了这个 `cgroup` 只能使用 1MB 的内存：\n\n```\n$ echo 1048576 > /sys/fs/cgroup/memory/test/memory.limit_in_bytes\n```\n\n向 `memory.limit_in_bytes` 写入数据主要通过 `mem_cgroup_write()` 函数实现的，其实现如下：\n\n```\nstatic ssize_t mem_cgroup_write(struct cgroup *cont, struct cftype *cft,\n                struct file *file, const char __user *userbuf,\n                size_t nbytes, loff_t *ppos)\n{\n    return res_counter_write(&mem_cgroup_from_cont(cont)->res,\n                cft->private, userbuf, nbytes, ppos,\n                mem_cgroup_write_strategy);\n}\n```\n\n其主要工作就是把 `内存子系统` 的资源控制对象 `mem_cgroup` 的 `res.limit` 字段设置为指定的数值。\n\n#### 5.5 限制进程使用资源\n\n当设置好 `cgroup` 的资源使用限制信息，并且把进程添加到这个 `cgroup` 的 `tasks` 列表后，进程的资源使用就会受到这个 `cgroup` 的限制。这里使用 `内存子系统` 作为例子，来分析一下内核是怎么通过 `cgroup` 来限制进程对资源的使用的。\n\n当进程要使用内存时，会调用 `do_anonymous_page()` 来申请一些内存页，而 `do_anonymous_page()` 函数会调用 `mem_cgroup_charge()` 函数来检测进程是否超过了 `cgroup` 设置的资源限制。而 `mem_cgroup_charge()` 最终会调用 `mem_cgroup_charge_common()` 函数进行检测，`mem_cgroup_charge_common()` 函数实现如下：\n\n```\nstatic int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,\n                gfp_t gfp_mask, enum charge_type ctype)\n{\n    struct mem_cgroup *mem;\n    ...\n    mem = rcu_dereference(mm->mem_cgroup); // 获取进程对应的内存限制对象\n    ...\n    while (res_counter_charge(&mem->res, PAGE_SIZE)) { // 判断进程使用内存是否超出限制\n        if (!(gfp_mask & __GFP_WAIT))\n            goto out;\n\n        if (try_to_free_mem_cgroup_pages(mem, gfp_mask)) // 如果超出限制, 就释放一些不用的内存\n            continue;\n\n        if (res_counter_check_under_limit(&mem->res))\n            continue;\n\n        if (!nr_retries--) {\n            mem_cgroup_out_of_memory(mem, gfp_mask); // 如果尝试过5次后还是超出限制, 那么发出oom信号\n            goto out;\n        }\n        ...\n    }\n    ...\n}\n```\n\n`mem_cgroup_charge_common()` 函数会对进程内存使用情况进行检测，如果进程已经超过了 `cgroup` 设置的限制，那么就会尝试进行释放一些不用的内存，如果还是超过限制，那么就会发出 `OOM (out of memory)` 的信号。\n\n### 6.参考资料\n\n[容器三把斧之 | cgroup原理与实现](https://mp.weixin.qq.com/s/n796FnrKsfLLxcvV4-dAlg)\n\n[CGroup 介绍](https://mp.weixin.qq.com/s/66MKhzWTVCZ_nJ07fPrVIw)"
  },
  {
    "path": "docker/3. chroot 命令详解.md",
    "content": "* [1\\. chroot命令介绍](#1-chroot命令介绍)\r\n* [2\\. chroot实践](#2-chroot实践)\r\n  * [2\\.1 执行bash, ls命令](#21-执行bash-ls命令)\r\n  * [2\\.2 执行ps命令](#22-执行ps命令)\r\n* [2\\.3 如何实现容器内pid 隔离](#23-如何实现容器内pid-隔离)\r\n    * [1\\. 在容器外面证明可以做到](#1-在容器外面证明可以做到)\r\n    * [2\\. 先取消之前的proc挂载](#2-先取消之前的proc挂载)\r\n* [3\\. 提取docker镜像中的rootfs文件](#3-提取docker镜像中的rootfs文件)\r\n* [4\\. 参考文档](#4-参考文档)\r\n\r\n### 1. chroot命令介绍\r\n\r\n把根目录换成指定的目的目录\r\n\r\n**chroot命令** 用来在指定的根目录下运行指令。chroot，即 change root directory （更改 root 目录）。在 linux 系统中，系统默认的目录结构都是以`/`，即是以根 (root) 开始的。而在使用 chroot 之后，系统的目录结构将以指定的位置作为`/`位置。\r\n\r\n在经过 chroot 命令之后，系统读取到的目录和文件将不在是旧系统根下的而是新根下（即被指定的新的位置）的目录结构和文件，因此它带来的好处大致有以下3个：\r\n\r\n**增加了系统的安全性，限制了用户的权力：**\r\n\r\n在经过 chroot 之后，在新根下将访问不到旧系统的根目录结构和文件，这样就增强了系统的安全性。这个一般是在登录 (login) 前使用 chroot，以此达到用户不能访问一些特定的文件。\r\n\r\n**建立一个与原系统隔离的系统目录结构，方便用户的开发：**\r\n\r\n使用 chroot 后，系统读取的是新根下的目录和文件，这是一个与原系统根下文件不相关的目录结构。在这个新的环境中，可以用来测试软件的静态编译以及一些与系统不相关的独立开发。\r\n\r\n**切换系统的根目录位置，引导 Linux 系统启动以及急救系统等：**\r\n\r\nchroot 的作用就是切换系统的根位置，而这个作用最为明显的是在系统初始引导磁盘的处理过程中使用，从初始 RAM 磁盘 (initrd) 切换系统的根位置并执行真正的 init。另外，当系统出现一些问题时，我们也可以使用 chroot 来切换到一个临时的系统。\r\n\r\n<br>\r\n\r\n### 2. chroot实践\r\n\r\n直接使用是不行的，所以需要构建好test目录\r\n\r\n```\r\nroot@k8s-master:~# chroot test\r\nchroot: failed to run command ‘/bin/bash’: No such file or directory\r\n```\r\n\r\n<br>\r\n\r\n#### 2.1 执行bash, ls命令\r\n\r\n```\r\nroot@k8s-master:~/test# tree \r\n.\r\n├── bin\r\n│   ├── bash      // bin目录下要有bash可执行文件\r\n│   └── ls\r\n├── lib\r\n│   ├── libc.so.6     //还要有ddl\r\n│   ├── libdl.so.2\r\n│   └── libtinfo.so.6\r\n└── lib64\r\n    └── ld-linux-x86-64.so.2\r\n\r\n// 还不能执行ls，因为没有ls对应的ddl\r\nroot@k8s-master:~/test/bin# chroot /root/test ls\r\nls: error while loading shared libraries: libselinux.so.1: cannot open shared object file: No such file or directory\r\n\r\n// 通过ldd 查看ls依赖哪些动态链接库，然后拷贝到lib目录\r\nroot@k8s-master:~/test/bin# ldd ls\r\n        linux-vdso.so.1 (0x00007ffff6bb8000)\r\n        libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007f9580683000)\r\n        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f95804c2000)\r\n        libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007f958044e000)\r\n        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9580449000)\r\n        /lib64/ld-linux-x86-64.so.2 (0x00007f95808d6000)\r\n        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f9580428000)\r\nroot@k8s-master:~/test/bin# \r\nroot@k8s-master:~/test/bin# \r\n\r\nroot@k8s-master:~/test/bin# cp /lib/x86_64-linux-gnu/libselinux.so.1 /root/test/lib\r\n\r\n有了这些，就可以chroot，执行bash, ls了\r\nroot@k8s-master:~/test# pwd \r\n/root/test\r\nroot@k8s-master:~/test# tree\r\n.\r\n├── bin\r\n│   ├── bash\r\n│   └── ls\r\n├── lib\r\n│   ├── libc.so.6\r\n│   ├── libdl.so.2\r\n│   ├── libpcre.so.3\r\n│   ├── libpthread.so.0\r\n│   ├── libselinux.so.1\r\n│   └── libtinfo.so.6\r\n└── lib64\r\n    └── ld-linux-x86-64.so.2\r\n\r\n3 directories, 9 files\r\n```\r\n\r\n<br>\r\n\r\n```\r\n成功chroot了，并且可以执行ls\r\nroot@k8s-master:~/test# chroot /root/test \r\nbash-5.0# ls\r\nbin  lib  lib64\r\n```\r\n\r\n#### 2.2 执行ps命令\r\n\r\nps命令有点特殊，除了需要拷贝ddl文件之外，还需要mount\r\n\r\n```\r\nroot@k8s-master:~/test# chroot . ps\r\nError, do this: mount -t proc proc /proc\r\n\r\n// 只能这样用，  其实最正确的做法应该是   mount -t proc proc /root/test/proc \r\nroot@k8s-master:~/test# mount -t proc proc proc\r\nroot@k8s-master:~/test# \r\nroot@k8s-master:~/test# pwd\r\n/root/test\r\nroot@k8s-master:~/test# ls\r\nbin  lib  lib64  proc\r\n\r\n// 可以看到，这里ps是看到了所有的 进程\r\nroot@k8s-master:~# chroot test bash\r\nbash-5.0# l ps  \r\nbash-5.0# ps\r\n  PID TTY          TIME CMD\r\n20877 ?        00:00:00 bash\r\n20929 ?        00:00:00 ps\r\n32545 ?        00:00:00 bash\r\nbash: history: /root/.bash_history: cannot create: No such file or directory\r\nbash-5.0# ps -ef\r\nUID        PID  PPID  C STIME TTY          TIME CMD\r\n0            1     0  0 Oct23 ?        00:07:35 /sbin/init nopti nospectre_v2 nospec_store_bypass_disable\r\n0            2     0  0 Oct23 ?        00:00:00 [kthreadd]\r\n0            3     2  0 Oct23 ?        00:00:00 [rcu_gp]\r\n0            4     2  0 Oct23 ?        00:00:00 [rcu_par_gp]\r\n0            6     2  0 Oct23 ?        00:00:00 [kworker/0:0H-kblockd]\r\n0            8     2  0 Oct23 ?        00:00:00 [mm_percpu_wq]\r\n0            9     2  0 Oct23 ?        00:03:21 [ksoftirqd/0]\r\n0           10     2  0 Oct23 ?        00:25:33 [rcu_sched]\r\n。。。。。\r\n\r\nbash-5.0# cd proc   \r\nbash-5.0# ls\r\n1      16   192    212    24     279    381   666        cmdline      kmsg          swaps\r\n10     17   193    213    240    28     3856  669        consoles     kpagecgroup   sys\r\n10696  170  194    214    241    281    3873  670        cpuinfo      kpagecount    sysrq-trigger\r\n10738  171  195    215    242    28614  3928  671        crypto       kpageflags    sysvipc\r\n11     172  196    216    243    29     3937  685        devices      loadavg       thread-self\r\n11292  173  19646  21635  244    3      4     688        diskstats    locks         timer_list\r\n11310  174  19654  217    245    30     455   692        dma          meminfo       tty\r\n115    175  197    22     246    31     4556  693        driver       misc          uptime\r\n116    176  198    224    247    32     4574  701        execdomains  modules       version\r\n11681  177  2      225    248    32521  4621  714        fb           mounts        vmallocinfo\r\n11700  178  20     226    249    32529  4629  718        filesystems  mtrr          vmstat\r\n118    179  200    227    25     32530  492   732        fs           net           zoneinfo\r\n119    180  206    228    250    32545  5271  8          interrupts   pagetypeinfo\r\n12     181  207    229    251    32560  54    8371       iomem        partitions\r\n122    187  208    230    26     33     5447  9          ioports      sched_debug\r\n14     188  20877  231    27     337    55    9586       irq          schedstat\r\n1485   189  209    232    27530  34     555   acpi       kallsyms     self\r\n15     19   21     233    276    35     6     buddyinfo  kcore        slabinfo\r\n1505   190  210    234    278    36     6134  bus        key-users    softirqs\r\n15434  191  211    235    27808  3728   65    cgroups    keys         stat\r\n```\r\n\r\n\r\n\r\n### 2.3 如何实现容器内pid 隔离\r\n\r\n##### 1. 在容器外面证明可以做到\r\n\r\n```\r\nroot@k8s-master:~# unshare --fork --pid --mount-proc /bin/bash\r\nroot@k8s-master:~# \r\nroot@k8s-master:~# ps -ef\r\nUID        PID  PPID  C STIME TTY          TIME CMD\r\nroot         1     0  1 19:25 pts/0    00:00:00 /bin/bash\r\nroot        11     1  0 19:25 pts/0    00:00:00 ps -ef\r\nroot@k8s-master:~# \r\n```\r\n\r\n<br>\r\n\r\n##### 2. 先取消之前的proc挂载\r\n\r\n```\r\nroot@k8s-master:~/test# cd proc/\r\nroot@k8s-master:~/test/proc# ls\r\n1      16     190    211    235    27530  34    555   acpi         kallsyms      self\r\n10     16679  191    212    23982  276    35    6     buddyinfo    kcore         slabinfo\r\n10696  16776  192    213    23983  278    36    65    bus          keys          softirqs\r\n10738  17     193    214    24     27808  3728  666   cgroups      key-users     stat\r\n11     170    194    215    240    279    381   669   cmdline      kmsg          swaps\r\n11292  171    195    216    241    28     3856  670   consoles     kpagecgroup   sys\r\n11310  172    196    21635  242    281    3873  671   cpuinfo      kpagecount    sysrq-trigger\r\n115    173    19646  217    243    28614  3928  685   crypto       kpageflags    sysvipc\r\n116    174    19654  22     244    29     3937  688   devices      loadavg       thread-self\r\n11681  175    197    224    245    3      4     692   diskstats    locks         timer_list\r\n11700  176    198    225    246    30     455   693   dma          meminfo       tty\r\n118    177    2      226    24640  31     4556  701   driver       misc          uptime\r\n119    178    20     227    247    32     4574  714   execdomains  modules       version\r\n12     179    200    228    248    32521  4621  718   fb           mounts        vmallocinfo\r\n122    180    206    229    249    32529  4629  732   filesystems  mtrr          vmstat\r\n14     181    207    230    25     32530  492   8     fs           net           zoneinfo\r\n1485   187    208    231    250    32545  5271  9     interrupts   pagetypeinfo\r\n15     188    209    232    251    32560  54    9362  iomem        partitions\r\n1505   189    21     233    26     33     5447  9586  ioports      sched_debug\r\n15434  19     210    234    27     337    55    9647  irq          schedstat\r\nroot@k8s-master:~/test/proc# \r\nroot@k8s-master:~/test/proc# \r\nroot@k8s-master:~/test/proc# cd ..\r\nroot@k8s-master:~/test# ls\r\nbin  lib  lib64  proc\r\nroot@k8s-master:~/test# umount /root/test/proc/\r\nroot@k8s-master:~/test# \r\nroot@k8s-master:~/test# ls\r\nbin  lib  lib64  proc\r\nroot@k8s-master:~/test# ls proc/\r\n```\r\n\r\n<br>\r\n\r\n```\r\n// 先通过unshare 隔离出来pid，就是这个/bin/bash 就是新的shell进程\r\nroot@k8s-master:~# unshare --fork --pid --mount-proc /bin/bash\r\n\r\n// 这个时候文件目录还是系统\r\nroot@k8s-master:~# ls\r\napiserver-to-kubelet-rbac.yaml      c.txt             kubernetes-server-linux-amd64.tar.gz  test1\r\na.sh                                cup               pod.yaml                              test.sh\r\na.txt                               kubectl           pod.yaml-1                            testYaml\r\nb.txt                               kube-flannel.yml  svc                                   TLS\r\ncni-plugins-linux-amd64-v0.8.6.tgz  kubernetes        test\r\nroot@k8s-master:~# \r\n\r\nroot@k8s-master:~# ls test/proc/\r\nroot@k8s-master:~# \r\nroot@k8s-master:~#  mount -t proc proc /root/test/proc\r\n\r\n// 修改root\r\nroot@k8s-master:~# chroot test\r\n\r\nbash-5.0# l ps\r\n  PID TTY          TIME CMD\r\n    1 ?        00:00:00 bash\r\n   21 ?        00:00:00 bash\r\n   23 ?        00:00:00 ps\r\nbash: history: /root/.bash_history: cannot create: No such file or directory\r\n\r\n// 进程已经改变了，只能看到自己的进程\r\nbash-5.0# ps -ef\r\nUID        PID  PPID  C STIME TTY          TIME CMD\r\n0            1     0  0 11:36 ?        00:00:00 /bin/bash\r\n0           21     1  0 11:38 ?        00:00:00 /bin/bash -i\r\n0           24    21  0 11:38 ?        00:00:00 ps -ef\r\n```\r\n\r\n<br>\r\n\r\n**如何查看默认的shell**\r\n\r\n```\r\nroot# echo ${SHELL}\r\n/bin/bash\r\n```\r\n\r\n<br>\r\n\r\n### 3. 提取docker镜像中的rootfs文件\r\n\r\n参考： https://www.cnblogs.com/sparkdev/p/8556075.html\r\n\r\n通过 chroot 运行 busybox 为例\r\n\r\nbusybox 包含了丰富的工具，我们可以把这些工具放置在一个目录下，然后通过 chroot 构造出一个 mini 系统。简单起见我们直接使用 docker 的 busybox 镜像打包的文件系统。先在当前目录下创建一个目录 rootfs：\r\n\r\n<br>\r\n\r\n```\r\nroot# mkdir rootfs\r\n\r\n// 提取busybox镜像的rootfs到当前目录\r\nroot# (docker export $(docker create busybox) | tar -C rootfs -xvf -)\r\n.dockerenv\r\nbin/\r\nbin/[\r\nbin/[[\r\nbin/acpid\r\nbin/add-shell\r\nbin/addgroup\r\nbin/adduser\r\nbin/adjtimex\r\nbin/ar\r\nbin/arch\r\nbin/arp\r\nbin/arping\r\nbin/ash\r\nbin/awk\r\nbin/base32\r\nbin/base64\r\nbin/basename\r\nbin/bc\r\nbin/beep\r\nbin/blkdiscard\r\nbin/blkid\r\nbin/blockdev\r\nbin/bootchartd\r\nbin/brctl\r\nbin/bunzip2\r\nbin/busybox\r\nbin/bzcat\r\nbin/bzip2\r\nbin/cal\r\nbin/cat\r\nbin/chat\r\nbin/chattr\r\nbin/chgrp\r\nbin/chmod\r\nbin/chown\r\nbin/chpasswd\r\nbin/chpst\r\nbin/chroot\r\nbin/chrt\r\nbin/chvt\r\nbin/cksum\r\nbin/clear\r\nbin/cmp\r\nbin/comm\r\nbin/conspy\r\nbin/cp\r\nbin/cpio\r\nbin/crond\r\nbin/crontab\r\nbin/cryptpw\r\nbin/cttyhack\r\nbin/cut\r\nbin/date\r\nbin/dc\r\nbin/dd\r\nbin/deallocvt\r\nbin/delgroup\r\nbin/deluser\r\nbin/depmod\r\nbin/devmem\r\nbin/df\r\nbin/dhcprelay\r\nbin/diff\r\nbin/dirname\r\nbin/dmesg\r\nbin/dnsd\r\nbin/dnsdomainname\r\nbin/dos2unix\r\nbin/dpkg\r\nbin/dpkg-deb\r\nbin/du\r\nbin/dumpkmap\r\nbin/dumpleases\r\nbin/echo\r\nbin/ed\r\nbin/egrep\r\nbin/eject\r\nbin/env\r\nbin/envdir\r\nbin/envuidgid\r\nbin/ether-wake\r\nbin/expand\r\nbin/expr\r\nbin/factor\r\nbin/fakeidentd\r\nbin/fallocate\r\nbin/false\r\nbin/fatattr\r\nbin/fbset\r\nbin/fbsplash\r\nbin/fdflush\r\nbin/fdformat\r\nbin/fdisk\r\nbin/fgconsole\r\nbin/fgrep\r\nbin/find\r\nbin/findfs\r\nbin/flock\r\nbin/fold\r\nbin/free\r\nbin/freeramdisk\r\nbin/fsck\r\nbin/fsck.minix\r\nbin/fsfreeze\r\nbin/fstrim\r\nbin/fsync\r\nbin/ftpd\r\nbin/ftpget\r\nbin/ftpput\r\nbin/fuser\r\nbin/getconf\r\nbin/getopt\r\nbin/getty\r\nbin/grep\r\nbin/groups\r\nbin/gunzip\r\nbin/gzip\r\nbin/halt\r\nbin/hd\r\nbin/hdparm\r\nbin/head\r\nbin/hexdump\r\nbin/hexedit\r\nbin/hostid\r\nbin/hostname\r\nbin/httpd\r\nbin/hush\r\nbin/hwclock\r\nbin/i2cdetect\r\nbin/i2cdump\r\nbin/i2cget\r\nbin/i2cset\r\nbin/i2ctransfer\r\nbin/id\r\nbin/ifconfig\r\nbin/ifdown\r\nbin/ifenslave\r\nbin/ifplugd\r\nbin/ifup\r\nbin/inetd\r\nbin/init\r\nbin/insmod\r\nbin/install\r\nbin/ionice\r\nbin/iostat\r\nbin/ip\r\nbin/ipaddr\r\nbin/ipcalc\r\nbin/ipcrm\r\nbin/ipcs\r\nbin/iplink\r\nbin/ipneigh\r\nbin/iproute\r\nbin/iprule\r\nbin/iptunnel\r\nbin/kbd_mode\r\nbin/kill\r\nbin/killall\r\nbin/killall5\r\nbin/klogd\r\nbin/last\r\nbin/less\r\nbin/link\r\nbin/linux32\r\nbin/linux64\r\nbin/linuxrc\r\nbin/ln\r\nbin/loadfont\r\nbin/loadkmap\r\nbin/logger\r\nbin/login\r\nbin/logname\r\nbin/logread\r\nbin/losetup\r\nbin/lpd\r\nbin/lpq\r\nbin/lpr\r\nbin/ls\r\nbin/lsattr\r\nbin/lsmod\r\nbin/lsof\r\nbin/lspci\r\nbin/lsscsi\r\nbin/lsusb\r\nbin/lzcat\r\nbin/lzma\r\nbin/lzop\r\nbin/makedevs\r\nbin/makemime\r\nbin/man\r\nbin/md5sum\r\nbin/mdev\r\nbin/mesg\r\nbin/microcom\r\nbin/mim\r\nbin/mkdir\r\nbin/mkdosfs\r\nbin/mke2fs\r\nbin/mkfifo\r\nbin/mkfs.ext2\r\nbin/mkfs.minix\r\nbin/mkfs.vfat\r\nbin/mknod\r\nbin/mkpasswd\r\nbin/mkswap\r\nbin/mktemp\r\nbin/modinfo\r\nbin/modprobe\r\nbin/more\r\nbin/mount\r\nbin/mountpoint\r\nbin/mpstat\r\nbin/mt\r\nbin/mv\r\nbin/nameif\r\nbin/nanddump\r\nbin/nandwrite\r\nbin/nbd-client\r\nbin/nc\r\nbin/netstat\r\nbin/nice\r\nbin/nl\r\nbin/nmeter\r\nbin/nohup\r\nbin/nologin\r\nbin/nproc\r\nbin/nsenter\r\nbin/nslookup\r\nbin/ntpd\r\nbin/nuke\r\nbin/od\r\nbin/openvt\r\nbin/partprobe\r\nbin/passwd\r\nbin/paste\r\nbin/patch\r\nbin/pgrep\r\nbin/pidof\r\nbin/ping\r\nbin/ping6\r\nbin/pipe_progress\r\nbin/pivot_root\r\nbin/pkill\r\nbin/pmap\r\nbin/popmaildir\r\nbin/poweroff\r\nbin/powertop\r\nbin/printenv\r\nbin/printf\r\nbin/ps\r\nbin/pscan\r\nbin/pstree\r\nbin/pwd\r\nbin/pwdx\r\nbin/raidautorun\r\nbin/rdate\r\nbin/rdev\r\nbin/readahead\r\nbin/readlink\r\nbin/readprofile\r\nbin/realpath\r\nbin/reboot\r\nbin/reformime\r\nbin/remove-shell\r\nbin/renice\r\nbin/reset\r\nbin/resize\r\nbin/resume\r\nbin/rev\r\nbin/rm\r\nbin/rmdir\r\nbin/rmmod\r\nbin/route\r\nbin/rpm\r\nbin/rpm2cpio\r\nbin/rtcwake\r\nbin/run-init\r\nbin/run-parts\r\nbin/runlevel\r\nbin/runsv\r\nbin/runsvdir\r\nbin/rx\r\nbin/script\r\nbin/scriptreplay\r\nbin/sed\r\nbin/sendmail\r\nbin/seq\r\nbin/setarch\r\nbin/setconsole\r\nbin/setfattr\r\nbin/setfont\r\nbin/setkeycodes\r\nbin/setlogcons\r\nbin/setpriv\r\nbin/setserial\r\nbin/setsid\r\nbin/setuidgid\r\nbin/sh\r\nbin/sha1sum\r\nbin/sha256sum\r\nbin/sha3sum\r\nbin/sha512sum\r\nbin/showkey\r\nbin/shred\r\nbin/shuf\r\nbin/slattach\r\nbin/sleep\r\nbin/smemcap\r\nbin/softlimit\r\nbin/sort\r\nbin/split\r\nbin/ssl_client\r\nbin/start-stop-daemon\r\nbin/stat\r\nbin/strings\r\nbin/stty\r\nbin/su\r\nbin/sulogin\r\nbin/sum\r\nbin/sv\r\nbin/svc\r\nbin/svlogd\r\nbin/svok\r\nbin/swapoff\r\nbin/swapon\r\nbin/switch_root\r\nbin/sync\r\nbin/sysctl\r\nbin/syslogd\r\nbin/tac\r\nbin/tail\r\nbin/tar\r\nbin/taskset\r\nbin/tc\r\nbin/tcpsvd\r\nbin/tee\r\nbin/telnet\r\nbin/telnetd\r\nbin/test\r\nbin/tftp\r\nbin/tftpd\r\nbin/time\r\nbin/timeout\r\nbin/top\r\nbin/touch\r\nbin/tr\r\nbin/traceroute\r\nbin/traceroute6\r\nbin/true\r\nbin/truncate\r\nbin/ts\r\nbin/tty\r\nbin/ttysize\r\nbin/tunctl\r\nbin/ubiattach\r\nbin/ubidetach\r\nbin/ubimkvol\r\nbin/ubirename\r\nbin/ubirmvol\r\nbin/ubirsvol\r\nbin/ubiupdatevol\r\nbin/udhcpc\r\nbin/udhcpc6\r\nbin/udhcpd\r\nbin/udpsvd\r\nbin/uevent\r\nbin/umount\r\nbin/uname\r\nbin/unexpand\r\nbin/uniq\r\nbin/unix2dos\r\nbin/unlink\r\nbin/unlzma\r\nbin/unshare\r\nbin/unxz\r\nbin/unzip\r\nbin/uptime\r\nbin/users\r\nbin/usleep\r\nbin/uudecode\r\nbin/uuencode\r\nbin/vconfig\r\nbin/vi\r\nbin/vlock\r\nbin/volname\r\nbin/w\r\nbin/wall\r\nbin/watch\r\nbin/watchdog\r\nbin/wc\r\nbin/wget\r\nbin/which\r\nbin/who\r\nbin/whoami\r\nbin/whois\r\nbin/xargs\r\nbin/xxd\r\nbin/xz\r\nbin/xzcat\r\nbin/yes\r\nbin/zcat\r\nbin/zcip\r\ndev/\r\ndev/console\r\ndev/pts/\r\ndev/shm/\r\netc/\r\netc/group\r\netc/hostname\r\netc/hosts\r\netc/localtime\r\netc/mtab\r\netc/network/\r\netc/network/if-down.d/\r\netc/network/if-post-down.d/\r\netc/network/if-pre-up.d/\r\netc/network/if-up.d/\r\netc/passwd\r\netc/resolv.conf\r\netc/shadow\r\nhome/\r\nproc/\r\nroot/\r\nsys/\r\ntmp/\r\nusr/\r\nusr/sbin/\r\nvar/\r\nvar/spool/\r\nvar/spool/mail/\r\nvar/www/\r\n\r\n\r\nroot#  ls rootfs\r\nbin  dev  etc  home  proc  root  sys  tmp  usr  var\r\n\r\n// proc是空的\r\nroot/rootfs# cd proc/\r\nroot /rootfs/proc# ls\r\nroot /rootfs/proc# \r\n\r\n没有任何进程（）\r\nroot # chroot rootfs /bin/ps\r\nPID   USER     TIME  COMMAND\r\n\r\n\r\nroot # chroot rootfs /bin/sh\r\n/ # ps -ef\r\nPID   USER     TIME  COMMAND\r\n/ # \r\n/ # ps ajxf\r\nPID   USER     TIME  COMMAND\r\n/ # \r\n/ # \r\n```\r\n\r\n\r\n\r\n### 4. 参考文档\r\n\r\n[chroot介绍和使用](https://wangchujiang.com/linux-command/c/chroot.html)\r\n\r\n[浅析Linux中的.a、.so、和.o文件](https://oldpan.me/archives/linux-a-so-o-tell)\r\n\r\n用linux命令实现容器: https://juejin.cn/post/6951639064843911175\r\n\r\nunshare详解： unshare 就是使用与父进程不共享的命名空间运行 子进程\r\n\r\nhttps://juejin.cn/post/6987564689606180900\r\n\r\n\r\n\r\n"
  },
  {
    "path": "docker/4. 如何用golang 实现一个 busybox的容器.md",
    "content": "* [1\\. 背景](#1-背景)\n* [2\\. 如何运行](#2-如何运行)\n* [3\\. 参考](#3-参考)\n\n### 1. 背景\n\n在入手docker源码之前，这里先用一个例子先理解一下，上面提到的Linux原理。\n\n主要参考这个repo：https://github.com/jiajunhuang/cup/blob/master/README.md\n\n\n原repo中需要准备工作为：\n\n（1）创建rootfs，并且自己下载 busybox 二进制文件\n\n但是我按照要求，下载好这个二进制文件，放入rootfs/bin 目录后一直报错：\n```\nroot /data/golang/src/cup/cup# ./cup \\\n> \n2021/12/05 15:21:44 main start...\n2021/12/05 15:21:44 path is :\n2021/12/05 15:21:44 childProcess start...uid: 0, gid: 0\n2021/12/05 15:21:44 child: hostname: kmaster\n2021/12/05 15:21:44 child: hostname: cup-host\n2021/12/05 15:21:44 failed to run command: fork/exec /bin/busybox: no such file or directory\npanic: failed to run command: fork/exec /bin/busybox: no such file or directory\n```\n\n因此为了更好的应用，和理解原理，这里做了一些修改。主要是修改了rootfs。rootfs的内容直接从busybox提取出来。\n\n```\nroot@zoux:/home/zoux/data/golang/src/cup/cup# (docker export $(docker create busybox) | tar -C rootfs -xvf -)\n.dockerenv\nbin/\nbin/[\n...\n```\n最终的目录结构：\n\n```\nroot /data/golang/src/cup/cup# tree -L 1\n.\n├── cup\n├── LICENSE\n├── main.go\n├── Makefile\n├── README.md\n└── rootfs\n\n1 directory, 5 files\n```\n<br>\n\n### 2. 如何运行\n\n(1) make 生成二进制文件 cup\n\n(2) ./cup 即可\n\n```\nroot  /data/golang/src/cup/cup# ./cup \n2021/12/05 18:28:16 main start...\n2021/12/05 18:28:16 childProcess start...uid: 0, gid: 0\n2021/12/05 18:28:16 child: hostname: kmaster\n2021/12/05 18:28:16 child: hostname: cup-host\n/ # ps ajxf\nPID   USER     TIME  COMMAND\n    1 root      0:00 {exe} childProcess\n    6 root      0:00 /bin/busybox sh\n    7 root      0:00 ps ajxf\n/ # ls\nbin   dev   etc   home  proc  root  sys   tmp   usr   var\n```\n\n\n### 3. 参考\n\n[Linux Namespace 技术与 Docker 原理浅析](https://www.cnblogs.com/dream397/p/13999018.html)\n\n"
  },
  {
    "path": "docker/5. docker-overlay技术.md",
    "content": "* [0 背景](#0-背景)\n* [1 overlay介绍](#1-overlay介绍)\n* [2\\. 实验\\-通过实验来理解](#2-实验-通过实验来理解)\n  * [2\\.1 实验设置](#21-实验设置)\n  * [2\\.2 补充实验](#22-补充实验)\n  * [2\\.2 结论](#22-结论)\n    * [2\\.2\\.1 workdir作用是什么](#221-workdir作用是什么)\n    * [2\\.2\\.2 文件覆盖规则](#222-文件覆盖规则)\n* [3 源码分析\\-通过原理来理解](#3-源码分析-通过原理来理解)\n* [4 总结](#4-总结)\n\n### 0 背景\n\ncgroup, namespaces, chroot都是Linux 已有功能。这些计算是可以做到了隔离。但是docker在这些基层上来，加上了联合文件系统，这个是docker image的基础，使得镜像可以分层继承。overlay是docker联合文件系统的一种。本节就是对overlay的基础知识进行整理总结。\n\n### 1 overlay介绍\n\n![image-20220226173105551](./image/overlay-1.png)\n\n`OverlayFS` 文件系统主要有三个角色，`lowerdir`、`upperdir` 和 `merged`。`lowerdir` 是只读层，用户不能修改这个层的文件；`upperdir` 是可读写层，用户能够修改这个层的文件；而 `merged` 是合并层，把 `lowerdir` 层和 `upperdir` 层的文件合并展示。\n\n<br>\n\n使用 `OverlayFS` 前需要进行挂载操作，挂载 `OverlayFS` 文件系统的基本命令如下：\n\n```\n$ mount -t overlay overlay -o lowerdir=lower1:lower2,upperdir=upper,workdir=work merged\n```\n\n参数 `-t` 表示挂载的文件系统类型，这里设置为 `overlay` 表示文件系统类型为 `OverlayFS`，而参数 `-o` 指定的是 `lowerdir`、`upperdir` 和 `workdir`，最后的 `merged` 目录就是最终的挂载点目录。下面说明一下 `-o` 参数几个目录的作用：\n\n1. `lowerdir`：指定用户需要挂载的lower层目录，指定多个目录可以使用 `:` 来分隔（最大支持500层）。\n2. `upperdir`：指定用户需要挂载的upper层目录。\n3. `workdir`：指定文件系统的工作基础目录，挂载后内容会被清空，且在使用过程中其内容用户不可见。\n\n### 2. 实验-通过实验来理解\n\n#### 2.1 实验设置\n\n```\nroot@k8s-master:~/testOverlay# mkdir -p fileRoot A B C worker\n\nroot@k8s-master:~/testOverlay# echo \"from A\" > A/a.txt\nroot@k8s-master:~/testOverlay# echo \"from B\" > B/b.txt\nroot@k8s-master:~/testOverlay# echo \"from C\" > C/c.txt\nroot@k8s-master:~/testOverlay# mkdir -p A/aa\nroot@k8s-master:~/testOverlay# tree\n.\n├── A\n│   ├── aa\n│   └── a.txt\n├── B\n│   └── b.txt\n├── C\n│   └── c.txt\n├── fileRoot\n└── worker\n```\n\n<br>\n\n指定 A，B是 底层文件； C是 上层文件。 worker为工作目录。 入口函数为 fileRoot\n\n```\nmount -t overlay overlay -o lowerdir=A:B,upperdir=C,workdir=worker fileRoot\n```\n\n查看fileRoot结果：\n\n```\nroot@k8s-master:~/testOverlay# mount -t overlay overlay -o lowerdir=A:B,upperdir=C,workdir=worker fileRoot\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# ls fileRoot/\naa  a.txt  b.txt  c.txt\n\n// 1.对worker目录进行实验。 结果：worker目录可以写入，但是不会影响fileRoot文件\nroot@k8s-master:~/testOverlay# echo \"from worker\" > worker/work.txt\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# ls fileRoot/\naa  a.txt  b.txt  c.txt\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# ls worker/\nwork  work.txt\n\n\n// 2.文件覆盖规则实验;  lowerdir可以手动修改\nroot@k8s-master:~/testOverlay# echo \"from A1\" > A/a.txt\nroot@k8s-master:~/testOverlay# cat fileRoot/a.txt \nfrom A1\nroot@k8s-master:~/testOverlay# ls worker/\nwork  work.txt\nroot@k8s-master:~/testOverlay# ls worker/work\n\n\n// 3. 覆盖顺序测试： upperdir优先级最高，lowerdir按照mount时从左到右的顺序，权重依次降低，左边的覆盖右边的同名文件或者文件夹。\nroot@k8s-master:~/testOverlay# echo \"from B\" > B/a.txt\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# ls fileRoot/\naa  a.txt  b.txt  c.txt\nroot@k8s-master:~/testOverlay# cat fileRoot/a.txt \nfrom A1\nroot@k8s-master:~/testOverlay# cat A/a.txt \nfrom A1\nroot@k8s-master:~/testOverlay# cat B/a.txt \nfrom B\n\noot@k8s-master:~/testOverlay# echo \"from A\" > A/b.txt\nroot@k8s-master:~/testOverlay# cat A/b.txt \nfrom A\nroot@k8s-master:~/testOverlay# cat fileRoot/b.txt \nfrom A\nroot@k8s-master:~/testOverlay# cat B/b.txt \nfrom B\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# cat C/c.txt \nfrom C\nroot@k8s-master:~/testOverlay# cat fileRoot/c.txt \nfrom C\nroot@k8s-master:~/testOverlay# echo \"from A\" > A/c.txt\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# cat A/c.txt \nfrom A\nroot@k8s-master:~/testOverlay# cat C/c.txt \nfrom C\nroot@k8s-master:~/testOverlay# cat fileRoot/c.txt \nfrom C\n\n// 目录中的文件也是一样，存在同名的时，以左边的A为准\nroot@k8s-master:~/testOverlay# mkdir B/aa\nroot@k8s-master:~/testOverlay# echo \"from bb\" > B/aa/a.txt\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# cat fileRoot/aa/a.txt \nfrom aa\n\n\nroot@k8s-master:~/testOverlay# echo \"from bb\" > B/aa/b.txt\n\n// 为什么aa目录下没有b.txt\nroot@k8s-master:~/testOverlay# ls fileRoot/aa\na.txt\nroot@k8s-master:~/testOverlay# ls fileRoot/aa\na.txt\nroot@k8s-master:~/testOverlay# ls B/aa/b.txt \nB/aa/b.txt\nroot@k8s-master:~/testOverlay# ls A/aa/b.txt \nls: cannot access 'A/aa/b.txt': No such file or directory\nroot@k8s-master:~/testOverlay# ls fileRoot/aa\na.txt\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# ls fileRoot/\naa  a.txt  b.txt  c.txt\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# cd fileRoot/\nroot@k8s-master:~/testOverlay/fileRoot# ls\naa  a.txt  b.txt  c.txt\nroot@k8s-master:~/testOverlay/fileRoot# cd aa/\nroot@k8s-master:~/testOverlay/fileRoot/aa# ls\na.txt\n\n// 破案了，因为 A//aa 目录的优先级 比 B/aa高，所以fileRoot/aa = A/aa\nroot@k8s-master:~/testOverlay# echo \"from aa\" > A/aa/e.txt\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# ls fileRoot/a\naa/    a.txt  \nroot@k8s-master:~/testOverlay# ls fileRoot/a\naa/    a.txt  \nroot@k8s-master:~/testOverlay# ls fileRoot/aa/\na.txt  e.txt\nroot@k8s-master:~/testOverlay# \nroot@k8s-master:~/testOverlay# echo \"from bb\" > B/aa/f.txt\nroot@k8s-master:~/testOverlay# ls fileRoot/aa/\na.txt  e.txt\nroot@k8s-master:~/testOverlay#\n\n// 这个就有，所以\nroot@k8s-master:~/testOverlay# echo \"from bb\" > B/f.txt\nroot@k8s-master:~/testOverlay# ls fileRoot/\naa  a.txt  b.txt  c.txt  f.txt\nroot@k8s-master:~/testOverlay# cat fileRoot/f.txt \nfrom bb\n\n// 为啥这个f.txt 不是from aa ???, 看起来又不是A为主？？\nroot@k8s-master:~/testOverlay# echo \"from bb\" > B/f.txt\nroot@k8s-master:~/testOverlay# ls fileRoot/\naa  a.txt  b.txt  c.txt  f.txt\nroot@k8s-master:~/testOverlay# cat fileRoot/f.txt \nfrom bb\nroot@k8s-master:~/testOverlay# echo \"from aa\" > A/f.txt\nroot@k8s-master:~/testOverlay# cat fileRoot/f.txt \nfrom bb\nroot@k8s-master:~/testOverlay# cat fileRoot/f.txt \nfrom bb\n\n在merged文件夹所做的所有修改，最终都会存储到upperdir目录中\nroot@k8s-master:~/testOverlay# echo \"from fileRoot\" > fileRoot/fr.txt\nroot@k8s-master:~/testOverlay# ls C\nc.txt  fr.txt\n```\n\n<br>\n\n#### 2.2 补充实验\n\n```\nroot@k8s-master:~/testOver# mkdir -p  fileRoot A/aa B/aa C worker\nroot@k8s-master:~/testOver# echo \"from A\" > A/a.txt\nroot@k8s-master:~/testOver# echo \"from A\" > A/aa/a.txt\nroot@k8s-master:~/testOver# echo \"from B\" > B/aa/a.txt\nroot@k8s-master:~/testOver# echo \"from B\" > B/aa/b.txt\nroot@k8s-master:~/testOver# echo \"from B\" > B/a.txt\nroot@k8s-master:~/testOver# echo \"from B\" > B/b.txt\nroot@k8s-master:~/testOver# echo \"from C\" > C/c.txt\n\nroot@k8s-master:~/testOver# mount -t overlay overlay -o lowerdir=A:B,upperdir=C,workdir=worker fileRoot\nroot@k8s-master:~/testOver# ls fileRoot/\naa  a.txt  b.txt  c.txt\nroot@k8s-master:~/testOver# ls fileRoot/a.txt \nfileRoot/a.txt\nroot@k8s-master:~/testOver# cat fileRoot/a.txt \nfrom A\nroot@k8s-master:~/testOver# ls fileRoot/aa/\na.txt  b.txt\nroot@k8s-master:~/testOver# cat fileRoot/aa/a.txt \nfrom A\n```\n\n#### 2.2 结论\n\n##### 2.2.1 workdir作用是什么\n\n通过实验：wokrdir目录平时都是空的，但是可以手动写入文件，写入文件后不影响overlay文件(fileRoot)；\n\n通过实验没看出来，查询资料，解析是：\n\nworkdir选项是必需的，用于在原子操作中将文件切换到覆盖目标之前准备文件（workdir必须与upperdir在同一文件系统上）。\n\n资料来源：[http](http://windsock.io/the-overlay-filesystem/) : [//windsock.io/the-overlay-filesystem/](http://windsock.io/the-overlay-filesystem/)\n\n我可能会猜测“覆盖目标”的意思`upperdir`。\n\n所以...某些文件（也许是“ whiteout”文件？）是非原子创建和配置的`workdir`，然后原子移动到的`upperdir`。\n\n链接：https://qastack.cn/unix/324515/linux-filesystem-overlay-what-is-workdir-used-for-overlayfs\n\n<br>\n\n##### 2.2.2 文件覆盖规则\n\n（1）lowerdir的值可以是一些的文件夹列表，文件都可以读写\n\n（2）merged文件夹是最终联合起来的文件系统，我们可以在merged文件夹中访问所有lowerdir和upperdir中的内容\n\n（3）文件的覆盖顺序，upperdir目录拥有最高覆盖权限，lowerdir按照mount时从左到右的顺序，权重依次降低，左边的覆盖右边的同名文件或者文件夹。\n\n（4）在merged文件夹所做的所有修改，最终都会存储到upperdir目录中\n\n（5）workdir指定的目录需要和upperdir位于同一目录中\n\n\n（6）mount的时候，lowerdir相同文件会被最左边的覆盖，不同的文件和合并到相同目录 （补充实验）\n\n\n### 3 源码分析-通过原理来理解\n\n目前暂时先不设计这一块代码，了解大概的使用即可。如果需要，后面参考这两个链接再仔细研究。\n\nhttps://mp.weixin.qq.com/s/pgu0uXvokgBTXUNk1LpB6Q\n\nhttps://docs.docker.com/storage/storagedriver/overlayfs-driver/\n\n\n###  4 总结\n\n从实验结果来看，docker image里面的应该是最新的在最左边。\n"
  },
  {
    "path": "docker/6. docker pull原理分析.md",
    "content": "* [0\\. 章节目标](#0-章节目标)\r\n* [1\\. docker pull busybox 引入](#1-docker-pull-busybox-引入)\r\n  * [1\\.1 引入的问题](#11-引入的问题)\r\n* [2\\. docker pull 原理](#2-docker-pull-原理)\r\n  * [2\\.1 查看docker 信息](#21-查看docker-信息)\r\n  * [2\\.2 Root Dir](#22-root-dir)\r\n  * [2\\.3 image目录](#23-image目录)\r\n  * [2\\.4 如何获取dockerhub镜像的manifest](#24-如何获取dockerhub镜像的manifest)\r\n* [3\\. docker pull后的文件是如何存储的](#3-docker-pull后的文件是如何存储的)\r\n  * [3\\.1 查看image元数据信息\\-imageConfig](#31-查看image元数据信息-imageconfig)\r\n  * [3\\.2 sha256sum 作用](#32-sha256sum-作用)\r\n  * [3\\.3 diff\\_ids vs docker pull的layer\\-id](#33-diff_ids-vs-docker-pull的layer-id)\r\n  * [3\\.4 如何查看每一层的layer在哪](#34-如何查看每一层的layer在哪)\r\n* [4\\. 结论](#4-结论)\r\n* [5 参考](#5-参考)\r\n\r\n### 0. 章节目标\r\n\r\n从体验和原理入手， 弄清楚doker pull 镜像的过程； 弄清楚docker 镜像是如何存储的， 为后面docker pull 源码做准备。\r\n\r\n### 1. docker pull busybox 引入\r\n\r\n```\r\nroot@k8s-master:~# docker pull busybox\r\nUsing default tag: latest\r\nlatest: Pulling from library/busybox\r\n3cb635b06aa2: Pull complete                                                                 //该镜像只有一层\r\nDigest: sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a\r\nStatus: Downloaded newer image for busybox:latest\r\ndocker.io/library/busybox:latest\r\n\r\n// 镜像id 是 ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af\r\nroot@k8s-master:/var/lib/docker/overlay2# docker images  --no-trunc\r\nREPOSITORY             TAG         IMAGE ID                                                              CREATED     SIZE\r\nbusybox               latest  sha256:ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af   4 days ago   1.24MB\r\n\r\n\r\n\r\nroot@k8s-master:~# docker pull busybox:latest \r\nlatest: Pulling from library/busybox\r\nDigest: sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a\r\nStatus: Image is up to date for busybox:latest\r\ndocker.io/library/busybox:latest\r\n\r\n\r\n\r\n\r\nroot@k8s-master:~# docker rmi busybox:latest\r\nUntagged: busybox:latest\r\nUntagged: busybox@sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a\r\nDeleted: sha256:ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af\r\nDeleted: sha256:64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed       \r\n\r\n\r\nroot@k8s-master:~# docker pull zoux/pause-amd64:3.0\r\n3.0: Pulling from zoux/pause-amd64\r\n4f4fb700ef54: Pull complete \r\nce150f7a21ec: Pull complete \r\nDigest: sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b\r\nStatus: Downloaded newer image for zoux/pause-amd64:3.0\r\n```\r\n\r\n#### 1.1 引入的问题\r\n\r\n Q: Digest 是什么 ？  \r\n\r\nA：镜像在服务器端的 sha256sum ID。\r\n\r\nQ: rmi 的时候为什么还要delete: 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed ?\r\n\r\nA: 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed  是bosybox 的rootfs_id\r\n\r\n<br>\r\n\r\n### 2. docker pull 原理\r\n\r\n![image-20220226174337490](./image/image-1.png)\r\n\r\n\r\n\r\n关键信息：\r\n\r\n（1）manifest 有什么信息\r\n\r\n（2）image config是什么\r\n\r\n（3）diff_ids是什么\r\n\r\n#### 2.1 查看docker 信息 \r\n\r\n```\r\nroot@k8s-master:~# docker info\r\nClient:\r\n Debug Mode: false\r\n\r\nServer:\r\n Containers: 11\r\n  Running: 4\r\n  Paused: 0\r\n  Stopped: 7\r\n Images: 7\r\n Server Version: 19.03.9\r\n Storage Driver: overlay2                       // 使用的是 overlay2文件系统\r\n  Backing Filesystem: extfs\r\n  Supports d_type: true\r\n  Native Overlay Diff: true\r\n Logging Driver: json-file\r\n Cgroup Driver: cgroupfs\r\n Plugins:\r\n  Volume: local\r\n  Network: bridge host ipvlan macvlan null overlay\r\n  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog\r\n Swarm: inactive\r\n Runtimes: runc\r\n Default Runtime: runc\r\n Init Binary: docker-init\r\n containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429\r\n runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd\r\n init version: fec3683\r\n Security Options:\r\n  apparmor\r\n  seccomp\r\n   Profile: default\r\n Kernel Version: 4.19.0-17-amd64\r\n Operating System: Debian GNU/Linux 10 (buster)\r\n OSType: linux\r\n Architecture: x86_64\r\n CPUs: 2\r\n Total Memory: 3.854GiB\r\n Name: k8s-master\r\n ID: DN3J:XOLZ:VIGR:W4E2:LK47:PCEH:43KP:LFCW:XPRG:NPEZ:4DRR:TPTE\r\n Docker Root Dir: /var/lib/docker                             // docker 关键文件\r\n Debug Mode: false\r\n Registry: https://index.docker.io/v1/\r\n Labels:\r\n Experimental: true\r\n Insecure Registries:\r\n  127.0.0.0/8\r\n Registry Mirrors:\r\n  https://b9pmyelo.mirror.aliyuncs.com/\r\n Live Restore Enabled: false\r\n Product License: Community Engine\r\n\r\nWARNING: No swap limit support\r\n```\r\n\r\n#### 2.2 Root Dir\r\n\r\n这里一个非常关键的就是: Docker Root Dir: /var/lib/docker                  \r\n\r\n```\r\nroot@k8s-master:~# ls -l /var/lib/docker\r\ntotal 60\r\ndrwx------  2 root root  4096 Oct 23 16:13 builder\r\ndrwx--x--x  4 root root  4096 Oct 23 16:13 buildkit\r\ndrwx------  3 root root  4096 Oct 23 16:13 containerd\r\ndrwx------ 13 root root  4096 Dec 12 16:51 containers\r\ndrwx------  3 root root  4096 Oct 23 16:13 image\r\ndrwxr-x---  3 root root  4096 Oct 23 16:13 network\r\ndrwx------ 55 root root 12288 Dec 12 16:51 overlay2\r\ndrwx------  4 root root  4096 Oct 23 16:13 plugins\r\ndrwx------  2 root root  4096 Dec 12 16:50 runtimes\r\ndrwx------  2 root root  4096 Oct 23 16:13 swarm\r\ndrwx------  2 root root  4096 Dec 12 16:50 tmp\r\ndrwx------  2 root root  4096 Oct 23 16:13 trust\r\ndrwx------  2 root root  4096 Oct 23 16:13 volume\r\n```\r\n\r\n和镜像存储有关的信息如下：\r\n\r\n- overlay2: 镜像和容器的层信息\r\n- image：存储镜像元相关信息\r\n\r\n#### 2.3 image目录\r\n\r\n```\r\nroot@k8s-master:~# tree -L 1 /var/lib/docker/image/overlay2/\r\n/var/lib/docker/image/overlay2/\r\n├── distribution\r\n├── imagedb\r\n├── layerdb\r\n└── repositories.json\r\n\r\n3 directories, 1 file\r\n```\r\n\r\nrepositories.json就是存储镜像信息，主要是name和image id的对应，digest和image id的对应。当pull镜像的时候会更新这个文件。\r\n\r\n```\r\nroot@k8s-master:/var/lib/docker# cat image/overlay2/repositories.json\r\n{\r\n    \"Repositories\": {\r\n        \"busybox\": {\r\n            \"busybox:latest\": \"sha256:ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af\", \r\n            \"busybox@sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a\": \"sha256:ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af\"\r\n        }, \r\n        \"zoux/pause-amd64\": {\r\n            \"zoux/pause-amd64:3.0\": \"sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2\", \r\n            \"zoux/pause-amd64@sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b\": \"sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2\"\r\n        }, \r\n        \"nginx\": {\r\n            \"nginx:latest\": \"sha256:f652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e\", \r\n            \"nginx@sha256:097c3a0913d7e3a5b01b6c685a60c03632fc7a2b50bc8e35bcaa3691d788226e\": \"sha256:ea335eea17ab984571cd4a3bcf90a0413773b559c75ef4cda07d0ce952b00291\", \r\n            \"nginx@sha256:644a70516a26004c97d0d85c7fe1d0c3a67ea8ab7ddf4aff193d9f301670cf36\": \"sha256:87a94228f133e2da99cb16d653cd1373c5b4e8689956386c1c12b60a20421a02\", \r\n            \"nginx@sha256:9522864dd661dcadfd9958f9e0de192a1fdda2c162a35668ab6ac42b465f0603\": \"sha256:f652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e\"\r\n        }, \r\n        \"quay.io/coreos/flannel\": {\r\n            \"quay.io/coreos/flannel:v0.15.0\": \"sha256:09b38f011a29c697679aa10918b7514e22136b50ceb6cf59d13151453fe8b7a0\", \r\n            \"quay.io/coreos/flannel@sha256:bf24fa829f753d20b4e36c64cf9603120c6ffec9652834953551b3ea455c4630\": \"sha256:09b38f011a29c697679aa10918b7514e22136b50ceb6cf59d13151453fe8b7a0\"\r\n        }, \r\n        \"rancher/mirrored-flannelcni-flannel-cni-plugin\": {\r\n            \"rancher/mirrored-flannelcni-flannel-cni-plugin:v1.2\": \"sha256:98660e6e4c3ae49bf49cd640309f79626c302e1d8292e1971dcc2e6a6b7b8c4d\", \r\n            \"rancher/mirrored-flannelcni-flannel-cni-plugin@sha256:b69fb2dddf176edeb7617b176543f3f33d71482d5d425217f360eca5390911dc\": \"sha256:98660e6e4c3ae49bf49cd640309f79626c302e1d8292e1971dcc2e6a6b7b8c4d\"\r\n        }\r\n    }\r\n}\r\n\r\n```\r\n\r\n<br>\r\n\r\n```\r\nroot@k8s-master:~# docker images --digests\r\nREPOSITORY TAG     DIGEST                                                                     IMAGE ID     CREATED         SIZE\r\nbusybox    latest  sha256:b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a    ffe9d497c324  4 days ago    1.24MB\r\n```\r\n\r\n**查看docker image信息**\r\n\r\n```\r\nroot@k8s-master:~# export DOCKER_CLI_EXPERIMENTAL=enabled   //需要开启docker cli\r\nroot@k8s-master:~# docker manifest inspect busybox:latest \r\n{\r\n   \"schemaVersion\": 2,\r\n   \"mediaType\": \"application/vnd.docker.distribution.manifest.list.v2+json\",\r\n   \"manifests\": [\r\n      {\r\n         \"mediaType\": \"application/vnd.docker.distribution.manifest.v2+json\",\r\n         \"size\": 527,\r\n         \"digest\": \"sha256:50e44504ea4f19f141118a8a8868e6c5bb9856efa33f2183f5ccea7ac62aacc9\",  //这个为啥不一样\r\n         \"platform\": {\r\n            \"architecture\": \"amd64\",\r\n            \"os\": \"linux\"\r\n         }\r\n      },\r\n      {    // 其他平台。。\r\n         \"mediaType\": \"application/vnd.docker.distribution.manifest.v2+json\",\r\n         \"size\": 527,\r\n         \"digest\": \"sha256:0252da5f2df7425dcf48afb4bc337966dfeb2d87079ea3f7fe25051d5b9e9c26\",\r\n         \"platform\": {\r\n            \"architecture\": \"arm\",\r\n            \"os\": \"linux\",\r\n            \"variant\": \"v5\"\r\n         }\r\n      },\r\n\r\n   ]\r\n}\r\n```\r\n\r\n\r\n\r\n**解答疑问：**\r\n\r\n从这里就可以看出来，repositories.json存储了 镜像id和  digestsid的对应关系。\r\n\r\ndigestsid 就是存储在服务器远端的 所有镜像文件的  sha256值。\r\n\r\n当第二次docker pull的时候，发现 busybox:latest 对应的  digestsid=b5cfd4befc119a590ca1a81d6bb0fa1fb19f1fbebd0397f25fae164abe1e8a6a。\r\n\r\n一查看repositories.json，发现本地有这个镜像，所以不会再下载了。\r\n\r\n<br>\r\n\r\ndigest是manifest的sha256:，因为manifest在本地没有，我们可以通过registry的结果去获取。\r\n\r\n#### 2.4 如何获取dockerhub镜像的manifest\r\n\r\nhttps://stackoverflow.com/questions/55269256/how-to-get-manifests-using-http-api-v2\r\n\r\nhttps://zhuanlan.zhihu.com/p/95900321\r\n\r\n这个看起来可以的\r\n\r\nhttps://gist.github.com/tnozicka/f46b37f57f7ac755fefa6a0f0c8a77bf\r\n\r\n```\r\nrepo=openshift/origin && curl -H \"Authorization: Bearer $(curl -sSL \"https://auth.docker.io/token?service=registry.docker.io&scope=repository:${repo}:pull\" | jq --raw-output .token)\" \"https://registry.hub.docker.com/v2/${repo}/manifests/latest\"\r\n\r\n\r\nroot@k8s-master:~# repo=zoux/pause-amd64 && curl -H \"Authorization: Bearer $(curl -sSL \"https://auth.docker.io/token?service=registry.docker.io&scope=repository:${repo}:pull\" | jq --raw-output .token)\" \"https://registry.hub.docker.com/v2/${repo}/manifests/3.0\"\r\n{\r\n   \"schemaVersion\": 1,\r\n   \"name\": \"zoux/pause-amd64\",\r\n   \"tag\": \"3.0\",\r\n   \"architecture\": \"amd64\",\r\n   \"fsLayers\": [\r\n      {\r\n         \"blobSum\": \"sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1\"\r\n      },\r\n      {\r\n         \"blobSum\": \"sha256:ce150f7a21ecb3a4150d71685079f2727057c1785323933f9fdd0750874e13e5\"\r\n      },\r\n      {\r\n         \"blobSum\": \"sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1\"\r\n      }\r\n   ],\r\n   \"history\": [\r\n      {\r\n         \"v1Compatibility\": \"{\\\"architecture\\\":\\\"amd64\\\",\\\"config\\\":{\\\"Hostname\\\":\\\"95722352e41d\\\",\\\"Domainname\\\":\\\"\\\",\\\"User\\\":\\\"\\\",\\\"AttachStdin\\\":false,\\\"AttachStdout\\\":false,\\\"AttachStderr\\\":false,\\\"Tty\\\":false,\\\"OpenStdin\\\":false,\\\"StdinOnce\\\":false,\\\"Env\\\":null,\\\"Cmd\\\":null,\\\"Image\\\":\\\"f8e2eec424cf985b4e41d6423991433fb7a93c90f9acc73a5e7bee213b789c52\\\",\\\"Volumes\\\":null,\\\"WorkingDir\\\":\\\"\\\",\\\"Entrypoint\\\":[\\\"/pause\\\"],\\\"OnBuild\\\":null,\\\"Labels\\\":{}},\\\"container\\\":\\\"a9873535145fe72b464d3055efbac36aab70d059914e221cbbd7fe3cac53ef6b\\\",\\\"container_config\\\":{\\\"Hostname\\\":\\\"95722352e41d\\\",\\\"Domainname\\\":\\\"\\\",\\\"User\\\":\\\"\\\",\\\"AttachStdin\\\":false,\\\"AttachStdout\\\":false,\\\"AttachStderr\\\":false,\\\"Tty\\\":false,\\\"OpenStdin\\\":false,\\\"StdinOnce\\\":false,\\\"Env\\\":null,\\\"Cmd\\\":[\\\"/bin/sh\\\",\\\"-c\\\",\\\"#(nop) ENTRYPOINT \\\\u0026{[\\\\\\\"/pause\\\\\\\"]}\\\"],\\\"Image\\\":\\\"f8e2eec424cf985b4e41d6423991433fb7a93c90f9acc73a5e7bee213b789c52\\\",\\\"Volumes\\\":null,\\\"WorkingDir\\\":\\\"\\\",\\\"Entrypoint\\\":[\\\"/pause\\\"],\\\"OnBuild\\\":null,\\\"Labels\\\":{}},\\\"created\\\":\\\"2016-05-04T06:26:41.522308365Z\\\",\\\"docker_version\\\":\\\"1.9.1\\\",\\\"id\\\":\\\"3d2e5b3ef4b070401482a8161420136e75da9354ccfc7cece40b2b5ba8d0f1be\\\",\\\"os\\\":\\\"linux\\\",\\\"parent\\\":\\\"58ca451648f521bb9749d929fab33c76c1aec4ac54990f4d33fb86705682ec32\\\"}\"\r\n      },\r\n      {\r\n         \"v1Compatibility\": \"{\\\"id\\\":\\\"58ca451648f521bb9749d929fab33c76c1aec4ac54990f4d33fb86705682ec32\\\",\\\"parent\\\":\\\"00fa447be331f70e08ea0dfff0174e514aac7f0f089a6c4d3a8f58d855a10b3e\\\",\\\"created\\\":\\\"2016-05-04T06:26:41.091672218Z\\\",\\\"container_config\\\":{\\\"Cmd\\\":[\\\"/bin/sh -c #(nop) ADD file:b7eb6a5df9d5fbe509cac16ed89f8d6513a4362017184b14c6a5fae151eee5c5 in /pause\\\"]}}\"\r\n      },\r\n      {\r\n         \"v1Compatibility\": \"{\\\"id\\\":\\\"00fa447be331f70e08ea0dfff0174e514aac7f0f089a6c4d3a8f58d855a10b3e\\\",\\\"created\\\":\\\"2016-05-04T06:26:40.628395649Z\\\",\\\"container_config\\\":{\\\"Cmd\\\":[\\\"/bin/sh -c #(nop) ARG ARCH\\\"]}}\"\r\n      }\r\n   ],\r\n   \"signatures\": [\r\n      {\r\n         \"header\": {\r\n            \"jwk\": {\r\n               \"crv\": \"P-256\",\r\n               \"kid\": \"W2RG:USLL:S22T:VLMH:PO66:FQVK:M5BQ:WYME:FDIC:TNX4:J4TE:LKIW\",\r\n               \"kty\": \"EC\",\r\n               \"x\": \"abyPWJMVZM6xBosAkf1sUh4D30sa-4XEjXNTuIv72_s\",\r\n               \"y\": \"9miJIR5j2yXpcTaxqrFW491OEKc0npyWDYAa5KLxDNw\"\r\n            },\r\n            \"alg\": \"ES256\"\r\n         },\r\n         \"signature\": \"WZVTu9_Q2jFeNViqxIXUf_bLlLTjhH5tAjdcdCB0ohC1hgyxLIrt1hAeG2ZZkxg0wBuEaWm8ip6C1yt6Vad9SQ\",\r\n         \"protected\": \"eyJmb3JtYXRMZW5ndGgiOjIzOTEsImZvcm1hdFRhaWwiOiJDbjAiLCJ0aW1lIjoiMjAyMS0xMi0yNVQwMjozNjowN1oifQ\"\r\n      }\r\n   ]\r\n}\r\n```\r\n\r\n### 3. docker pull后的文件是如何存储的\r\n\r\n#### 3.1 查看image元数据信息-imageConfig\r\n\r\n镜像元数据存储在了/var/lib/docker/image/<storage_driver>/imagedb/content/sha256/目录下，名称是以镜像ID命名的文件，镜像ID可通过docker images查看，这些文件以json的形式保存了该镜像的rootfs信息、镜像创建时间、构建历史信息、所用容器、包括启动的Entrypoint和CMD等等。\r\n\r\n这里以bosybox镜像为例： 从docker pull的输出可以看出来，busybox只有一层， 3cb635b06aa2\r\n\r\n```\r\n// docker pull busybox之前\r\nroot@k8s-master:/var/lib/docker/image/overlay2/imagedb/content/sha256# ls\r\n09b38f011a29c697679aa10918b7514e22136b50ceb6cf59d13151453fe8b7a0\r\n87a94228f133e2da99cb16d653cd1373c5b4e8689956386c1c12b60a20421a02\r\n98660e6e4c3ae49bf49cd640309f79626c302e1d8292e1971dcc2e6a6b7b8c4d\r\n99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2\r\nea335eea17ab984571cd4a3bcf90a0413773b559c75ef4cda07d0ce952b00291\r\nf652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e\r\n\r\n// docker pull的时候，只有这个pull\r\n3cb635b06aa2: Pull complete\r\n\r\n// 下载镜像之后\r\nroot@k8s-master:/var/lib/docker/image/overlay2/imagedb/content/sha256# ls  \r\n09b38f011a29c697679aa10918b7514e22136b50ceb6cf59d13151453fe8b7a0\r\n87a94228f133e2da99cb16d653cd1373c5b4e8689956386c1c12b60a20421a02\r\n98660e6e4c3ae49bf49cd640309f79626c302e1d8292e1971dcc2e6a6b7b8c4d\r\n99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2\r\nea335eea17ab984571cd4a3bcf90a0413773b559c75ef4cda07d0ce952b00291\r\nf652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e\r\nffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af     // 多了这一层, 每个文件名就是一个imageid\r\n\r\n\r\n// 文件内容是镜像的详细信息\r\nroot@k8s-master:/var/lib/docker/image/overlay2/imagedb/content/sha256# cat ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af\r\n{\r\n    \"architecture\": \"amd64\", \r\n    \"config\": {\r\n        \"Hostname\": \"\", \r\n        \"Domainname\": \"\", \r\n        \"User\": \"\", \r\n        \"AttachStdin\": false, \r\n        \"AttachStdout\": false, \r\n        \"AttachStderr\": false, \r\n        \"Tty\": false, \r\n        \"OpenStdin\": false, \r\n        \"StdinOnce\": false, \r\n        \"Env\": [\r\n            \"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\"\r\n        ], \r\n        \"Cmd\": [\r\n            \"sh\"\r\n        ], \r\n        \"Image\": \"sha256:47595422ea26649bce6768903b3f14aa220694e0811e1bdb5e5bd6fd3df852b2\", \r\n        \"Volumes\": null, \r\n        \"WorkingDir\": \"\", \r\n        \"Entrypoint\": null, \r\n        \"OnBuild\": null, \r\n        \"Labels\": null\r\n    }, \r\n    \"container\": \"0234093c99ba42a97028378063ca32364ca85f74b6804ae65da0f874c16cff69\", \r\n    \"container_config\": {\r\n        \"Hostname\": \"0234093c99ba\", \r\n        \"Domainname\": \"\", \r\n        \"User\": \"\", \r\n        \"AttachStdin\": false, \r\n        \"AttachStdout\": false, \r\n        \"AttachStderr\": false, \r\n        \"Tty\": false, \r\n        \"OpenStdin\": false, \r\n        \"StdinOnce\": false, \r\n        \"Env\": [\r\n            \"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\"\r\n        ], \r\n        \"Cmd\": [\r\n            \"/bin/sh\", \r\n            \"-c\", \r\n            \"#(nop) \", \r\n            \"CMD [\\\"sh\\\"]\"\r\n        ], \r\n        \"Image\": \"sha256:47595422ea26649bce6768903b3f14aa220694e0811e1bdb5e5bd6fd3df852b2\", \r\n        \"Volumes\": null, \r\n        \"WorkingDir\": \"\", \r\n        \"Entrypoint\": null, \r\n        \"OnBuild\": null, \r\n        \"Labels\": { }\r\n    }, \r\n    \"created\": \"2021-12-08T00:22:34.424256906Z\", \r\n    \"docker_version\": \"20.10.7\", \r\n    \"history\": [\r\n        {\r\n            \"created\": \"2021-12-08T00:22:34.228923742Z\", \r\n            \"created_by\": \"/bin/sh -c #(nop) ADD file:e2d2d9591696b14787114bccd6c84033d8e8433ce416045672e2870b983b6029 in / \"\r\n        }, \r\n        {\r\n            \"created\": \"2021-12-08T00:22:34.424256906Z\", \r\n            \"created_by\": \"/bin/sh -c #(nop)  CMD [\\\"sh\\\"]\", \r\n            \"empty_layer\": true\r\n        }\r\n    ], \r\n    \"os\": \"linux\", \r\n    \"rootfs\": {\r\n        \"type\": \"layers\", \r\n        \"diff_ids\": [\r\n            \"sha256:64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed\"   \r\n        ]\r\n    }\r\n}\r\n```\r\n\r\n<br>\r\n\r\n#### 3.2 sha256sum 作用\r\n\r\nsha256sum：计算文件的哈希值\r\n\r\n```\r\nroot@k8s-master:~# sha256sum a.sh\r\n96a9988dd952b0910d4d808187b52a623fda2a45b86337b61a76589618f901bf  a.sh\r\n\r\n\r\n没看错，镜像id就是 该image-config文件的hash值\r\nroot@k8s-master:/var/lib/docker/image/overlay2/imagedb/content/sha256# sha256sum  ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af           \r\nffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af      // 该文件的hash值 ffe9d497c32414b1c5cdad8178a85602ee72453082da2463f1dede592ac7d5af\r\n```\r\n\r\n**镜像id就是 该image-config文件的hash值！！！**\r\n\r\n#### 3.3 diff_ids vs docker pull的layer-id\r\n\r\n/var/lib/docker/image/overlay2/imagedb/content/sha256  目录存放了镜像的 config。并且指定了 diff_ids是： 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed\r\n\r\n这个看起来就是 具体镜像文件了。\r\n\r\ndocker pull是: 3cb635b06aa2  \r\n\r\ndiff_ids: 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed\r\n\r\n这两为啥又不一样：\r\n\r\n在pull镜像的时候显示的是各个layer的digest信息，在image config存的是diffid。要区分这两个，还要先回答为什么manifest的layer的表达和image config的layer的表达中不是一个东西。\r\n\r\n<br>\r\n\r\n**结论：**  image config里面的diffid 就是本地解压后的 layer sha256sum值。 docker pull的是服务器端压缩的 layer sha256sum\r\n\r\n当我们去registry上拉layer的时候，拉什么格式的呢，是根据请求中的media type决定的，因为layer存在本地的时候未压缩的，或者说是解压过的。\r\n\r\n为了在网络上传输的更加快呢，所有media type一般会指定压缩格式的，比如gzip的，具体有哪些格式，见：[media type](https://link.zhihu.com/?target=https%3A//docs.docker.com/registry/spec/manifest-v2-2/%23media-types)\r\n\r\n结合我最开始说的（manifest对应registry服务端的配置，image config针对本地存储端的），其实也就不难理解了。\r\n\r\n当docker发现本地不存在某个layer的时候，就会通过manifest里面的digest + mediaType（一般是\"application/vnd.docker.image.rootfs.diff.tar.gzip\"）去registry拉对应的leyer。\r\n\r\n然后在image id存的对应的diff id就是上面拿到的tar.gz包解压为tar包的id。\r\n\r\n```\r\n# curl -H \"Accept:application/vnd.docker.image.rootfs.diff.tar.gzip\" https://docker-search.4pd.io/v2/ubuntu/blobs/sha256:7ddbc47eeb70dc7f08e410a667948b87ff3883024eb41478b44ef9a81bf400c -o layer1.tar.gz\r\n\r\n\r\n# sha256sum layer1.tar.gz\r\n7ddbc47eeb70dc7f08e410a6667948b87ff3883024eb41478b44ef9a81bf400c  layer1.tar.gz\r\n\r\n# sha256sum layer1.tar\r\ncc967c529ced563b7746b663d98248bc571afdb3c012019d7f54d6c092793b8b  layer1.tar\r\n```\r\n\r\n**distribution目录存放了对应的转换关系**\r\n\r\nv2metadata-by-diffid ： 文件名是 diff_ids， 文件的值是digest\r\n\r\ndiffid-by-digest: 文件名是digest, 文件值是 diff_ids\r\n\r\n```\r\nroot@k8s-master:/var/lib/docker/image/overlay2/distribution/v2metadata-by-diffid/sha256# cat 64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed \r\n[{\"Digest\":\"sha256:3cb635b06aa273034d7080e0242e4b6628c59347d6ddefff019bfd82f45aa7d5\",\"SourceRepository\":\"docker.io/library/busybox\",\"HMAC\":\"\"}]\r\n\r\nroot@k8s-master:/var/lib/docker/image/overlay2/distribution/diffid-by-digest/sha256# cat 3cb635b06aa273034d7080e0242e4b6628c59347d6ddefff019bfd82f45aa7d5 \r\nsha256:64cac9eaf0da6a7ae6519b6c7198929f232324e0822b5e359ee0e27104e2d3ed\r\n```\r\n\r\n#### 3.4 如何查看每一层的layer在哪\r\n\r\n以curlimages/curl:7.75.0镜像为例：\r\n\r\n```\r\n{\r\n\t\"architecture\": \"amd64\",\r\n\t\"config\": {\r\n\t\t\"User\": \"curl_user\",\r\n\t\t\"Env\": [\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\", \"CURL_VERSION=7_75_0\", \"CURL_RELEASE_TAG=curl-7_75_0\", \"CURL_GIT_REPO=https://github.com/curl/curl.git\", \"CURL_CA_BUNDLE=/cacert.pem\"],\r\n\t\t\"Entrypoint\": [\"/entrypoint.sh\"],\r\n\t\t\"Cmd\": [\"curl\"],\r\n\t\t\"Labels\": {\r\n\t\t\t\"Maintainer\": \"James Fuller \\u003cjim.fuller@webcomposite.com\\u003e\",\r\n\t\t\t\"Name\": \"curl\",\r\n\t\t\t\"Version\": \"1.0.0\",\r\n\t\t\t\"docker.cmd\": \"docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se\",\r\n\t\t\t\"se.haxx.curl\": \"curl\",\r\n\t\t\t\"se.haxx.curl.description\": \"network utility\",\r\n\t\t\t\"se.haxx.curl.release_tag\": \"curl-7_75_0\",\r\n\t\t\t\"se.haxx.curl.version\": \"7_75_0\"\r\n\t\t},\r\n\t\t\"ArgsEscaped\": true,\r\n\t\t\"OnBuild\": null\r\n\t},\r\n\t\"created\": \"2021-02-03T10:22:09.59342396Z\",\r\n\t\"history\": [{\r\n\t\t\"created\": \"2020-12-17T00:19:41.960367136Z\",\r\n\t\t\"created_by\": \"/bin/sh -c #(nop) ADD file:ec475c2abb2d46435286b5ae5efacf5b50b1a9e3b6293b69db3c0172b5b9658b in / \"\r\n\t}, {\r\n\t\t\"created\": \"2020-12-17T00:19:42.11518025Z\",\r\n\t\t\"created_by\": \"/bin/sh -c #(nop)  CMD [\\\"/bin/sh\\\"]\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"ARG CURL_RELEASE_TAG=latest\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"ARG CURL_RELEASE_VERSION\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"ARG CURL_GIT_REPO=https://github.com/curl/curl.git\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"ENV CURL_VERSION=7_75_0\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"ENV CURL_RELEASE_TAG=curl-7_75_0\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"ENV CURL_GIT_REPO=https://github.com/curl/curl.git\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"LABEL Maintainer=James Fuller \\u003cjim.fuller@webcomposite.com\\u003e\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"LABEL Name=curl\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"LABEL Version=\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"LABEL docker.cmd=docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:02.868616268Z\",\r\n\t\t\"created_by\": \"RUN |3 CURL_RELEASE_TAG=curl-7_75_0 CURL_RELEASE_VERSION=7_75_0 CURL_GIT_REPO=https://github.com/curl/curl.git /bin/sh -c apk add --no-cache brotli brotli-dev libssh2 nghttp2-dev \\u0026\\u0026     rm -fr /var/cache/apk/* # buildkit\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\"\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:18:03.050522395Z\",\r\n\t\t\"created_by\": \"RUN |3 CURL_RELEASE_TAG=curl-7_75_0 CURL_RELEASE_VERSION=7_75_0 CURL_GIT_REPO=https://github.com/curl/curl.git /bin/sh -c addgroup -S curl_group \\u0026\\u0026 adduser -S curl_user -G curl_group # buildkit\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\"\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:08.691286411Z\",\r\n\t\t\"created_by\": \"COPY /cacert.pem /cacert.pem # buildkit\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\"\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:08.691286411Z\",\r\n\t\t\"created_by\": \"ENV CURL_CA_BUNDLE=/cacert.pem\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:08.768815145Z\",\r\n\t\t\"created_by\": \"COPY /alpine/usr/local/lib/libcurl.so.4.7.0 /usr/lib/ # buildkit\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\"\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:08.853211212Z\",\r\n\t\t\"created_by\": \"COPY /alpine/usr/local/bin/curl /usr/bin/curl # buildkit\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\"\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:09.262850838Z\",\r\n\t\t\"created_by\": \"RUN |3 CURL_RELEASE_TAG=curl-7_75_0 CURL_RELEASE_VERSION=7_75_0 CURL_GIT_REPO=https://github.com/curl/curl.git /bin/sh -c ln -s /usr/lib/libcurl.so.4.7.0 /usr/lib/libcurl.so.4 # buildkit\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\"\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:09.516766096Z\",\r\n\t\t\"created_by\": \"RUN |3 CURL_RELEASE_TAG=curl-7_75_0 CURL_RELEASE_VERSION=7_75_0 CURL_GIT_REPO=https://github.com/curl/curl.git /bin/sh -c ln -s /usr/lib/libcurl.so.4 /usr/lib/libcurl.so # buildkit\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\"\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:09.516766096Z\",\r\n\t\t\"created_by\": \"USER curl_user\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:09.59342396Z\",\r\n\t\t\"created_by\": \"COPY entrypoint.sh /entrypoint.sh # buildkit\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\"\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:09.59342396Z\",\r\n\t\t\"created_by\": \"CMD [\\\"curl\\\"]\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}, {\r\n\t\t\"created\": \"2021-02-03T10:22:09.59342396Z\",\r\n\t\t\"created_by\": \"ENTRYPOINT [\\\"/entrypoint.sh\\\"]\",\r\n\t\t\"comment\": \"buildkit.dockerfile.v0\",\r\n\t\t\"empty_layer\": true\r\n\t}],\r\n\t\"os\": \"linux\",\r\n\t\"rootfs\": {\r\n\t\t\"type\": \"layers\",\r\n\t\t\"diff_ids\": [\"sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf\",    \"sha256:019dd39b82bba02007b940007ee0662015ff0a11ddd55fb7b4a4f6f1e3f694f2\", \"sha256:ead19f98b65e2cb338cab0470d7ddadc8a23c32ccd34ab6511a35393c7b7335d\", \"sha256:bcbfcc5b87d4afa5cf8981569a2dcebfd01643a7ddbe82f191062cf677d024b2\", \"sha256:6e767bd912c28e4d667adfec7adcf1dab84f76ecf0b71cba76634b03a00e67e8\", \"sha256:9904f3d51f2e6e052fd2ce88494090739f23acec20f2a9c3b2d3deb86874dd0e\", \"sha256:56a8d17054bd206ae215f3b81ecbb2d2715b21f48966763fc8c9144ac8f8d46e\", \"sha256:939fe15ec48dad8528237a6330438426dd8627db92a891eb610e36075274e2f5\", \"sha256:3e7aa53fce9350e24217d0b33912c286a4748e36facfd174c32ec53303be025f\"]\r\n\t}\r\n}\r\n\r\n\r\n```\r\n\r\n/var/lib/docker/image/overlay2/layerdb/sha256目录存放的diffids的最上层信息，也就是777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf，这是个目录\r\n\r\n那这个里面到底是啥意思呢，这个里面是chainid，这个是因为chainid的一层是依赖上一层的，这就导致最后算出来的rootfs是统一的。 公式为（具体可见：[layer-chainid](https://link.zhihu.com/?target=https%3A//github.com/opencontainers/image-spec/blob/master/config.md%23layer-chainid)）： chaninid(1) = diffid(1) chainid(n) = sha256(chain(n-1) diffid(n) )\r\n\r\n```\r\nroot@k8s-node:~# cd /var/lib/docker/image/overlay2/layerdb/sha256\r\nroot@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256# ls\r\n02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05\r\n1be8816ebbd7f52290964aa6df8ff27825772a40baded5d91a152ded7c2534a3\r\n1bfbb02dea047ad9341efddc61f0b8a9b473b86001bf7605df7a8880b157b8a9\r\n4006d6bc83834f41eae67f73db4fd4ed3364b06362780a529b44dd5015711092\r\n439f01e6ba92ba1e5b3be977f73014ab80e7997462b9ca86f44ae9b6cdc99cb7\r\n4d4eb19da25f4f4649cf74c7028acd317962959e4b9b55aec27b4cfc3b867b93\r\n5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef\r\n666604249ff52593858b7716232097daa6d721b7b4825aac8bf8a3f45dfba1ce\r\n722f29343eb01a012a210445f66fc22678ca5750ae3bba2cfde9a5c3b62c701d\r\n777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf     //第一层\r\n7897c392c5f451552cd2eb20fdeadd1d557c6be8a3cd20d0355fb45c1f151738\r\n7fcb75871b2101082203959c83514ac8a9f4ecfee77a0fe9aa73bbe56afdf1b4\r\n8473ff61fb5a229cfb7e0410cc815321b3bbe7a88c22766fe4f3f643a7ea2e32\r\n85e5b916bf35f12eeb78c6d89d1cba758c0a60d516401beef41a2aa65f8ddb76\r\nb6b031f5155c8fdd924e4e2508b6ae4018ff646efa86734c8c34b0d61a82b5ea\r\nbfb718dadfd11e598f98dc1314421be5bdee044f417a4149bfd370083db78e6e\r\nd43d6edaff1c22bfd53fcb4b0aa1f00dcd987d45b38ac3971317350785c18574\r\nd8546a51a3203d6ac8eb7b5b0f23a97e77aa706e0ee2136e8747c000538926bd\r\ne8f232ecf2faa5a124d8025eaea6861ff94fc1a5c7da17d7b9712aa24431293e\r\neea7cd97478d04eff4f9fc36c229d9e9f3d42740e6dc02d6578104e945f38d9f\r\nf1dd685eb59e7d19dd353b02c4679d9fafd21ccffe1f51960e6c3645f3ceb0cd\r\nroot@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256# \r\nroot@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256# \r\n\r\n\r\n\r\nroot@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256/777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf# ls -l\r\ntotal 32\r\n-rw-r--r-- 1 root root    64 Dec 19 20:41 cache-id                真正对应的layer数据那个目录\r\n-rw-r--r-- 1 root root    71 Dec 19 20:41 diff                    该层的diffid\r\n-rw-r--r-- 1 root root     7 Dec 19 20:41 size                    该层的大小\r\n-rw-r--r-- 1 root root 19501 Dec 19 20:41 tar-split.json.gz       layer压缩包的split文件\r\n\r\n\r\nroot@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256/777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf# cat cache-id \r\n840c5d412d4af8d058a526074900c098c1469ecd2f08fb21c39d23ffd2a9d527\r\n\r\n\r\nroot@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256/777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf# cat diff \r\nsha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf\r\n```\r\n\r\n<br>\r\n\r\n/var/lib/docker/overlay2/就是layer数据存放的目录，比如每个chainid里面cache-id都回应这个目录下面的一个目录\r\n\r\ndiff 目录就是所有数据的目录\r\n\r\n```\r\n// 没有lower, diff目录\r\nroot@k8s-node:/var/lib/docker/overlay2/840c5d412d4af8d058a526074900c098c1469ecd2f08fb21c39d23ffd2a9d527# ls\r\ncommitted  diff  link\r\n\r\nroot@k8s-node:/var/lib/docker/overlay2/840c5d412d4af8d058a526074900c098c1469ecd2f08fb21c39d23ffd2a9d527/diff# ls -l\r\ntotal 68\r\ndrwxr-xr-x  2 root root 4096 Dec 16  2020 bin\r\ndrwxr-xr-x  2 root root 4096 Dec 16  2020 dev\r\ndrwxr-xr-x 15 root root 4096 Dec 16  2020 etc\r\ndrwxr-xr-x  2 root root 4096 Dec 16  2020 home\r\ndrwxr-xr-x  7 root root 4096 Dec 16  2020 lib\r\ndrwxr-xr-x  5 root root 4096 Dec 16  2020 media\r\ndrwxr-xr-x  2 root root 4096 Dec 16  2020 mnt\r\ndrwxr-xr-x  2 root root 4096 Dec 16  2020 opt\r\ndr-xr-xr-x  2 root root 4096 Dec 16  2020 proc\r\ndrwx------  2 root root 4096 Dec 16  2020 root\r\ndrwxr-xr-x  2 root root 4096 Dec 16  2020 run\r\ndrwxr-xr-x  2 root root 4096 Dec 16  2020 sbin\r\ndrwxr-xr-x  2 root root 4096 Dec 16  2020 srv\r\ndrwxr-xr-x  2 root root 4096 Dec 16  2020 sys\r\ndrwxrwxrwt  2 root root 4096 Dec 16  2020 tmp\r\ndrwxr-xr-x  7 root root 4096 Dec 16  2020 usr\r\ndrwxr-xr-x 12 root root 4096 Dec 16  2020 var\r\n```\r\n\r\n<br>\r\n\r\n这里很奇怪的一点就是： 镜像中 diff_ids 这个为啥只有第一层在 /var/lib/docker/image/overlay2/layerdb/sha256 目录中，其他的都 不在吗？\r\n\r\n```\r\n\t\t\"diff_ids\": [\"sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf\",    \"sha256:019dd39b82bba02007b940007ee0662015ff0a11ddd55fb7b4a4f6f1e3f694f2\", \"sha256:ead19f98b65e2cb338cab0470d7ddadc8a23c32ccd34ab6511a35393c7b7335d\", \"sha256:bcbfcc5b87d4afa5cf8981569a2dcebfd01643a7ddbe82f191062cf677d024b2\", \"sha256:6e767bd912c28e4d667adfec7adcf1dab84f76ecf0b71cba76634b03a00e67e8\", \"sha256:9904f3d51f2e6e052fd2ce88494090739f23acec20f2a9c3b2d3deb86874dd0e\", \"sha256:56a8d17054bd206ae215f3b81ecbb2d2715b21f48966763fc8c9144ac8f8d46e\", \"sha256:939fe15ec48dad8528237a6330438426dd8627db92a891eb610e36075274e2f5\", \"sha256:3e7aa53fce9350e24217d0b33912c286a4748e36facfd174c32ec53303be025f\"]\r\n```\r\n\r\n其实不是的，这里的diff_ids可以认为是累加的changeid，比如说我想知道第二层对应 overlay的文件。就可以。\r\n\r\n```\r\n必须这样这个是因为chainid的一层是依赖上一层的，这就导致最后算出来的rootfs是统一的。 公式为（具体可见：[layer-chainid](https://link.zhihu.com/?target=https%3A//github.com/opencontainers/image-spec/blob/master/config.md%23layer-chainid)）： chaninid(1) = diffid(1) chainid(n) = sha256(chain(n-1) diffid(n) )\r\n\r\n\r\n// 02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05 就是第二层的目录，里面的cache_id就是 overLay-id\r\nroot@k8s-node:/var/lib/docker/image/overlay2/layerdb# echo -n \"sha256:777b2c648970480f50f5b4d0af8f9a8ea798eea43dbcf40ce4a8c7118736bdcf sha256:019dd39b82bba02007b940007ee0662015ff0a11ddd55fb7b4a4f6f1e3f694f2\" | sha256sum\r\n02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05  -\r\n\r\n\r\n// 4d4eb19da25f4f4649cf74c7028acd317962959e4b9b55aec27b4cfc3b867b93 就是第三层层的目录，里面的cache_id就是 overLay-id\r\nroot@k8s-node:/var/lib/docker/image/overlay2/layerdb/sha256# echo -n \"sha256:02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05 sha256:ead19f98b65e2cb338cab0470d7ddadc8a23c32ccd34ab6511a35393c7b7335d\" | sha256sum\r\n4d4eb19da25f4f4649cf74c7028acd317962959e4b9b55aec27b4cfc3b867b93  -\r\n```\r\n\r\n这样 b4c36536404c5e7e468080cabf0c664a45b68eece4a37ff09cac8395869131fc （02aca22ece6a3cd150e7df6e3a651c1386983a9cd525250e804957e5c8629a05 cache-id的内容）就是第二层的overlay 文件。\r\n\r\n```\r\nroot@k8s-node:/var/lib/docker/overlay2/b4c36536404c5e7e468080cabf0c664a45b68eece4a37ff09cac8395869131fc/diff# ls\r\netc  lib  usr\r\n\r\n\r\nroot@k8s-node:/var/lib/docker/overlay2/0422e796ce6cdc75d11303c0018b65ca9285dc36b812f4e14c4f68dbc01bc6d9/diff# ls\r\netc  home\r\n// 有lower, work目录\r\nroot@k8s-node:/var/lib/docker/overlay2/0422e796ce6cdc75d11303c0018b65ca9285dc36b812f4e14c4f68dbc01bc6d9# ls\r\ncommitted  diff  link  lower  work\r\n```\r\n\r\n<br>\r\n\r\n再找一个最简单的。或者直接比较镜像中文件和第一层layer文件，发现第一层layer文件是最基础的。\r\n\r\n```\r\nroot@# docker pull zoux/pause-amd64:3.0\r\n3.0: Pulling from zoux/pause-amd64\r\n4f4fb700ef54: Pull complete \r\nce150f7a21ec: Pull complete \r\nDigest: sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b\r\nStatus: Downloaded newer image for zoux/pause-amd64:3.0\r\n\r\n\r\n\"diff_ids\":[\r\n\"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef\",\r\n\"sha256:41ff149e94f22c52b8f36c59cafe7538b70ea771e62d9fc6922dedac25392fdf\",\r\n\"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef\"]}}\r\n\r\n\r\necho -n \"sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef sha256:41ff149e94f22c52b8f36c59cafe7538b70ea771e62d9fc6922dedac25392fdf\" | sha256sum\r\n```\r\n\r\n### 4. 结论\r\n\r\n（1）一个镜像有一个唯一的imageid 和 digestid。imageid 可以认为是本地image config的sha256sum值，digestid是服务器端该镜像config的sha256sum值。\r\n\r\n例如本地image config 保存在 /var/lib/docker/image/overlay2/imagedb/content/sha256 目录。该目录下，一个文件就是一个image config。\r\n\r\n对该文件内容计算sha256sum得出来的值就是 imageid， 也就是文件名。\r\n\r\n/var/lib/docker/image/overlay2/repositories.json 存放了对应的转换关系。\r\n\r\n（2）为什么有了imageid，还需要digestid。因为本地的image config一般都是解压后的，服务器端一般都是压缩打包的，所以可以认为digestid是服务器端压缩好的image config 的sha256sum\r\n\r\n（3）image config里面的diffid 就是本地解压后的 layer sha256sum值。 docker pull的是服务器端压缩的 layer sha256sum。\r\n\r\n/var/lib/docker/image/overlay2/distribution/v2metadata-by-diffid/sha256 目录下存放了对应的转换关系。\r\n\r\n（4）diffids是本地镜像每一层的sha256sum值。 pull 的是 服务器中每一层的sha256sum值。\r\n\r\n（5）/var/lib/docker/image/overlay2/layerdb/sha256 存放了 diffids -> overlay(实际文件) 转换   （cacheid）\r\n\r\n但是不是第一层的要经过转换。\r\n\r\n（6）var/lib/docker/overlay2/0422e796ce6cdc75d11303c0018b65ca9285dc36b812f4e14c4f68dbc01bc6d9/diff 是实际每一层的文件内容\r\n\r\n第一层没有 lower, work目录，因为从第二层开始才是联合文件。\r\n\r\n<br>\r\n\r\n举例说明：\r\n\r\nf04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b 是该镜像config在 服务器的sha256sum值\r\n\r\n4f4fb700ef54, ce150f7a21ec表示该镜像有俩层，是 layer在服务器压缩文件的 sha256sum \r\n\r\n```\r\nroot@k8s-master: # docker pull zoux/pause-amd64:3.0\r\n3.0: Pulling from zoux/pause-amd64\r\n4f4fb700ef54: Pull complete \r\nce150f7a21ec: Pull complete \r\nDigest: sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b\r\nStatus: Downloaded newer image for zoux/pause-amd64:3.0\r\n```\r\n\r\n<br>\r\n\r\n```\r\nroot@k8s-master:# docker rmi zoux/pause-amd64:3.0\r\nUntagged: zoux/pause-amd64:3.0     \r\n// untag服务器 digestid\r\nUntagged: zoux/pause-amd64@sha256:f04288efc7e65a84be74d4fc63e235ac3c6c603cf832e442e0bd3f240b10a91b\r\n\r\n// 删除镜像id\r\nDeleted: sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2\r\n\r\n// 删除 layer-id (不是diff-ids, 是转换好的id，所以通过这个id，可以直接在 )\r\nDeleted: sha256:666604249ff52593858b7716232097daa6d721b7b4825aac8bf8a3f45dfba1ce\r\nDeleted: sha256:7897c392c5f451552cd2eb20fdeadd1d557c6be8a3cd20d0355fb45c1f151738\r\n\r\n// 找到真正的overlay目录\r\n/var/lib/docker/image/overlay2/layerdb/sha256/7897c392c5f451552cd2eb20fdeadd1d557c6be8a3cd20d0355fb45c1f151738# cat cache-id \r\nd932ba5b6deb33a4933760be2010ffb5a81bfd874a42b36678fbcf5a3091f827\r\n```\r\n\r\n### 5 参考\r\n\r\nhttps://zhuanlan.zhihu.com/p/95900321"
  },
  {
    "path": "docker/7. docker 命令详解.md",
    "content": "* [1\\.docker 常见命令行用法](#1docker-常见命令行用法)\r\n  * [1\\.1 docker 系统本身相关](#11-docker-系统本身相关)\r\n    * [1\\.1\\.1 docker info](#111-docker-info)\r\n    * [1\\.1\\.2 docker system](#112-docker-system)\r\n    * [1\\.1\\.3 docker events](#113-docker-events)\r\n  * [1\\.2 docker image相关](#12-docker-image相关)\r\n    * [1\\-虚悬镜像](#1-虚悬镜像)\r\n    * [2\\-docker image ls 格式化展示](#2-docker-image-ls-格式化展示)\r\n    * [3\\-Untagged 和 Deleted](#3-untagged-和-deleted)\r\n  * [1\\.3 docke container相关](#13-docke-container相关)\r\n    * [1\\-docker diff](#1-docker-diff)\r\n    * [2\\-docker top](#2-docker-top)\r\n    * [3\\-docker attach](#3-docker-attach)\r\n    * [4\\-docker logs \\-f containerId](#4-docker-logs--f-containerid)\r\n* [2\\. docker api](#2-docker-api)\r\n  * [2\\.1  Unix domain socket介绍](#21--unix-domain-socket介绍)\r\n  * [2\\.2 如何通过 unix socket 使用docker](#22-如何通过-unix-socket-使用docker)\r\n  * [2\\.3 如何通过restful api 使用docker](#23-如何通过restful-api-使用docker)\r\n* [3\\. 参考](#3-参考)\r\n\r\n## 1.docker 常见命令行用法\r\n\r\n### 1.1 docker 系统本身相关\r\n\r\n#### 1.1.1 docker info\r\n\r\n查看docker 的详细信息，例如docker root目录，使用的联合文件系统等等\r\n\r\n```\r\nroot@k8s-node:~# docker info\r\nClient:\r\n Debug Mode: false\r\n\r\nServer:\r\n Containers: 9\r\n  Running: 4\r\n  Paused: 0\r\n  Stopped: 5\r\n Images: 4\r\n Server Version: 19.03.9\r\n Storage Driver: overlay2\r\n  Backing Filesystem: extfs\r\n  Supports d_type: true\r\n  Native Overlay Diff: true\r\n Logging Driver: json-file\r\n Cgroup Driver: cgroupfs\r\n Plugins:\r\n  Volume: local\r\n  Network: bridge host ipvlan macvlan null overlay\r\n  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog\r\n Swarm: inactive\r\n Runtimes: runc\r\n Default Runtime: runc\r\n Init Binary: docker-init\r\n containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429\r\n runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd\r\n init version: fec3683\r\n Security Options:\r\n  apparmor\r\n  seccomp\r\n   Profile: default\r\n Kernel Version: 4.19.0-17-amd64\r\n Operating System: Debian GNU/Linux 10 (buster)\r\n OSType: linux\r\n Architecture: x86_64\r\n CPUs: 2\r\n Total Memory: 3.854GiB\r\n Name: k8s-node\r\n ID: FZUV:UMD7:U4L5:KUOH:WYWM:HI6I:HYOD:WSXF:E4D7:RUP2:4ETP:OQTY\r\n Docker Root Dir: /var/lib/docker\r\n Debug Mode: false\r\n Registry: https://index.docker.io/v1/\r\n Labels:\r\n Experimental: false\r\n Insecure Registries:\r\n  127.0.0.0/8\r\n Registry Mirrors:\r\n  https://b9pmyelo.mirror.aliyuncs.com/\r\n Live Restore Enabled: false\r\n Product License: Community Engine\r\n\r\nWARNING: No swap limit support\r\n```\r\n\r\n\r\n\r\n\r\n\r\n#### 1.1.2 docker system\r\n\r\n```\r\nUsage:  docker system COMMAND\r\n\r\nManage Docker\r\n\r\nCommands:\r\n  df          Show docker disk usage\r\n  events      Get real time events from the server\r\n  info        Display system-wide information\r\n  prune       Remove unused data\r\n\r\nRun 'docker system COMMAND --help' for more information on a command.\r\n// 查看镜像实际占用的磁盘空间\r\nroot@k8s-master:~# docker system df   \r\nTYPE                TOTAL               ACTIVE              SIZE                RECLAIMABLE\r\nImages              7                   5                   491.1MB             140.1MB (28%)\r\nContainers          11                  4                   3.557kB             2.324kB (65%)\r\nLocal Volumes       0                   0                   0B                  0B\r\nBuild Cache         0                   0                   0B                  0B\r\n```\r\n\r\n#### 1.1.3 docker events\r\n\r\n获取docker server的实时事件\r\n\r\n```\r\n# docker events --since 112141543\r\n\r\n2022-01-17T12:29:19.046917401+08:00 container die 78deadc2dcd6a3fafc9ac6f8380e1cd8853ffd6bc33796a224ece76d17dd1d92 (Maintainer=James Fuller <jim.fuller@webcomposite.com>, Name=curl, Version=1.0.0, annotation.io.kubernetes.container.hash=bef672e5, annotation.io.kubernetes.container.restartCount=686, annotation.io.kubernetes.container.terminationMessagePath=/dev/termination-log, annotation.io.kubernetes.container.terminationMessagePolicy=File, annotation.io.kubernetes.pod.terminationGracePeriod=10, docker.cmd=docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se, exitCode=0, image=sha256:26a9afb7027cca51ed4f7915474a04822a13e99fce2e1eecad3d43aab6199387, io.kubernetes.container.logpath=/var/log/pods/default_nginx1_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21/nginx/686.log, io.kubernetes.container.name=nginx, io.kubernetes.docker.type=container, io.kubernetes.pod.name=nginx1, io.kubernetes.pod.namespace=default, io.kubernetes.pod.uid=cc8a9cfb-872c-44ba-9899-b4c8bbc93a21, io.kubernetes.sandbox.id=e93a3ae70771ca0e4954fcb6ecf0ffd091eebfc64bcb3cbf461c94eb5474c9aa, name=k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_686, se.haxx.curl=curl, se.haxx.curl.description=network utility, se.haxx.curl.release_tag=curl-7_75_0, se.haxx.curl.version=7_75_0)\r\n2022-01-17T12:29:19.386628039+08:00 container destroy 98c26f5e6c744e7733eaf39fd4a0bfc3692d312213f0504664353157d5d446d9 (Maintainer=James Fuller <jim.fuller@webcomposite.com>, Name=curl, Version=1.0.0, annotation.io.kubernetes.container.hash=bef672e5, annotation.io.kubernetes.container.restartCount=685, annotation.io.kubernetes.container.terminationMessagePath=/dev/termination-log, annotation.io.kubernetes.container.terminationMessagePolicy=File, annotation.io.kubernetes.pod.terminationGracePeriod=10, docker.cmd=docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se, image=sha256:26a9afb7027cca51ed4f7915474a04822a13e99fce2e1eecad3d43aab6199387, io.kubernetes.container.logpath=/var/log/pods/default_nginx1_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21/nginx/685.log, io.kubernetes.container.name=nginx, io.kubernetes.docker.type=container, io.kubernetes.pod.name=nginx1, io.kubernetes.pod.namespace=default, io.kubernetes.pod.uid=cc8a9cfb-872c-44ba-9899-b4c8bbc93a21, io.kubernetes.sandbox.id=e93a3ae70771ca0e4954fcb6ecf0ffd091eebfc64bcb3cbf461c94eb5474c9aa, name=k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_685, se.haxx.curl=curl, se.haxx.curl.description=network utility, se.haxx.curl.release_tag=curl-7_75_0, se.haxx.curl.version=7_75_0)\r\n2022-01-17T12:29:19.448410928+08:00 container create 21c6aa12859cf40f78c0a80f6ef4b782e86b86a84a23fefb860b73cfed55cf31 (Maintainer=James Fuller <jim.fuller@webcomposite.com>, Name=curl, Version=1.0.0, annotation.io.kubernetes.container.hash=bef672e5, annotation.io.kubernetes.container.restartCount=687, annotation.io.kubernetes.container.terminationMessagePath=/dev/termination-log, annotation.io.kubernetes.container.terminationMessagePolicy=File, annotation.io.kubernetes.pod.terminationGracePeriod=10, docker.cmd=docker run -it curl/curl:7.75.0 -s -L http://curl.haxx.se, image=sha256:26a9afb7027cca51ed4f7915474a04822a13e99fce2e1eecad3d43aab6199387, io.kubernetes.container.logpath=/var/log/pods/default_nginx1_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21/nginx/687.log, io.kubernetes.container.name=nginx, io.kubernetes.docker.type=container, io.kubernetes.pod.name=nginx1, io.kubernetes.pod.namespace=default, io.kubernetes.pod.uid=cc8a9cfb-872c-44ba-9899-b4c8bbc93a21, io.kubernetes.sandbox.id=e93a3ae70771ca0e4954fcb6ecf0ffd091eebfc64bcb3cbf461c94eb5474c9aa, name=k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_687, se.haxx.curl=curl, se.haxx.curl.description=network utility, se.haxx.curl.release_tag=curl-7_75_0, se.haxx.curl.version=7_75_0)\r\n```\r\n\r\n### 1.2 docker image相关\r\n\r\n| 命令    | 解释                                          |\r\n| ------- | --------------------------------------------- |\r\n| pull    | 从某个registry拉取镜像或者仓库                |\r\n| history | 展示镜像历史信息                              |\r\n| export  | 打包一个容器文件系统到tar文件                 |\r\n| build   | 从一个Dockerfile构建镜像                      |\r\n| commit  | 从一个容器的修改创建一个新的镜像              |\r\n| images  | 展示镜像列表                                  |\r\n| import  | 用tar文件导入并创建镜像文件                   |\r\n| load    | 从tar文件或者标准输入载入镜像                 |\r\n| login   | 登录Docker registry                           |\r\n| logout  | 从Docker registry退出                         |\r\n| save    | 打包一个或多个镜像到tar文件(默认是到标准输出) |\r\n| rmi     | 移除一个或多个镜像                            |\r\n| version | 显示Docker版本信息                            |\r\n| tag     | 标记一个镜像到仓库                            |\r\n\r\n补充说明\r\n\r\n#### 1-虚悬镜像\r\n\r\n镜像列表中，还可以看到一个特殊的镜像，这个镜像既没有仓库名，也没有标签，均 为  \r\n\r\n```\r\n<none>     <none>   00285df0df87    5 days ago    342 MB\r\n```\r\n\r\n这个镜像原本是有镜像名和标签的，原来为 mongo:3.2 ，随着官方镜像维护，发布了新版本 后，重新 docker pull mongo:3.2 时， mongo:3.2 这个镜像名被转移\r\n\r\n到了新下载的镜像身 上，而旧的镜像上的这个名称则被取消，从而成为了虚悬镜像。除了 docker pull 可能导致 这种情况， docker build 也同样可以导致这种现\r\n\r\n象。由于新旧镜像同名，旧镜像名称被取 消，从而出现仓库名、标签均为  的镜像。\r\n\r\n这类无标签镜像也被称为 虚悬镜像 (dangling image) ，可以用下面的命令专门显示这类镜像：\r\n\r\n```\r\n $ docker image ls -f dangling=true \r\nREPOSITORY TAG IMAGE ID CREATED SIZE  00285df0df87 5 days ago 342 MB \r\n```\r\n\r\n一般来说，虚悬镜像已经失去了存在的价值，是可以随意删除的，可以用下面的命令删除。 \r\n\r\n```\r\n$ docker image prune\r\n```\r\n\r\n#### 2-docker image ls 格式化展示\r\n\r\n不加任何参数的情况下， docker image ls 会列出所有顶级镜像，但是有时候我们只希望列出 部分镜像。 docker image ls 有好几个参数可以帮助做到这个事情。 根据仓库名列出镜像\r\n\r\n```\r\n$ docker image ls ubuntu\r\nREPOSITORY TAG IMAGE ID CREATED SIZE\r\nubuntu 16.04 f753707788c5 4 weeks ago 127 MB\r\nubuntu latest f753707788c5 4 weeks ago 127 MB\r\nubuntu 14.04 1e0c3dd64ccd 4 weeks ago 188 MB\r\n```\r\n\r\n列出特定的某个镜像，也就是说指定仓库名和标签\r\n\r\n```\r\n docker image ls ubuntu:16.04\r\nREPOSITORY TAG IMAGE ID CREATED SIZE\r\nubuntu 16.04 f753707788c5 4 weeks ago 127 MB\r\n```\r\n\r\n除此以外， docker image ls 还支持强大的过滤器参数 --filter ，或者简写 -f 。之前我们 已经看到了使用过滤器来列出虚悬镜像的用法，它还有更多的用法。比如，我们希望看到在 mongo:3.2 之后建立的镜像，可以用下面的命令：\r\n\r\n```\r\n docker image ls -f since=mongo:3.2\r\nREPOSITORY TAG IMAGE ID CREATED SIZE\r\nredis latest 5f515359c7f8 5 days ago 183 MB\r\nnginx latest 05a60462f8ba 5 days ago 181 MB\r\n```\r\n\r\n想查看某个位置之前的镜像也可以，只需要把 since 换成 before 即可。 此外，如果镜像构建时，定义了 LABEL ，还可以通过 LABEL 来过滤。\r\n\r\n```\r\n$ docker image ls -f label=com.example.version=0.1\r\n```\r\n\r\n**以特定格式显示**\r\n\r\n默认情况下， docker image ls 会输出一个完整的表格，但是我们并非所有时候都会需要这些 内容。比如，刚才删除虚悬镜像的时候，我们需要利用 docker image ls 把所有的虚悬镜像 的 ID 列出来，然后才可以交给 docker image rm 命令作为参数来删除指定的这些镜像，这个 时候就用到了 -q 参数。\r\n\r\n```\r\n$ docker image ls -q     //展示所有镜像的id\r\n5f515359c7f8\r\n05a60462f8ba\r\nfe9198c04d62\r\n00285df0df87\r\nf753707788c5\r\nf753707788c5\r\n1e0c3dd64ccd\r\n```\r\n\r\n--filter 配合 -q 产生出指定范围的 ID 列表，然后送给另一个 docker 命令作为参数，从 而针对这组实体成批的进行某种操作的做法在 Docker 命令行使用过程中非常常见，不仅仅是 镜像，将来我们会在各个命令中看到这类搭配以完成很强大的功能。因此每次在文档看到过 滤器后，可以多注意一下它们的用法。 另外一些时候，我们可能只是对表格的结构不满意，希望自己组织列；或者不希望有标题， 这样方便其它程序解析结果等，这就用到了 Go 的模板语法。 比如，下面的命令会直接列出镜像结果，并且只包含镜像ID和仓库名：\r\n\r\n```\r\n$ docker image ls --format \"{{.ID}}: {{.Repository}}\"\r\n5f515359c7f8: redis\r\n05a60462f8ba: nginx\r\nfe9198c04d62: mongo\r\n00285df0df87: <none>\r\nf753707788c5: ubuntu\r\nf753707788c5: ubuntu\r\n1e0c3dd64ccd: ubuntu\r\n```\r\n\r\n或者打算以表格等距显示，并且有标题行，和默认一样，不过自己定义列：\r\n\r\n```\r\n$ docker image ls --format \"table {{.ID}}\\t{{.Repository}}\\t{{.Tag}}\"\r\nIMAGE ID REPOSITORY TAG\r\n5f515359c7f8 redis latest\r\n05a60462f8ba nginx latest\r\nfe9198c04d62 mongo 3.2\r\n00285df0df87 <none> <none>\r\nf753707788c5 ubuntu 16.04\r\nf753707788c5 ubuntu latest\r\n1e0c3dd64ccd ubuntu 14.04\r\n```\r\n\r\n#### 3-Untagged 和 Deleted \r\n\r\n如果观察上面这几个命令的运行输出信息的话，你会注意到删除行为分为两类，一类是 Untagged ，另一类是 Deleted 。\r\n\r\n我们之前介绍过，镜像的唯一标识是其 ID 和摘要，而一个 镜像可以有多个标签。 因此当我们使用上面命令删除镜像的时候，实际上是在要求删除某个标签的镜像。所以首先 需要做的是将满足我们要求的所有镜像标签都取消，这就是我们看到的 Untagged 的信息。 因为一个镜像可以对应多个标签，因此当我们删除了所指定的标签后，可能还有别的标签指 向了这个镜像，如果是这种情况，那么 Delete 行为就不会发生。所以并非所有的 docker rmi 都会产生删除镜像的行为，有可能仅仅是取消了某个标签而已。 当该镜像所有的标签都被取消了，该镜像很可能会失去了存在的意义，因此会触发删除行 为。镜像是多层存储结构，因此在删除的时候也是从上层向基础层方向依次进行判断删除。 镜像的多层结构让镜像复用变动非常容易，因此很有可能某个其它镜像正依赖于当前镜像的 某一层。这种情况，依旧不会触发删除该层的行为。直到没有任何层依赖当前层时，才会真 实的删除当前层。这就是为什么，有时候会奇怪，为什么明明没有别的标签指向这个镜像， 但是它还是存在的原因，也是为什么有时候会发现所删除的层数和自己 docker pull 看到的 层数不一样的源。 除了镜像依赖以外，还需要注意的是容器对镜像的依赖。如果有用这个镜像启动的容器存在 （即使容器没有运行），那么同样不可以删除这个镜像。之前讲过，容器是以镜像为基础， 再加一层容器存储层，组成这样的多层存储结构去运行的。因此该镜像如果被这个容器所依 赖的，那么删除必然会导致故障。如果这些容器是不需要的，应该先将它们删除，然后再来 删除镜像。\r\n\r\n### 1.3 docke container相关\r\n\r\n| 命令    | 解释                                            |\r\n| ------- | ----------------------------------------------- |\r\n| attach  | 附加到一个运行的容器                            |\r\n| cp      | 在容器与本地文件系统之间复制文件/文件夹         |\r\n| create  | 创建新的容器                                    |\r\n| diff    | 检阅一个容器文件系统的修改                      |\r\n| exec    | 在运行的容器内执行命令                          |\r\n| inspect | 展示一个容器/镜像或者任务的底层信息             |\r\n| kill    | 终止一个或者多个运行中的容器                    |\r\n| logs    | 获取容器的日志                                  |\r\n| network | 管理Docker网络                                  |\r\n| node    | 管理Docker Swarm节点                            |\r\n| pause   | 暂停一个或者多个容器的所有进程                  |\r\n| port    | 管理容器的端口映射                              |\r\n| ps      | 展示容器列表                                    |\r\n| rename  | 重命名容器                                      |\r\n| restart | 重启容器                                        |\r\n| rm      | 移除一个或多个容器                              |\r\n| run     | 运行一个新的容器                                |\r\n| search  | 在Docker Hub搜索镜像                            |\r\n| service | 管理Docker services（和k8s svc 咋看起来差不多） |\r\n| top     | 展示容器运行进程（方便查看container对应的Pid）  |\r\n| unpause | 解除暂停一个或多个容器的所有进程                |\r\n| swarm   | 管理Docker Swarm                                |\r\n| stop    | 停止一个或多个运行容器                          |\r\n| stats   | 获取容器的实时资源使用统计                      |\r\n| update  | 更新一个或多个容器的配置                        |\r\n| volume  | 管理Docker volumes                              |\r\n| wait    | 阻塞直到容器停止，然后打印退出代码              |\r\n| start   | 启动一个或者多个容器                            |\r\n\r\n补充说明\r\n\r\n#### 1-docker diff\r\n\r\n```\r\nroot@k8s-node:~# docker diff 3596feb5ce62\r\nC /run\r\nA /run/secrets\r\nA /run/secrets/kubernetes.io\r\nA /run/secrets/kubernetes.io/serviceaccount\r\n```\r\n\r\n A代表新增文件\r\nC代表修改过的文件\r\nD代表被删除的文件\r\n\r\n#### 2-docker top\r\n\r\n快速查看containerid 对应的pid\r\n\r\n```\r\nroot@k8s-node:~# docker top c3a457fe7cc5\r\nUID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD\r\nroot                3709                3692                0                   2021                ?                   00:20:16            /opt/bin/flanneld --ip-masq --kube-subnet-mgr\r\n```\r\n\r\n#### 3-docker attach\r\n\r\nDocker attach可以attach到一个已经运行的容器的stdin，然后进行命令执行的动作。\r\n但是需要注意的是，如果从这个stdin中exit，会导致容器的停止。 （docker exec则不会）\r\n\r\n```\r\nroot@k8s-master:~# docker run -d nginx:latest \r\na66f0b29a030b4b0fbe9128faaa373b995526ea1cb8ca714db7e3b3dc821d09d\r\nroot@k8s-master:~# \r\nroot@k8s-master:~# docker ps\r\nCONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS              PORTS               NAMES\r\na66f0b29a030        nginx:latest                  \"/docker-entrypoint.…\"   5 seconds ago       Up 4 seconds        80/tcp              eager_cannon\r\n\r\nroot@k8s-master:~# \r\nroot@k8s-master:~# docker attach a66f0b29a030b4b0fbe9128faaa373b995526ea1cb8ca714db7e3b3dc821d09d\r\n\r\nls\r\n\r\n/bahs\r\nexit\r\n^Z^Z\r\n\r\n\r\n^C2021/12/14 13:28:15 [notice] 1#1: signal 2 (SIGINT) received, exiting\r\n2021/12/14 13:28:15 [notice] 32#32: exiting\r\n2021/12/14 13:28:15 [notice] 31#31: exiting\r\n2021/12/14 13:28:15 [notice] 31#31: exit\r\n2021/12/14 13:28:15 [notice] 32#32: exit\r\n2021/12/14 13:28:15 [notice] 1#1: signal 17 (SIGCHLD) received from 31\r\n2021/12/14 13:28:15 [notice] 1#1: worker process 31 exited with code 0\r\n2021/12/14 13:28:15 [notice] 1#1: worker process 32 exited with code 0\r\n2021/12/14 13:28:15 [notice] 1#1: exit\r\n^Zroot@k8s-master:~# \r\nroot@k8s-master:~# docker ps\r\nCONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS              PORTS         \r\n```\r\n\r\n#### 4-docker logs -f containerId\r\n\r\n```\r\nroot@k8s-master:~# docker logs -f f051884c5784\r\n/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration\r\n/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/\r\n/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh\r\n10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf\r\n10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf\r\n/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh\r\n/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh\r\n/docker-entrypoint.sh: Configuration complete; ready for start up\r\n2021/12/12 08:51:21 [notice] 1#1: using the \"epoll\" event method\r\n2021/12/12 08:51:21 [notice] 1#1: nginx/1.21.4\r\n2021/12/12 08:51:21 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6) \r\n2021/12/12 08:51:21 [notice] 1#1: OS: Linux 4.19.0-17-amd64\r\n2021/12/12 08:51:21 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576\r\n2021/12/12 08:51:21 [notice] 1#1: start worker processes\r\n2021/12/12 08:51:21 [notice] 1#1: start worker process 31\r\n2021/12/12 08:51:21 [notice] 1#1: start worker process 32\r\n```\r\n\r\n## 2. docker api\r\n\r\n在Docker生态系统中一共有3种API 。 \r\n\r\n（1）Registry API：提供了与来存储Docker镜像的Docker Registry集成 的功能。\r\n\r\n（2）Docker Hub API：提供了与Docker Hub 集成的功能。 \r\n\r\n（3）Docker Remote API：提供与Docker守护进程进行集成的功能。\r\n\r\n这里主要熟悉一下第三种，docker remote Api\r\n\r\n<br>\r\n\r\n### 2.1  Unix domain socket介绍\r\n\r\n**Unix domain socket 又叫 IPC(inter-process communication 进程间通信) socket，用于实现同一主机上的进程间通信。**socket 原本是为网络通讯设计的，但后来在 socket 的框架上发展出一种 IPC 机制，就是 UNIX domain socket。虽然网络 socket 也可用于同一台主机的进程间通讯(通过 loopback 地址 127.0.0.1)，但是 UNIX domain socket 用于 IPC 更有效率：不需要经过网络协议栈，不需要打包拆包、计算校验和、维护序号和应答等，只是将应用层数据从一个进程拷贝到另一个进程。这是因为，IPC 机制本质上是可靠的通讯，而网络协议是为不可靠的通讯设计的。\r\n\r\nUNIX domain socket 是全双工的，API 接口语义丰富，相比其它 IPC 机制有明显的优越性，目前已成为使用最广泛的 IPC 机制，比如 X Window 服务器和 GUI 程序之间就是通过 UNIX domain socket 通讯的。Unix domain socket 是 POSIX 标准中的一个组件，所以不要被名字迷惑，linux 系统也是支持它的。\r\n\r\n了解Docker的同学应该知道Docker daemon监听一个docker.sock文件，这个docker.sock文件的默认路径是`/var/run/docker.sock`，这个Socket就是一个Unix domain socket。\r\n\r\n### 2.2 如何通过 unix socket 使用docker\r\n\r\n例如：参考这个， 查看所有的container信息\r\n\r\nhttps://docs.docker.com/engine/api/sdk/examples/\r\n\r\n```\r\nroot@k8s-dnode:~# curl --unix-socket /var/run/docker.sock http://127.0.0.1/v1.40/containers/json\r\n[{\"Id\":\"64a14bf3626b576f9fd7dd56555d0e091f770eb31926d48211dd604874805f92\",\"Names\":[\"/k8s_container-0_nginx-78f97d8d6d-8vtw8_default_dcf8f5c4-315b-4e43-a623-dc8842f36d36_0\"],\"Image\":\"nginx@sha256:9522864dd661dcadfd9958f9e0de192a1fdda2c162a35668ab6ac42b465f0603\",\"ImageID\":\"sha256:f652ca386ed135a4cbe356333e08ef0816f81b2ac8d0619af01e2b256837ed3e\",\"Command\":\"/docker-entrypoint.sh nginx -g 'daemon off;'\",\"Created\":1639901315,\"Ports\":[],\"Labels\":{\"annotation.io.kubernetes.container.hash\":\"a36242a4\",\"annotation.io.kubernetes.container.restartCount\":\"0\",\"annotation.io.kubernetes.container.terminationMessagePath\":\"/dev/termination-log\",\"annotation.io.kubernetes.container.terminationMessagePolicy\":\"File\",\"annotation.io.kubernetes.pod.terminationGracePeriod\":\"30\",\"io.kubernetes.container.logpath\":\"/var/log/pods/default_nginx-78f97d8d6d-8vtw8_dcf8f5c4-315b-4e43-a623-dc8842f36d36/container-0/0.log\",\"io.kubernetes.container.name\":\"container-0\",\"io.kubernetes.docker.type\":\"container\",\"io.kubernetes.pod.name\":\"nginx-78f97d8d6d-8vtw8\",\"io.kubernetes.pod.namespace\":\"default\",\"io.kubernetes.pod.uid\":\"dcf8f5c4-315b-4e43-a623-dc8842f36d36\",\"io.kubernetes.sandbox.id\":\"d35b5a6084bc009340f77a3594a7891c794bad76d2e80c8eafa4e0c95cd772cd\",\"maintainer\":\"NGINX Docker Maintainers <docker-maint@nginx.com>\"},\"State\":\"running\",\"Status\":\"Up 17 minutes\",\"HostConfig\":{\"NetworkMode\":\"container:d35b5a6084bc009340f77a3594a7891c794bad76d2e80c8eafa4e0c95cd772cd\"},\"NetworkSettings\":{\"Networks\":{}},\"Mounts\":[{\"Type\":\"bind\",\"Source\":\"/var/lib/kubelet/pods/dcf8f5c4-315b-4e43-a623-dc8842f36d36/etc-hosts\",\"Destination\":\"/etc/hosts\",\"Mode\":\"\",\"RW\":true,\"Propagation\":\"rprivate\"},{\"Type\":\"bind\",\"Source\":\"/var/lib/kubelet/pods/dcf8f5c4-315b-4e43-a623-dc8842f36d36/volumes/kubernetes.io~secret/default-token-f8snr\",\"Destination\":\"/var/run/secrets/kubernetes.io/serviceaccount\",\"Mode\":\"ro\",\"RW\":false,\"Propagation\":\"rprivate\"},{\"Type\":\"bind\",\"Source\":\"/var/lib/kubelet/pods/dcf8f5c4-315b-4e\r\n```\r\n\r\n\r\n\r\n### 2.3 如何通过restful api 使用docker\r\n\r\n这里需要先将unix socker 和 tcp:port 绑定。操作如下：\r\n\r\n```\r\nroot@k8s-master:~# cat /usr/lib/systemd/system/docker.service\r\n[Unit]\r\nDescription=Docker Application Container Engine\r\nDocumentation=https://docs.docker.com\r\nAfter=network-online.target firewalld.service\r\nWants=network-online.target\r\n\r\n[Service]\r\nType=notify  \r\nExecStart=/usr/bin/dockerd -H tcp://0.0.0.0:2375     //之前是 ExecStart=/usr/bin/dockerd\r\nExecReload=/bin/kill -s HUP \r\nLimitNOFILE=infinity\r\nLimitNPROC=infinity\r\nLimitCORE=infinity\r\nTimeoutStartSec=0\r\nDelegate=yes\r\nKillMode=process\r\nRestart=on-failure\r\nStartLimitBurst=3\r\nStartLimitInterval=60s\r\n\r\n[Install]\r\nWantedBy=multi-user.target\r\n```\r\n\r\n<br>\r\n\r\n```\r\nroot@k8s-master:~# sudo docker -H 192.168.0.4:2375 info\r\nClient:\r\n Debug Mode: false\r\n\r\nServer:\r\n Containers: 13\r\n  Running: 0\r\n  Paused: 0\r\n  Stopped: 13\r\n Images: 6\r\n Server Version: 19.03.9\r\n Storage Driver: overlay2\r\n  Backing Filesystem: extfs\r\n  Supports d_type: true\r\n  Native Overlay Diff: true\r\n Logging Driver: json-file\r\n Cgroup Driver: cgroupfs\r\n Plugins:\r\n  Volume: local\r\n  Network: bridge host ipvlan macvlan null overlay\r\n  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog\r\n Swarm: inactive\r\n ...\r\n```\r\n\r\n## 3. 参考\r\n\r\n[docker api](https://docs.docker.com/engine/api/v1.22/?spm=a2c6h.12873639.0.0.481a90afUqk0rt#2-endpoints)\r\n\r\n[手撕Linux Socket——Socket原理与实践分析](https://zhuanlan.zhihu.com/p/234806787)"
  },
  {
    "path": "docker/8. docker核心组件介绍.md",
    "content": "\n\n* [0\\. 章节目的](#0-章节目的)\n* [1\\.docker 组件介绍](#1docker-组件介绍)\n* [2\\. docker 组件分析](#2-docker-组件分析)\n  * [2\\.1 docker](#21-docker)\n  * [2\\.2 docker proxy](#22-docker-proxy)\n  * [2\\.2 docker\\-init](#22-docker-init)\n  * [2\\.4 runc](#24-runc)\n  * [2\\.5 dockerd](#25-dockerd)\n  * [2\\.6 containerd](#26-containerd)\n  * [2\\.7 <strong>containerd\\-shim</strong>](#27-containerd-shim)\n  * [2\\.8 ctr](#28-ctr)\n  * [2\\.9 组件总结](#29-组件总结)\n* [3\\. 进程关系](#3-进程关系)\n* [4\\. docker为什么是这种结构](#4-docker为什么是这种结构)\n* [5\\. 参考文档](#5-参考文档)\n\n### 0. 章节目的\n\n本节的目的就是为了弄清楚：\n\n（1）docker各组件有什么功能\n\n（2）通过docker运行的容器，进程关系是什么样子，为什么会这样\n\n### 1.docker 组件介绍\n\n二进制安装docker的时候，可以发现，docker由以下的组件组成。\n\n```\nroot@k8s-node:~#  tar zxvf docker-19.03.9.tgz\ndocker/\ndocker/docker-init\ndocker/runc\ndocker/docker\ndocker/docker-proxy\ndocker/containerd\ndocker/ctr\ndocker/dockerd\ndocker/containerd-shim\n```\n\n<br>\n\n### 2. docker 组件分析\n\n#### 2.1 docker\n\ndocker 是 Docker 客户端的一个完整实现，它是一个二进制文件，对用户可见的操作形式为 docker 命令，通过 docker 命令可以完成所有的 Docker 客户端与服务端的通信。\n\nDocker 客户端与服务端的交互过程是：docker 组件向服务端发送请求后，服务端根据请求执行具体的动作并将结果返回给 docker，docker 解析服务端的返回结果，并将结果通过命令行标准输出展示给用户。这样一次完整的客户端服务端请求就完成了。\n\n例如常见的命令  docker run/ps 等等\n\n<br>\n\n#### 2.2 docker proxy\n\ndocker-proxy 主要是用来做端口映射的。当我们使用 docker run 命令启动容器时，如果使用了 -p 参数，docker-proxy 组件就会把容器内相应的端口映射到主机上来，底层是依赖于 iptables 实现的。\n\n```\nroot@cld-dnode1-1051:/usr/bin# docker run --name=nginx -d -p 8080:80 nginx\n\n\nroot@cld-dnode1-1051:/usr/bin# docker inspect --format '{{ .NetworkSettings.IPAddress }}' nginx\n172.17.0.2\n\n// 会多一个docker-proxy的进程\nroot@cld-dnode1-1051:/usr/bin# ps aux |grep docker-proxy\nroot     1983163  0.0  0.0 105912  4252 ?        Sl   15:42   0:00 /bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8080 -container-ip 172.17.0.2 -container-port 80\nroot     1985160  0.0  0.0  13544  2600 pts/1    S+   15:43   0:00 grep docker-proxy\n```\n\n#### 2.2 docker-init\n\n在执行 docker run 启动容器时可以添加 --init 参数，此时 Docker 会使用 docker-init 作为1号进程，帮你管理容器内子进程，例如回收僵尸进程等。\n\n```\nroot@cld-dnode1-1051:/usr/bin# ls docker*\ndocker\tdockerd  dockerd-ce  docker-init  docker-proxy\nroot@cld-dnode1-1051:/usr/bin#\n\nroot@cld-dnode1-1051:/usr/bin# docker-init version\n[WARN  tini (1973230)] Tini is not running as PID 1 and isn't registered as a child subreaper.\nZombie processes will not be re-parented to Tini, so zombie reaping won't work.\nTo fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.\n[FATAL tini (1973231)] exec version failed: No such file or directory\n\n\nroot@cld-dnode1-1051:/usr/bin# docker run -it busybox sh\n/ # ps aux\nPID   USER     TIME  COMMAND\n    1 root      0:00 sh\n    6 root      0:00 ps aux\n\n// 容器里面的init就是docker-init（看起来就是tini）\nroot@cld-dnode1-1051:/usr/bin# docker run -it --init busybox sh\n/ # ps aux\nPID   USER     TIME  COMMAND\n    1 root      0:00 /dev/init -- sh\n    6 root      0:00 sh\n    7 root      0:00 ps aux\n\n/ # /dev/init version\n[WARN  tini (8)] Tini is not running as PID 1 and isn't registered as a child subreaper.\nZombie processes will not be re-parented to Tini, so zombie reaping won't work.\nTo fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.\n[FATAL tini (9)] exec version failed: No such file or directory\n```\n\n#### 2.4 runc\n\nrunc 是一个标准的 OCI 容器运行时的实现，它是一个命令行工具，可以直接用来创建和运行容器。接下来直接进行演示：\n\n(1) 准备容器运行时文件。可以看出来这里时和docker啥的都没有关系，都是一堆的基础目录和文件\n\n```\nroot@cld-dnode1-1051:/ cd /root\nroot@cld-dnode1-1051:/ mkdir runc\nroot@cld-dnode1-1051:/ mkdir rootfs && docker export $(docker create busybox) | tar -C rootfs -xvf -\n\nroot@cld-dnode1-1051:/home/zouxiang/runc# tree -L 2\n.\n└── rootfs\n    ├── bin\n    ├── dev\n    ├── etc\n    ├── home\n    ├── proc\n    ├── root\n    ├── sys\n    ├── tmp\n    ├── usr\n    └── var\n```\n\n（2）准备config文件\n\n使用 runc spec 命令根据文件系统生成对应的 config.json 文件。\n\n在config.json里指定了容器运行的args，env等等。\n\n```\nroot@cld-dnode1-1051:/home/zouxiang/runc# runc spec\n\nroot@cld-dnode1-1051:/home/zouxiang/runc# cat config.json\n{\n\t\"ociVersion\": \"1.0.1-dev\",\n\t\"process\": {\n\t\t\"terminal\": true,\n\t\t\"user\": {\n\t\t\t\"uid\": 0,\n\t\t\t\"gid\": 0\n\t\t},\n\t\t\"args\": [\n\t\t\t\"sh\"\n\t\t],\n\t\t\"env\": [\n\t\t\t\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\",\n\t\t\t\"TERM=xterm\"\n\t\t],\n\t\t\"cwd\": \"/\",\n\t\t\"capabilities\": {\n\t\t\t\"bounding\": [\n\t\t\t\t\"CAP_AUDIT_WRITE\",\n\t\t\t\t\"CAP_KILL\",\n\t\t\t\t\"CAP_NET_BIND_SERVICE\"\n\t\t\t],\n\t\t\t\"effective\": [\n\t\t\t\t\"CAP_AUDIT_WRITE\",\n\t\t\t\t\"CAP_KILL\",\n\t\t\t\t\"CAP_NET_BIND_SERVICE\"\n\t\t\t],\n\t\t\t\"inheritable\": [\n\t\t\t\t\"CAP_AUDIT_WRITE\",\n\t\t\t\t\"CAP_KILL\",\n\t\t\t\t\"CAP_NET_BIND_SERVICE\"\n\t\t\t],\n\t\t\t\"permitted\": [\n\t\t\t\t\"CAP_AUDIT_WRITE\",\n\t\t\t\t\"CAP_KILL\",\n\t\t\t\t\"CAP_NET_BIND_SERVICE\"\n\t\t\t],\n\t\t\t\"ambient\": [\n\t\t\t\t\"CAP_AUDIT_WRITE\",\n\t\t\t\t\"CAP_KILL\",\n\t\t\t\t\"CAP_NET_BIND_SERVICE\"\n\t\t\t]\n\t\t},\n\t\t\"rlimits\": [\n\t\t\t{\n\t\t\t\t\"type\": \"RLIMIT_NOFILE\",\n\t\t\t\t\"hard\": 1024,\n\t\t\t\t\"soft\": 1024\n\t\t\t}\n\t\t],\n\t\t\"noNewPrivileges\": true\n\t},\n\t\"root\": {\n\t\t\"path\": \"rootfs\",\n\t\t\"readonly\": true\n\t},\n\t\"hostname\": \"runc\",\n\t\"mounts\": [\n\t\t{\n\t\t\t\"destination\": \"/proc\",\n\t\t\t\"type\": \"proc\",\n\t\t\t\"source\": \"proc\"\n\t\t},\n\t\t{\n\t\t\t\"destination\": \"/dev\",\n\t\t\t\"type\": \"tmpfs\",\n\t\t\t\"source\": \"tmpfs\",\n\t\t\t\"options\": [\n\t\t\t\t\"nosuid\",\n\t\t\t\t\"strictatime\",\n\t\t\t\t\"mode=755\",\n\t\t\t\t\"size=65536k\"\n\t\t\t]\n\t\t},\n\t\t{\n\t\t\t\"destination\": \"/dev/pts\",\n\t\t\t\"type\": \"devpts\",\n\t\t\t\"source\": \"devpts\",\n\t\t\t\"options\": [\n\t\t\t\t\"nosuid\",\n\t\t\t\t\"noexec\",\n\t\t\t\t\"newinstance\",\n\t\t\t\t\"ptmxmode=0666\",\n\t\t\t\t\"mode=0620\",\n\t\t\t\t\"gid=5\"\n\t\t\t]\n\t\t},\n\t\t{\n\t\t\t\"destination\": \"/dev/shm\",\n\t\t\t\"type\": \"tmpfs\",\n\t\t\t\"source\": \"shm\",\n\t\t\t\"options\": [\n\t\t\t\t\"nosuid\",\n\t\t\t\t\"noexec\",\n\t\t\t\t\"nodev\",\n\t\t\t\t\"mode=1777\",\n\t\t\t\t\"size=65536k\"\n\t\t\t]\n\t\t},\n\t\t{\n\t\t\t\"destination\": \"/dev/mqueue\",\n\t\t\t\"type\": \"mqueue\",\n\t\t\t\"source\": \"mqueue\",\n\t\t\t\"options\": [\n\t\t\t\t\"nosuid\",\n\t\t\t\t\"noexec\",\n\t\t\t\t\"nodev\"\n\t\t\t]\n\t\t},\n\t\t{\n\t\t\t\"destination\": \"/sys\",\n\t\t\t\"type\": \"sysfs\",\n\t\t\t\"source\": \"sysfs\",\n\t\t\t\"options\": [\n\t\t\t\t\"nosuid\",\n\t\t\t\t\"noexec\",\n\t\t\t\t\"nodev\",\n\t\t\t\t\"ro\"\n\t\t\t]\n\t\t},\n\t\t{\n\t\t\t\"destination\": \"/sys/fs/cgroup\",\n\t\t\t\"type\": \"cgroup\",\n\t\t\t\"source\": \"cgroup\",\n\t\t\t\"options\": [\n\t\t\t\t\"nosuid\",\n\t\t\t\t\"noexec\",\n\t\t\t\t\"nodev\",\n\t\t\t\t\"relatime\",\n\t\t\t\t\"ro\"\n\t\t\t]\n\t\t}\n\t],\n\t\"linux\": {\n\t\t\"resources\": {\n\t\t\t\"devices\": [\n\t\t\t\t{\n\t\t\t\t\t\"allow\": false,\n\t\t\t\t\t\"access\": \"rwm\"\n\t\t\t\t}\n\t\t\t]\n\t\t},\n\t\t\"namespaces\": [\n\t\t\t{\n\t\t\t\t\"type\": \"pid\"\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"type\": \"network\"\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"type\": \"ipc\"\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"type\": \"uts\"\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"type\": \"mount\"\n\t\t\t}\n\t\t],\n\t\t\"maskedPaths\": [\n\t\t\t\"/proc/acpi\",\n\t\t\t\"/proc/asound\",\n\t\t\t\"/proc/kcore\",\n\t\t\t\"/proc/keys\",\n\t\t\t\"/proc/latency_stats\",\n\t\t\t\"/proc/timer_list\",\n\t\t\t\"/proc/timer_stats\",\n\t\t\t\"/proc/sched_debug\",\n\t\t\t\"/sys/firmware\",\n\t\t\t\"/proc/scsi\"\n\t\t],\n\t\t\"readonlyPaths\": [\n\t\t\t\"/proc/bus\",\n\t\t\t\"/proc/fs\",\n\t\t\t\"/proc/irq\",\n\t\t\t\"/proc/sys\",\n\t\t\t\"/proc/sysrq-trigger\"\n\t\t]\n\t}\n}\n\n注意 config.json 和rootfs是同一级\nroot@cld-dnode1-1051:/home/zouxiang/runc# tree -L 1\n.\n├── config.json\n└── rootfs\n```\n\n（3）运行容器\n\n```\nroot@cld-dnode1-1051:/home/zouxiang/runc# runc run container1\n/ # ps aux\nPID   USER     TIME  COMMAND\n    1 root      0:00 sh\n    7 root      0:00 ps aux\n\n另一个窗口就能看到\nroot@cld-dnode1-1051:/home/zouxiang# runc list\nID           PID         STATUS      BUNDLE                CREATED                          OWNER\ncontainer1   2040317     running     /home/zouxiang/runc   2022-01-26T08:14:57.916602955Z   root\n```\n\n#### 2.5 dockerd\n\ndockerd 是 Docker 服务端的后台常驻进程，用来接收客户端发送的请求，执行具体的处理任务，处理完成后将结果返回给客户端。\n\ndocker run/ps 是客户端。dockerd是服务器端。但是dockerd不是真正干活的，正在干活的是containerd。\n\n#### 2.6 containerd\n\ncontainerd 组件是从 Docker 1.11 版本正式从 dockerd 中剥离出来的，它的诞生完全遵循 OCI 标准，是容器标准化后的产物。containerd 完全遵循了 OCI 标准，并且是完全社区化运营的，因此被容器界广泛采用。\n\ncontainerd 不仅负责容器生命周期的管理，同时还负责一些其他的功能：\n\n- 镜像的管理，例如容器运行前从镜像仓库拉取镜像到本地；\n- 接收 dockerd 的请求，通过适当的参数调用 runc 启动容器；\n- 管理存储相关资源；\n- 管理网络相关资源。\n\ncontainerd 包含一个后台常驻进程，默认的 socket 路径为 /run/containerd/containerd.sock，dockerd 通过 UNIX 套接字向 containerd 发送请求，containerd 接收到请求后负责执行相关的动作并把执行结果返回给 dockerd。\n\n如果你不想使用 dockerd，也可以直接使用 containerd 来管理容器，由于 containerd 更加简单和轻量，生产环境中越来越多的人开始直接使用 containerd 来管理容器。\n\n#### 2.7 **containerd-shim**\n\ncontainerd-shim 的意思是垫片，类似于拧螺丝时夹在螺丝和螺母之间的垫片。containerd-shim 的主要作用是将 containerd 和真正的容器进程解耦，使用 containerd-shim 作为容器进程的父进程，从而实现重启 containerd 不影响已经启动的容器进程。\n\n```\nroot@cld-dnode1-1051:/usr/bin# containerd-shim -h\nUsage of containerd-shim:\n  -address string\n    \tgrpc address back to main containerd\n  -containerd-binary containerd publish\n    \tpath to containerd binary (used for containerd publish) (default \"containerd\")\n  -criu string\n    \tpath to criu binary\n  -debug\n    \tenable debug output in logs\n  -namespace string\n    \tnamespace that owns the shim\n  -runtime-root string\n    \troot directory for the runtime (default \"/run/containerd/runc\")\n  -socket string\n    \tabstract socket path to serve\n  -systemd-cgroup\n    \tset runtime to use systemd-cgroup\n  -workdir string\n    \tpath used to storge large temporary data\n```\n\n#### 2.8 ctr \n\nctr 实际上是 containerd-ctr，它是 containerd 的客户端，主要用来开发和调试，在没有 dockerd 的环境中，ctr 可以充当 docker 客户端的部分角色，直接向 containerd 守护进程发送操作容器的请求。\n\n```\nroot@cld-dnode1-1051:/usr/bin# ctr -h\nNAME:\n   ctr -\n        __\n  _____/ /______\n / ___/ __/ ___/\n/ /__/ /_/ /\n\\___/\\__/_/\n\ncontainerd CLI\n\n\nUSAGE:\n   ctr [global options] command [command options] [arguments...]\n\nVERSION:\n   1.2.13\n\nCOMMANDS:\n     plugins, plugin           provides information about containerd plugins\n     version                   print the client and server versions\n     containers, c, container  manage containers\n     content                   manage content\n     events, event             display containerd events\n     images, image, i          manage images\n     leases                    manage leases\n     namespaces, namespace     manage namespaces\n     pprof                     provide golang pprof outputs for containerd\n     run                       run a container\n     snapshots, snapshot       manage snapshots\n     tasks, t, task            manage tasks\n     install                   install a new package\n     shim                      interact with a shim directly\n     cri                       interact with cri plugin\n     help, h                   Shows a list of commands or help for one command\n\nGLOBAL OPTIONS:\n   --debug                      enable debug output in logs\n   --address value, -a value    address for containerd's GRPC server (default: \"/run/containerd/containerd.sock\")\n   --timeout value              total timeout for ctr commands (default: 0s)\n   --connect-timeout value      timeout for connecting to containerd (default: 0s)\n   --namespace value, -n value  namespace to use with commands (default: \"default\") [$CONTAINERD_NAMESPACE]\n   --help, -h                   show help\n   --version, -v                print the version\n```\n\n<br>\n\n#### 2.9 组件总结\n\n| 组件类别           | 组件名称        | 核心功能                                                     |\n| ------------------ | --------------- | ------------------------------------------------------------ |\n| docker相关组件     | Docker          | docker的客户端，复杂发送docker操作请求                       |\n| docker相关组件     | Docker          | docker服务端的入口，负责处理客户端请求                       |\n| docker相关组件     | Docker-init     | 实用docker-init作为1号进程（业务1号进程没有回收僵尸进程的能力）。 |\n| docker相关组件     | Docker-proxy    | docker网络实现，通过操作iptables实现。                       |\n| Containerd相关组件 | Containerd      | 负责管理容器生命周期，通过接受dockerd的请求, 执行启动或者销毁容器草 |\n| Containerd相关组件 | Containerd-shim | 将真正运行的容器进程和Container解藕，Containerd-shim作为容器进程的父进程 |\n| Containerd相关组件 | Ctr             | Containerd的客户端，可以直接向containerd发送容器操作的请求，主要用于开发和调试 |\n| 容器运行时组件     | Runc            | 通过调用namespaces, groups等系统调用接口，实现容器的操作     |\n\n<br>\n\n### 3. 进程关系\n\n并且查看进程树，发现进程关系为：\n\n```\ndocker     ctr\n  |         |\n  V         V\ndockerd -> containerd ---> shim -> runc -> runc init -> process\n                      |-- > shim -> runc -> runc init -> process\n                      +-- > shim -> runc -> runc init -> process\n```\n\n<br>\n\n```\nroot     3250772  ...    /usr/bin/dockerd -p /var/run/docker.pid\nroot        2010 ...  /usr/bin/containerd\nroot     3467567  ... containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/cabf53bfcd5f079159b8891520c2c2c0dee811568f7d0942b80dd8d12459ab06 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc\n```\n\nDocker, containerd的父进程都是1好进程。从进程树来看，并没有直接的父子关系。\n\n从这篇文章可以看出来，[docker进程模型，架构分析](https://segmentfault.com/a/1190000011294361)，containerd进程时docker启动的。\n\n<br>\n\n### 4. docker为什么是这种结构\n\n当 Kubelet 想要创建一个**容器**时, 有这么几步:\n\n1. Kubelet 通过 **CRI 接口**(gRPC) 调用 dockershim, 请求创建一个容器. **CRI** 即容器运行时接口(Container Runtime Interface), 这一步中, Kubelet 可以视作一个简单的 CRI Client, 而 dockershim 就是接收请求的 Server. 目前 dockershim 的代码其实是内嵌在 Kubelet 中的, 所以接收调用的凑巧就是 Kubelet 进程;\n2. dockershim 收到请求后, 转化成 Docker Daemon 能听懂的请求, 发到 Docker Daemon 上请求创建一个容器;\n3. Docker Daemon 早在 1.12 版本中就已经将针对容器的操作移到另一个守护进程: containerd 中了, 因此 Docker Daemon 仍然不能帮我们创建容器, 而是要请求 containerd 创建一个容器;\n4. containerd 收到请求后, 并不会自己直接去操作容器, 而是创建一个叫做 containerd-shim 的进程, 让 containerd-shim 去操作容器. 这是因为容器进程需要一个父进程来做诸如收集状态, 维持 stdin 等 fd 打开等工作. 而假如这个父进程就是 containerd, 那每次 containerd 挂掉或升级, 整个宿主机上所有的容器都得退出了. 而引入了 containerd-shim 就规避了这个问题(containerd 和 shim 并不是父子进程关系);\n5. 我们知道创建容器需要做一些设置 namespaces 和 cgroups, 挂载 root filesystem 等等操作, 而这些事该怎么做已经有了公开的规范了, 那就是 [OCI(Open Container Initiative, 开放容器标准)](https://link.zhihu.com/?target=https%3A//github.com/opencontainers/runtime-spec). 它的一个参考实现叫做 [runc](https://link.zhihu.com/?target=https%3A//github.com/opencontainers/runc). 于是, containerd-shim 在这一步需要调用 `runc` 这个命令行工具, 来启动容器;\n6. `runc` 启动完容器后本身会直接退出, containerd-shim 则会成为容器进程的父进程, 负责收集容器进程的状态, 上报给 containerd, 并在容器中 pid 为 1 的进程退出后接管容器中的子进程进行清理, 确保不会出现僵尸进程;\n\n\n\n![image-20220226174613059](./image/struct-1.png)\n\n\n\n（1）为什么需要docker-shim?\n\n因为k8s定义了CRI，这样可以和docker, rkt等容器运行时解藕。但是dockerd没有不支持CRI。虽有需要docker-shim进行一次转换。\n\n（2）为什么需要containerd\n\n其实 k8s 最开始的 Runtime 架构远没这么复杂: kubelet 想要创建容器直接跟 Docker Daemon 说一声就行, 而那时也不存在 containerd, Docker Daemon 自己调一下 `libcontainer` 这个库把容器跑起来, 整个过程就搞完了.\n\n但是大佬们为了不让容器运行时标准被 Docker 一家公司控制, 于是就撺掇着搞了开放容器标准 OCI. Docker 则把 `libcontainer` 封装了一下, 变成 runC 捐献出来作为 OCI 的参考实现。\n\n所以： libcontainer = runc\n\ncontainerd就变成了负责兼容的处理人，处理客户端请求。具体执行变成了runc\n\n（3）为什么需要containers-shim\n\n主要作用是将 containerd 和真正的容器进程解耦，使用 containerd-shim 作为容器进程的父进程，从而实现重启 containerd 不影响已经启动的容器进程。\n\n### 5. 参考文档\n\n[组件组成：剖析 Docker 组件作用及其底层工作原理](https://blog.csdn.net/qq_34556414/article/details/112247223)\n\n[系列好文 ｜ Kubernetes 弃用 Docker，我们该何去何从？](http://blog.itpub.net/70002215/viewspace-2779207/)\n\n[docker进程模型，架构分析](https://segmentfault.com/a/1190000011294361)\n\nhttps://www.huweihuang.com/article/docker/code-analysis/code-analysis-of-docker-server/\n\n[白话 Kubernetes Runtime](https://zhuanlan.zhihu.com/p/58784095)\n\n[Docker源码分析](https://www.huweihuang.com/article/docker/code-analysis/code-analysis-of-docker-server/)\n\n[docker exec 失败问题排查之旅](https://xyz.uscwifi.xyz/post/DdS5a690E/)\n\n[kubectl exec 是怎么工作的](https://www.techclone.cn/post/tech/k8s/k8s-exec-failure/)"
  },
  {
    "path": "docker/9. docker问题链路排查实例.md",
    "content": "* [1\\. 确定问题](#1-确定问题)\r\n* [2\\. 开始排查](#2-开始排查)\r\n  * [2\\.1 排除是否是dockerd出现了问题](#21-排除是否是dockerd出现了问题)\r\n  * [2\\.2 排除是否是containerd出现了问题](#22-排除是否是containerd出现了问题)\r\n* [3\\.参考文档](#3参考文档)\r\n\r\n再熟悉docker核心组件的基础上，以docker exec ls 执行失败为例。提供思路：排查docker哪个组件出现了问题。\r\n\r\n### 1. 确定问题\r\n\r\n以exec容器里面执行 ls为例\r\n\r\n```\r\nroot@k8s-node:~# docker ps \r\nCONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS              PORTS               NAMES\r\n490008a7de69        26a9afb7027c                  \"sleep 3600\"             19 minutes ago      Up 19 minutes                           k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_1266\r\ne93a3ae70771        lizhenliang/pause-amd64:3.0   \"/pause\"                 7 weeks ago         Up 7 weeks                              k8s_POD_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_0\r\nc3a457fe7cc5        e6ea68648f0c                  \"/opt/bin/flanneld -…\"   7 weeks ago         Up 7 weeks                              k8s_kube-flannel_kube-flannel-ds-97qn4_kube-system_a2533c24-53f2-4dea-86df-4c61de17415f_0\r\n6fb0829d3f0d        lizhenliang/pause-amd64:3.0   \"/pause\"                 7 weeks ago         Up 7 weeks                              k8s_POD_kube-flannel-ds-97qn4_kube-system_a2533c24-53f2-4dea-86df-4c61de17415f_0\r\n```\r\n假设执行 `docker exec -it 490008a7de69 ls`出现了问题。\r\n```\r\nroot@k8s-node:~# docker exec -it 490008a7de69 ls\r\nbin            etc            mnt            run            tmp\r\ncacert.pem     home           opt            sbin           usr\r\ndev            lib            proc           srv            var\r\nentrypoint.sh  media          root           sys\r\n```\r\n\r\n### 2. 开始排查\r\n先弄清kubelet-> docker的调用链路\r\n\r\n![image.png](./image/struct-1.png)\r\n\r\n一般而言Kubelet->docker 是不是有问题很好排除。这里主要介绍当docker出现了问题，定位到时哪里出现了问题。\r\n\r\n#### 2.1 排除是否是dockerd出现了问题\r\n\r\ndockerd只是一个服务器端，它其实就是一个工具人，最终请求的都是转发到containerd进行处理的。\r\n\r\n这里利用了一个工具就是ctr。 之前叫docker-containerd-ctr，安装docker的时候会自动安装这个。\r\n这个工具就是用来调试的。\r\n\r\nctr 常见操作如下：\r\n注意：-a 是 address的意思。这个一定要指定socket。这个可以 `ps -ef | grep socket` 找出来。\r\n\r\n```\r\n查看有哪些命名空间的容器（和Pod的ns不是一个东西）\r\nroot@k8s-node:~# ctr -a /var/run/docker/containerd/containerd.sock namespaces ls\r\nNAME LABELS \r\nmoby   \r\n\r\n查看moby ns下有哪些容器，这个其实就是对应的docker ps的容器\r\nroot@k8s-node:~# ctr -a /var/run/docker/containerd/containerd.sock -n moby containers ls \r\nCONTAINER                                                           IMAGE    RUNTIME                           \r\n6fb0829d3f0dae7f8e0328ef88748ed1c7bdb8d6783059461c790031232da19d    -        io.containerd.runtime.v1.linux    \r\n97a519dcd3d6622b9650af95450fbb2b9e6c4761c277c43dd9e7b0e9f74e703d    -        io.containerd.runtime.v1.linux    \r\nc3a457fe7cc56185375ff67faa34a0141712c09f7b12f740f4fe4ebf18023984    -        io.containerd.runtime.v1.linux    \r\ne93a3ae70771ca0e4954fcb6ecf0ffd091eebfc64bcb3cbf461c94eb5474c9aa    -        io.containerd.runtime.v1.linux    \r\n\r\nroot@k8s-node:~# docker ps\r\nCONTAINER ID        IMAGE                         COMMAND                  CREATED             STATUS              PORTS               NAMES\r\n97a519dcd3d6        26a9afb7027c                  \"sleep 3600\"             12 minutes ago      Up 12 minutes                           k8s_nginx_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_1267\r\ne93a3ae70771        lizhenliang/pause-amd64:3.0   \"/pause\"                 7 weeks ago         Up 7 weeks                              k8s_POD_nginx1_default_cc8a9cfb-872c-44ba-9899-b4c8bbc93a21_0\r\nc3a457fe7cc5        e6ea68648f0c                  \"/opt/bin/flanneld -…\"   7 weeks ago         Up 7 weeks                              k8s_kube-flannel_kube-flannel-ds-97qn4_kube-system_a2533c24-53f2-4dea-86df-4c61de17415f_0\r\n6fb0829d3f0d        lizhenliang/pause-amd64:3.0   \"/pause\"                 7 weeks ago         Up 7 weeks                              k8s_POD_kube-flannel-ds-97qn4_kube-system_a2533c24-53f2-4dea-86df-4c61de17415f_0\r\n```\r\n\r\n**排查是不是dockerd出现了问题**\r\n\r\n* 如果docker exec ls有问题。但是ctr exec ls没有问题那就是dockerd有问题，因为ctr替代了dockerd发送了命令。\r\n\r\n* 如果docker exec ls有问题，ctr exec ls出现了同样的问题。那dockerd没有问题，是后面某一层出现了问题。\r\n\r\n\r\n下面的参数介绍: （可以结合ctr -h查看）\r\n-a address 指定socket\r\n-n namespaces 指定ns\r\nt tasks 表示要执行一下任务\r\nexec 表示是exec类型的任务\r\n--exec-id 表示任务Id，后面stupig1随便一个名字。aa/bb都可以\r\n```\r\nroot@k8s-node:~# ctr -a /var/run/docker/containerd/containerd.sock  -n moby t exec --exec-id stupig1 97a519dcd3d6622b9650af95450fbb2b9e6c4761c277c43dd9e7b0e9f74e703d ls\r\nbin\r\ncacert.pem\r\ndev\r\nentrypoint.sh\r\netc\r\nhome\r\nlib\r\nmedia\r\nmnt\r\nopt\r\nproc\r\nroot\r\nrun\r\nsbin\r\nsrv\r\nsys\r\ntmp\r\nusr\r\nvar\r\nroot@k8s-node:~# docker exec 97a519dcd3d6622b9650af95450fbb2b9e6c4761c277c43dd9e7b0e9f74e703d ls\r\nbin\r\ncacert.pem\r\ndev\r\nentrypoint.sh\r\netc\r\nhome\r\nlib\r\nmedia\r\nmnt\r\nopt\r\nproc\r\nroot\r\nrun\r\nsbin\r\nsrv\r\nsys\r\ntmp\r\nusr\r\nvar\r\n```\r\n\r\n#### 2.2 排除是否是containerd出现了问题\r\n\r\ndocker -> container -> runc\r\n\r\n由于container包含了containerd+container-shim。这两个工具都不好排查。所以直接使用runc排查。\r\n\r\n/var/run/docker/runtime-runc/moby/ 是root目录，这个是containerd运行的时候指定的\r\n\r\n```\r\nroot@k8s-node:~/docker# runc --root /var/run/docker/runtime-runc/moby/ exec 97a519dcd3d6622b9650af95450fbb2b9e6c4761c277c43dd9e7b0e9f74e703d ls\r\nbin\r\ncacert.pem\r\ndev\r\nentrypoint.sh\r\netc\r\nhome\r\nlib\r\nmedia\r\nmnt\r\nopt\r\nproc\r\nroot\r\nrun\r\nsbin\r\nsrv\r\nsys\r\ntmp\r\nusr\r\nvar\r\nroot@k8s-node:~/docker# \r\nroot@k8s-node:~/docker# ps -ef |grep moby  查看root目录。这个是containerd运行的时候指定的\r\n```\r\n\r\n如果runc执行没有问题，那就是containerd有问题，否则就是runc有问题。\r\n\r\nrunc有问题的时候会打印log日志。或者直接debug模式查看具体过程。\r\n\r\n如果是docker有问题，docker是有日志输出的\r\n\r\n```\r\nroot@k8s-node:~# runc  --debug --root /var/run/docker/runtime-runc/moby/ exec  d6cef7d7206d22873050d3c5b303b32d962803bb53ddb6c3386e5b1ead3cbf5d  ls \r\nDEBU[0000] nsexec:601 nsexec started                    \r\nDEBU[0000] child process in init()                      \r\nDEBU[0000] logging has already been configured          \r\nbin\r\ncacert.pem\r\ndev\r\nentrypoint.sh\r\netc\r\nhome\r\nlib\r\nmedia\r\nmnt\r\nopt\r\nproc\r\nroot\r\nrun\r\nsbin\r\nsrv\r\nsys\r\ntmp\r\nusr\r\nvar\r\nDEBU[0000] log pipe has been closed: EOF                \r\nDEBU[0000] process exited                                pid=3901 status=0\r\n```\r\n\r\n\r\n\r\n### 3.参考文档\r\n\r\n\r\n[containerd的本地CLI工具ctr使用](https://www.mdnice.com/writing/78929e9fe39442fbba982009faf371b1)\r\n\r\n[docker exec 失败问题排查之旅](https://plpan.github.io/docker-exec-%E5%A4%B1%E8%B4%A5%E9%97%AE%E9%A2%98%E6%8E%92%E6%9F%A5%E4%B9%8B%E6%97%85/)"
  },
  {
    "path": "docker/其他/补充-僵尸进程处理.md",
    "content": "### 1. 背景\n\n再使用容器时用户不当的使用，可能会造成了大量的僵尸进程没有回收，从而导致容器kill失败。kill 9 , kill 15 singal都没有反应。\n\n因此针对这个问题，输出一个处理报告。该报告分为两个部分：用户如何预防僵尸进程的产生？如果确定产生了僵尸进程，我们如何解决？\n\n<br>\n\n### 2. 如何预防僵尸进程的产生\n\n#### 2.1 僵尸进程的产生\n\n 在UNIX 系统中，**任何一个子进程(init除外)在exit()之后，并非马上就消失掉，而是留下一个称为僵尸进程(Zombie)的数据结构，等待父进程处理。**这是每个 子进程在结束时都要经过的阶段。如果子进程在exit()之后，父进程没有来得及处理，这时用ps命令就能看到子进程的状态是“Z”。如果父进程能及时 处理，可能用ps命令就来不及看到子进程的僵尸状态，但这并不等于子进程不经过僵尸状态。  如果父进程在子进程结束之前退出，则子进程将由init接管。init将会以父进程的身份对僵尸状态的子进程进行处理。\n\n<br>\n\n#### 2.2 如何回收僵尸进程\n\n**核心点：**  让容器的1号进程可以回收僵尸进程\n\n##### 方法1  用户层次解决\n\n1、父进程通过wait和waitpid等函数等待子进程结束，这会导致父进程挂起\n\n2、如果父进程很忙，那么可以用signal函数为SIGCHLD安装handler，因为子进程结束后，父进程会收到该信号，可以在handler中调用wait回收\n\n3、如果父进程不关心子进程什么时候结束，那么可以用signal(SIGCHLD, SIG_IGN) 通知内核，自己对子进程的结束不感兴趣，那么子进程结束后，内核会回收，并不再给父进程发送信号\n\n<br>\n\n##### 方法2  容器层次解决\n\n在镜像中替换1号进程\n\n某些时候，用户运作在容器中的1号进程没办法处理僵尸进程，这个时候就需要引入init进程，让init进程为1号进程。用户需要运行的进程为子进程。这样用户进程创造出来的僵尸进程在用户进程死掉之后，init进程可以回收。\n\n目前常见的在镜像中加入 [tini](https://github.com/krallin/tini) 或 [dumb-init](https://github.com/Yelp/dumb-init) 实现，范例如下（详细建议阅读官方 guied）：\n\n```\n## 使用tini作为1号进程\n# Add Tini\nENV TINI_VERSION v0.18.0\nADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini\nRUN chmod +x /tini\nENTRYPOINT [\"/tini\", \"--\"]\n\n## 或者使用dumb-init作为1号进程\n# Run your program under Tini\nCMD [\"/your/program\", \"-and\", \"-its\", \"arguments\"]\n# or docker run your-image /your/program ...\n# Runs \"/usr/bin/dumb-init -- /my/script --with --args\"\nENTRYPOINT [\"/usr/bin/dumb-init\", \"--\"]\n\n\n## 用户需要执行的代码\nCMD [\"/my/script\", \"--with\", \"--args\"]\n```\n\n**实验：**\n\n**构造用户示例代码，该代码会产生一个僵尸进程**\n\n```\nimport os\nimport subprocess\n\n\npid = os.fork()\nif pid == 0:  # child\n    pid2 = os.fork()\n    if pid2 != 0:  # parent\n        print('The zombie pid will be: {}'.format(pid2))\nelse:  # parent\n    os.waitpid(pid, 0)\n    subprocess.check_call(('ps', 'xawuf'))\n```\n\n**对应的Dockerfile**\n\n```\nFROM python:3\nCOPY test.sh /root/\nCMD [\"/root/test.sh\"]\n```\n\n\n\n**运行后的结果**\n\n出现\n\n```\nroot@cld-dnode1-1091:/home/zouxiang/DockerFiles# docker run --rm zoux/tini:sh2\nThe zombie pid will be: 7\nUSER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND\nroot           1 14.0  0.0  14140 11476 ?        Ss   08:04   0:00 python3 /root/test.sh\nroot           7  0.0  0.0      0     0 ?        Z    08:04   0:00 [python3] <defunct>\nroot           8  0.0  0.0   9392  3000 ?        R    08:04   0:00 ps xawuf\n```\n\n<br>\n\n**对比：使用tini**\n\n用户代码不需要改变。修改Dockerfile如下：\n\n```\nFROM python:3\n\nADD tini /                      ##增加tini，将其作为1号进程\nENTRYPOINT [\"/tini\",\"--\"\"]\n\nCOPY test.sh /root/\nCMD [\"/root/test.sh\"]\n```\n\n<br>\n\n**运行后的结果：**\n\n**8号僵尸进程已经被回收**\n\n```\nroot@cld-dnode1-1091:/home/zouxiang/DockerFiles# docker run --rm zoux/tini:sh3\nThe zombie pid will be: 8\nUSER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND\nroot           1  0.0  0.0   2280   756 ?        Ss   08:04   0:00 /tini -- /root/test.sh /root/test.sh\nroot           6  0.0  0.0  14140 11580 ?        S    08:04   0:00 python3 /root/test.sh /root/test.sh\nroot           9  0.0  0.0   9392  3044 ?        R    08:04   0:00  \\_ ps xawuf\n```\n\n\n\n除了tini或者dump-init外，用户也可以自己定制 init进程，例如：https://github.com/fpco/pid1\n\n<br>\n\n##### 方法三  k8s层次解决\n\n通过pause成为init进程，并且回收僵尸进程。\n\n每个 K8s Pod 有一个 [pause](https://github.com/kubernetes/kubernetes/blob/master/build/pause/pause.c) 容器组件，一般我们说起它的功能就是 Pod 内容器共享网络。其实除了共享网络还有睡觉之外，它还会捕获僵尸进程。默认 K8s Pod 内的 PID namespace 是不共享的，早期我们可以通过 kubelet `--docker-disable-shared-pid=false` 选项开启 Pod 内 PID namespace 共享，如此对应节点的 Pod 中 PID 为 1 的进程就是 pause 了，它便可以捕获处理僵尸进程了。kubelet 选项有一个坏处，就是调度到节点的 Pod 都会共享 PID namespace，社区就觉得应该移除这个选项，在 Pod 层实现，社区讨论见 [Remove `–docker-disable-shared-pid` from kubelet](https://github.com/kubernetes/kubernetes/issues/41938) 。在 K8s 1.10 就开始支持 Pod Spec 添加 `ShareProcessNamespace` 字段，支持在 Pod 层开启 PID namespace 共享。\n\n硬性条件：docker >= 1.13.1。pause有回收僵尸进程的能力。\n\n<br>\n\n### 3. 处理僵尸进程\n\n如何真的出现了僵尸进程，导致pod kill失败应该如何处理？\n\n目前调研来看最常用的解决方法就是： \n\n（1） kill 僵尸进程的父进程，这样僵尸进程就会被回收。\n\n（2）重启docker"
  },
  {
    "path": "docker/其他/补充-容器进程.md",
    "content": "### 1. 为什么杀不死 容器的1号进程\n\n动作：在容器内杀死1号进程。\n\n```\n# kubectl exec -it zx-hpa-7c669876bb-bddsr -n test-zx /bin/sh\n# ps -ef\nUID          PID    PPID  C STIME TTY          TIME CMD\nroot           1       0  0 07:57 ?        00:00:00 sleep 3600\nroot        3896       0  0 08:30 pts/0    00:00:00 /bin/sh\nroot        3912    3896  0 08:30 pts/0    00:00:00 ps -ef\n#\n# kill 1\n#\n# kill -9 1\n```\n\n<br>\n\n原因：出现了信号屏蔽\n\n如何检查进程正在监听的信号？https://qastack.cn/unix/85364/how-can-i-check-what-signals-a-process-is-listening-to\n\nSIGTERM（15）和 SIGKILL（9）\n\n| 1号进程 | kill -9 1 | kill 1 |\n| ------- | --------- | ------ |\n| bash    | 不行      | 不行   |\n| c++     | 不行      | 不行   |\n| golang  | 不行      | 行     |\n\n<br>\n\n第一个概念是 Linux 1 号进程。它是第一个用户态的进程。它直接或者间接创建了 Namespace 中的其他进程。\n\n第二个概念是 Linux 信号。Linux 有 31 个基本信号，进程在处理大部分信号时有三个选择：忽略、捕获和缺省行为。其中两个特权信号 SIGKILL 和 SIGSTOP 不能被忽略或者捕获。\n\n容器里 1 号进程对信号处理的两个要点，这也是这一讲里我想让你记住的两句话：\n\n(1)  在容器中，1 号进程永远不会响应 SIGKILL 和 SIGSTOP 这两个特权信号；\n\n(2)   对于其他的信号，如果用户自己注册了 handler，1 号进程可以响应。\n\n<br>\n\n### 2. 如何通过找到容器的父进程\n\n```\n## 第一步：通过pod找到容器名字，这里容器名字为 zx-nginx\n\n## 第二步：通过容器名字，找到容器id\n # docker ps | grep zx-nginx\n8803c7c666d9        68cb644cdf30                                        \"./main\"                 22 minutes ago      Up 22 minutes                           k8s_zx-nginx_istio-ingressgateway-fc76bb8c9-667qv_test-zx_f24bdce8-277e-4e1f-8338-b6204068c6ec_1\n\n## 第三步：通过容器id，找到父进程id\n# ps -ef |grep 8803c7c666d9\nroot      961776    1703  0 14:45 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8803c7c666d918c37cd81891e586d2173db45d173debe567fe5aa56df12111b0 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc\nroot      997057  977180  0 15:01 pts/0    00:00:00 grep 8803c7c666d9\n\n## 父进程是containerd-shim, containerd-shim的父进程是 /usr/bin/containerd。\n# ps -ef | grep 1703\nroot        1703       1  1  2020 ?        2-19:54:09 /usr/bin/containerd\n\n## 通过父进程id还能反推回去找到 容器的1号进程 ./main\n# ps -ef | grep 961776\nroot      961776    1703  0 14:45 ?        00:00:03 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/8803c7c666d918c37cd81891e586d2173db45d173debe567fe5aa56df12111b0 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc\nroot      961793  961776 99 14:45 ?        00:19:07 ./main\nroot     1003020  977180  0 15:04 pts/0    00:00:00 grep 961776\n\n## 直接通过进程名也可以找到\n# ps -ef | grep /ma\nroot      961793  961776 99 14:45 ?        00:15:49 ./main\nroot      996535  977180  0 15:01 pts/0    00:00:00 grep /ma\n```\n\n<br>\n\n### 3. 如何找到容器和pod的cgroup设置\n\n#### 3.1 方法一\n\n（1）查看pod的yaml. 获得以下信息\n\n```\npodName: zx-hpa-7c669876bb-cv8gm\nnodeip: XXXXX\nuid: c7902b75-88ed-473c-806d-9c419bcff548\nqosClass: Burstable\n```\n\n<br>\n\n（2）在node节点的对应路径上就可以看见\n\n```\n/sys/fs/cgroup/memory/kubepods/burstable# ls\ncgroup.clone_children\t\t    memory.usage_in_bytes\ncgroup.event_control\t\t    memory.use_hierarchy\ncgroup.procs\t\t\t    notify_on_release\nmemory.failcnt\t\t\t    pod1a337b4b-0c5a-4197-bc1a-f725b126f9df\nmemory.force_empty\t\t    pod210e5803-331b-4e13-9bf7-66d4854f6e3f\nmemory.kmem.failcnt\t\t    pod2f287b5c-f979-48d5-b8dd-6399164952bc\nmemory.kmem.limit_in_bytes\t    pod392ecb7f-e16c-4dbd-aed7-445f066a17da\nmemory.kmem.max_usage_in_bytes\t    pod4000ac86-1aa1-47ff-b194-3e80a8073fd7\nmemory.kmem.slabinfo\t\t    pod54d7ff80-cde1-47f1-b6ea-39ec0f34beb5\nmemory.kmem.tcp.failcnt\t\t    pod7ece8e2d-1a67-4fc5-a1be-bde0b64cc8c6\nmemory.kmem.tcp.limit_in_bytes\t    pod7f5ba05e-e962-40d3-a598-6e0a93a1b139\nmemory.kmem.tcp.max_usage_in_bytes  pod8820126f-4278-4f56-babd-00b751e82520\nmemory.kmem.tcp.usage_in_bytes\t    pod91cbcae8-1d2a-430a-a414-6d5f1c374c04\nmemory.kmem.usage_in_bytes\t    pod9a56e405-e0c1-4851-9242-18d6526d1aa5\nmemory.limit_in_bytes\t\t    poda08d46d1-0eeb-4cf9-bb01-7bdc07cfad08\nmemory.max_usage_in_bytes\t    podae792623-f895-4726-97a4-aafbaab23fea\nmemory.memsw.failcnt\t\t    podafd1b6ea-158b-464d-816b-d307d7b67ba0\nmemory.memsw.limit_in_bytes\t    podb57fe34b-5e44-472f-baba-837d85bb84fa\nmemory.memsw.max_usage_in_bytes     podb6628cac-fc22-4e21-a074-c54bd25a0204\nmemory.memsw.usage_in_bytes\t    podb6c4f1fc-566d-455b-b4d2-4c135dfc41ca\nmemory.move_charge_at_immigrate     podc7902b75-88ed-473c-806d-9c419bcff548       //就是这个pod\nmemory.numa_stat\t\t    podcbf6dc19-5058-4016-b9f1-e17ebd41a751\nmemory.oom_control\t\t    pode0651b8f-9c80-4c08-9bc9-8f92681cc6de\nmemory.pressure_level\t\t    podefa14886-4a76-4297-88da-07a6fb572ab5\nmemory.soft_limit_in_bytes\t    podf304b287-417c-4812-82f6-e88d7bd010c0\nmemory.stat\t\t\t    podf50a670b-058a-4ee9-945e-7a57cec8549b\nmemory.swappiness\t\t    tasks\n```\n\n<br>\n\n(3) 进入改路径，就可以看见container的\n\n```\n:/sys/fs/cgroup/memory/kubepods/burstable/podc7902b75-88ed-473c-806d-9c419bcff548# ls\n//这个就是container的。 docker ps的container\tid就是钱12位\n5cf735569b3adcf74582e1a6082adf3ebebf0250bd68170b7ce980a952760b73    \nmemory.max_usage_in_bytes\ncgroup.clone_children\t\t\t\t\t\t  memory.memsw.failcnt\ncgroup.event_control\t\t\t\t\t\t  memory.memsw.limit_in_bytes\ncgroup.procs\t\t\t\t\t\t\t  memory.memsw.max_usage_in_bytes\nfc1c7dfcfa73e6ed23fbfb4011f20d24646ea2b3f1ce0fbbaa23802a5cdf7f79  memory.memsw.usage_in_bytes\nmemory.failcnt\t\t\t\t\t\t\t  memory.move_charge_at_immigrate\nmemory.force_empty\t\t\t\t\t\t  memory.numa_stat\nmemory.kmem.failcnt\t\t\t\t\t\t  memory.oom_control\nmemory.kmem.limit_in_bytes\t\t\t\t\t  memory.pressure_level\nmemory.kmem.max_usage_in_bytes\t\t\t\t\t  memory.soft_limit_in_bytes\nmemory.kmem.slabinfo\t\t\t\t\t\t  memory.stat\nmemory.kmem.tcp.failcnt\t\t\t\t\t\t  memory.swappiness\nmemory.kmem.tcp.limit_in_bytes\t\t\t\t\t  memory.usage_in_bytes\nmemory.kmem.tcp.max_usage_in_bytes\t\t\t\t  memory.use_hierarchy\nmemory.kmem.tcp.usage_in_bytes\t\t\t\t\t  notify_on_release\nmemory.kmem.usage_in_bytes\t\t\t\t\t  tasks\nmemory.limit_in_bytes\n```\n\n<br>\n\n#### 3.2 方法二\n\n（1） docker ps 找出来 containerid\n\n（2）docker inspect containerid | grep \\\"Pid\\\", 找出来 pidId\n\n```\ndocker inspect fc1c7dfcfa73 | grep \"Pid\"\n            \"Pid\": 1139631,\n            \"PidMode\": \"\",\n            \"PidsLimit\": 0,\n```\n\n（3）cat /proc/pidId/cgroup | grep memory \n\n```\n# cat /proc/1139631/cgroup | grep memory\n11:memory:/kubepods/burstable/podc7902b75-88ed-473c-806d-9c419bcff548/fc1c7dfcfa73e6ed23fbfb4011f20d24646ea2b3f1ce0fbbaa23802a5cdf7f79\n```\n\n(3). 找出来 memory对应的  cgroup链接\n\n前缀是/sys/fs/cgroup/memory/\n\n上一层就是pod的\n\n```\n# ls /sys/fs/cgroup/memory/kubepods/burstable/podc7902b75-88ed-473c-806d-9c419bcff548/fc1c7dfcfa73e6ed23fbfb4011f20d24646ea2b3f1ce0fbbaa23802a5cdf7f79\ncgroup.clone_children\t\tmemory.kmem.tcp.max_usage_in_bytes  memory.oom_control\ncgroup.event_control\t\tmemory.kmem.tcp.usage_in_bytes\t    memory.pressure_level\ncgroup.procs\t\t\tmemory.kmem.usage_in_bytes\t    memory.soft_limit_in_bytes\nmemory.failcnt\t\t\tmemory.limit_in_bytes\t\t    memory.stat\nmemory.force_empty\t\tmemory.max_usage_in_bytes\t    memory.swappiness\nmemory.kmem.failcnt\t\tmemory.memsw.failcnt\t\t    memory.usage_in_bytes\nmemory.kmem.limit_in_bytes\tmemory.memsw.limit_in_bytes\t    memory.use_hierarchy\nmemory.kmem.max_usage_in_bytes\tmemory.memsw.max_usage_in_bytes     notify_on_release\nmemory.kmem.slabinfo\t\tmemory.memsw.usage_in_bytes\t    tasks\nmemory.kmem.tcp.failcnt\t\tmemory.move_charge_at_immigrate\nmemory.kmem.tcp.limit_in_bytes\tmemory.numa_stat\n```\n\n<br>\n\n#### 3.3 qos介绍\n\nQoS（Quality of Service），大部分译为“服务质量等级”，又译作“服务质量保证”，是作用在 Pod 上的一个配置，当 Kubernetes 创建一个 Pod 时，它就会给这个 Pod 分配一个 QoS 等级，可以是以下等级之一：\n\n- **Guaranteed**：Pod 里的每个容器都必须有内存/CPU 限制和请求，而且值必须相等。\n- **Burstable**：Pod 里至少有一个容器有内存或者 CPU 请求且不满足 Guarantee 等级的要求，即内存/CPU 的值设置的不同。\n- **BestEffort**：容器必须没有任何内存或者 CPU 的限制或请求。\n\n该配置不是通过一个配置项来配置的，而是通过配置 CPU/内存的 `limits` 与 `requests` 值的大小来确认服务质量等级的。使用 `kubectl get pod -o yaml` 可以看到 pod 的配置输出中有 `qosClass` 一项。该配置的作用是为了给资源调度提供策略支持，调度算法根据不同的服务质量等级可以确定将 pod 调度到哪些节点上。"
  },
  {
    "path": "etcd/0. etcd常用操作.md",
    "content": "\r\n\r\n### 1. 实用脚本\r\n\r\n```\r\n### cat zoux_etcdctl.sh\r\n#! /bin/bash\r\n# Already test in etcd V3.0.4 and V3.1.7\r\n# Deal with ls command\r\n\r\nENDPOINTS=\"http://7.33.96.71:24001,http://7.33.96.72:24001,http://7.33.96.73:24001\"\r\nETCDCTL_ABSPATH=\"/usr/local/bin/etcdctl-v3.4.3\"\r\nCERT_ARGS=\"\"\r\n\r\nexport ETCDCTL_API=3\r\n\r\nif [ $1 == \"ls\" ]\r\nthen\r\n    keys=$2\r\n    if [ -z $keys ]\r\n    then\r\n        keys=\"/\"\r\n    fi\r\n    if [ ${keys: -1} != \"/\" ]\r\n    then\r\n        keys=$keys\"/\"\r\n    fi\r\n    num=`echo $keys | grep -o \"/\" | wc -l`\r\n    (( num=$num+1 ))\r\n    $ETCDCTL_ABSPATH --endpoints=\"$ENDPOINTS\" get $keys --prefix=true --keys-only=true $CERT_ARGS | cut -d '/' -f 1-$num | grep -v \"^$\" | grep -v \"compact_rev_key\" | uniq | sort\r\n    exit 0\r\nfi\r\n# Deal with get command\r\nif [ $1 == \"get\" ]\r\nthen\r\n    $ETCDCTL_ABSPATH --endpoints=\"$ENDPOINTS\" $* $CERT_ARGS\r\n#--print-value-only=true\r\n    exit 0\r\nfi\r\n# Deal with other command\r\n$ETCDCTL_ABSPATH --endpoints=\"$ENDPOINTS\" $* $CERT_ARGS\r\nexit 0\r\n\r\n\r\neg.\r\nbash zoux_etcdctl.sh --debug=true ls\r\nbash zoux_etcdctl.sh endpoint status -w table\r\nbash zoux_etcdctl.sh --command-timeout=15s ls\r\nbash zoux_etcdctl.sh endpoint status\r\n```\r\n\r\n### 2. etcd的基本操作\r\n\r\n#### 2.1 查看所有的key\r\n\r\n```\r\n[root@k8s-master ssl]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=ca.pem --cert=server.pem  --key=server-key.pem  --endpoints=\"https://192.168.0.4:2379,https://192.168.0.5:2379\" get / \r\n\r\n\r\n/registry/services/endpoints/kube-system/kube-controller-manager\r\n\r\n/registry/services/endpoints/kube-system/kube-scheduler\r\n\r\n/registry/services/specs/default/kubernetes\r\n```\r\n\r\n#### 2.2 查看某个pod的内容\r\n\r\n```\r\n\r\n[root@k8s-master ssl]# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=ca.pem --cert=server.pem  --key=server-key.pem  --endpoints=\"https://192.168.0.4:2379,https://192.168.0.5:2379\" get /registry/pods/default/my-nginx-756f645cd7-4ws6k  -w=json | jq .\r\n\r\n{  \"header\": {    \"cluster_id\": 12138850119299830000,\r\n    \"member_id\": 6539934570868143000,\r\n    \"revision\": 7643164,\r\n    \"raft_term\": 1892\r\n  },\r\n  \"kvs\": [\r\n    {\r\n      \"key\": \"L3JlZ2lzdHJ5L3BvZHMvZGVmYXVsdC9teS1uZ2lueC03NTZmNjQ1Y2Q3LTR3czZr\",\r\n      \"create_revision\": 7642432,\r\n      \"mod_revision\": 7642554,\r\n      \"version\": 4,\r\n      \"value\": \"azhzAAoJCgJ2MRIDUG9kEtUICvoBChlteS1uZ2lueC03NTZmNjQ1Y2Q3LTR3czZrEhRteS1uZ2lueC03NTZmNjQ1Y2Q3LRoHZGVmYXVsdCIAKiRjM2U2ZDE3ZS03ZjJlLTExZWItOTY4OC1mYTI3MDAwNGIwMGQyADgAQggI99GSggYQAFofChFwb2QtdGVtcGxhdGUtaGFzaBIKNzU2ZjY0NWNkN1oPCgNydW4SCG15LW5naW54alQKClJlcGxpY2FTZXQaE215LW5naW54LTc1NmY2NDVjZDciJGMzZTE4NzlkLTdmMmUtMTFlYi05Njg4LWZhMjcwMDA0YjAwZCoHYXBwcy92MTABOAF6ABKlAwoxChNkZWZhdWx0LXRva2VuLTY5Yzk1EhoyGAoTZGVmYXVsdC10b2tlbi02OWM5NRikAxKcAQoIbXktbmdpbngSBW5naW54KgAyDQoAEAAYUCIDVENQKgBCAEpIChNkZWZhdWx0LXRva2VuLTY5Yzk1EAEaLS92YXIvcnVuL3NlY3JldHMva3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudCIAahQvZGV2L3Rlcm1pbmF0aW9uLWxvZ3IGQWx3YXlzgAEAiAEAkAEAogEERmlsZRoGQWx3YXlzIB4yDENsdXN0ZXJGaXJzdEIHZGVmYXVsdEoHZGVmYXVsdFILMTkyLjE2OC4wLjVYAGAAaAByAIIBAIoBAJoBEWRlZmF1bHQtc2NoZWR1bGVysgE2Chxub2RlLmt1YmVybmV0ZXMuaW8vbm90LXJlYWR5EgZFeGlzdHMaACIJTm9FeGVjdXRlKKwCsgE4Ch5ub2RlLmt1YmVybmV0ZXMuaW8vdW5yZWFjaGFibGUSBkV4aXN0cxoAIglOb0V4ZWN1dGUorALCAQDIAQAarQMKB1J1bm5pbmcSIwoLSW5pdGlhbGl6ZWQSBFRydWUaACIICPfRkoIGEAAqADIAEh0KBVJlYWR5EgRUcnVlGgAiCAjL0pKCBhAAKgAyABInCg9Db250YWluZXJzUmVhZHkSBFRydWUaACIICMvSkoIGEAAqADIAEiQKDFBvZFNjaGVkdWxlZBIEVHJ1ZRoAIggI99GSggYQACoAMgAaACIAKgsxOTIuMTY4LjAuNTILMTcyLjE3LjgzLjI6CAj30ZKCBhAAQtgBCghteS1uZ2lueBIMEgoKCAjK0pKCBhAAGgAgASgAMgxuZ2lueDpsYXRlc3Q6X2RvY2tlci1wdWxsYWJsZTovL25naW54QHNoYTI1NjpmMzY5M2ZlNTBkNWIxZGYxZWNkMzE1ZDU0ODEzYTc3YWZkNTZiMDI0NWE0MDQwNTVhOTQ2NTc0ZGViNmIzNGZjQklkb2NrZXI6Ly9iNDc4NTBmYWY2NGM1YjFiZWRjNjg0M2EzNzZlZTA1YTVlOGFmZmU4Y2VlZGNlMzNhOWJjNzQxY2EzNDVlOGRjSgpCZXN0RWZmb3J0WgAaACIA\"\r\n    }\r\n  ],\r\n  \"count\": 1\r\n}\r\n[root@k8s-master ssl]# \r\n[root@k8s-master ssl]# \r\n\r\nkey 是 base64加密的\r\n[root@k8s-master ssl]# \r\n[root@k8s-master ssl]# echo L3JlZ2lzdHJ5L3BvZHMvZGVmYXVsdC9teS1uZ2lueC03NTZmNjQ1Y2Q3LTR3czZr|base64 -d\r\n/registry/pods/default/my-nginx-756f645cd7-4ws6k[root@k8s-master ssl]# ^C\r\n\r\n但是为什么值 还有乱码\r\n[root@k8s-master ssl]# echo azhzAAoJCgJ2MRIDUG9kEtUICvoBChlteS1uZ2lueC03NTZmNjQ1Y2Q3LTR3czZrEhRteS1uZ2lueC03NTZmNjQ1Y2Q3LRoHZGVmYXVsdCIAKiRjM2U2ZDE3ZS03ZjJlLTExZWItOTY4OC1mYTI3MDAwNGIwMGQyADgAQggI99GSggYQAFofChFwb2QtdGVtcGxhdGUtaGFzaBIKNzU2ZjY0NWNkN1oPCgNydW4SCG15LW5naW54alQKClJlcGxpY2FTZXQaE215LW5naW54LTc1NmY2NDVjZDciJGMzZTE4NzlkLTdmMmUtMTFlYi05Njg4LWZhMjcwMDA0YjAwZCoHYXBwcy92MTABOAF6ABKlAwoxChNkZWZhdWx0LXRva2VuLTY5Yzk1EhoyGAoTZGVmYXVsdC10b2tlbi02OWM5NRikAxKcAQoIbXktbmdpbngSBW5naW54KgAyDQoAEAAYUCIDVENQKgBCAEpIChNkZWZhdWx0LXRva2VuLTY5Yzk1EAEaLS92YXIvcnVuL3NlY3JldHMva3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudCIAahQvZGV2L3Rlcm1pbmF0aW9uLWxvZ3IGQWx3YXlzgAEAiAEAkAEAogEERmlsZRoGQWx3YXlzIB4yDENsdXN0ZXJGaXJzdEIHZGVmYXVsdEoHZGVmYXVsdFILMTkyLjE2OC4wLjVYAGAAaAByAIIBAIoBAJoBEWRlZmF1bHQtc2NoZWR1bGVysgE2Chxub2RlLmt1YmVybmV0ZXMuaW8vbm90LXJlYWR5EgZFeGlzdHMaACIJTm9FeGVjdXRlKKwCsgE4Ch5ub2RlLmt1YmVybmV0ZXMuaW8vdW5yZWFjaGFibGUSBkV4aXN0cxoAIglOb0V4ZWN1dGUorALCAQDIAQAarQMKB1J1bm5pbmcSIwoLSW5pdGlhbGl6ZWQSBFRydWUaACIICPfRkoIGEAAqADIAEh0KBVJlYWR5EgRUcnVlGgAiCAjL0pKCBhAAKgAyABInCg9Db250YWluZXJzUmVhZHkSBFRydWUaACIICMvSkoIGEAAqADIAEiQKDFBvZFNjaGVkdWxlZBIEVHJ1ZRoAIggI99GSggYQACoAMgAaACIAKgsxOTIuMTY4LjAuNTILMTcyLjE3LjgzLjI6CAj30ZKCBhAAQtgBCghteS1uZ2lueBIMEgoKCAjK0pKCBhAAGgAgASgAMgxuZ2lueDpsYXRlc3Q6X2RvY2tlci1wdWxsYWJsZTovL25naW54QHNoYTI1NjpmMzY5M2ZlNTBkNWIxZGYxZWNkMzE1ZDU0ODEzYTc3YWZkNTZiMDI0NWE0MDQwNTVhOTQ2NTc0ZGViNmIzNGZjQklkb2NrZXI6Ly9iNDc4NTBmYWY2NGM1YjFiZWRjNjg0M2EzNzZlZTA1YTVlOGFmZmU4Y2VlZGNlMzNhOWJjNzQxY2EzNDVlOGRjSgpCZXN0RWZmb3J0WgAaACIA|base64 -d\r\nk8s\r\n\r\nv1Pod�\r\n�\r\nmy-nginx-756f645cd7-4ws6kmy-nginx-756f645cd7-default\"*$c3e6d17e-7f2e-11eb-9688-fa270004b00d2�ђ�Z\r\npod-template-hash\r\n756f645cd7Z\r\nrumy-nginxjT\r\n\r\nReplicaSetmy-nginx-756f645cd7\"$c3e1879d-7f2e-11eb-9688-fa270004b00d*apps/v108z�\r\n1\r\ndefault-token-69c952\r\ndefault-token-69c95��\r\nmy-nginxnginx*2\r\nP\"TCP*BJH\r\ndefault-token-69c95-/var/run/secrets/kubernetes.io/serviceaccount\"j/dev/termination-logrAlways����FileAlways 2\r\n                                                                                                              ClusterFirstBdefaultJdefaultR\r\n          192.168.0.5X`hr���default-scheduler�6\r\nnode.kubernetes.io/not-readyExists\"     NoExecute(��8\r\nnode.kubernetes.io/unreachableExists\"   NoExecute(����\r\nRunning#\r\n\r\nInitializedTru�ђ�*2\r\nReadyTru�Ғ�*2'\r\nContainersReadyTru�Ғ�*2$\r\n```\r\n\r\nhttps://github.com/openshift/origin/tree/master/tools/etcdhelper\r\n\r\nvalue是乱码的原因找到了。因为使用l proto存储。有个工具可以解决这个显示问题，见上面的链接。\r\n\r\n**参考的操作链接:**\r\n\r\nhttps://jimmysong.io/kubernetes-handbook/guide/using-etcdctl-to-access-kubernetes-data.html\r\n\r\nhttps://yq.aliyun.com/articles/561888\r\n\r\n<br>"
  },
  {
    "path": "etcd/协议理论知识/1. cap原理.md",
    "content": "### 1.cap\n\n\nCAP 理论对分布式系统的特性做了高度抽象，形成了三个指标\n\n* 一致性\n* 可用性\n* 分区容错性\n\n\n\n一致性（Consistency）：客户端每次读写操作，不管是访问哪个节点都是一样的数据\n\n可用性(Availability)：不管客户端访问的是哪个节点，都会给你返回数据\n\n分区容错性（Partition Tolerance）:当节点间出现任意数量的消息丢失或高延迟的时候，系统仍然可以继续提供服务。也就是说，分布式系统在告诉访问本系统的客户端：不管我的内部出现什么样的数据同步问题，我会一直运行，提供服务。\n\n**注意点**\n（1）一致性并不代表完整性\n\n（2）CAP不能三个都满足的条件在于，没有网络故障或者系统问题。当系统没问题的时候CAP是可以同时满足的。当有故障时，就要根据情况，选择A 和是  C\n\n### 2.总结\n\n(1) CA 模型，在分布式系统中不存在。因为舍弃 P，意味着舍弃分布式系统，就比如单机版关系型数据库 MySQL，如果 MySQL 要考虑主备或集群部署时，它必须考虑 P。\n\n(2) CP 模型，采用 CP 模型的分布式系统，一旦因为消息丢失、延迟过高发生了网络分区，就影响用户的体验和业务的可用性。因为为了防止数据不一致，集群将拒绝新数据的写入，典型的应用是 ZooKeeper，Etcd 和 HBase。\n\n(3) AP 模型，采用 AP 模型的分布式系统，实现了服务的高可用。用户访问系统的时候，都能得到响应数据，不会出现响应错误，但当出现分区故障时，相同的读操作，访问不同的节点，得到响应数据可能不一样。典型应用就比如 Cassandra 和 DynamoDB。\n\n"
  },
  {
    "path": "etcd/协议理论知识/2. ACID理论.md",
    "content": "### 1.ACID是什么\n\n事务是由一组SQL语句组成的逻辑处理单元，事务具有以下4个属性，通常简称为事务的ACID属性。\n\nACID\nAtomic（原子性）\n\nConsistency（一致性）\n\nIsolation（隔离性）\n\nDurability（持久性）的英文缩写。\n\n### 2. 分布式系统如何实现ACID\n#### 2.1 二阶段提交协议\n\n两阶段提交协议（2PC：Two-Phase Commit）\n两阶段提交协议的目标在于为分布式系统保证数据的一致性，许多分布式系统采用该协议提供对分布式事务的支持。顾名思义，该协议将一个分布式的事务过程拆分成两个阶段： 投票 和 事务提交 。为了让整个数据库集群能够正常的运行，该协议指定了一个 协调者 单点，用于协调整个数据库集群各节点的运行。为了简化描述，我们将数据库集群中的各个节点称为 参与者 ，三阶段提交协议中同样包含协调者和参与者这两个角色定义。\n\n##### 2.1.1 原理\n\n**第一阶段：投票**\n该阶段的主要目的在于打探数据库集群中的各个参与者是否能够正常的执行事务，具体步骤如下：\n\n协调者向所有的参与者发送事务执行请求，并等待参与者反馈事务执行结果；\n事务参与者收到请求之后，执行事务但不提交，并记录事务日志；\n参与者将自己事务执行情况反馈给协调者，同时阻塞等待协调者的后续指令。\n**第二阶段：事务提交**\n在经过第一阶段协调者的询盘之后，各个参与者会回复自己事务的执行情况，这时候存在 3 种可能性：\n\n（1）所有的参与者都回复能够正常执行事务。\n\n（2）一个或多个参与者回复事务执行失败。\n\n（3）协调者等待超时。\n\n<br>\n\n对于第 1 种情况，协调者将向所有的参与者发出提交事务的通知，具体步骤如下：\n\n协调者向各个参与者发送 commit 通知，请求提交事务；\n参与者收到事务提交通知之后执行 commit 操作，然后释放占有的资源；\n参与者向协调者返回事务 commit 结果信息。\n\n![image-20220406164951426](../images/acid-1.png)\n\n对于第 2 和第 3 种情况，协调者均认为参与者无法成功执行事务，为了整个集群数据的一致性，所以要向各个参与者发送事务回滚通知，具体步骤如下：\n\n协调者向各个参与者发送事务 rollback 通知，请求回滚事务；\n参与者收到事务回滚通知之后执行 rollback 操作，然后释放占有的资源；\n参与者向协调者返回事务 rollback 结果信息。\n\n![image-20220406165445685](../images/acid-2.png)\n\n__两阶段提交协议解决的是分布式数据库数据强一致性问题__，实际应用中更多的是用来解决事务操作的原子性，下图描绘了协调者与参与者的状态转换。\n\n![image-20220406165528208](../images/acid-3.png)\n\n站在协调者的角度，在发起投票之后就进入了 WAIT 状态，等待所有参与者回复各自事务执行状态，并在收到所有参与者的回复后决策下一步是发送 commit 或 rollback 信息。站在参与者的角度，当回复完协调者的投票请求之后便进入 READY 状态（能够正常执行事务），接下去就是等待协调者最终的决策通知，一旦收到通知便可依据决策执行 commit 或 rollback 操作。\n\n##### 2.1.2 优缺点\n\n两阶段提交协议原理简单、易于实现，但是缺点也是显而易见的，包含如下：\n\n（1）单点问题\n协调者在整个两阶段提交过程中扮演着举足轻重的作用，一旦协调者所在服务器宕机，就会影响整个数据库集群的正常运行。比如在第二阶段中，如果协调者因为故障不能正常发送事务提交或回滚通知，那么参与者们将一直处于阻塞状态，整个数据库集群将无法提供服务。\n\n（2）同步阻塞\n两阶段提交执行过程中，所有的参与者都需要听从协调者的统一调度，期间处于阻塞状态而不能从事其他操作，这样效率极其低下。\n\n（3）数据不一致性\n两阶段提交协议虽然是分布式数据强一致性所设计，但仍然存在数据不一致性的可能性。比如在第二阶段中，假设协调者发出了事务 commit 通知，但是因为网络问题该通知仅被一部分参与者所收到并执行了commit 操作，其余的参与者则因为没有收到通知一直处于阻塞状态，这时候就产生了数据的不一致性。\n\n针对上述问题可以引入 超时机制 和 互询机制 在很大程度上予以解决。\n\n对于协调者来说如果在指定时间内没有收到所有参与者的应答，则可以自动退出 WAIT 状态，并向所有参与者发送 rollback 通知。对于参与者来说如果位于 READY 状态，但是在指定时间内没有收到协调者的第二阶段通知，则不能武断地执行 rollback 操作，因为协调者可能发送的是 commit 通知，这个时候执行 rollback 就会导致数据不一致。\n\n此时，我们可以介入互询机制，让参与者 A 去询问其他参与者 B 的执行情况。如果 B 执行了 rollback 或 commit 操作，则 A 可以大胆的与 B 执行相同的操作；如果 B 此时还没有到达 READY 状态，则可以推断出协调者发出的肯定是 rollback 通知；如果 B 同样位于 READY 状态，则 A 可以继续询问另外的参与者。只有当所有的参与者都位于 READY 状态时，此时两阶段提交协议无法处理，将陷入长时间的阻塞状态。\n\n**三段式提交协议多了一个预询盘阶段**\n\n\n#### 2.2 TCC （Try-confirm-cancel）\n\nTCC 其实就是采用的补偿机制，其核心思想是：针对每个操作，都要注册一个与其对应的确认和补偿（撤销）操作。它分为三个阶段：\n\n（1）Try 阶段主要是对业务系统做检测及资源预留\n\n（2）Confirm 阶段主要是对业务系统做确认提交，Try阶段执行成功并开始执行 Confirm阶段时，默认 Confirm阶段是不会出错的。即：只要Try成功，Confirm一定成功。\n\n（3）Cancel 阶段主要是在业务执行错误，需要回滚的状态下执行的业务取消，预留资源释放。\n举个例子，假入 Bob 要向 Smith 转账，思路大概是：我们有一个本地方法，里面依次调用\n\n1、首先在 Try 阶段，要先调用远程接口把 Smith 和 Bob 的钱给冻结起来。\n\n2、在 Confirm 阶段，执行远程调用的转账的操作，转账成功进行解冻。\n\n3、如果第2步执行成功，那么转账成功，如果第二步执行失败，则调用远程冻结接口对应的解冻方法 (Cancel)。\n\n#### 2.3 二阶段提交和TCC区别\n\n经常在网络上看见有人介绍TCC时，都提一句，”TCC是两阶段提交的一种”。其理由是TCC将业务逻辑分成try、confirm/cancel在两个不同的阶段中执行。其实这个说法，是不正确的。\n\n可能是因为既不太了解两阶段提交机制、也不太了解TCC机制的缘故，于是将两阶段提交机制的prepare、commit两个事务提交阶段和TCC机制的try、confirm/cancel两个业务执行阶段互相混淆，才有了这种说法。两阶段提交（Two Phase Commit，下文简称2PC），简单的说，是将事务的提交操作分成了prepare、commit两个阶段。\n\n其事务处理方式为：\n\n1、 在全局事务决定提交时，\n\n​     a）逐个向RM发送prepare请求；\n\n​     b）若所有RM都返回OK，则逐个发送commit请求最终提交事务；否则，逐个发送rollback请求来回滚事务；\n\n2、 在全局事务决定回滚时，直接逐个发送rollback请求即可，不必分阶段。\n\n需要注意的是：2PC机制需要RM提供底层支持（一般是兼容XA），而TCC机制则不需要。\n\n\n\nTCC（Try-Confirm-Cancel），则是将业务逻辑分成try、confirm/cancel两个阶段执行，其事务处理方式为：\n\n1、 在全局事务决定提交时，调用与try业务逻辑相对应的confirm业务逻辑；\n\n2、 在全局事务决定回滚时，调用与try业务逻辑相对应的cancel业务逻辑。\n\n可见，TCC在事务处理方式上，是很简单的：要么调用confirm业务逻辑，要么调用cancel逻辑。这里为什么没有提到try业务逻辑呢？因为try逻辑与全局事务处理无关。\n\n当讨论2PC时，我们只专注于事务处理阶段，因而只讨论prepare和commit，所以，可能很多人都忘了，使用2PC事务管理机制时也是有业务逻辑阶段的。正是因为业务逻辑的执行，发起了全局事务，这才有其后的事务处理阶段。\n\n\n\n实际上，使用2PC机制时\n\n————以提交为例————\n\n一个完整的事务生命周期是：begin -> 业务逻辑 -> prepare -> commit。\n\n再看TCC，也不外乎如此。我们要发起全局事务，同样也必须通过执行一段业务逻辑来实现。该业务逻辑\n\n一来通过执行触发TCC全局事务的创建；二来也需要执行部分数据写操作；\n\n此外，还要通过执行来向TCC全局事务注册自己，以便后续TCC全局事务commit/rollback时回调其相应的confirm/cancel业务逻辑。\n\n所以，使用TCC机制时\n\n————以提交为例————\n\n一个完整的事务生命周期是：begin -> 业务逻辑(try业务) -> commit(comfirm业务)。\n\n综上，我们可以从执行的阶段上将二者一一对应起来：\n\n1、 2PC机制的业务阶段 等价于 TCC机制的try业务阶段；\n\n2、 2PC机制的提交阶段（prepare & commit） 等价于 TCC机制的提交阶段（confirm）；\n\n3、 2PC机制的回滚阶段（rollback） 等价于 TCC机制的回滚阶段（cancel）。\n\n因此，可以看出，虽然TCC机制中有两个阶段都存在业务逻辑的执行，但其中try业务阶段其实是与全局事务处理无关的。认清了这一点，当我们再比较TCC和2PC时，就会很容易地发现，TCC不是两阶段提交，而只是它对事务的提交/回滚是通过执行一段confirm/cancel业务逻辑来实现，仅此而已。"
  },
  {
    "path": "etcd/协议理论知识/3. base理论.md",
    "content": "BASE 理论是 CAP 理论中的 AP 的延伸，是对互联网大规模分布式系统的实践总结，强调可用性。几乎所有的互联网后台分布式系统都有 BASE 的支持，这个理论很重要，地位也很高。一旦掌握它，你就能掌握绝大部分场景的分布式系统的架构技巧，设计出适合业务场景特点的、高可用性的分布式系统。\n\n而它的核心就是基本可用（Basically Available）和最终一致性（Eventually consistent）。也有人会提到软状态（Soft state），在我看来，软状态描述的是实现服务可用性的时候系统数据的一种过渡状态，也就是说不同节点间，数据副本存在短暂的不一致。你只需要知道软状态是一种过渡状态就可以了，我们不多说。\n\n\n\n如何做到基本可用？\n\n当发生系统故障时：\n\n掌握流量削峰（不同项目、业务分时间断访问）、延迟响应、体验降级、过载保护（拒绝请求）这 4 板斧\n\n还有可以考虑：重试、幂等、异步、负载均衡、故障隔离、流量切换、自动扩缩容、兜底（熔断限流降级）、容量规划\n\nacid是数据库系统经典之作；base是在实践中受挫后的思想松绑，提出一种重要的指导，给人以信心"
  },
  {
    "path": "etcd/协议理论知识/4. raft协议.md",
    "content": "[toc]\n\n### 1. raft算法是如何初始化的\n\n初始状态下，集群中所有的节点都是跟随者状态。\n\nRaft 算法实现了随机超时时间的特性。也就是说，每个节点等待领导者节点心跳信息的超时时间间隔是随机的。通过下面的图片你可以看到，集群中没有领导者，而节点 A 的等待超时时间最小（150ms），它会最先因为没有等到领导者的心跳信息，发生超时。\n\n所以A节点最先没有收到领导者的心跳。所以这个时候，节点 A 就增加自己的任期编号，并推举自己为候选人，先给自己投上一张选票，然后向其他节点发送请求投票 RPC 消息，请它们选举自己为领导者。\n\n如果其他节点接收到候选人 A 的请求投票 RPC 消息，在编号为 1 的这届任期内，也还没有进行过投票，那么它将把选票投给节点 A，并增加自己的任期编号。\n\n如果候选人在选举超时时间内赢得了大多数的选票，那么它就会成为本届任期内新的领导者。\n\n节点 A 当选领导者后，他将周期性地发送心跳消息，通知其他服务器我是领导者，阻止跟随者发起新的选举，篡权。\n\n![image-20220406170608476](../images/raft-1.png)\n\n#### 1.1 节点之间是如何通信的\n\n在 Raft 算法中，服务器节点间的沟通联络采用的是远程过程调用（RPC），在领导者选举中，需要用到这样两类的 RPC：\n\n1. 请求投票（RequestVote）RPC，是由候选人在选举期间发起，通知各节点进行投票；\n\n2. 日志复制（AppendEntries）RPC，是由领导者发起，用来复制日志和提供心跳消息。\n\n我想强调的是，日志复制 RPC 只能由领导者发起，这是实现强领导者模型的关键之一，希望你能注意这一点，后续能更好地理解日志复制，理解日志的一致是怎么实现的。\n\n#### 1.2 任期编号有什么用\n\n任期编号是递增的，随着选举的举行而不断变化的。具体有：\n\n（1）跟随者在等待领导者心跳信息超时后，推举自己为候选人时，会增加自己的任期号，比如节点 A 的当前任期编号为 0，那么在推举自己为候选人时，会将自己的任期编号增加为 1。\n\n（2）如果一个服务器节点，发现自己的任期编号比其他节点小，那么它会更新自己的编号到较大的编号值。比如节点 B 的任期编号是 0，当收到来自节点 A 的请求投票 RPC 消息时，因为消息中包含了节点 A 的任期编号，且编号为 1，那么节点 B 将把自己的任期编号更新为 1。\n\n\n#### 1.3 选举规则\n（1）领导者周期性地向所有跟随者发送心跳消息（即不包含日志项的日志复制 RPC 消息），通知大家我是领导者，阻止跟随者发起新的选举。\n\n（2）如果在指定时间内，跟随者没有接收到来自领导者的消息，那么它就认为当前没有领导者，推举自己为候选人，发起领导者选举。\n\n（3）在一次选举中，赢得大多数选票的候选人，将晋升为领导者。\n\n（4）在一个任期内，领导者一直都会是领导者，直到它自身出现问题（比如宕机），或者因为网络延迟，其他节点发起一轮新的选举。\n\n（5）在一次选举中，每一个服务器节点最多会对一个任期编号投出一张选票，并且按照“先来先服务”的原则进行投票。比如节点 C 的任期编号为 3，先收到了 1 个包含任期编号为 4 的投票请求（来自节点 A），然后又收到了 1 个包含任期编号为 4 的投票请求（来自节点 B）。那么节点 C 将会把唯一一张选票投给节点 A，当再收到节点 B 的投票请求 RPC 消息时，对于编号为 4 的任期，已没有选票可投了。\n\n（6）当任期编号相同时，日志完整性高的跟随者（也就是最后一条日志项对应的任期编号值更大，索引号更大），拒绝投票给日志完整性低的候选人。比如节点 B、C 的任期编号都是 3，节点 B 的最后一条日志项对应的任期编号为 3，而节点 C 为 2，那么当节点 C 请求节点 B 投票给自己时，节点 B 将拒绝投票。\n\n选举是跟随者发起的，推举自己为候选人；大多数选票是指集群成员半数以上的选票；大多数选票规则的目标，是为了保证在一个给定的任期内最多只有一个领导者。\n\n\n#### 1.4 如何理解随机超时时间\n\n在议会选举中，常出现未达到指定票数，选举无效，需要重新选举的情况。在 Raft 算法的选举中，也存在类似的问题，那它是如何处理选举无效的问题呢？\n\n其实，Raft 算法巧妙地使用随机选举超时时间的方法，把超时时间都分散开来，在大多数情况下只有一个服务器节点先发起选举，而不是同时发起选举，这样就能减少因选票瓜分导致选举失败的情况。\n\n随机超时包括两层含义：\n\n（1）跟随者等待领导者心跳信息超时的时间间隔，是随机的；\n\n（2）当没有候选人赢得过半票数，选举无效了，这时需要等待一个随机时间间隔，也就是说，等待选举超时的时间间隔，是随机的。\n\n\n#### 1.5.疑问\n\n（1）选举规则5,6有矛盾，先来先到 和根据日志判断。 \n\n看起来跟随者是有等待时间的吗，等待所有的候选人发的rpc到了之后，再选择一个投票吗？\n\n目前看起来就是和自己比。如果当前节点收到了一个投票者的RPC。如果item比自己小直接拒绝；如果item一致，但是日志index比自己小也直接拒绝。因为集群写入好一个数据，会在大部分节点上都写好才算。所以候选人会和挂之前的leader保持一致的。\n\n（2）如果A是候选人，B，C是跟随者。但是A->C 出现了网络问题。 C发起重新选举。并且C和A的日志是一样的，会怎么样？\n\n目前看起来是会切换为C的。\n\n\n#### 1.6. raft的选举机制的局限\n\n关于raft的领导者选举限制和局限：\n\n1.读写请求和数据转发压力落在领导者节点，导致领导者压力。\n\n2.大规模跟随者的集群，领导者需要承担大量元数据维护和心跳通知的成本。\n\n3.领导者单点问题，故障后直到新领导者选举出来期间集群不可用。\n\n4.随着候选人规模增长，收集半数以上投票的成本更大。\n<br>\n\n### 2.raft日志机制\n\n#### 2.1 什么是日志\n\n日志项是一种数据格式，它主要包含用户指定的数据，也就是指令（Command），还包含一些附加信息，比如索引值（Log index）、任期编号（Term）。\n\n![image-20220406170705593](../images/raft-2.png)\n\n（1）指令：一条由客户端请求指定的、状态机需要执行的指令。你可以将指令理解成客户端指定的数据。\n\n（2）索引值：日志项对应的整数索引值。它其实就是用来标识日志项的，是一个连续的、单调递增的整数号码。\n\n（3）任期编号：创建这条日志项的领导者的任期编号。\n\n从图中可以看到，一届领导者任期，往往有多条日志项。而且日志项的索引值是连续的。\n\n上述四个节点的日志不一致的原因在于，由于网络原因或者服务器其他问题，导致某些节点的日志点没跟上。所以这个时候就要进行日志同步了。\n\n#### 2.2 日志同步\n\n![image-20220406170731474](../images/raft-3.png)\n\n正常的日志同步是这样的：\n\n（1）接收到客户端请求后，领导者基于客户端请求中的指令，创建一个新日志项，并附加到本地日志中。\n\n（2）领导者通过日志复制 RPC，将新的日志项复制到其他的服务器。\n\n（3）当领导者将日志项，成功复制到大多数的服务器上的时候，领导者会将这条日志项提交到它的状态机中。\n\n（4）领导者将执行的结果返回给客户端。\n\n（5）当跟随者接收到心跳信息，或者新的日志复制 RPC 消息后，如果跟随者发现领导者已经提交了某条日志项，而它还没提交，那么跟随者就将这条日志项提交到本地的状态机中。\n\n但是由于网络或者其他原因，某些节点并不是大多数之一，所以日志就一直落后。这个时候就需要复制日志了。\n\n在 Raft 算法中，领导者通过强制跟随者直接复制自己的日志项，处理不一致日志。也就是说，Raft 是通过以领导者的日志为准，来实现各节点日志的一致的。具体有 2 个步骤。\n\n首先，领导者通过日志复制 RPC 的一致性检查，找到跟随者节点上，与自己相同日志项的最大索引值。也就是说，这个索引值之前的日志，领导者和跟随者是一致的，之后的日志是不一致的了。\n\n然后，领导者强制跟随者更新覆盖的不一致日志项，实现日志的一致。\n\n详细过程如下：\n\nPrevLogEntry：表示当前要复制的日志项，前面一条日志项的索引值。比如在图中，如果领导者将索引值为 8 的日志项发送给跟随者，那么此时 PrevLogEntry 值为 7。\n\nPrevLogTerm：表示当前要复制的日志项，前面一条日志项的任期编号，比如在图中，如果领导者将索引值为 8 的日志项发送给跟随者，那么此时 PrevLogTerm 值为 4。\n\n![image-20220406170756696](../images/raft-4.png)\n\n那么复制日志的过程如下：\n\n（1）领导者通过日志复制 RPC 消息，发送当前最新日志项到跟随者（为了演示方便，假设当前需要复制的日志项是最新的），这个消息的 PrevLogEntry 值为 7，PrevLogTerm 值为 4。\n\n（2）如果跟随者在它的日志中，找不到与 PrevLogEntry 值为 7、PrevLogTerm 值为 4 的日志项，也就是说它的日志和领导者的不一致了，那么跟随者就会拒绝接收新的日志项，并返回失败信息给领导者。\n\n（3）这时，领导者会递减要复制的日志项的索引值，并发送新的日志项到跟随者，这个消息的 PrevLogEntry 值为 6，PrevLogTerm 值为 3。\n\n（4）如果跟随者在它的日志中，找到了 PrevLogEntry 值为 6、PrevLogTerm 值为 3 的日志项，那么日志复制 RPC 返回成功，这样一来，领导者就知道在 PrevLogEntry 值为 6、PrevLogTerm 值为 3 的位置，跟随者的日志项与自己相同。\n\n（5）领导者通过日志复制 RPC，复制并更新覆盖该索引值之后的日志项（也就是不一致的日志项），最终实现了集群各节点日志的一致。\n\n从上面步骤中你可以看到，领导者通过日志复制 RPC 一致性检查，找到跟随者节点上与自己相同日志项的最大索引值，然后复制并更新覆盖该索引值之后的日志项，实现了各节点日志的一致。需要你注意的是，跟随者中的不一致日志项会被领导者的日志覆盖，而且领导者从来不会覆盖或者删除自己的日志。\n<br>\n\n当这个跟随者与leader恢复响应后，leader通过rpc日志检查一致性来进行日志同步，但是这里有个问题，如果跟随者跟leader的日志相差太多，会有很频繁的rpc日志检查。\n\n这个只是思想，代码实现的时候可以优化，不是递增的寻找。\n\n**etcd** 中就是leader节点定期向每个fllower节点发送 PrevLogEntry+PrevLogTerm用于判断日志同步\n\n### 3.raft集群成员变更\n\n（1）成员变更的问题，主要在于进行成员变更时，可能存在新旧配置的 2 个“大多数”，导致集群中同时出现两个领导者，破坏了 Raft 的领导者的唯一性原则，影响了集群的稳定运行。\n\n（2）单节点变更是利用“一次变更一个节点，不会同时存在旧配置和新配置 2 个‘大多数’”的特性，实现成员变更。\n\n（3）因为联合共识实现起来复杂，不好实现，所以绝大多数 Raft 算法的实现，采用的都是单节点变更的方法（比如 Etcd、Hashicorp Raft）。其中，Hashicorp Raft 单节点变更的实现，是由 Raft 算法的作者迭戈·安加罗（Diego Ongaro）设计的，很有参考价值。"
  },
  {
    "path": "k8s/README.md",
    "content": "## 版本说明\n\n如无特别说明，本章节所涉及的k8s源码版本皆为 V1.17.4"
  },
  {
    "path": "k8s/client-go/1- clientGo简介与章节安排.md",
    "content": "Table of Contents\r\n=================\r\n\r\n  * [1. client-go简介](#1-client-go简介)\r\n     * [1.1 client-go章节安排](#11-client-go章节安排)\r\n  * [2. client-go如何使用kubeconfig配置](#2-client-go如何使用kubeconfig配置)\r\n     * [2.1 kube-config介绍](#21-kube-config介绍)\r\n     * [2.1 client-go加载kubeconfig](#21-client-go加载kubeconfig)\r\n     * [2.2 BuildConfigFromFlags](#22-buildconfigfromflags)\r\n  * [3.总结](#3总结)\r\n\r\n### 1. client-go简介\r\n\r\nclient-go就是  Go client for Kubernetes。它提供了与k8s交互的各种方法。\r\n\r\nKubernetes官方从2016年8月份开始，将Kubernetes资源操作相关的核心源码抽取出来，独立出来一个项目Client-go，作为官方提供的Go client。Kubernetes的部分代码也是基于这个client实现的，所以对这个client的质量、性能等方面还是非常有信心的。\r\n\r\nclient-go是一个调用kubernetes集群资源对象API的客户端，即通过client-go实现对kubernetes集群中资源对象（包括deployment、service、ingress、replicaSet、pod、namespace、node等）的增删改查等操作。大部分对kubernetes进行前置API封装的二次开发都通过client-go这个第三方包来实现。\r\n\r\nclient-go的代码库已经集成到Kubernetes源码中了，无须考虑版本兼容性问题，源码结构示例如下。client-go源码目录结构如下所示：\r\n\r\n```\r\n[root@k8s-node client-go]# tree -L 1\r\n.\r\n├── code-of-conduct.md  \r\n├── CONTRIBUTING.md\r\n├── discovery            提供discovery client客户端\r\n├── dynamic              提供dynamic客户端\r\n├── examples             几个常见的example示例\r\n├── Godeps               godeps的简单说明\r\n├── informers            每种资源的informer实现\r\n├── INSTALL.md \r\n├── kubernetes           提供clientset客户端\r\n├── LICENSE \r\n├── listers              每种资源的list实现\r\n├── OWNERS\r\n├── pkg\r\n├── plugin               提供openstack, GCP, Azure等云服务商授权插件\r\n├── rest                 提供restful客户端，执行restful操作\r\n├── restmapper\r\n├── scale                提供scale客户端，用于deploy,rs,rc等的扩缩容。\r\n├── SECURITY_CONTACTS\r\n├── testing\r\n├── third_party\r\n├── tools                提供常用的工具，例如cache,Indexers,DealtFIFO\r\n├── transport            提供安全的TCP连接，支持Http Stream\r\n└── util                 提供常用方法，例如workqueue,证书管理等。\r\n```\r\n\r\n#### 1.1 client-go章节安排\r\n\r\n打算主要从这三个方面入手，研究client-go的源码\r\n\r\n（1）client-go提供四种连接apiserver的客户端\r\n\r\n（2）client-go list-watch功能实现\r\n\r\n（3）与之配套提供的cache，dealtFifo，queue等辅助功能\r\n\r\n希望加强对这些部分更深入的了解，对k8s整理以及以后控制器的编写根据得心应手。\r\n\r\n接下来的文章安排就是了解上述的功能如何使用，如何实现。\r\n\r\n<br>\r\n\r\n### 2. client-go如何使用kubeconfig配置\r\n\r\n#### 2.1 kube-config介绍\r\n\r\nkubeconfig用于管理访问kube-apiserver的配置信息，同时也支持访问多kube-apiserver的配置管理，可以在不同的环境下管理不同的kube-apiserver集群配置，不同的业务线也可以拥有不同的集群。Kubernetes的其他组件都使用kubeconfig配置信息来连接kube-apiserver组件，例如当kubectl访问kube-apiserver时，会默认加载kubeconfig配置信息。kubeconfig中存储了集群、用户、命名空间和身份验证等信息，在默认的情况下，kubeconfig存放在$HOME/.kube/config路径下。Kubeconfig配置信息如下：\r\n\r\n```\r\ncat /root/.kube/config\r\napiVersion: v1\r\nclusters:\r\n- cluster:\r\n    server: https://39.98.210.73:6443\r\n    certificate-authority-data: \r\n  name: kubernetes\r\ncontexts:\r\n- context:\r\n    cluster: kubernetes\r\n    user: \"kubernetes-admin\"\r\n  name: kubernetes-admin-cd0201255113548b782faa6fbf68c80cd\r\ncurrent-context: kubernetes-admin-cd0201255113548b782faa6fbf68c80cd\r\nkind: Config\r\npreferences: {}\r\nusers:\r\n- name: \"kubernetes-admin\"\r\n  user:\r\n    client-certificate-data: \r\n    client-key-data: \r\n```\r\n\r\nkubeconfig配置信息通常包含3个部分，分别介绍如下。\r\n\r\n● clusters：定义Kubernetes集群信息，例如kube-apiserver的服务地址及集群的证书信息等。\r\n\r\n● users：定义Kubernetes集群用户身份验证的客户端凭据，例如client-certificate、client-key、token及username/password等。\r\n\r\n● contexts：定义Kubernetes集群用户信息和命名空间等，用于将请求发送到指定的集群。\r\n\r\n这里其实就很好理解。就是定义 集群用户，和上下文。集群上下文可以有多个。例如 context1 <集群A，用于A>\r\n\r\ncontext2 <集群B，用户A>\r\n\r\n这样使用 kubectl config指定 context2就能马上  以用户A的角色连接到 集群B。\r\n\r\n#### 2.1 client-go加载kubeconfig\r\n\r\nclient-go会读取kubeconfig配置信息并生成config对象，用于与kube-apiserver通信。这里主要就是通过 tools/clientcmd包实现的。更具体就是通过 clientcmd.BuildConfigFromFlags\r\n\r\n```\r\n像kube-eventwatcher组件也是通过这个 连接集群。\r\nfunc NewPodController(opt *config.Option) (*PodController, error) {\r\n\tcfg, err := clientcmd.BuildConfigFromFlags(\"\", opt.KubeConfig)\r\n\tif err != nil {\r\n\t\tglog.Errorf(\"can not read the cfg: %v\\n\", err)\r\n\t\treturn nil, err\r\n\t}\r\n```\r\n\r\n<br>\r\n\r\n#### 2.2 BuildConfigFromFlags\r\n\r\n这个函数的主要作用就是，通过 path，或者命令行输入，实例化一个 restclient.Config对象。\r\n\r\n从主函数BuildConfigFromFlags可以看出来。还是命令行输入的config优先使用\r\n\r\n```go\r\n// 1.主函数BuildConfigFromFlags\r\n// BuildConfigFromFlags is a helper function that builds configs from a master\r\n// url or a kubeconfig filepath. These are passed in as command line flags for cluster\r\n// components. Warnings should reflect this usage. If neither masterUrl or kubeconfigPath\r\n// are passed in we fallback to inClusterConfig. If inClusterConfig fails, we fallback\r\n// to the default config.\r\nfunc BuildConfigFromFlags(masterUrl, kubeconfigPath string) (*restclient.Config, error) {\r\n\tif kubeconfigPath == \"\" && masterUrl == \"\" {\r\n\t\tglog.Warningf(\"Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.\")\r\n\t\tkubeconfig, err := restclient.InClusterConfig()\r\n\t\tif err == nil {\r\n\t\t\treturn kubeconfig, nil\r\n\t\t}\r\n\t\tglog.Warning(\"error creating inClusterConfig, falling back to default config: \", err)\r\n\t}\r\n\treturn NewNonInteractiveDeferredLoadingClientConfig(\r\n\t\t&ClientConfigLoadingRules{ExplicitPath: kubeconfigPath},\r\n\t\t&ConfigOverrides{ClusterInfo: clientcmdapi.Cluster{Server: masterUrl}}).ClientConfig()\r\n}\r\n\r\n\r\n// 2. 调用了clientconfig()看起来就是合并，因为可能一份kubeconfig可能要操作多个集群。并且还是通过文件指定一部分集群，通过命令行指定一部分集群。\r\n// ClientConfig implements ClientConfig\r\nfunc (config *DeferredLoadingClientConfig) ClientConfig() (*restclient.Config, error) {\r\n\tmergedClientConfig, err := config.createClientConfig()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\r\n\t// load the configuration and return on non-empty errors and if the\r\n\t// content differs from the default config\r\n\tmergedConfig, err := mergedClientConfig.ClientConfig()\r\n\tswitch {\r\n\tcase err != nil:\r\n\t\tif !IsEmptyConfig(err) {\r\n\t\t\t// return on any error except empty config\r\n\t\t\treturn nil, err\r\n\t\t}\r\n\tcase mergedConfig != nil:\r\n\t\t// the configuration is valid, but if this is equal to the defaults we should try\r\n\t\t// in-cluster configuration\r\n\t\tif !config.loader.IsDefaultConfig(mergedConfig) {\r\n\t\t\treturn mergedConfig, nil\r\n\t\t}\r\n\t}\r\n\r\n\t// check for in-cluster configuration and use it\r\n\tif config.icc.Possible() {\r\n\t\tglog.V(4).Infof(\"Using in-cluster configuration\")\r\n\t\treturn config.icc.ClientConfig()\r\n\t}\r\n\r\n\t// return the result of the merged client config\r\n\treturn mergedConfig, err\r\n}\r\n\r\n// 3.调用NewNonInteractiveDeferredLoadingClientConfig函数\r\n// NewNonInteractiveDeferredLoadingClientConfig creates a ConfigClientClientConfig using the passed context name\r\nfunc NewNonInteractiveDeferredLoadingClientConfig(loader ClientConfigLoader, overrides *ConfigOverrides) ClientConfig {\r\n\treturn &DeferredLoadingClientConfig{loader: loader, overrides: overrides, icc: &inClusterClientConfig{overrides: overrides}}\r\n}\r\n\r\n// 4.最终就是实例化这样一个对象\r\n// DeferredLoadingClientConfig is a ClientConfig interface that is backed by a client config loader.\r\n// It is used in cases where the loading rules may change after you've instantiated them and you want to be sure that\r\n// the most recent rules are used.  This is useful in cases where you bind flags to loading rule parameters before\r\n// the parse happens and you want your calling code to be ignorant of how the values are being mutated to avoid\r\n// passing extraneous information down a call stack\r\ntype DeferredLoadingClientConfig struct {\r\n\tloader         ClientConfigLoader\r\n\toverrides      *ConfigOverrides\r\n\tfallbackReader io.Reader\r\n\r\n\tclientConfig ClientConfig\r\n\tloadingLock  sync.Mutex\r\n\r\n\t// provided for testing\r\n\ticc InClusterConfig\r\n}\r\n```\r\n\r\n合并Kubeconfig的效果如下：\r\n![merged-config.png](../images/merged-config.png)\r\n\r\n### 3.总结\r\n\r\n（1）简单了解client-go的结构和对kube-config的使用"
  },
  {
    "path": "k8s/client-go/10. Controller-runtime原理分析.md",
    "content": "- [1. Controller-runtime结构介绍](#1-controller-runtime----)\n- [2. Controller-runtime 底层原理](#2-controller-runtime-----)\n  * [2.1 manager相关结构体介绍](#21-manager-------)\n  * [2.2 controller相关结构体介绍](#22-controller-------)\n  * [2.3 controller启动流程](#23-controller----)\n  * [2.4 manager是如何启动controller的](#24-manager-----controller-)\n    + [2.4.1 第一步-manager的初始化](#241-----manager----)\n    + [2.4.2 第二步-将controller绑定到manager](#242------controller---manager)\n    + [2.4.3 第三步-启动manager.start](#243-------managerstart)\n  * [2.5 runtime cache](#25-runtime-cache)\n    + [2.5.1 cache是什么](#251-cache---)\n    + [2.5.2 cache初始化逻辑](#252-cache-----)\n- [3.总结](#3--)\n- [4. 参考](#4---)\n\n### 1. Controller-runtime结构介绍\n\nkubebuilder底层使用的就是Controller-runtime，Controller-runtime为 Controller 的开发提供了各种功能模块，每个模块中包括了一个 或多个实现，通过这些模块，开发者可以灵活地构建自己的 Controller，主要包括以下内容:\n\n（1） Client：用于读写 Kubernetes 资源对象的客户端。 \n\n（2） Cache：本地缓存，用于保存需要监听的 Kubernetes 资源。缓存提供了只读客户端， 用于从缓存中读取对象。缓存还可以注册处理方法（EventHandler），以响应更新的事件。 \n\n（3） Manager：用于控制多个 Controller，提供 Controller 共用的依赖项，如 Client、 Cache、Schemes 等。通过调用 Manager.Start 方法，可以启动 Controller。\n\n（4） Controller：控制器，响应事件（Kubernetes 资源对象的创建、更新、删除）并 确保对象规范（Spec 字段）中指定的状态与系统状态匹配，如果不匹配，则控制器需要根 据事件的对象，通过协调器（Reconciler）进行同步。在实现上，Controller 是用于处理 reconcile.Requests 的工作队列，reconcile.Requests 包含了需要匹配状态的资源对象。\n\n ① Controller 需要提供 Reconciler 来处理从工作队列中获取的请求。\n\n ② Controller 需要配置相应的资源监听，根据监听到的 Event 生成 reconcile.Requests 并加入队列。\n\n （5） Reconciler：为 Controller 提供同步的功能，Controller 可以随时通过资源对象的 Name 和 Namespace 来调用 Reconciler，调用时，Reconciler 将确保系统状态与资源对象 所表示的状态相匹配。例如，当某个 ReplicaSet 的副本数为 5，但系统中只有 3 个 Pod 时， 同步 ReplicaSet 资源的 Reconciler 需要新建两个 Pod，并将它们的 OwnerReference 字段 指向对应的 ReplicaSet。\n\n ① Reconciler 包含了 Controller 所有的业务逻辑。\n\n ② Reconciler 通常只处理单个对象类型，例如只处理 ReplicaSets 的 Reconciler，不 处理其他的对象类型。如果需要处理多种对象类型，需要实现多个 Controller。如果你 希望通过其他类型来触发 Reconciler，例如，通过 Pod 对象的事件来触发 ReplicaSet 的 Recon- ciler，则可以提供一个映射，通过该映射将触发 Reconciler 的类型映射到需要匹 配的类型。 \n\n③ 提供给 Reconciler 的参数是需要匹配的资源对象的 Name 和 Namespace。 \n\n④ Reconciler 不关心触发它的事件的内容和类型。例如，对于同步 ReplicaSet 资源的 Reconciler 来说，触发它的是 ReplicaSet 的创建还是更新并不重要，Reconciler 总是会比 较系统中相应的 Pod 数量和 ReplicaSet 中指定的副本数量。 \n\n（6） WebHook：准 入 WebHook（Admission WebHook） 是 扩 展 Kubernetes API 的 一种机制，WebHook 可以根据事件类型进行配置，比如资源对象的创建、删除、更改等 事件，当配置的事件发生时，Kubernetes 的 APIServer 会向 WebHook 发送准入请求 （AdmissionRequests），WebHook 可以对请求中的资源对象进行更改或准入验证，然后将 处理结果响应给 APIServer。\n\n（7） Source：resource.Source 是 Controller.Watch 的参数，提供事件，事件通常是来 自 Kubernetes 的 APIServer（如 Pod 创建、更新和删除）。例如，source.Kind 使用指定 对象（通过 GroupVersionKind 指定）的 Kubernetes API Watch 接口来提供此对象的创建、 更新、删除事件。 \n\n① Source 通过 Watch API 提供 Kubernetes 指定对象的事件流。\n\n② 建议开发者使用 Controller-runtime 中已有的 Source 实现，而不是自己实现此接口。 \n\n（8） EventHandler：handler.EventHandler 是 Controller.Watch 的 参 数， 用 于 将 事 件对应的 reconcile.Requests 加入队列。例如，从 Source 中接收到一个 Pod 的创建事 件，eventhandler.EnqueueHandler 会 根 据 Pod 的 Name 与 Namespace 生 成 reconcile. Requests 后，加入队列。\n\n ① EventHandlers 处理事件的方式是将一个或多个 reconcile.Requests 加入队列。 \n\n② 在 EventHandler 的处理中，事件所属的对象的类型（比如 Pod 的创建事件属于 Pod 对象），可能与 reconcile.Requests 所加入的对象类型相同。\n\n③ 事件所属的对象的类型也可能与 reconcile.Requests 所加入的对象类型不同。例如 将 Pod 的事件映射为所属的 ReplicaSet 的 reconcile.Requests。 \n\n④ EventHandler 可能会将一个事件映射为多个 reconcile.Requests 并加入队列，多个 reconcile.Requests 可能属于一个对象类型，也可能涉及多个对象类型。例如，由于集群扩 展导致的 Node 事件。\n\n ⑤ 在大多数情况下，建议开发者使用 Controller-runtime 中已有的 EventHandler 来 实现，而不是自己实现此接口。\n\n（9） Predicate：predicate.Predicate 是 Controller.Watch 的参数，是用于过滤事件的 过滤器，过滤器可以复用或者组合。\n\n ① Predicate 接口以事件作为输入，以布尔值作为输出，当返回 True 时，表示需要将 事件加入队列。\n\n ② Predicate 是可选的。\n\n ③ 建议开发者使用 Controller-runtime 中已有的 Predicate 实现，但可以使用其他 Predicate 进行过滤。\n\n\n\n![image-20220826141310531](../images/image-20220826141310531.png)\n\nController-runtime 核心流程如下：\n\n* Source 通过 Kubernetes APIServer 监听指定资源对象\n* EventHandler 根据资源对象变化事件，将 reconcile.Request 加入队列\n* 从队列中获取 reconcile.Request，并调用 Reconciler 进行同步\n\n![image-20220826142338724](../images/image-20220826142338724.png)\n\n### 2. Controller-runtime 底层原理\n\n#### 2.1 manager相关结构体介绍\n\nManager的方法\n\n```\n\ntype Manager interface {\n\tcluster.Cluster                   //cluster.Cluster  提供了一系列方法，以获取与集群相关的对象。\n\tAdd(Runnable) error               //添加controller\n\tElected() <-chan struct{}         // 选举相关, 返回一个 Channel 结构，用于判断选举状态。当未配\n置选举或当选 Leader 时，Channel 将被关闭。\n\tAddMetricsExtraHandler(path string, handler http.Handler) error     // metrics相关\n\tAddHealthzCheck(name string, check healthz.Checker) error           // 健康检查相关\n\tAddReadyzCheck(name string, check healthz.Checker) error            // 是否就绪\n\tStart(ctx context.Context) error                                    // 启动所有的controller\n\tGetWebhookServer() *webhook.Server                                  \n\tGetLogger() logr.Logger\n\tGetControllerOptions() v1alpha1.ControllerConfigurationSpec             \n}\n```\n\nManager启动时Options介绍。这里介绍几个关键的。\n\n（1） Scheme 结构。一般先通过 k8s.io/apimachinery/pkg/runtime 中的 NewScheme() 方法获取 Kubernetes 的 Scheme，然后再将 CRD 注册到 Scheme\n\n（2） MapperProvider 是一个函数对象，其定义为 func（c *rest.Config) （meta.RESTMapper，error)，用于定义 Manager 如何获取 RESTMapper。默认通过 k8s.io/client-go 中的 DiscoveryClient 请求获取 Kube-APIServer。 \n\n（3） Logger 用于定义 Manager 的日志输出对象，默认使用 pkg/internal/log 包下的 全局参数 RuntimeLog。 \n\n（4）SyncPeriod 参数用于指定 Informer 重新同步并处理资源的时间间隔，默认为 10 小时。此参数也决定了 Controller 重新同步的时间间隔，每个 Controller 的时间间隔以此 参数为基准有 10% 的抖动，以避免多个 Controller 同时进行重新同步。\n\n（5） Namespace 参数用于限制 Manager.Cache 只监听指定 Namespace 的资源，默认 情况下无限制。 \n\n（6） EventBroadcaster 参数用于提供 Manager，以获取 EventRecorder，当前已不 推荐使用，因为当Manager或Controller的生命周期短于EventBroadcaster的生命周期时， 可能会导致 goroutine 泄露。\n\n```\n// Options are the arguments for creating a new Manager.\ntype Options struct {\n\t// Scheme is the scheme used to resolve runtime.Objects to GroupVersionKinds / Resources\n\t// Defaults to the kubernetes/client-go scheme.Scheme, but it's almost always better\n\t// idea to pass your own scheme in.  See the documentation in pkg/scheme for more information.\n\tScheme *runtime.Scheme\n\n\t// MapperProvider provides the rest mapper used to map go types to Kubernetes APIs\n\tMapperProvider func(c *rest.Config) (meta.RESTMapper, error)\n\n\tSyncPeriod *time.Duration\n\t。。。\n}\n```\n\n<br>\n\n#### 2.2 controller相关结构体介绍\n\n**接口**\n\n```\ntype Controller interface {\n  // 匿名接口，定义了 Reconcile（context.Context，Request） （Result，error）\n\treconcile.Reconciler\n\n  // Watch() 方法会从 source.Source 中 获 取 Event， 并 根 据 参 数 Eventhandler 来 决 定 如 何 入 队， 根 据 参 数 Predicates 进行 Event 过滤，Preficates 可能有多个，只有所有的 Preficates 都返回True 时，才会将 Event 发送给 Eventhandler 处理。\n\tWatch(src source.Source, eventhandler handler.EventHandler, predicates ...predicate.Predicate) error\n\n  // Controller 的启动方法，实现了 Controller 接 口的对象，也实现了 Runnable，因此，该方法可以被 Manager 管理。\n\tStart(ctx context.Context) error\n\n\t// 获取 Controller 内的 Logger，用于日志输出。\n\tGetLogger() logr.Logger\n}\n```\n\n**结构体实现**\n\nController 的实现在 pkg/internal/controller/controller.go 下，为结构体 Controller， Controller 结构体中包括的主要成员如下。 （1） Name string：必须设置，用于标识 Controller，会在 Controller 的日志输 出中进行关联。 \n\n（2） MaxConcurrentReconciles int：定义允许 reconcile.Reconciler 同时运行的最多个 数，默认为 1。 \n\n（3） Do reconcile.Reconciler：定义了 Reconcile() 方法，包含了 Controller 同步的业务 逻辑。Reconcile() 能在任意时刻被调用，接收一个对象的 Name 与 Namespace，并同步集 群当前实际状态至该对象被设置的期望状态。 \n\n（4） MakeQueue func() workqueue.RateLimitingInterface：用 于 在 Controller 启动时，创建工作队列。由于标准的 Kubernetes 工作队列创建后会立即启动，因此， 如果在 Controller 启动前就创建队列，在重复调用 controller.New() 方法创建 Controller 的情况下，就会导致 Goroutine 泄露。\n\n （5） Queue workqueue.RateLimitingInterface：使用上面方法创建的工作队列。 （6） SetFields func（i interface{}） error：用 于 从 Manager 中 获 取 Controller 依 赖 的 方 法， 依 赖 包 括 Sourcess、EventHandlers 和 Predicates 等。 此 方 法 存 储 的 是 controllerManager.SetFields() 方法。 （7） Started Bool：用于表示 Controller 是否已经启动。 （8） CacheSyncTimeout time.Duration：定义了 Cache 完成同步的等待时长，超过时 长会被认为是同步失败。默认时长为 2 分钟。 （9） startWatches [ ]watchDescription：定 义 了 一 组 Watch 操 作 的 属 性， 会 在 Controller 启动时，根据属性进行 Watch 操作。watchDescription 的定义见代码清单 3-30，watchDescription 包 括 Event 的 源 source.Source、Event 的 入 队 方 法 handler. EventHandler 以及 Event 的过滤方法 predicate.Predicate。\n\n```\n// Controller implements controller.Controller.\ntype Controller struct {\n\t// Name is used to uniquely identify a Controller in tracing, logging and monitoring.  Name is required.\n\tName string\n\n\t// MaxConcurrentReconciles is the maximum number of concurrent Reconciles which can be run. Defaults to 1.\n\tMaxConcurrentReconciles int\n\n\t// Reconciler is a function that can be called at any time with the Name / Namespace of an object and\n\t// ensures that the state of the system matches the state specified in the object.\n\t// Defaults to the DefaultReconcileFunc.\n\tDo reconcile.Reconciler\n\n\t// MakeQueue constructs the queue for this controller once the controller is ready to start.\n\t// This exists because the standard Kubernetes workqueues start themselves immediately, which\n\t// leads to goroutine leaks if something calls controller.New repeatedly.\n\tMakeQueue func() workqueue.RateLimitingInterface\n\n\t// Queue is an listeningQueue that listens for events from Informers and adds object keys to\n\t// the Queue for processing\n\tQueue workqueue.RateLimitingInterface\n\n\t// SetFields is used to inject dependencies into other objects such as Sources, EventHandlers and Predicates\n\t// Deprecated: the caller should handle injected fields itself.\n\tSetFields func(i interface{}) error\n\n\t// mu is used to synchronize Controller setup\n\tmu sync.Mutex\n\n\t// Started is true if the Controller has been Started\n\tStarted bool\n\n\t// ctx is the context that was passed to Start() and used when starting watches.\n\t//\n\t// According to the docs, contexts should not be stored in a struct: https://golang.org/pkg/context,\n\t// while we usually always strive to follow best practices, we consider this a legacy case and it should\n\t// undergo a major refactoring and redesign to allow for context to not be stored in a struct.\n\tctx context.Context\n\n\t// CacheSyncTimeout refers to the time limit set on waiting for cache to sync\n\t// Defaults to 2 minutes if not set.\n\tCacheSyncTimeout time.Duration\n\n\t// startWatches maintains a list of sources, handlers, and predicates to start when the controller is started.\n\tstartWatches []watchDescription\n\n\t// Log is used to log messages to users during reconciliation, or for example when a watch is started.\n\tLog logr.Logger\n\n\t// RecoverPanic indicates whether the panic caused by reconcile should be recovered.\n\tRecoverPanic bool\n}\n```\n\n#### 2.3 controller启动流程\n\ncontroller跟随manager.start而启动。然后根据下面的流程运行。在c.Do.Reconcile函数中调用了我们实现的Reconcile函数进行真正的控制器逻辑处理。\n\n![image-20220826145939603](../images/image-20220826145939603.png)\n\n#### 2.4 manager是如何启动controller的\n\n##### 2.4.1 第一步-manager的初始化\n\n一般在main函数就调用ctrl.NewManager函数进行初始化。ctrl.NewManager函数有2个参数，第一个参数就是k8s集群的*rest.Config， 第二个就是Options。就是manger结构体介绍的参数，比如可以自定义SyncPeriod等等。\n\n```\nmgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{\n\t\tScheme:                 scheme,\n\t\tMetricsBindAddress:     metricsAddr,\n\t\tPort:                   9443,\n\t\tHealthProbeBindAddress: probeAddr,\n\t\tLeaderElection:         enableLeaderElection,\n\t\tLeaderElectionID:       \"ec7e1f70.github.com\",\n\t\t// LeaderElectionReleaseOnCancel defines if the leader should step down voluntarily\n\t\t// when the Manager ends. This requires the binary to immediately end when the\n\t\t// Manager is stopped, otherwise, this setting is unsafe. Setting this significantly\n\t\t// speeds up voluntary leader transitions as the new leader don't have to wait\n\t\t// LeaseDuration time first.\n\t\t//\n\t\t// In the default scaffold provided, the program ends immediately after\n\t\t// the manager stops, so would be fine to enable this option. However,\n\t\t// if you are doing or is intended to do any operation such as perform cleanups\n\t\t// after the manager stops then its usage might be unsafe.\n\t\t// LeaderElectionReleaseOnCancel: true,\n\t})\n```\n\n<br>\n\nctrl.NewManager实际就是初始化这个结构体，有了这个结构体就可以和k8s集群打交道了。\n\n```\nreturn &controllerManager{\n\t\tstopProcedureEngaged:          pointer.Int64(0),\n\t\tcluster:                       cluster,\n\t\trunnables:                     runnables,\n\t\terrChan:                       errChan,\n\t\trecorderProvider:              recorderProvider,\n\t\tresourceLock:                  resourceLock,\n\t\tmetricsListener:               metricsListener,\n\t\tmetricsExtraHandlers:          metricsExtraHandlers,\n\t\tcontrollerOptions:             options.Controller,\n\t\tlogger:                        options.Logger,\n\t\telected:                       make(chan struct{}),\n\t\tport:                          options.Port,\n\t\thost:                          options.Host,\n\t\tcertDir:                       options.CertDir,\n\t\twebhookServer:                 options.WebhookServer,\n\t\tleaseDuration:                 *options.LeaseDuration,\n\t\trenewDeadline:                 *options.RenewDeadline,\n\t\tretryPeriod:                   *options.RetryPeriod,\n\t\thealthProbeListener:           healthProbeListener,\n\t\treadinessEndpointName:         options.ReadinessEndpointName,\n\t\tlivenessEndpointName:          options.LivenessEndpointName,\n\t\tgracefulShutdownTimeout:       *options.GracefulShutdownTimeout,\n\t\tinternalProceduresStop:        make(chan struct{}),\n\t\tleaderElectionStopped:         make(chan struct{}),\n\t\tleaderElectionReleaseOnCancel: options.LeaderElectionReleaseOnCancel,\n\t}, nil\n```\n\n##### 2.4.2 第二步-将controller绑定到manager\n\n这一步需要调用SetupWithManager函数，这个是每个controller自己实现的。最简单就是使用通用的方法。\n\n```\n// SetupWithManager sets up the controller with the Manager.\nfunc (r *PodCountReconciler) SetupWithManager(mgr ctrl.Manager) error {\n\treturn ctrl.NewControllerManagedBy(mgr).\n\t\tFor(&zouxappv1.PodCount{}).\n\t\tComplete(r)\n}\n```\n\n详细地说，创建 Controller 基本分为 3 步。\n\n **第一步**，通过 ControllerManagedBy（m manager.Manager） *Builder 方法实例化一个 Builder 对象，其中传入的 Manager 提供创建 Controller 所需的依赖。 \n\n这步骤的意思是，我定义了一个builder，绑定了manager\n\n```\n// Builder builds a Controller.\ntype Builder struct {\n\tforInput         ForInput\n\townsInput        []OwnsInput\n\twatchesInput     []WatchesInput\n\tmgr              manager.Manager\n\tglobalPredicates []predicate.Predicate\n\tctrl             controller.Controller\n\tctrlOptions      controller.Options\n\tname             string\n}\n\n// ControllerManagedBy returns a new controller builder that will be started by the provided Manager.\nfunc ControllerManagedBy(m manager.Manager) *Builder {\n\treturn &Builder{mgr: m}\n}\n```\n\n**第二步**，使用 For（object client.Object，opts ...ForOption）方法设置需要监听的资源 类型。\n\n实际就是完善Builder的forInput结构体。\n\n**注意：**这里就相对于调用了Watches(&source.Kind{Type: apiType}, &handler.EnqueueRequestForObject{}).\n\n如果想一个controller监听多个对象，或者想实现自己的监听逻辑，比如不想监听删除操作，执行监听特定的update操作。就需要自己NewController来实现了。\n\n```\n// For defines the type of Object being *reconciled*, and configures the ControllerManagedBy to respond to create / delete /\n// update events by *reconciling the object*.\n// This is the equivalent of calling\n// Watches(&source.Kind{Type: apiType}, &handler.EnqueueRequestForObject{}).\nfunc (blder *Builder) For(object client.Object, opts ...ForOption) *Builder {\n\tif blder.forInput.object != nil {\n\t\tblder.forInput.err = fmt.Errorf(\"For(...) should only be called once, could not assign multiple objects for reconciliation\")\n\t\treturn blder\n\t}\n\tinput := ForInput{object: object}\n\tfor _, opt := range opts {\n\t\topt.ApplyToFor(&input)\n\t}\n\n\tblder.forInput = input\n\treturn blder\n}\n```\n\n**第三步**，使用Complete函数将controller绑定到manager\n\n```\n// Complete builds the Application Controller.\nfunc (blder *Builder) Complete(r reconcile.Reconciler) error {\n\t_, err := blder.Build(r)\n\treturn err\n}\n```\n\n**总结：** controller-runtime实际是通过builder这个对象，将mgr和controller绑定。\n\n##### 2.4.3 第三步-启动manager.start\n\ncontrollerManager.start会依次启动serveMetrics，serveHealthProbes，Webhooks，Caches，startLeaderElectionRunnables\n\n这里就是关注如何启动每个controller的。manager.start->startLeaderElectionRunnables->cm.runnables.LeaderElection.Start -> go r.reconcile() -> fou循环go routinue启动每个controller\n\n```\nif err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {\n\t\tsetupLog.Error(err, \"problem running manager\")\n\t\tos.Exit(1)\n\t}\n\n\n// Start starts the manager and waits indefinitely.\n// There is only two ways to have start return:\n// An error has occurred during in one of the internal operations,\n// such as leader election, cache start, webhooks, and so on.\n// Or, the context is cancelled.\nfunc (cm *controllerManager) Start(ctx context.Context) (err error) {\n\tcm.Lock()\n\tif cm.started {\n\t\tcm.Unlock()\n\t\treturn errors.New(\"manager already started\")\n\t}\n\tvar ready bool\n\tdefer func() {\n\t\t// Only unlock the manager if we haven't reached\n\t\t// the internal readiness condition.\n\t\tif !ready {\n\t\t\tcm.Unlock()\n\t\t}\n\t}()\n\n\t// Initialize the internal context.\n\tcm.internalCtx, cm.internalCancel = context.WithCancel(ctx)\n\n\t// This chan indicates that stop is complete, in other words all runnables have returned or timeout on stop request\n\tstopComplete := make(chan struct{})\n\tdefer close(stopComplete)\n\t// This must be deferred after closing stopComplete, otherwise we deadlock.\n\tdefer func() {\n\t\t// https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/gettyimages-459889618-1533579787.jpg\n\t\tstopErr := cm.engageStopProcedure(stopComplete)\n\t\tif stopErr != nil {\n\t\t\tif err != nil {\n\t\t\t\t// Utilerrors.Aggregate allows to use errors.Is for all contained errors\n\t\t\t\t// whereas fmt.Errorf allows wrapping at most one error which means the\n\t\t\t\t// other one can not be found anymore.\n\t\t\t\terr = kerrors.NewAggregate([]error{err, stopErr})\n\t\t\t} else {\n\t\t\t\terr = stopErr\n\t\t\t}\n\t\t}\n\t}()\n\n\t// Add the cluster runnable.\n\tif err := cm.add(cm.cluster); err != nil {\n\t\treturn fmt.Errorf(\"failed to add cluster to runnables: %w\", err)\n\t}\n\n\t// Metrics should be served whether the controller is leader or not.\n\t// (If we don't serve metrics for non-leaders, prometheus will still scrape\n\t// the pod but will get a connection refused).\n\tif cm.metricsListener != nil {\n\t\tcm.serveMetrics()\n\t}\n\n\t// Serve health probes.\n\tif cm.healthProbeListener != nil {\n\t\tcm.serveHealthProbes()\n\t}\n\n\t// First start any webhook servers, which includes conversion, validation, and defaulting\n\t// webhooks that are registered.\n\t//\n\t// WARNING: Webhooks MUST start before any cache is populated, otherwise there is a race condition\n\t// between conversion webhooks and the cache sync (usually initial list) which causes the webhooks\n\t// to never start because no cache can be populated.\n\tif err := cm.runnables.Webhooks.Start(cm.internalCtx); err != nil {\n\t\tif err != wait.ErrWaitTimeout {\n\t\t\treturn err\n\t\t}\n\t}\n\n\t// Start and wait for caches.\n\tif err := cm.runnables.Caches.Start(cm.internalCtx); err != nil {\n\t\tif err != wait.ErrWaitTimeout {\n\t\t\treturn err\n\t\t}\n\t}\n\n\t// Start the non-leaderelection Runnables after the cache has synced.\n\tif err := cm.runnables.Others.Start(cm.internalCtx); err != nil {\n\t\tif err != wait.ErrWaitTimeout {\n\t\t\treturn err\n\t\t}\n\t}\n\n\t// Start the leader election and all required runnables.\n\t{\n\t\tctx, cancel := context.WithCancel(context.Background())\n\t\tcm.leaderElectionCancel = cancel\n\t\tgo func() {\n\t\t\tif cm.resourceLock != nil {\n\t\t\t\tif err := cm.startLeaderElection(ctx); err != nil {\n\t\t\t\t\tcm.errChan <- err\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t   // 启动每个controller\n\t\t\t\t// Treat not having leader election enabled the same as being elected.\n\t\t\t\tif err := cm.startLeaderElectionRunnables(); err != nil {\n\t\t\t\t\tcm.errChan <- err\n\t\t\t\t}\n\t\t\t\tclose(cm.elected)\n\t\t\t}\n\t\t}()\n\t}\n\n\tready = true\n\tcm.Unlock()\n\tselect {\n\tcase <-ctx.Done():\n\t\t// We are done\n\t\treturn nil\n\tcase err := <-cm.errChan:\n\t\t// Error starting or running a runnable\n\t\treturn err\n\t}\n}\n```\n\n<br>\n\n最终Controller就像内置的控制器一样，通过processNextWorkItem函数一个个处理。主要这里还可以通过**MaxConcurrentReconciles**提高并发。\n\n```\n// Start implements controller.Controller.\nfunc (c *Controller) Start(ctx context.Context) error {\n\t// use an IIFE to get proper lock handling\n\t// but lock outside to get proper handling of the queue shutdown\n\tc.mu.Lock()\n\tif c.Started {\n\t\treturn errors.New(\"controller was started more than once. This is likely to be caused by being added to a manager multiple times\")\n\t}\n\n\tc.initMetrics()\n\n\t// Set the internal context.\n\tc.ctx = ctx\n\n\tc.Queue = c.MakeQueue()\n\tgo func() {\n\t\t<-ctx.Done()\n\t\tc.Queue.ShutDown()\n\t}()\n\n\twg := &sync.WaitGroup{}\n\terr := func() error {\n\t\tdefer c.mu.Unlock()\n\n\t\t// TODO(pwittrock): Reconsider HandleCrash\n\t\tdefer utilruntime.HandleCrash()\n\n\t\t// NB(directxman12): launch the sources *before* trying to wait for the\n\t\t// caches to sync so that they have a chance to register their intendeded\n\t\t// caches.\n\t\tfor _, watch := range c.startWatches {\n\t\t\tc.Log.Info(\"Starting EventSource\", \"source\", fmt.Sprintf(\"%s\", watch.src))\n\n\t\t\tif err := watch.src.Start(ctx, watch.handler, c.Queue, watch.predicates...); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\n\t\t// Start the SharedIndexInformer factories to begin populating the SharedIndexInformer caches\n\t\tc.Log.Info(\"Starting Controller\")\n\n\t\tfor _, watch := range c.startWatches {\n\t\t\tsyncingSource, ok := watch.src.(source.SyncingSource)\n\t\t\tif !ok {\n\t\t\t\tcontinue\n\t\t\t}\n\n\t\t\tif err := func() error {\n\t\t\t\t// use a context with timeout for launching sources and syncing caches.\n\t\t\t\tsourceStartCtx, cancel := context.WithTimeout(ctx, c.CacheSyncTimeout)\n\t\t\t\tdefer cancel()\n\n\t\t\t\t// WaitForSync waits for a definitive timeout, and returns if there\n\t\t\t\t// is an error or a timeout\n\t\t\t\tif err := syncingSource.WaitForSync(sourceStartCtx); err != nil {\n\t\t\t\t\terr := fmt.Errorf(\"failed to wait for %s caches to sync: %w\", c.Name, err)\n\t\t\t\t\tc.Log.Error(err, \"Could not wait for Cache to sync\")\n\t\t\t\t\treturn err\n\t\t\t\t}\n\n\t\t\t\treturn nil\n\t\t\t}(); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\n\t\t// All the watches have been started, we can reset the local slice.\n\t\t//\n\t\t// We should never hold watches more than necessary, each watch source can hold a backing cache,\n\t\t// which won't be garbage collected if we hold a reference to it.\n\t\tc.startWatches = nil\n\n\t\t// Launch workers to process resources\n\t\tc.Log.Info(\"Starting workers\", \"worker count\", c.MaxConcurrentReconciles)\n\t\twg.Add(c.MaxConcurrentReconciles)\n\t\tfor i := 0; i < c.MaxConcurrentReconciles; i++ {\n\t\t\tgo func() {\n\t\t\t\tdefer wg.Done()\n\t\t\t\t// Run a worker thread that just dequeues items, processes them, and marks them done.\n\t\t\t\t// It enforces that the reconcileHandler is never invoked concurrently with the same object.\n\t\t\t\tfor c.processNextWorkItem(ctx) {\n\t\t\t\t}\n\t\t\t}()\n\t\t}\n\n\t\tc.Started = true\n\t\treturn nil\n\t}()\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t<-ctx.Done()\n\tc.Log.Info(\"Shutdown signal received, waiting for all workers to finish\")\n\twg.Wait()\n\tc.Log.Info(\"All workers finished\")\n\treturn nil\n}\n```\n\n\n\n\n\n#### 2.5 runtime cache\n\n##### 2.5.1 cache是什么\n\nCache 接口定义了如下两个接口:\n\n （1）client.Reader：用于从 Cache 中获取及列举 Kubernetes 集群的资源。 \n\n（2）Informers：可为不同的 GVK 创建或获取对应的 Informer，并将 Index 添加到对 应的 Informer 中。\n\nKubernetes 是典型的 Server-Client 的架构，APIServer 作为集群统一的操作入口，任 何对资源所做的操作（包括增删改查）都必须经过 APIServer。为了减轻 APIServer 的压力， Controller-runtime 抽象出一个 Cache 层，Client 端对 APIServer 数据的读取和监听操作都 将通过 Cache 层来进行。\n\n```\n// Cache knows how to load Kubernetes objects, fetch informers to request\n// to receive events for Kubernetes objects (at a low-level),\n// and add indices to fields on the objects stored in the cache.\ntype Cache interface {\n\t// Cache acts as a client to objects stored in the cache.\n\tclient.Reader\n\n\t// Cache loads informers and adds field indices.\n\tInformers\n}\n```\n\n<br>\n\n##### 2.5.2 cache初始化逻辑\n\n在new Manager的时候就初始化了缓存，具体的步骤是 New-> cluster.New -> \n\n```\n// New returns a new Manager for creating Controllers.\nfunc New(config *rest.Config, options Options) (Manager, error) {\n\t// Set default values for options fields\n\toptions = setOptionsDefaults(options)\n\n\tcluster, err := cluster.New(config, func(clusterOptions *cluster.Options) {\n\t\tclusterOptions.Scheme = options.Scheme\n\t\tclusterOptions.MapperProvider = options.MapperProvider\n\t\tclusterOptions.Logger = options.Logger\n\t\tclusterOptions.SyncPeriod = options.SyncPeriod\n\t\tclusterOptions.Namespace = options.Namespace\n\t\tclusterOptions.NewCache = options.NewCache\n\t\tclusterOptions.NewClient = options.NewClient\n\t\tclusterOptions.ClientDisableCacheFor = options.ClientDisableCacheFor\n\t\tclusterOptions.DryRunClient = options.DryRunClient\n\t\tclusterOptions.EventBroadcaster = options.EventBroadcaster //nolint:staticcheck\n\t})\n\n\noptions.NewCache初始化cache\n// Create the cache for the cached read client and registering informers\n\tcache, err := options.NewCache(config, cache.Options{Scheme: options.Scheme, Mapper: mapper, Resync: options.SyncPeriod, Namespace: options.Namespace})\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t\n// Allow newCache to be mocked\n\tif options.NewCache == nil {\n\t\toptions.NewCache = cache.New\n\t}\n```\n\n在 Controller Manager 的初始化启动过程中，将会构建 Cache 层，以供 Manager 使 用。在用户没有指定 Cache 初始化函数的前提下，将使用 Controller-runtime 默认提供的 Cache 初始化函数，默认 Cache 初始化的流程如下：\n\n```\n// New initializes and returns a new Cache.\nfunc New(config *rest.Config, opts Options) (Cache, error) {\n   opts, err := defaultOpts(config, opts)\n   if err != nil {\n      return nil, err\n   }\n   selectorsByGVK, err := convertToSelectorsByGVK(opts.SelectorsByObject, opts.DefaultSelector, opts.Scheme)\n   if err != nil {\n      return nil, err\n   }\n   disableDeepCopyByGVK, err := convertToDisableDeepCopyByGVK(opts.UnsafeDisableDeepCopyByObject, opts.Scheme)\n   if err != nil {\n      return nil, err\n   }\n   im := internal.NewInformersMap(config, opts.Scheme, opts.Mapper, *opts.Resync, opts.Namespace, selectorsByGVK, disableDeepCopyByGVK)\n   return &informerCache{InformersMap: im}, nil\n}\n```\n\n（1） 设 置 默 认 参 数：若 Scheme 为 空， 则 设 置 为 scheme.Scheme ；若 Mapper 为 空， 则 通 过 apiutil.NewDiscoveryRESTMapper 基 于 Discovery 的 信 息 构 建 出 一 个 RESTMapper，用于管理所有 Object 的信息；若同步时间为空，则将 Informer 的同步时 间设置为 10 小时。 \n\n（2）初始化 InformersMap，为 3 种不同类型的 Object（structured、unstructured、 metadata-only）分别构建 InformersMap。\n\n （3）初始化 specificInformersMap：该接口通过 Object 与 GVK 的组合信息创建并缓 存 Informers。 \n\n（4）定义 List-Watch 函数：为 3 种不同类型的 Object 实现 List-Watch 函数，通过 该函数可对 GVK 进行 List 和 Watch 操作。 通过 Cache 的初始化流程，我们可以看出 Cache 主要创建了 InformersMap，Scheme 中的每个 GVK 都会创建对应的 Informer，再通过 informersByGVK 的 Map，实现 GVK 到 Informer的映射；每个Informer都会通过List-Watch函数对相应的GVK进行List和Watch操作。\n\n![image-20220829162503586](../images/image-20220829162503586.png)\n\n\n\nCache 启动的核心是启动创建的所有 Informer\n\n```\n// Start calls Run on each of the informers and sets started to true.  Blocks on the context.\nfunc (m *InformersMap) Start(ctx context.Context) error {\n   go m.structured.Start(ctx)\n   go m.unstructured.Start(ctx)\n   go m.metadata.Start(ctx)\n   <-ctx.Done()\n   return nil\n}\n\n\n// Start calls Run on each of the informers and sets started to true.  Blocks on the context.\n// It doesn't return start because it can't return an error, and it's not a runnable directly.\nfunc (ip *specificInformersMap) Start(ctx context.Context) {\n\tfunc() {\n\t\tip.mu.Lock()\n\t\tdefer ip.mu.Unlock()\n\n\t\t// Set the stop channel so it can be passed to informers that are added later\n\t\tip.stop = ctx.Done()\n\n\t\t// Start each informer\n\t\tfor _, informer := range ip.informersByGVK {\n\t\t\tgo informer.Informer.Run(ctx.Done())\n\t\t}\n\n\t\t// Set started to true so we immediately start any informers added later.\n\t\tip.started = true\n\t\tclose(ip.startWait)\n\t}()\n\t<-ctx.Done()\n}\n```\n\nInformer 的启动流程主要包含以下 3 个步骤：\n\n （1）初始化 Delta FIFO 队列。\n\n （2）创建内部 Controller：配置 Delta FIFO 队列和事件的处理函数。 \n\n（3）启动 Controller：创建 Reflector，负责监听 APIServer 上指定的 GVK，将 Add、 Update、Delete 变更事件写入 Delta FIFO 队列中，作为变更事件的生产者；Controller 中 的事件处理函数 HandleDeltas() 会消费这些变更事件，负责将更新写入本地 Indexer，同 时将这些 Add、Update、Delete 事件分发给之前注册的监听器。\n\n### 3.总结\n\ncontroller-runtime其实就是利用client-go informer那套，底层是创建shareIndexInformer。\n\ncontroller-runtime通过屏蔽底层细节，让crd operator的实现非常简单。梳理一下，整理的工作流程如下所示：\n\n![image-20220826161016681](../images/image-20220826161016681.png)\n\n### 4. 参考\n\n云原生应用开发：Operator原理与实践"
  },
  {
    "path": "k8s/client-go/2-clientGo提供的四种客户端.md",
    "content": "Table of Contents\r\n=================\r\n\r\n  * [0. 四种客户端简介](#0-四种客户端简介)\r\n  * [1.discovery](#1discovery)\r\n     * [1.1 ServerGroups](#11-servergroups)\r\n     * [1.2 ServerGroupsAndResources](#12-servergroupsandresources)\r\n     * [1.3 缓存](#13-缓存)\r\n     * [1.4 实例展示](#14-实例展示)\r\n  * [2.restClient客户端](#2restclient客户端)\r\n  * [3.clientSet客户端](#3clientset客户端)\r\n     * [2.1 Clientset的定义](#21-clientset的定义)\r\n  * [4.DynamicClient客户端](#4dynamicclient客户端)\r\n\r\n### 0. 四种客户端简介\r\n\r\nclient-go的客户端对象有4个，作用各有不同：\r\n\r\n- RESTClient： 是对HTTP Request进行了封装，实现了RESTful风格的API。其他客户端都是在RESTClient基础上的实现。可与用于k8s内置资源和CRD资源\r\n- ClientSet:是对k8s内置资源对象的客户端的集合，默认情况下，不能操作CRD资源，但是通过client-gen代码生成的话，也是可以操作CRD资源的。\r\n- DynamicClient:不仅能对K8S内置资源进行处理，还可以对CRD资源进行处理，不需要client-gen生成代码即可实现。\r\n- DiscoveryClient：用于发现kube-apiserver所支持的资源组、资源版本、资源信息（即Group、Version、Resources）。\r\n\r\n![client](../images/client.png)\r\n\r\n\r\nRESTClient是最基础的客户端。RESTClient对HTTP Request进行了封装，实现了RESTful风格的API。ClientSet、DynamicClient及DiscoveryClient客户端都是基于RESTClient实现的。\r\n\r\n\r\n\r\nClientSet在RESTClient的基础上封装了对Resource和Version的管理方法。每一个Resource可以理解为一个客户端，而ClientSet则是多个客户端的集合，每一个Resource和Version都以函数的方式暴露给开发者。ClientSet只能够处理Kubernetes内置资源，它是通过client-gen代码生成器自动生成的。\r\n\r\n\r\n\r\nDynamicClient与ClientSet最大的不同之处是，ClientSet仅能访问Kubernetes自带的资源（即Client集合内的资源），不能直接访问CRD自定义资源。DynamicClient能够处理Kubernetes中的所有资源对象，包括Kubernetes内置资源与CRD自定义资源。\r\n\r\nDiscoveryClient发现客户端，用于发现kube-apiserver所支持的资源组、资源版本、资源信息（即Group、Versions、Resources）。以上4种客户端：RESTClient、ClientSet、DynamicClient、DiscoveryClient都可以通过kubeconfig配置信息连接到指定的KubernetesAPI Server。\r\n\r\n**总结下**：RESTCLient、ClientSet和DynamicClient都可以对K8S内置资源和CRD资源进行操作。只是clientSet需要生成代码才能操作CRD资源。\r\n\r\n而clientSet 和dynamicClient不同在于，dynamicClient可以操作任意的对象，clientset初始化是只能指定一种对象操作。\r\n\r\n<br>\r\n\r\n\r\n### 1.discovery\r\n\r\ndiscovery包的主要作用就是提供当前k8s集群支持哪些资源以及版本信息。\r\n\r\nKubernetes API Server暴露出/api和/apis接口。DiscoveryClient通过RESTClient分别请求/api和/apis接口，从而获取Kubernetes API Server所支持的资源组、资源版信息。这个是通过ServerGroups函数实现的\r\n\r\n有了group, version信息后，但是还是不够，因为还没有具体到资源。\r\n\r\nServerGroupsAndResources 就获得了所有的资源信息（所有的GVR资源信息），而在Resource资源的定义中，会定义好该资源支持哪些操作：list, delelte ,get等等。\r\n\r\n所以kubectl中就使用discovery做了资源的校验。获取所有资源的版本信息，以及支持的操作。就可以判断客户端当前操作是否合理。\r\n\r\n#### 1.1 ServerGroups\r\n\r\nstaging/src/k8s.io/client-go/discovery/discovery_client.go\r\n\r\n```\r\n// ServerGroups returns the supported groups, with information like supported versions and the\r\n// preferred version.\r\nfunc (d *DiscoveryClient) ServerGroups() (apiGroupList *metav1.APIGroupList, err error) {\r\n\t// Get the groupVersions exposed at /api\r\n\tv := &metav1.APIVersions{}\r\n\t// 先请求 https://192.168.0.4:6443/api，获得core下面的组\r\n\terr = d.restClient.Get().AbsPath(d.LegacyPrefix).Do().Into(v)\r\n\tapiGroup := metav1.APIGroup{}\r\n\tif err == nil && len(v.Versions) != 0 {\r\n\t\tapiGroup = apiVersionsToAPIGroup(v)\r\n\t}\r\n\tif err != nil && !errors.IsNotFound(err) && !errors.IsForbidden(err) {\r\n\t\treturn nil, err\r\n\t}\r\n\r\n\t// Get the groupVersions exposed at /apis\r\n\tapiGroupList = &metav1.APIGroupList{}\r\n\t// 再请求https://192.168.0.4:6443/api ，获得其他的组 \r\n\terr = d.restClient.Get().AbsPath(\"/apis\").Do().Into(apiGroupList)\r\n\tif err != nil && !errors.IsNotFound(err) && !errors.IsForbidden(err) {\r\n\t\treturn nil, err\r\n\t}\r\n\t// to be compatible with a v1.0 server, if it's a 403 or 404, ignore and return whatever we got from /api\r\n\tif err != nil && (errors.IsNotFound(err) || errors.IsForbidden(err)) {\r\n\t\tapiGroupList = &metav1.APIGroupList{}\r\n\t}\r\n\r\n\t// prepend the group retrieved from /api to the list if not empty\r\n\tif len(v.Versions) != 0 {\r\n\t\tapiGroupList.Groups = append([]metav1.APIGroup{apiGroup}, apiGroupList.Groups...)\r\n\t}\r\n\treturn apiGroupList, nil\r\n}\r\n```\r\n\r\n<br>\r\n\r\napiGroupList 就是获取所有的 组，每个组所有的version信息\r\n\r\n```\r\n// APIGroupList is a list of APIGroup, to allow clients to discover the API at\r\n// /apis.\r\ntype APIGroupList struct {\r\n\tTypeMeta `json:\",inline\"`\r\n\t// groups is a list of APIGroup.\r\n\tGroups []APIGroup `json:\"groups\" protobuf:\"bytes,1,rep,name=groups\"`\r\n}\r\n\r\n// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object\r\n\r\n// APIGroup contains the name, the supported versions, and the preferred version\r\n// of a group.\r\ntype APIGroup struct {\r\n\tTypeMeta `json:\",inline\"`\r\n\t// name is the name of the group.\r\n\tName string `json:\"name\" protobuf:\"bytes,1,opt,name=name\"`\r\n\t// versions are the versions supported in this group.\r\n\tVersions []GroupVersionForDiscovery `json:\"versions\" protobuf:\"bytes,2,rep,name=versions\"`\r\n\t// preferredVersion is the version preferred by the API server, which\r\n\t// probably is the storage version.\r\n\t// +optional\r\n\tPreferredVersion GroupVersionForDiscovery `json:\"preferredVersion,omitempty\" protobuf:\"bytes,3,opt,name=preferredVersion\"`\r\n\t// a map of client CIDR to server address that is serving this group.\r\n\t// This is to help clients reach servers in the most network-efficient way possible.\r\n\t// Clients can use the appropriate server address as per the CIDR that they match.\r\n\t// In case of multiple matches, clients should use the longest matching CIDR.\r\n\t// The server returns only those CIDRs that it thinks that the client can match.\r\n\t// For example: the master will return an internal IP CIDR only, if the client reaches the server using an internal IP.\r\n\t// Server looks at X-Forwarded-For header or X-Real-Ip header or request.RemoteAddr (in that order) to get the client IP.\r\n\t// +optional\r\n\tServerAddressByClientCIDRs []ServerAddressByClientCIDR `json:\"serverAddressByClientCIDRs,omitempty\" protobuf:\"bytes,4,rep,name=serverAddressByClientCIDRs\"`\r\n}\r\n```\r\n\r\n直接访问 /api  /apis就能或者 gruop, version信息。\r\n\r\n```\r\nroot@k8s-master:~# curl https://192.168.0.4:6443/api --cert /opt/kubernetes/ssl/server.pem --key /opt/kubernetes/ssl/server-key.pem --cacert /opt/kubernetes/ssl/ca.pem{\r\n  \"kind\": \"APIVersions\",\r\n  \"versions\": [  //这里省略了 gruop=core，其实core也是我们后面的称号，可以认为没有gruop的概念。\r\n    \"v1\"\r\n  ],\r\n  \"serverAddressByClientCIDRs\": [\r\n    {\r\n      \"clientCIDR\": \"0.0.0.0/0\",\r\n      \"serverAddress\": \"192.168.0.4:6443\"\r\n    }\r\n  ]\r\n}\r\n\r\nroot@k8s-master:~# curl https://192.168.0.4:6443/apis --cert /opt/kubernetes/ssl/server.pem --key /opt/kubernetes/ssl/server-key.pem --cacert /opt/kubernetes/ssl/ca.pem\r\n{\r\n  \"kind\": \"APIGroupList\",\r\n  \"apiVersion\": \"v1\",\r\n  \"groups\": [\r\n    {\r\n      \"name\": \"apiregistration.k8s.io\",\r\n      \"versions\": [\r\n        {\r\n          \"groupVersion\": \"apiregistration.k8s.io/v1\",\r\n          \"version\": \"v1\"\r\n        },\r\n        {\r\n          \"groupVersion\": \"apiregistration.k8s.io/v1beta1\",\r\n          \"version\": \"v1beta1\"\r\n        }\r\n      ],\r\n      \"preferredVersion\": {\r\n        \"groupVersion\": \"apiregistration.k8s.io/v1\",\r\n        \"version\": \"v1\"\r\n      }\r\n    },\r\n    ...\r\n}\r\n```\r\n\r\n#### 1.2 ServerGroupsAndResources \r\n\r\n```\r\nfunc ServerGroupsAndResources(d DiscoveryInterface) ([]*metav1.APIGroup, []*metav1.APIResourceList, error) {\r\n\t\r\n\t...\r\n\tgroupVersionResources, failedGroups := fetchGroupVersionResources(d, sgs)\r\n    ...\r\n\r\n}\r\n\r\n\r\n// fetchServerResourcesForGroupVersions uses the discovery client to fetch the resources for the specified groups in parallel.\r\nfunc fetchGroupVersionResources(d DiscoveryInterface, apiGroups *metav1.APIGroupList) (map[schema.GroupVersion]*metav1.APIResourceList, map[schema.GroupVersion]error) {\r\n\r\n\tfor _, apiGroup := range apiGroups.Groups {\r\n\t\tfor _, version := range apiGroup.Versions {\r\n\r\n\t\t\t\tapiResourceList, err := d.ServerResourcesForGroupVersion(groupVersion.String())\r\n\r\n\t\t\t\r\n}\r\n\r\n\r\n// ServerResourcesForGroupVersion returns the supported resources for a group and version.\r\nfunc (d *DiscoveryClient) ServerResourcesForGroupVersion(groupVersion string) (resources *metav1.APIResourceList, err error) {\r\n\turl := url.URL{}\r\n\tif len(groupVersion) == 0 {\r\n\t\treturn nil, fmt.Errorf(\"groupVersion shouldn't be empty\")\r\n\t}\r\n\t// 如果是core v1，直接访问 curl https://192.168.0.4:6443/api/v1， 获得所有的资源\r\n\tif len(d.LegacyPrefix) > 0 && groupVersion == \"v1\" {\r\n\t\turl.Path = d.LegacyPrefix + \"/\" + groupVersion\r\n\t} else {\r\n\t\turl.Path = \"/apis/\" + groupVersion\r\n\t}\r\n\tresources = &metav1.APIResourceList{\r\n\t\tGroupVersion: groupVersion,\r\n\t}\r\n\terr = d.restClient.Get().AbsPath(url.String()).Do().Into(resources)\r\n\tif err != nil {\r\n\t\t// ignore 403 or 404 error to be compatible with an v1.0 server.\r\n\t\tif groupVersion == \"v1\" && (errors.IsNotFound(err) || errors.IsForbidden(err)) {\r\n\t\t\treturn resources, nil\r\n\t\t}\r\n\t\treturn nil, err\r\n\t}\r\n\treturn resources, nil\r\n}\r\n```\r\n\r\n实践：\r\n\r\n```\r\nroot@k8s-master:~# curl https://192.168.0.4:6443/api/v1 --cert /opt/kubernetes/ssl/server.pem --key /opt/kubernetes/ssl/server-key.pem --cacert /opt/kubernetes/ssl/ca.pem\r\n{  //省略了很多输出\r\n  \"kind\": \"APIResourceList\",\r\n  \"groupVersion\": \"v1\",\r\n  \"resources\": [\r\n    {\r\n      \"name\": \"bindings\",\r\n      \"singularName\": \"\",\r\n      \"namespaced\": true,\r\n      \"kind\": \"Binding\",\r\n      \"verbs\": [\r\n        \"create\"\r\n      ]\r\n    {\r\n      \"name\": \"pods\",\r\n      \"singularName\": \"\",\r\n      \"namespaced\": true,\r\n      \"kind\": \"Pod\",\r\n      \"verbs\": [\r\n        \"create\",\r\n        \"delete\",\r\n        \"deletecollection\",\r\n        \"get\",\r\n        \"list\",\r\n        \"patch\",\r\n        \"update\",\r\n        \"watch\"\r\n      ],\r\n      \"shortNames\": [\r\n        \"po\"\r\n      ],\r\n      \"categories\": [\r\n        \"all\"\r\n      ],\r\n      \"storageVersionHash\": \"xPOwRZ+Yhw8=\"\r\n }\r\n```\r\n\r\n#### 1.3 缓存\r\n\r\nDiscoveryClient可以将资源相关信息存储于本地，默认存储位置为～/.kube/cache和～/.kube/http-cache。缓存可以减轻client-go对KubernetesAPI Server的访问压力。默认每10分钟与Kubernetes API Server同步一次，同步周期较长，因为资源组、源版本、资源信息一般很少变动。本地缓存的DiscoveryClient如图5-4所示。DiscoveryClient第一次获取资源组、资源版本、资源信息时，首先会查询本地缓存，如果数据不存在（没有命中）则请求Kubernetes API Server接口（回源），Cache将Kubernetes API Server响应的数据存储在本地一份并返回给DiscoveryClient。当下一次DiscoveryClient再次获取资源信息时，会将数据直接从本地缓存返回（命中）给DiscoveryClient。本地缓存的默认存储周期为10分钟。代码示例如下：\r\n\r\nstaging/src/k8s.io/client-go/discovery/cached/disk/cached_discovery.go\r\n\r\n```\r\nfunc (d *CachedDiscoveryClient) getCachedFile(filename string) ([]byte, error) {\r\n\t// after invalidation ignore cache files not created by this process\r\n\td.mutex.Lock()\r\n\t_, ourFile := d.ourFiles[filename]\r\n\tif d.invalidated && !ourFile {\r\n\t\td.mutex.Unlock()\r\n\t\treturn nil, errors.New(\"cache invalidated\")\r\n\t}\r\n\td.mutex.Unlock()\r\n\r\n\tfile, err := os.Open(filename)\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\tdefer file.Close()\r\n\r\n\tfileInfo, err := file.Stat()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\r\n\tif time.Now().After(fileInfo.ModTime().Add(d.ttl)) {\r\n\t\treturn nil, errors.New(\"cache expired\")\r\n\t}\r\n\r\n\t// the cache is present and its valid.  Try to read and use it.\r\n\tcachedBytes, err := ioutil.ReadAll(file)\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\r\n\td.mutex.Lock()\r\n\tdefer d.mutex.Unlock()\r\n\td.fresh = d.fresh && ourFile\r\n\r\n\treturn cachedBytes, nil\r\n}\r\n```\r\n\r\n#### 1.4 实例展示\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n    \"fmt\"\r\n\r\n    \"k8s.io/apimachinery/pkg/runtime/schema\"\r\n    \"k8s.io/client-go/discovery\"\r\n    \"k8s.io/client-go/tools/clientcmd\"\r\n)\r\n\r\nfunc main() {\r\n    // 加载kubeconfig文件，生成config对象\r\n    config, err := clientcmd.BuildConfigFromFlags(\"\", \"D:\\\\coding\\\\config\")\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // discovery.NewDiscoveryClientForConfigg函数通过config实例化discoveryClient对象\r\n    discoveryClient, err := discovery.NewDiscoveryClientForConfig(config)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // discoveryClient.ServerGroupsAndResources 返回API Server所支持的资源组、资源版本、资源信息\r\n    _, APIResourceList, err := discoveryClient.ServerGroupsAndResources()\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // 输出所有资源信息\r\n    for _, list := range APIResourceList {\r\n        gv, err := schema.ParseGroupVersion(list.GroupVersion)\r\n        if err != nil {\r\n            panic(err)\r\n        }\r\n\r\n        for _, resource := range list.APIResources {\r\n            fmt.Printf(\"NAME: %v, GROUP: %v, VERSION: %v \\n\", resource.Name, gv.Group, gv.Version)\r\n        }\r\n    }\r\n}\r\n\r\n\r\n// 测试\r\n go run .\\discoveryClient-example.go\r\nNAME: bindings, GROUP: , VERSION: v1 \r\nNAME: componentstatuses, GROUP: , VERSION: v1 \r\nNAME: configmaps, GROUP: , VERSION: v1\r\nNAME: endpoints, GROUP: , VERSION: v1\r\nNAME: events, GROUP: , VERSION: v1\r\nNAME: limitranges, GROUP: , VERSION: v1\r\nNAME: namespaces, GROUP: , VERSION: v1\r\nNAME: namespaces/finalize, GROUP: , VERSION: v1\r\nNAME: namespaces/status, GROUP: , VERSION: v1\r\nNAME: nodes, GROUP: , VERSION: v1\r\nNAME: nodes/proxy, GROUP: , VERSION: v1\r\nNAME: nodes/status, GROUP: , VERSION: v1\r\nNAME: persistentvolumeclaims, GROUP: , VERSION: v1\r\nNAME: persistentvolumeclaims/status, GROUP: , VERSION: v1\r\nNAME: persistentvolumes, GROUP: , VERSION: v1\r\nNAME: persistentvolumes/status, GROUP: , VERSION: v1\r\nNAME: pods, GROUP: , VERSION: v1\r\nNAME: pods/attach, GROUP: , VERSION: v1\r\nNAME: pods/binding, GROUP: , VERSION: v1\r\nNAME: pods/eviction, GROUP: , VERSION: v1\r\nNAME: pods/exec, GROUP: , VERSION: v1\r\nNAME: pods/log, GROUP: , VERSION: v1\r\nNAME: pods/portforward, GROUP: , VERSION: v1\r\nNAME: pods/proxy, GROUP: , VERSION: v1\r\nNAME: pods/status, GROUP: , VERSION: v1\r\nNAME: podtemplates, GROUP: , VERSION: v1\r\nNAME: replicationcontrollers, GROUP: , VERSION: v1\r\nNAME: replicationcontrollers/scale, GROUP: , VERSION: v1\r\nNAME: replicationcontrollers/status, GROUP: , VERSION: v1\r\nNAME: resourcequotas, GROUP: , VERSION: v1\r\nNAME: resourcequotas/status, GROUP: , VERSION: v1\r\nNAME: secrets, GROUP: , VERSION: v1\r\nNAME: serviceaccounts, GROUP: , VERSION: v1\r\nNAME: services, GROUP: , VERSION: v1\r\nNAME: services/proxy, GROUP: , VERSION: v1\r\nNAME: services/status, GROUP: , VERSION: v1\r\nNAME: apiservices, GROUP: apiregistration.k8s.io, VERSION: v1\r\nNAME: apiservices/status, GROUP: apiregistration.k8s.io, VERSION: v1\r\nNAME: apiservices, GROUP: apiregistration.k8s.io, VERSION: v1beta1 \r\nNAME: apiservices/status, GROUP: apiregistration.k8s.io, VERSION: v1beta1\r\nNAME: ingresses, GROUP: extensions, VERSION: v1beta1\r\nNAME: ingresses/status, GROUP: extensions, VERSION: v1beta1\r\nNAME: controllerrevisions, GROUP: apps, VERSION: v1\r\nNAME: daemonsets, GROUP: apps, VERSION: v1\r\nNAME: daemonsets/status, GROUP: apps, VERSION: v1\r\nNAME: deployments, GROUP: apps, VERSION: v1\r\nNAME: deployments/scale, GROUP: apps, VERSION: v1\r\nNAME: deployments/status, GROUP: apps, VERSION: v1\r\nNAME: replicasets, GROUP: apps, VERSION: v1\r\nNAME: replicasets/scale, GROUP: apps, VERSION: v1\r\nNAME: replicasets/status, GROUP: apps, VERSION: v1\r\nNAME: statefulsets, GROUP: apps, VERSION: v1\r\nNAME: statefulsets/scale, GROUP: apps, VERSION: v1\r\nNAME: statefulsets/status, GROUP: apps, VERSION: v1\r\nNAME: events, GROUP: events.k8s.io, VERSION: v1beta1\r\nNAME: tokenreviews, GROUP: authentication.k8s.io, VERSION: v1\r\nNAME: tokenreviews, GROUP: authentication.k8s.io, VERSION: v1beta1\r\nNAME: localsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1\r\nNAME: selfsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1\r\nNAME: selfsubjectrulesreviews, GROUP: authorization.k8s.io, VERSION: v1\r\nNAME: subjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1\r\nNAME: localsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1beta1\r\nNAME: selfsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1beta1\r\nNAME: selfsubjectrulesreviews, GROUP: authorization.k8s.io, VERSION: v1beta1\r\nNAME: subjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1beta1\r\nNAME: horizontalpodautoscalers, GROUP: autoscaling, VERSION: v1\r\nNAME: horizontalpodautoscalers/status, GROUP: autoscaling, VERSION: v1\r\nNAME: horizontalpodautoscalers, GROUP: autoscaling, VERSION: v2beta1\r\nNAME: horizontalpodautoscalers/status, GROUP: autoscaling, VERSION: v2beta1\r\nNAME: horizontalpodautoscalers, GROUP: autoscaling, VERSION: v2beta2\r\nNAME: horizontalpodautoscalers/status, GROUP: autoscaling, VERSION: v2beta2\r\nNAME: jobs, GROUP: batch, VERSION: v1\r\nNAME: jobs/status, GROUP: batch, VERSION: v1\r\nNAME: cronjobs, GROUP: batch, VERSION: v1beta1\r\nNAME: cronjobs/status, GROUP: batch, VERSION: v1beta1\r\nNAME: certificatesigningrequests, GROUP: certificates.k8s.io, VERSION: v1beta1\r\nNAME: certificatesigningrequests/approval, GROUP: certificates.k8s.io, VERSION: v1beta1\r\nNAME: certificatesigningrequests/status, GROUP: certificates.k8s.io, VERSION: v1beta1\r\nNAME: networkpolicies, GROUP: networking.k8s.io, VERSION: v1\r\nNAME: ingressclasses, GROUP: networking.k8s.io, VERSION: v1beta1\r\nNAME: ingresses, GROUP: networking.k8s.io, VERSION: v1beta1\r\nNAME: ingresses/status, GROUP: networking.k8s.io, VERSION: v1beta1\r\nNAME: poddisruptionbudgets, GROUP: policy, VERSION: v1beta1\r\nNAME: poddisruptionbudgets/status, GROUP: policy, VERSION: v1beta1\r\nNAME: podsecuritypolicies, GROUP: policy, VERSION: v1beta1\r\nNAME: clusterrolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1\r\nNAME: clusterroles, GROUP: rbac.authorization.k8s.io, VERSION: v1\r\nNAME: rolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1\r\nNAME: roles, GROUP: rbac.authorization.k8s.io, VERSION: v1\r\nNAME: clusterrolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1\r\nNAME: clusterroles, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1\r\nNAME: rolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1\r\nNAME: roles, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1\r\nNAME: csidrivers, GROUP: storage.k8s.io, VERSION: v1\r\nNAME: csinodes, GROUP: storage.k8s.io, VERSION: v1\r\nNAME: storageclasses, GROUP: storage.k8s.io, VERSION: v1\r\nNAME: volumeattachments, GROUP: storage.k8s.io, VERSION: v1\r\nNAME: volumeattachments/status, GROUP: storage.k8s.io, VERSION: v1 \r\nNAME: csidrivers, GROUP: storage.k8s.io, VERSION: v1beta1\r\nNAME: csinodes, GROUP: storage.k8s.io, VERSION: v1beta1\r\nNAME: storageclasses, GROUP: storage.k8s.io, VERSION: v1beta1\r\nNAME: volumeattachments, GROUP: storage.k8s.io, VERSION: v1beta1\r\nNAME: mutatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1\r\nNAME: validatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1\r\nNAME: mutatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1beta1\r\nNAME: validatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1beta1\r\nNAME: customresourcedefinitions, GROUP: apiextensions.k8s.io, VERSION: v1\r\nNAME: customresourcedefinitions/status, GROUP: apiextensions.k8s.io, VERSION: v1\r\nNAME: customresourcedefinitions, GROUP: apiextensions.k8s.io, VERSION: v1beta1\r\nNAME: customresourcedefinitions/status, GROUP: apiextensions.k8s.io, VERSION: v1beta1\r\nNAME: priorityclasses, GROUP: scheduling.k8s.io, VERSION: v1\r\nNAME: priorityclasses, GROUP: scheduling.k8s.io, VERSION: v1beta1\r\nNAME: leases, GROUP: coordination.k8s.io, VERSION: v1\r\nNAME: leases, GROUP: coordination.k8s.io, VERSION: v1beta1\r\nNAME: runtimeclasses, GROUP: node.k8s.io, VERSION: v1beta1\r\nNAME: endpointslices, GROUP: discovery.k8s.io, VERSION: v1beta1\r\n```\r\n\r\n\r\n### 2.restClient客户端\r\n\r\nrest.RESTClientFor函数通过kubeconfig配置信息实例化RESTClient对象，RESTClient对象构建HTTP请求参数，例如Get函数设置请求方法为get操作，它还支持Post、Put、Delete、Patch，list, watch等请求方法。\r\n\r\nrest由于是三个client的父类，这里介绍详细一点。\r\n\r\n```\r\nrest目录如下， 添加了每个文件的功能。代码就不一一展示\r\n\r\n│  BUILD\r\n│  client.go            初始化restClient,从初始化的过程中，可以看出来使用了令牌桶限速。同时实现了Get，put等方法，就是设置http请求的verb字段\r\n\r\n│  client_test.go\r\n│  config.go            处理kubeconfig的一些函数\r\n│  config_test.go\r\n│  OWNERS\r\n│  plugin.go            插件，从代码中看，目前只有auth插件\r\n│  plugin_test.go\r\n│  request.go           处理发送http请求相关的函数, get, list等等都在这\r\n│  request_test.go \r\n│  transport.go         还是处理http请求相关的函数，http中的transport\r\n│  urlbackoff.go        处理backoff\r\n│  urlbackoff_test.go\r\n│  url_utils.go         处理url,定义了defaultUrl\r\n│  url_utils_test.go   \r\n│  zz_generated.deepcopy.go\r\n└─watch\r\n        BUILD\r\n        decoder.go          对watch事件对象解码\r\n        decoder_test.go\r\n        encoder.go          对watch事件对象编码\r\n        encoder_test.go     \r\n```\r\n\r\nrestClient并没有直接调用create,get等资源的接口。它需要自己确定url，访问资源。如下的例子：\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n    \"fmt\"\r\n\r\n    corev1 \"k8s.io/api/core/v1\"\r\n    metav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\r\n    \"k8s.io/client-go/kubernetes/scheme\"\r\n    \"k8s.io/client-go/rest\"\r\n    \"k8s.io/client-go/tools/clientcmd\"\r\n)\r\n\r\nfunc main() {\r\n    // 加载kubeconfig文件，生成config对象\r\n    config, err := clientcmd.BuildConfigFromFlags(\"\", \"D:\\\\coding\\\\config\")\r\n\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n    // 配置API路径和请求的资源组/资源版本信息\r\n    config.APIPath = \"api\"\r\n    config.GroupVersion = &corev1.SchemeGroupVersion\r\n    config.NegotiatedSerializer = scheme.Codecs\r\n\r\n    // 通过rest.RESTClientFor()生成RESTClient对象。 RESTClientFor通过令牌桶算法，有限制的说法。\r\n    restClient, err := rest.RESTClientFor(config)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // 通过RESTClient构建请求参数，查询default空间下所有pod资源\r\n    result := &corev1.PodList{}\r\n    err = restClient.Get().\r\n        Namespace(\"default\").\r\n        Resource(\"pods\").\r\n        VersionedParams(&metav1.ListOptions{Limit: 500}, scheme.ParameterCodec).\r\n        Do().\r\n        Into(result)\r\n\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    for _, d := range result.Items {\r\n        fmt.Printf(\"NAMESPACE:%v \\t NAME: %v \\t STATUS: %v\\n\", d.Namespace, d.Name, d.Status.Phase)\r\n    }\r\n}\r\n\r\n// 测试\r\ngo run .\\restClient-example.go\r\nNAMESPACE:default        NAME: nginx-deployment-6b474476c4-lpld7         STATUS: Running\r\nNAMESPACE:default        NAME: nginx-deployment-6b474476c4-t6xl4         STATUS: Running\r\n```\r\n\r\n以这个例子为例：一般的使用就是 restClient.Get().XX.XX.Do().Into(result)。最终会回到Do 和 into函数\r\n\r\n前面的XX例如VersionedParams函数将一些查询选项（如limit、TimeoutSeconds等）添加到请求参数中。通过Do函数执行该请求，并且获得结构。\r\n\r\ninTO就是进行decode，然后赋值给result对象。\r\n\r\n```\r\n// Do formats and executes the request. Returns a Result object for easy response\r\n// processing.\r\n//\r\n// Error type:\r\n//  * If the server responds with a status: *errors.StatusError or *errors.UnexpectedObjectError\r\n//  * http.Client.Do errors are returned directly.\r\nfunc (r *Request) Do() Result {\r\n\tif err := r.tryThrottle(); err != nil {\r\n\t\treturn Result{err: err}\r\n\t}\r\n\r\n\tvar result Result\r\n\terr := r.request(func(req *http.Request, resp *http.Response) {\r\n\t\tresult = r.transformResponse(resp, req)\r\n\t})\r\n\tif err != nil {\r\n\t\treturn Result{err: err}\r\n\t}\r\n\treturn result\r\n}\r\n\r\n\r\n// Into stores the result into obj, if possible. If obj is nil it is ignored.\r\n// If the returned object is of type Status and has .Status != StatusSuccess, the\r\n// additional information in Status will be used to enrich the error.\r\nfunc (r Result) Into(obj runtime.Object) error {\r\n\tif r.err != nil {\r\n\t\t// Check whether the result has a Status object in the body and prefer that.\r\n\t\treturn r.Error()\r\n\t}\r\n\tif r.decoder == nil {\r\n\t\treturn fmt.Errorf(\"serializer for %s doesn't exist\", r.contentType)\r\n\t}\r\n\tif len(r.body) == 0 {\r\n\t\treturn fmt.Errorf(\"0-length response with status code: %d and content type: %s\",\r\n\t\t\tr.statusCode, r.contentType)\r\n\t}\r\n\r\n\tout, _, err := r.decoder.Decode(r.body, nil, obj)\r\n\tif err != nil || out == obj {\r\n\t\treturn err\r\n\t}\r\n\t// if a different object is returned, see if it is Status and avoid double decoding\r\n\t// the object.\r\n\tswitch t := out.(type) {\r\n\tcase *metav1.Status:\r\n\t\t// any status besides StatusSuccess is considered an error.\r\n\t\tif t.Status != metav1.StatusSuccess {\r\n\t\t\treturn errors.FromObject(t)\r\n\t\t}\r\n\t}\r\n\treturn nil\r\n}\r\n```\r\n\r\n<br>\r\n\r\n### 3.clientSet客户端\r\n\r\nRESTClient是一种最基础的客户端，使用时需要指定Resource和Version等信息，编写代码时需要提前知道Resource所在的Group和对应的Version信息。相比RESTClient，ClientSet使用起来更加便捷，一般情况下，开发者对Kubernetes进行二次开发时通常使用ClientSet。\r\n\r\nClientSet对应的是   client-go/kubernetes 这个目录\r\n\r\n这个目录结构核心目录和文件如下：\r\n\r\n```\r\n│  BUILD\r\n│  clientset.go      定义和初始化clientset相关函数    \r\n│  typed目录          里面定义了所有内置资源的get,list等等\r\n│  scheme            \r\n```\r\n\r\n#### 2.1 Clientset的定义\r\n\r\n```\r\n// Clientset contains the clients for groups. Each group has exactly one\r\n// version included in a Clientset.\r\ntype Clientset struct {\r\n\t*discovery.DiscoveryClient\r\n\tadmissionregistrationV1      *admissionregistrationv1.AdmissionregistrationV1Client\r\n\tadmissionregistrationV1beta1 *admissionregistrationv1beta1.AdmissionregistrationV1beta1Client\r\n\tappsV1                       *appsv1.AppsV1Client\r\n    ...\r\n\tcoreV1                       *corev1.CoreV1Client\r\n\t...\r\n}\r\n\r\nCoreV1Client 其实就是一个rest client接口\r\n// CoreV1Client is used to interact with features provided by the  group.\r\ntype CoreV1Client struct {\r\n\trestClient rest.Interface\r\n}\r\n\r\n只不过封装了很多额外的函数\r\nfunc (c *CoreV1Client) Pods(namespace string) PodInterface {\r\n\treturn newPods(c, namespace)\r\n}\r\n```\r\n\r\n<br>\r\n\r\nstaging/src/k8s.io/client-go/kubernetes/typed/core/v1/pod.go\r\n\r\n到typed目录下具体的一个资源对象文件看看, 这里以get为例。可以看出来其实就是封装了restClient的写法而已。\r\n\r\n```\r\n// Get takes name of the pod, and returns the corresponding pod object, and an error if there is any.\r\nfunc (c *pods) Get(name string, options metav1.GetOptions) (result *v1.Pod, err error) {\r\n\tresult = &v1.Pod{}\r\n\terr = c.client.Get().\r\n\t\tNamespace(c.ns).\r\n\t\tResource(\"pods\").\r\n\t\tName(name).\r\n\t\tVersionedParams(&options, scheme.ParameterCodec).\r\n\t\tDo().\r\n\t\tInto(result)\r\n\treturn\r\n}\r\n```\r\n\r\n这样的好处就是每次使用的时候，简化一点。\r\n\r\n如下的例子可见，clientSet通过 NewForConfig 实现一个客户端。用起来也方便很多。\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n    \"fmt\"\r\n\r\n    apiv1 \"k8s.io/api/core/v1\"\r\n    metav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\r\n    \"k8s.io/client-go/kubernetes\"\r\n    \"k8s.io/client-go/tools/clientcmd\"\r\n)\r\n\r\nfunc main() {\r\n    // 加载kubeconfig文件，生成config对象\r\n    config, err := clientcmd.BuildConfigFromFlags(\"\", \"D:\\\\coding\\\\config\")\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // kubernetes.NewForConfig通过config实例化ClientSet对象\r\n    clientset, err := kubernetes.NewForConfig(config)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    //请求core核心资源组v1资源版本下的Pods资源对象\r\n    podClient := clientset.CoreV1().Pods(apiv1.NamespaceDefault)\r\n    // 设置选项\r\n    list, err := podClient.List(metav1.ListOptions{Limit: 500})\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    for _, d := range list.Items {\r\n        fmt.Printf(\"NAMESPACE: %v \\t NAME:%v \\t STATUS: %+v\\n\", d.Namespace, d.Name, d.Status.Phase)\r\n    }\r\n}\r\n\r\n// 测试\r\ngo run .\\clientSet-example.go\r\n\r\nNAMESPACE: default       NAME:nginx-deployment-6b474476c4-lpld7          STATUS: Running\r\nNAMESPACE: default       NAME:nginx-deployment-6b474476c4-t6xl4          STATUS: Running\r\n```\r\n\r\n<br>\r\n\r\n### 4.DynamicClient客户端\r\n\r\nDynamicClient是一种动态客户端，它可以对任意Kubernetes资源进行RESTful操作，包括CRD自定义资源。DynamicClient与ClientSet操作类似，同样封装了RESTClient，同样提供了Create、Update、Delete、Get、List、Watch、Patch等方法。DynamicClient与ClientSet最大的不同之处是，ClientSet仅能访问Kubernetes自带的资源（即客户端集合内的资源），不能直接访问CRD自定义资源。ClientSet需要预先实现每种Resource和Version的操作，其内部的数据都是结构化数据（即已知数据结构）。而DynamicClient内部实现了Unstructured，用于处理非结构化数据结构（即无法提前预知数据结构），这也是DynamicClient能够处理CRD自定义资源的关键。\r\n\r\ndynamic目录结构如下：\r\n\r\n```\r\n│  BUILD\r\n│  client_test.go\r\n│  interface.go\r\n│  scheme.go\r\n│  simple.go                感觉叫dynamicClient.go更好，就是定义和初始化dynamic文件。然后定义update,get函数的等实现\r\n│\r\n├─dynamicinformer\r\n│      BUILD\r\n│      informer.go          定义dynamicinformer类型的Informer，其他内置资源在informer目录中都定义了\r\n│      informer_test.go\r\n│      interface.go\r\n│\r\n├─dynamiclister\r\n│      BUILD\r\n│      interface.go\r\n│      lister.go            定义dynamicinformer类型的lister，其他内置资源在lister目录中都定义了\r\n│      lister_test.go\r\n│      shim.go\r\n│\r\n└─fake\r\n        BUILD\r\n        simple.go\r\n        simple_test.go\r\n```\r\n\r\nstaging/src/k8s.io/client-go/dynamic/simple.go\r\n\r\n以Get为例，看看是如何实现的。其实和clientset是一样的。\r\n\r\n```\r\nfunc (c *dynamicResourceClient) Get(name string, opts metav1.GetOptions, subresources ...string) (*unstructured.Unstructured, error) {\r\n   if len(name) == 0 {\r\n      return nil, fmt.Errorf(\"name is required\")\r\n   }\r\n   // 拼凑好rest url\r\n   result := c.client.client.Get().AbsPath(append(c.makeURLSegments(name), subresources...)...).SpecificallyVersionedParams(&opts, dynamicParameterCodec, versionV1).Do()\r\n   if err := result.Error(); err != nil {\r\n      return nil, err\r\n   }\r\n   retBytes, err := result.Raw()\r\n   if err != nil {\r\n      return nil, err\r\n   }\r\n   // 都是使用unstructured.Unstructured接收返回的结果\r\n   uncastObj, err := runtime.Decode(unstructured.UnstructuredJSONScheme, retBytes)\r\n   if err != nil {\r\n      return nil, err\r\n   }\r\n   return uncastObj.(*unstructured.Unstructured), nil\r\n}\r\n```\r\n\r\ninformer 和 list再下一节单独介绍\r\n\r\n**注意：**\r\n\r\n* DynamicClient获得的数据都是一个object类型。存的时候是 unstructured\r\n\r\n* DynamicClient不是类型安全的，因此在访问CRD自定义资源时需要特别注意。例如，在操作指针不当的情况下可能会导致程序崩溃。\r\n\r\n* DynamicClient如果要使用informer，必须是NewFilteredDynamicSharedInformerFactory\r\n\r\n  ```ruby\r\n  \tf := dynamicinformer.NewFilteredDynamicSharedInformerFactory(dc, 0, v1.NamespaceAll, nil)\r\n  ```\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n    \"fmt\"\r\n\r\n    apiv1 \"k8s.io/api/core/v1\"\r\n    corev1 \"k8s.io/api/core/v1\"\r\n    metav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\r\n\r\n    \"k8s.io/apimachinery/pkg/runtime\"\r\n    \"k8s.io/apimachinery/pkg/runtime/schema\"\r\n    \"k8s.io/client-go/dynamic\"\r\n    \"k8s.io/client-go/tools/clientcmd\"\r\n)\r\n\r\nfunc main() {\r\n    // 加载kubeconfig文件，生成config对象\r\n    config, err := clientcmd.BuildConfigFromFlags(\"\", \"D:\\\\coding\\\\config\")\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // dynamic.NewForConfig函数通过config实例化dynamicClient对象\r\n    dynamicClient, err := dynamic.NewForConfig(config)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // 通过schema.GroupVersionResource设置请求的资源版本和资源组，设置命名空间和请求参数,得到unstructured.UnstructuredList指针类型的PodList\r\n    gvr := schema.GroupVersionResource{Version: \"v1\", Resource: \"pods\"}\r\n    unstructObj, err := dynamicClient.Resource(gvr).Namespace(apiv1.NamespaceDefault).List(metav1.ListOptions{Limit: 500})\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // 通过runtime.DefaultUnstructuredConverter函数将unstructured.UnstructuredList转为PodList类型\r\n    podList := &corev1.PodList{}\r\n    err = runtime.DefaultUnstructuredConverter.FromUnstructured(unstructObj.UnstructuredContent(), podList)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    for _, d := range podList.Items {\r\n        fmt.Printf(\"NAMESPACE: %v NAME:%v \\t STATUS: %+v\\n\", d.Namespace, d.Name, d.Status.Phase)\r\n    }\r\n}\r\n\r\n// 测试\r\ngo run .\\dynamicClient-example.go\r\nNAMESPACE: default NAME:nginx-deployment-6b474476c4-lpld7        STATUS: Running\r\nNAMESPACE: default NAME:nginx-deployment-6b474476c4-t6xl4        STATUS: Running\r\n```\r\n\r\n<br>"
  },
  {
    "path": "k8s/client-go/3. apiserver中的list-watch机制.md",
    "content": "Table of Contents\r\n=================\r\n\r\n  * [1. 背景](#1-背景)\r\n  * [2. list watch机制](#2-list-watch机制)\r\n     * [2.1 如何实现实时性](#21-如何实现实时性)\r\n     * [2.2 如何实现顺序性](#22-如何实现顺序性)\r\n     * [2.3 如何实现消息可靠性](#23-如何实现消息可靠性)\r\n     * [2.4 如何解决性能问题](#24-如何解决性能问题)\r\n  * [3.总结](#3总结)\r\n\r\n\r\n### 1. 背景\r\n\r\nclient-go实际只是一个客户端，list-watch我们经常听到。但实际上是apisever的实现。在apisever注册资源对象的create, update, delete等等hanlder时，也注册了List-watch的实现。\r\n\r\n所以在研究client-go是如果处理list watch之前，先了解一个apiserver的list watch机制\r\n\r\n### 2. list watch机制\r\n\r\n`List-watch`是`K8S`统一的异步消息处理机制，保证了消息的实时性，可靠性，顺序性，性能等等，为声明式风格的`API`奠定了良好的基础，它是优雅的通信方式，是`K8S 架构`的精髓。\r\n\r\n#### 2.1 如何实现实时性\r\n\r\n一般客户端和服务器端的同步，无非就是两种大类：一种是客户端轮训服务器端。第二种就是服务器端主动发起通知。\r\n\r\nk8s采用的是第二种，主动发起通知。这里具体就是使用了watch机制。\r\n\r\nlist, watch其实都是特殊的get接口。\r\n\r\nget default命名空间所有的pod  url如下：  curl http://xxx:port/api/v1/namespaces/default/pods\r\n\r\nwatch default命名空间所有的pod url如下:  curl http://XXX/api/v1/namespaces/default/pods?watch=true   就是多了一个watch=true的参数\r\n\r\n```\r\nroot:/# curl http://XXX/api/v1/namespaces/default/pods?watch=true\r\n{\"type\":\"ADDED\",\"object\":{\"kind\":\"Pod\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"zx-vpa-786d4b8bb5-xv5zw\",\"generateName\":\"zx-vpa-786d4b8bb5-\",\"namespace\":\"default\",\"selfLink\":\"/api/v1/namespaces/default/pods/zx-vpa-786d4b8bb5-xv5zw\",\"uid\":\"639944b7-3495-4fbb-a21d-cbc7f4d6f7a5\",\"resourceVersion\":\"157197390\",\"creationTimestamp\":\"2021-11-12T10:59:39Z\",\"labels\":{\"app\":\"zx-vpa-test\",\"pod-template-hash\":\"786d4b8bb5\"},\"annotations\":{\"v2-fixed-ip\":\"\",\"v2-subnet\":\"faf7c8b0-55c3-42c7-ba27-ad90290a9cd9\",\"v2-tenant\":\"\",\"v2-vpc\":\"6af350be-c456-44bc-909d-4b92c48b3b54\",\"vpaObservedContainers\":\"zx-vpa, zx-vpa2\",\"vpaUpdates\":\"Pod resources updated by hamster-vpa: container 0: memory request, cpu request, memory limit, cpu limit; container 1: cpu request, memory request, cpu limit, memory limit\"},\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"kind\":\"ReplicaSet\",\"name\":\"zx-vpa-786d4b8bb5\",\"uid\":\"8199639c-40fc-4dc5-81c3-d3faff7f6b4c\",\"controller\":true,\"blockOwnerDeletion\":true}]},\"spec\":{\"volumes\":[{\"name\":\"default-token-dbxf8\",\"secret\":{\"secretName\":\"default-token-dbxf8\",\"defaultMode\":420}}],\"containers\":[{\"name\":\"zx-vpa\",\"image\":\"dockerhub.nie.netease.com/fanqihong/ubuntu:stress\",\"command\":[\"sleep\",\"36000\"],\"resources\":{\"limits\":{\"cpu\":\"12m\",\"memory\":\"131072k\"},\"requests\":{\"cpu\":\"12m\",\"memory\":\"131072k\"}},\"volumeMounts\":[{\"name\":\"default-token-dbxf8\",\"readOnly\":true,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}],\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\",\"imagePullPolicy\":\"IfNotPresent\"},{\"name\":\"zx-vpa2\",\"image\":\"ncr.nie.netease.com/zouxiang/testcpu:v1\",\"command\":[\"sleep\",\"36000\"],\"resources\":{\"limits\":{\"cpu\":\"12m\",\"memory\":\"131072k\"},\"requests\":{\"cpu\":\"12m\",\"memory\":\"131072k\"}},\"volumeMounts\":[{\"name\":\"default-token-dbxf8\",\"readOnly\":true,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}],\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\",\"imagePullPolicy\":\"IfNotPresent\"}],\"restartPolicy\":\"Always\",\"terminationGracePeriodSeconds\":5,\"dnsPolicy\":\"ClusterFirst\",\"serviceAccountName\":\"default\",\"serviceAccount\":\"default\",\"nodeName\":\"7.34.19.14\",\"hostNetwork\":true,\"securityContext\":{},\"schedulerName\":\"default-scheduler\",\"enableServiceLinks\":true},\"status\":{\"phase\":\"Running\",\"conditions\":[{\"type\":\"Initialized\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-12T10:59:39Z\"},{\"type\":\"Ready\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-14T02:59:47Z\"},{\"type\":\"ContainersReady\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-14T02:59:47Z\"},{\"type\":\"PodScheduled\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-12T10:59:39Z\"}],\"hostIP\":\"7.34.19.14\",\"podIP\":\"7.34.19.14\",\"podIPs\":[{\"ip\":\"7.34.19.14\"}],\"startTime\":\"2021-11-12T10:59:39Z\",\"containerStatuses\":[{\"name\":\"zx-vpa\",\"state\":{\"running\":{\"startedAt\":\"2021-11-14T02:59:46Z\"}},\"lastState\":{\"terminated\":{\"exitCode\":0,\"reason\":\"Completed\",\"startedAt\":\"2021-11-13T16:59:45Z\",\"finishedAt\":\"2021-11-14T02:59:45Z\",\"containerID\":\"docker://87a70d2061b7fb37b0f97be3a4f9d44b345fbd54be3dcc4d8a61879dd5c6a127\"}},\"ready\":true,\"restartCount\":4,\"image\":\"dockerhub.nie.netease.com/fanqihong/ubuntu:stress\",\"imageID\":\"docker-pullable://dockerhub.nie.netease.com/fanqihong/ubuntu@sha256:ac49d16f9686c2acd351d436ed7154311e4dba50ed8c18b6abaa578dde696440\",\"containerID\":\"docker://bc586f53f363e9afb08c7a214eef06c8c1202f72439fc972d4c7d6177cfb8e63\",\"started\":true},{\"name\":\"zx-vpa2\",\"state\":{\"running\":{\"startedAt\":\"2021-11-14T02:59:47Z\"}},\"lastState\":{\"terminated\":{\"exitCode\":0,\"reason\":\"Completed\",\"startedAt\":\"2021-11-13T16:59:46Z\",\"finishedAt\":\"2021-11-14T02:59:46Z\",\"containerID\":\"docker://37d8dd54be6d27ed9f055049e700f12fa4aa30ec29f2fd16fd5176218b2acce9\"}},\"ready\":true,\"restartCount\":4,\"image\":\"ncr.nie.netease.com/zouxiang/testcpu:v1\",\"imageID\":\"docker-pullable://ncr.nie.netease.com/zouxiang/testcpu@sha256:4560824247d61f92c0d4b62224fdb3efc47560339ff05c92f73d6c731eba2717\",\"containerID\":\"docker://9ccb7968bee2c155c472e03d56a5987c9cf7e6833a4cb125084ceb19158474ed\",\"started\":true}],\"qosClass\":\"Guaranteed\"}}}\r\n{\"type\":\"ADDED\",\"object\":{\"kind\":\"Pod\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"zx-vpa-786d4b8bb5-mw6mr\",\"generateName\":\"zx-vpa-786d4b8bb5-\",\"namespace\":\"default\",\"selfLink\":\"/api/v1/namespaces/default/pods/zx-vpa-786d4b8bb5-mw6mr\",\"uid\":\"4e7f3a44-7483-434d-a917-52b37c0eae33\",\"resourceVersion\":\"157192079\",\"creationTimestamp\":\"2021-11-12T10:52:37Z\",\"labels\":{\"app\":\"zx-vpa-test\",\"pod-template-hash\":\"786d4b8bb5\"},\"annotations\":{\"v2-fixed-ip\":\"\",\"v2-subnet\":\"faf7c8b0-55c3-42c7-ba27-ad90290a9cd9\",\"v2-tenant\":\"\",\"v2-vpc\":\"6af350be-c456-44bc-909d-4b92c48b3b54\",\"vpaObservedContainers\":\"zx-vpa, zx-vpa2\",\"vpaUpdates\":\"Pod resources updated by hamster-vpa: container 0: cpu request, memory request, cpu limit, memory limit; container 1: memory request, cpu request, cpu limit, memory limit\"},\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"kind\":\"ReplicaSet\",\"name\":\"zx-vpa-786d4b8bb5\",\"uid\":\"8199639c-40fc-4dc5-81c3-d3faff7f6b4c\",\"controller\":true,\"blockOwnerDeletion\":true}]},\"spec\":{\"volumes\":[{\"name\":\"default-token-dbxf8\",\"secret\":{\"secretName\":\"default-token-dbxf8\",\"defaultMode\":420}}],\"containers\":[{\"name\":\"zx-vpa\",\"image\":\"dockerhub.nie.netease.com/fanqihong/ubuntu:stress\",\"command\":[\"sleep\",\"36000\"],\"resources\":{\"limits\":{\"cpu\":\"12m\",\"memory\":\"131072k\"},\"requests\":{\"cpu\":\"12m\",\"memory\":\"131072k\"}},\"volumeMounts\":[{\"name\":\"default-token-dbxf8\",\"readOnly\":true,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}],\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\",\"imagePullPolicy\":\"IfNotPresent\"},{\"name\":\"zx-vpa2\",\"image\":\"ncr.nie.netease.com/zouxiang/testcpu:v1\",\"command\":[\"sleep\",\"36000\"],\"resources\":{\"limits\":{\"cpu\":\"12m\",\"memory\":\"131072k\"},\"requests\":{\"cpu\":\"12m\",\"memory\":\"131072k\"}},\"volumeMounts\":[{\"name\":\"default-token-dbxf8\",\"readOnly\":true,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}],\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\",\"imagePullPolicy\":\"IfNotPresent\"}],\"restartPolicy\":\"Always\",\"terminationGracePeriodSeconds\":5,\"dnsPolicy\":\"ClusterFirst\",\"serviceAccountName\":\"default\",\"serviceAccount\":\"default\",\"nodeName\":\"10.90.67.175\",\"hostNetwork\":true,\"securityContext\":{},\"schedulerName\":\"default-scheduler\",\"enableServiceLinks\":true},\"status\":{\"phase\":\"Running\",\"conditions\":[{\"type\":\"Initialized\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-12T10:52:37Z\"},{\"type\":\"Ready\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-14T02:52:54Z\"},{\"type\":\"ContainersReady\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-14T02:52:54Z\"},{\"type\":\"PodScheduled\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-12T10:52:37Z\"}],\"hostIP\":\"10.90.67.175\",\"podIP\":\"10.90.67.175\",\"podIPs\":[{\"ip\":\"10.90.67.175\"}],\"startTime\":\"2021-11-12T10:52:37Z\",\"containerStatuses\":[{\"name\":\"zx-vpa\",\"state\":{\"running\":{\"startedAt\":\"2021-11-14T02:52:50Z\"}},\"lastState\":{\"terminated\":{\"exitCode\":0,\"reason\":\"Completed\",\"startedAt\":\"2021-11-13T16:52:47Z\",\"finishedAt\":\"2021-11-14T02:52:47Z\",\"containerID\":\"docker://ddf625ee9c90ba70ba5f1d27caa4d61ded938143a724dccbfada898271ac7fd0\"}},\"ready\":true,\"restartCount\":4,\"image\":\"dockerhub.nie.netease.com/fanqihong/ubuntu:stress\",\"imageID\":\"docker-pullable://dockerhub.nie.netease.com/fanqihong/ubuntu@sha256:ac49d16f9686c2acd351d436ed7154311e4dba50ed8c18b6abaa578dde696440\",\"containerID\":\"docker://854091543fc0ba88d3dc4a839f8014d21ecbaecf4e40221d2ce9d6a343ddbe29\",\"started\":true},{\"name\":\"zx-vpa2\",\"state\":{\"running\":{\"startedAt\":\"2021-11-14T02:52:53Z\"}},\"lastState\":{\"terminated\":{\"exitCode\":0,\"reason\":\"Completed\",\"startedAt\":\"2021-11-13T16:52:50Z\",\"finishedAt\":\"2021-11-14T02:52:50Z\",\"containerID\":\"docker://0241d05357e1f6b8eec73810341bddf14479a168aaa7958ee580855eb2f4300f\"}},\"ready\":true,\"restartCount\":4,\"image\":\"ncr.nie.netease.com/zouxiang/testcpu:v1\",\"imageID\":\"docker-pullable://ncr.nie.netease.com/zouxiang/testcpu@sha256:4560824247d61f92c0d4b62224fdb3efc47560339ff05c92f73d6c731eba2717\",\"containerID\":\"docker://ad0d09158c0db8b10d2c282f2749c1c1e97fdb707e10392b9033aa619b162450\",\"started\":true}],\"qosClass\":\"Guaranteed\"}}}\r\n\r\n\r\n\r\n\r\n\r\n// 一旦有对象改变就会收到事件。type有三种，modifyed, added ,deleted\r\n{\"type\":\"MODIFIED\",\"object\":{\"kind\":\"Pod\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"zx-vpa-786d4b8bb5-xv5zw\",\"generateName\":\"zx-vpa-786d4b8bb5-\",\"namespace\":\"default\",\"selfLink\":\"/api/v1/namespaces/default/pods/zx-vpa-786d4b8bb5-xv5zw\",\"uid\":\"639944b7-3495-4fbb-a21d-cbc7f4d6f7a5\",\"resourceVersion\":\"157401529\",\"creationTimestamp\":\"2021-11-12T10:59:39Z\",\"labels\":{\"app\":\"zx-vpa-test\",\"pod-template-hash\":\"786d4b8bb5\"},\"annotations\":{\"v2-fixed-ip\":\"\",\"v2-subnet\":\"faf7c8b0-55c3-42c7-ba27-ad90290a9cd9\",\"v2-tenant\":\"\",\"v2-vpc\":\"6af350be-c456-44bc-909d-4b92c48b3b54\",\"vpaObservedContainers\":\"zx-vpa, zx-vpa2\",\"vpaUpdates\":\"11111111111Pod resources updated by hamster-vpa: container 0: memory request, cpu request, memory limit, cpu limit; container 1: cpu request, memory request, cpu limit, memory limit\"},\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"kind\":\"ReplicaSet\",\"name\":\"zx-vpa-786d4b8bb5\",\"uid\":\"8199639c-40fc-4dc5-81c3-d3faff7f6b4c\",\"controller\":true,\"blockOwnerDeletion\":true}]},\"spec\":{\"volumes\":[{\"name\":\"default-token-dbxf8\",\"secret\":{\"secretName\":\"default-token-dbxf8\",\"defaultMode\":420}}],\"containers\":[{\"name\":\"zx-vpa\",\"image\":\"dockerhub.nie.netease.com/fanqihong/ubuntu:stress\",\"command\":[\"sleep\",\"36000\"],\"resources\":{\"limits\":{\"cpu\":\"12m\",\"memory\":\"131072k\"},\"requests\":{\"cpu\":\"12m\",\"memory\":\"131072k\"}},\"volumeMounts\":[{\"name\":\"default-token-dbxf8\",\"readOnly\":true,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}],\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\",\"imagePullPolicy\":\"IfNotPresent\"},{\"name\":\"zx-vpa2\",\"image\":\"ncr.nie.netease.com/zouxiang/testcpu:v1\",\"command\":[\"sleep\",\"36000\"],\"resources\":{\"limits\":{\"cpu\":\"12m\",\"memory\":\"131072k\"},\"requests\":{\"cpu\":\"12m\",\"memory\":\"131072k\"}},\"volumeMounts\":[{\"name\":\"default-token-dbxf8\",\"readOnly\":true,\"mountPath\":\"/var/run/secrets/kubernetes.io/serviceaccount\"}],\"terminationMessagePath\":\"/dev/termination-log\",\"terminationMessagePolicy\":\"File\",\"imagePullPolicy\":\"IfNotPresent\"}],\"restartPolicy\":\"Always\",\"terminationGracePeriodSeconds\":5,\"dnsPolicy\":\"ClusterFirst\",\"serviceAccountName\":\"default\",\"serviceAccount\":\"default\",\"nodeName\":\"7.34.19.14\",\"hostNetwork\":true,\"securityContext\":{},\"schedulerName\":\"default-scheduler\",\"enableServiceLinks\":true},\"status\":{\"phase\":\"Running\",\"conditions\":[{\"type\":\"Initialized\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-12T10:59:39Z\"},{\"type\":\"Ready\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-14T02:59:47Z\"},{\"type\":\"ContainersReady\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-14T02:59:47Z\"},{\"type\":\"PodScheduled\",\"status\":\"True\",\"lastProbeTime\":null,\"lastTransitionTime\":\"2021-11-12T10:59:39Z\"}],\"hostIP\":\"7.34.19.14\",\"podIP\":\"7.34.19.14\",\"podIPs\":[{\"ip\":\"7.34.19.14\"}],\"startTime\":\"2021-11-12T10:59:39Z\",\"containerStatuses\":[{\"name\":\"zx-vpa\",\"state\":{\"running\":{\"startedAt\":\"2021-11-14T02:59:46Z\"}},\"lastState\":{\"terminated\":{\"exitCode\":0,\"reason\":\"Completed\",\"startedAt\":\"2021-11-13T16:59:45Z\",\"finishedAt\":\"2021-11-14T02:59:45Z\",\"containerID\":\"docker://87a70d2061b7fb37b0f97be3a4f9d44b345fbd54be3dcc4d8a61879dd5c6a127\"}},\"ready\":true,\"restartCount\":4,\"image\":\"dockerhub.nie.netease.com/fanqihong/ubuntu:stress\",\"imageID\":\"docker-pullable://dockerhub.nie.netease.com/fanqihong/ubuntu@sha256:ac49d16f9686c2acd351d436ed7154311e4dba50ed8c18b6abaa578dde696440\",\"containerID\":\"docker://bc586f53f363e9afb08c7a214eef06c8c1202f72439fc972d4c7d6177cfb8e63\",\"started\":true},{\"name\":\"zx-vpa2\",\"state\":{\"running\":{\"startedAt\":\"2021-11-14T02:59:47Z\"}},\"lastState\":{\"terminated\":{\"exitCode\":0,\"reason\":\"Completed\",\"startedAt\":\"2021-11-13T16:59:46Z\",\"finishedAt\":\"2021-11-14T02:59:46Z\",\"containerID\":\"docker://37d8dd54be6d27ed9f055049e700f12fa4aa30ec29f2fd16fd5176218b2acce9\"}},\"ready\":true,\"restartCount\":4,\"image\":\"ncr.nie.netease.com/zouxiang/testcpu:v1\",\"imageID\":\"docker-pullable://ncr.nie.netease.com/zouxiang/testcpu@sha256:4560824247d61f92c0d4b62224fdb3efc47560339ff05c92f73d6c731eba2717\",\"containerID\":\"docker://9ccb7968bee2c155c472e03d56a5987c9cf7e6833a4cb125084ceb19158474ed\",\"started\":true}],\"qosClass\":\"Guaranteed\"}}}\r\n\r\n\r\n\r\n//删除一个pod，发现会进入\r\nMODIFIED （设置deletiontimestamp）-> \r\nADDED  \"status\":{\"phase\":\"Pending\",\"qosClass\":\"Guaranteed\"}}} 新pod pending\r\nMODIFIED podScheduled\r\nMODIFIED ContainerCreating\r\nMODIFIED.. 到pod running\r\nDELETED  删除旧Pod\r\n\r\n\r\n curl http://7.34.19.44:58201/api/v1/watch/namespaces/default/pods\r\n 看起来也是一样的效果\r\n```\r\n\r\n通过上面的实践可以发现：\r\n\r\n（1）watch其实就是一种特殊的get\r\n\r\n（2）可以看到删除操作后，对象的整个变化过程\r\n\r\n（3）watch每次都会返回type，和**完整**的对象信息\r\n\r\n#### 2.2 如何实现顺序性\r\n\r\n`K8S`在每个资源的事件中都带一个`resourceVersion`的标签，这个标签是递增的数字，所以当客户端并发处理同一个资源的事件时，它就可以对比`resourceVersion`来保证最终的状态和最新的事件所期望的状态保持一致。\r\n\r\n#### 2.3 如何实现消息可靠性\r\n\r\n`list`和`watch`一起保证了消息的可靠性，避免因消息丢失而造成状态不一致场景。具体而言，`list API`可以查询当前的资源及其对应的状态(即期望的状态)，客户端通过拿`期望的状态`和`实际的状态`进行对比，纠正状态不一致的资源。`Watch API`和`apiserver`保持一个`长链接`，接收资源的`状态变更事件`并做相应处理。如果仅调用`watch API`，若某个时间点连接中断，就有可能导致消息丢失，所以需要通过`list API`解决`消息丢失`的问题。从另一个角度出发，我们可以认为`list API`获取全量数据，`watch API`获取增量数据。虽然仅仅通过轮询`list API`，也能达到同步资源状态的效果，但是存在开销大，实时性不足的问题。\r\n\r\n#### 2.4 如何解决性能问题\r\n\r\n（1）list-watch机制的结合就已经在apiserver做了性能优化。（是不是可以watch的时候，只传递更新了的字段，而不是全量数据）\r\n\r\n（2）client-go的 tool.cache做了客户端的性能优化问题\r\n\r\n### 3.总结\r\n\r\n 本节主要从apiserver端探究了以下list-watch。接下来从client-go端源码看看具体是如何实现的\r\n\r\n"
  },
  {
    "path": "k8s/client-go/4. client informer机制简介.md",
    "content": "Table of Contents\r\n=================\r\n\r\n  * [1. informer机制简介](#1-informer机制简介)\r\n     * [1.2. informer机制 example介绍](#12-informer机制-example介绍)\r\n  * [2. informer](#2-informer)\r\n     * [2.1 shared informer](#21-shared-informer)\r\n     * [2.2 shared informer是如何实现的](#22-shared-informer是如何实现的)\r\n     * [2.3 informer和reflector的关系](#23-informer和reflector的关系)\r\n  * [3. Reflector](#3-reflector)\r\n  * [4. listAndwatcher](#4-listandwatcher)\r\n     * [4.1 list](#41-list)\r\n     * [4.2 watcher](#42-watcher)\r\n  * [5. DeltaFIFO](#5-deltafifo)\r\n     * [5.1 生产者](#51-生产者)\r\n     * [5.2 消费者](#52-消费者)\r\n     * [5.3 Resync](#53-resync)\r\n  * [6.Indexer](#6indexer)\r\n     * [6.1 Indexer索引器](#61-indexer索引器)\r\n  * [7. 总结](#7-总结)\r\n  * [8.附录](#8附录)\r\n  * [9.参考](#9参考)\r\n\r\n### 1. informer机制简介\r\n\r\n在Kubernetes系统中，组件之间通过HTTP协议进行通信，在不依赖任何中间件的情况下需要保证消息的实时性、可靠性、顺序性是通过list-watch机制实现的。\r\n\r\n作为客户端，client-go也实现了一套对应的list-watch进行用来处理对象的变化。这个机制在client-go就是informer机制。\r\n\r\nKubernetes的其他组件（kcm, kubelet等等）都是通过client-go的Informer机制与Kubernetes API Server进行通信的。\r\n\r\nInformer机制运行原理如图：\r\n\r\n![informer](../images/informer.png)\r\n\r\n大体流程如下：\r\n\r\n（1）new一个informer，然后informer的时候指定了 listAndwatcher（这个就是获取apiserver数据）\r\n\r\n（2）informer.Run的时候，会new 一个 Reflector对象。Reflector包含了listAndwatcher，接下来基本就是Reflector进行操作了\r\n\r\n（3）Reflector对listWatcher来的数据进行处理，这里使用到了DeltaFIFO队列对watch来的数据一个个的处理，HandleDeltas函数\r\n\r\n（4）具体的处理逻辑分为两部分，第一部分是，通过操作cache.indexer，更新本地缓存+索引; 第二部分是，将watch的数据发送给 Informer自定义的处理函数进行处理\r\n\r\n本节就先总结一下informer机制的大概流程，然后简单介绍一个流程中出现的几个概念。后面的章节一个一个进行详细研究\r\n\r\n<br>\r\n\r\n#### 1.2. informer机制 example介绍\r\n\r\n直接阅读Informer机制代码会比较晦涩，通过Informers Example代码示例来理解Informer，印象会更深刻。Informers Example代码示例如下：\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n\t\"log\"\r\n\t\"time\"\r\n\r\n\tcorev1 \"k8s.io/api/core/v1\"\r\n\t\"k8s.io/apimachinery/pkg/apis/meta/v1\"\r\n\t\"k8s.io/client-go/informers\"\r\n\t\"k8s.io/client-go/kubernetes\"\r\n\t\"k8s.io/client-go/tools/cache\"\r\n\t\"k8s.io/client-go/tools/clientcmd\"\r\n)\r\n\r\nfunc main() {\r\n\tconfig, err := clientcmd.BuildConfigFromFlags(\"\", \"/root/.kube/config\")\r\n\tif err!= nil {\r\n\t\tpannic(er)\r\n\t}\r\n\r\n\tclientset, err := kubernetes.NewForConfig(config)\r\n    if err!=nil {\r\n    \tpanic(err)\r\n\t}\r\n    \r\n\tstopCh := make(chan struct{})\r\n\tdefer close(stopCh)\r\n\tsharedInformers := informers.NewSharedInformerFactory(clientset, time.Minute)\r\n\tinformer := sharedInformers.Core().V1().Pods().Informer()\r\n\t\r\n\tinformers.AddEventHandler(cache.ResourceEventHandlerFuncs{\r\n\t\tAddFunc: func(obj interface{}) {\r\n\t\t\tmObj := obj.(v1.Object)\r\n\t\t\tlog.Printf(\"New Pod Added to Store:%s\", mObj.GetName())\r\n\t\t},\r\n\t\tUpdateFunc: func(oldObj, newObj interface{}) {\r\n\t\t\toObj := oldObj.(v1.Object)\r\n\t\t\tnObj := newObj.(v1.Object)\r\n\t\t\tlog.Printf(\"%s Pod Updated to %s\", oObj.GetName(), nObj.GetName())\r\n\t\t},\r\n\t\tDeleteFunc: func(obj interface{}) {\r\n\t\t\tmObj := obj.(v1.Object)\r\n\t\t\tlog.Printf(\"Pod Deleted from Store:%s\", mObj.GetName())\r\n\t\t},\r\n\t})\r\n\r\n\tinformer.Run(stopCh)\r\n}\r\n```\r\n\r\n首先通过kubernetes.NewForConfig创建clientset对象，Informer需要通过ClientSet与Kubernetes API Server进行交互。另外，创建stopCh对象，该对象用于在程序进程退出之前通知Informer提前退出，因为Informer是一个持久运行的goroutine。informers.NewSharedInformerFactory函数实例化了SharedInformer对象，它接收两个参数：第1个参数clientset是用于与Kubernetes API Server交互的客户端，第2个参数time.Minute用于设置多久进行一次resync（重新同步），resync会周期性地执行List操作，将所有的资源存放在Informer Store中，如果该参数为0，则禁用resync功能。\r\n\r\n在Informers Example代码示例中，通过sharedInformers.Core（）.V1（）.Pods（）.Informer可以得到具体Pod资源的informer对象。通过informer.AddEventHandler函数可以为Pod资源添加资源事件回调方法，支持3种资源事件回调方法，分别介绍如下。\r\n\r\n● AddFunc：当创建Pod资源对象时触发的事件回调方法。\r\n\r\n● UpdateFunc：当更新Pod资源对象时触发的事件回调方法。\r\n\r\n● DeleteFunc：当删除Pod资源对象时触发的事件回调方法。在正常的情况下，Kubernetes的其他组件在使用Informer机制时触发资源事件回调方法，将资源对象推送到WorkQueue或其他队列中(**实际过程中大都是这样的**)，在InformersExample代码示例中，我们直接输出触发的资源事件。最后通过informer.Run函数运行当前的Informer，内部为Pod资源类型创建Informer。\r\n\r\n<br>\r\n\r\n### 2. informer\r\n\r\n每一个Kubernetes资源上都实现了Informer机制。每一个Informer上都会实现Informer和Lister方法，例如PodInformer，代码示例如下\r\n\r\n```\r\n// PodInformer provides access to a shared informer and lister for\r\n// Pods.\r\ntype PodInformer interface {\r\n\tInformer() cache.SharedIndexInformer\r\n\tLister() v1.PodLister\r\n}\r\n```\r\n\r\n用不同资源的Informer，代码示例如下：\r\n\r\n```\r\npodInformer := sharedInformers.Core().V1().Pods().Informer()\r\nnodeInformer := sharedInformers.Node().V1beta1().RuntimeClasses().Informer\r\n```\r\n\r\n定义不同资源的Informer，允许监控不同资源的资源事件，例如，监听Node资源对象，当Kubernetes集群中有新的节点（Node）加入时，client-go能够及时收到资源对象的变更信息。\r\n\r\n<br>\r\n\r\n#### 2.1 shared informer\r\n\r\n可以认为 informer都是 shared informer\r\n\r\nInformer也被称为Shared Informer，它是可以共享使用的。在用client-go编写代码程序时，若同一资源的Informer被实例化了多次，每个Informer使用一个Reflector，那么会运行过多相同的ListAndWatch，太多重复的序列化和反序列化操作会导致Kubernetes API Server负载过重。Shared Informer可以使同一类资源Informer共享一个Reflector，这样可以节约很多资源。通过map数据结构实现共享的Informer机制。SharedInformer定义了一个map数据结构，用于存放所有Informer的字段，代码示例如下：\r\n\r\n```\r\ntype sharedInformerFactory struct {\r\n\tclient           kubernetes.Interface\r\n\tnamespace        string\r\n\ttweakListOptions internalinterfaces.TweakListOptionsFunc\r\n\tlock             sync.Mutex\r\n\tdefaultResync    time.Duration\r\n\tcustomResync     map[reflect.Type]time.Duration\r\n\r\n\tinformers map[reflect.Type]cache.SharedIndexInformer\r\n\t// startedInformers is used for tracking which informers have been started.\r\n\t// This allows Start() to be called multiple times safely.\r\n\tstartedInformers map[reflect.Type]bool\r\n}\r\n```\r\n\r\ninformers字段中存储了资源类型和对应于SharedIndexInformer的映射关系。InformerFor函数添加了不同资源的Informer，在添加过程中如果已经存在同类型的资源Informer，则返回当前Informer，不再继续添加。最后通过Shared Informer的Start方法使f.informers中的每个informer通过goroutine持久运行。\r\n\r\n同一个factory定义的shareInformer可以复用复用。\r\n\r\n#### 2.2 shared informer是如何实现的\r\n\r\n从结构体可以看出来：有一个字段 Store，这里就是保存从apiserver同步过来的数据。\r\n\r\n还有一个函数Run()，这个函数会调用controller.Run --> Reflector.Run->ListAndWatch()\r\n\r\n而ListAndWatch()就是从apiserver获取数据。\r\n\r\n```\r\n// SharedInformer has a shared data cache and is capable of distributing notifications for changes\r\n// to the cache to multiple listeners who registered via AddEventHandler. If you use this, there is\r\n// one behavior change compared to a standard Informer.  When you receive a notification, the cache\r\n// will be AT LEAST as fresh as the notification, but it MAY be more fresh.  You should NOT depend\r\n// on the contents of the cache exactly matching the notification you've received in handler\r\n// functions.  If there was a create, followed by a delete, the cache may NOT have your item.  This\r\n// has advantages over the broadcaster since it allows us to share a common cache across many\r\n// controllers. Extending the broadcaster would have required us keep duplicate caches for each\r\n// watch.\r\ntype SharedInformer interface {\r\n\t// AddEventHandler adds an event handler to the shared informer using the shared informer's resync\r\n\t// period.  Events to a single handler are delivered sequentially, but there is no coordination\r\n\t// between different handlers.\r\n\tAddEventHandler(handler ResourceEventHandler)\r\n\t// AddEventHandlerWithResyncPeriod adds an event handler to the shared informer using the\r\n\t// specified resync period.  Events to a single handler are delivered sequentially, but there is\r\n\t// no coordination between different handlers.\r\n\tAddEventHandlerWithResyncPeriod(handler ResourceEventHandler, resyncPeriod time.Duration)\r\n\t// GetStore returns the Store.\r\n\tGetStore() Store\r\n\t// GetController gives back a synthetic interface that \"votes\" to start the informer\r\n\tGetController() Controller\r\n\t// Run starts the shared informer, which will be stopped when stopCh is closed.\r\n\tRun(stopCh <-chan struct{})\r\n\t// HasSynced returns true if the shared informer's store has synced.\r\n\tHasSynced() bool\r\n\t// LastSyncResourceVersion is the resource version observed when last synced with the underlying\r\n\t// store. The value returned is not synchronized with access to the underlying store and is not\r\n\t// thread-safe.\r\n\tLastSyncResourceVersion() string\r\n}\r\n```\r\n\r\nEventHandler：这是一个回调函数，当一个`Informer`/`SharedInformer`要分发一个对象到控制器时，会调用此函数。例如：将对象的`Key`放在`WorkQueue`中并等待后续的处理。\r\n\r\n这里先简单介绍整体的逻辑。后面再详细介绍。\r\n\r\n<br>\r\n\r\n#### 2.3 informer和reflector的关系\r\n\r\n再使用informer的时候，一般都是：\r\n\r\n（1）new 一个sharedInformerFactory对象\r\n\r\n（2）根据sharedInformerFactory生成一个informer\r\n\r\n（3）定义informer的 addFunc, deleteFunc, updateFunc函数\r\n\r\n（4）informer.Run(stopCh) 运行起来\r\n\r\n```\r\n// 在informer的Run函数中调用了controller.Run\r\nfunc (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {\r\n\t\r\n\r\n\tfunc() {\r\n\t\ts.startedLock.Lock()\r\n\t\tdefer s.startedLock.Unlock()\r\n\r\n\t\ts.controller = New(cfg)\r\n\t\ts.controller.(*controller).clock = s.clock\r\n\t\ts.started = true\r\n\t}()\r\n    \r\n\ts.controller.Run(stopCh)\r\n}\r\n\r\n// run函数生成了一个reflector对象\r\n// Run begins processing items, and will continue until a value is sent down stopCh.\r\n// It's an error to call Run more than once.\r\n// Run blocks; call via go.\r\nfunc (c *controller) Run(stopCh <-chan struct{}) {\r\n    ....\r\n\tr := NewReflector(\r\n\t\tc.config.ListerWatcher,\r\n\t\tc.config.ObjectType,\r\n\t\tc.config.Queue,\r\n\t\tc.config.FullResyncPeriod,\r\n\t)\r\n   ...\r\n}\r\n```\r\n\r\n**可以看出来：**\r\n\r\n（1）一个informer对应一个reflector\r\n\r\n（2）reflector是informer run的时候才生成的，并且Informer list-watch都是由reflector完成的. informer只管定义add, update, del处理事件即可\r\n\r\n### 3. Reflector\r\n\r\nInformer可以对Kubernetes API Server的资源执行监控（Watch）操作，资源类型可以是Kubernetes内置资源，也可以是CRD自定义资源，其中最核心的功能是Reflector。Reflector用于监控指定资源的Kubernetes资源，当监控的资源发生变化时，触发相应的变更事件，例如Added（资源添加）事件、Updated（资源更新）事件、Deleted（资源删除）事件，并将其资源对象存放到本地缓存DeltaFIFO中。通过NewReflector实例化Reflector对象，实例化过程中须传入ListerWatcher数据接口对象，它拥有List和Watch方法，用于获取及监控资源列表。只要实现了List和Watch方法的对象都可以称为ListerWatcher。Reflector对象通过Run函数启动监控并处理监控事件。而在Reflector源码实现中，其中最主要的是ListAndWatch函数，它负责获取资源列表（List）和监控（Watch）指定的Kubernetes API Server资源。\r\n\r\n```\r\n// Reflector watches a specified resource and causes all changes to be reflected in the given store.\r\ntype Reflector struct {\r\n\t// name identifies this reflector. By default it will be a file:line if possible.\r\n\tname string\r\n\r\n\t// The name of the type we expect to place in the store. The name\r\n\t// will be the stringification of expectedGVK if provided, and the\r\n\t// stringification of expectedType otherwise. It is for display\r\n\t// only, and should not be used for parsing or comparison.\r\n\texpectedTypeName string\r\n\t// The type of object we expect to place in the store.\r\n\texpectedType reflect.Type\r\n\t// The GVK of the object we expect to place in the store if unstructured.\r\n\texpectedGVK *schema.GroupVersionKind\r\n\t// The destination to sync up with the watch source\r\n\tstore Store                     // store对象\r\n\t// listerWatcher is used to perform lists and watches.\r\n\tlisterWatcher ListerWatcher     // listwatcher对象\r\n\t// period controls timing between one watch ending and\r\n\t// the beginning of the next one.\r\n\tperiod       time.Duration\r\n\tresyncPeriod time.Duration\r\n\tShouldResync func() bool\r\n\t// clock allows tests to manipulate time\r\n\tclock clock.Clock\r\n\t// lastSyncResourceVersion is the resource version token last\r\n\t// observed when doing a sync with the underlying store\r\n\t// it is thread safe, but not synchronized with the underlying store\r\n\tlastSyncResourceVersion string\r\n\t// lastSyncResourceVersionMutex guards read/write access to lastSyncResourceVersion\r\n\tlastSyncResourceVersionMutex sync.RWMutex\r\n\t// WatchListPageSize is the requested chunk size of initial and resync watch lists.\r\n\t// Defaults to pager.PageSize.\r\n\tWatchListPageSize int64\r\n}\r\n```\r\n\r\n**reflector包含了listwatch对象**\r\n\r\n<br>\r\n\r\n### 4. listAndwatcher\r\n\r\n#### 4.1 list\r\n\r\n```\r\n// ListAndWatch first lists all items and get the resource version at the moment of call,\r\n// and then use the resource version to watch.\r\n// It returns error if ListAndWatch didn't even try to initialize watch.\r\nfunc (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {\r\n\tklog.V(3).Infof(\"Listing and watching %v from %s\", r.expectedTypeName, r.name)\r\n\tvar resourceVersion string\r\n\r\n\t// Explicitly set \"0\" as resource version - it's fine for the List()\r\n\t// to be served from cache and potentially be delayed relative to\r\n\t// etcd contents. Reflector framework will catch up via Watch() eventually.\r\n\t// resourceVersion=0 表示 list所有资源\r\n\toptions := metav1.ListOptions{ResourceVersion: \"0\"}\r\n    \r\n    // \r\n\tif err := func() error {\r\n\t\tinitTrace := trace.New(\"Reflector ListAndWatch\", trace.Field{\"name\", r.name})\r\n\t\tdefer initTrace.LogIfLong(10 * time.Second)\r\n\t\tvar list runtime.Object\r\n\t\tvar err error\r\n\t\tlistCh := make(chan struct{}, 1)\r\n\t\tpanicCh := make(chan interface{}, 1)\r\n\t\tgo func() {\r\n\t\t\tdefer func() {\r\n\t\t\t\tif r := recover(); r != nil {\r\n\t\t\t\t\tpanicCh <- r\r\n\t\t\t\t}\r\n\t\t\t}()\r\n\t\t\t// 判断是否chunks一段一段的list\r\n\t\t\t// Attempt to gather list in chunks, if supported by listerWatcher, if not, the first\r\n\t\t\t// list request will return the full response.\r\n\t\t\tpager := pager.New(pager.SimplePageFunc(func(opts metav1.ListOptions) (runtime.Object, error) {\r\n\t\t\t\treturn r.listerWatcher.List(opts)\r\n\t\t\t}))\r\n\t\t\tif r.WatchListPageSize != 0 {\r\n\t\t\t\tpager.PageSize = r.WatchListPageSize\r\n\t\t\t}\r\n\t\t\t// Pager falls back to full list if paginated list calls fail due to an \"Expired\" error.\r\n\t\t\tlist, err = pager.List(context.Background(), options)\r\n\t\t\tclose(listCh)\r\n\t\t}()\r\n\t\tselect {\r\n\t\tcase <-stopCh:\r\n\t\t\treturn nil\r\n\t\tcase r := <-panicCh:\r\n\t\t\tpanic(r)\r\n\t\tcase <-listCh:\r\n\t\t}\r\n\t\tif err != nil {\r\n\t\t\treturn fmt.Errorf(\"%s: Failed to list %v: %v\", r.name, r.expectedTypeName, err)\r\n\t\t}\r\n\t\tinitTrace.Step(\"Objects listed\")\r\n\t\tlistMetaInterface, err := meta.ListAccessor(list)\r\n\t\tif err != nil {\r\n\t\t\treturn fmt.Errorf(\"%s: Unable to understand list result %#v: %v\", r.name, list, err)\r\n\t\t}\r\n\t\tresourceVersion = listMetaInterface.GetResourceVersion()\r\n\t\tinitTrace.Step(\"Resource version extracted\")\r\n\t\titems, err := meta.ExtractList(list)\r\n\t\tif err != nil {\r\n\t\t\treturn fmt.Errorf(\"%s: Unable to understand list result %#v (%v)\", r.name, list, err)\r\n\t\t}\r\n\t\tinitTrace.Step(\"Objects extracted\")\r\n\t\tif err := r.syncWith(items, resourceVersion); err != nil {\r\n\t\t\treturn fmt.Errorf(\"%s: Unable to sync list result: %v\", r.name, err)\r\n\t\t}\r\n\t\tinitTrace.Step(\"SyncWith done\")\r\n\t\tr.setLastSyncResourceVersion(resourceVersion)\r\n\t\tinitTrace.Step(\"Resource version updated\")\r\n\t\treturn nil\r\n\t}(); err != nil {\r\n\t\treturn err\r\n\t}\r\n\r\n\tresyncerrc := make(chan error, 1)\r\n\tcancelCh := make(chan struct{})\r\n\tdefer close(cancelCh)\r\n\tgo func() {\r\n\t\tresyncCh, cleanup := r.resyncChan()\r\n\t\tdefer func() {\r\n\t\t\tcleanup() // Call the last one written into cleanup\r\n\t\t}()\r\n\t\tfor {\r\n\t\t\tselect {\r\n\t\t\tcase <-resyncCh:\r\n\t\t\tcase <-stopCh:\r\n\t\t\t\treturn\r\n\t\t\tcase <-cancelCh:\r\n\t\t\t\treturn\r\n\t\t\t}\r\n\t\t\tif r.ShouldResync == nil || r.ShouldResync() {\r\n\t\t\t\tklog.V(4).Infof(\"%s: forcing resync\", r.name)\r\n\t\t\t\tif err := r.store.Resync(); err != nil {\r\n\t\t\t\t\tresyncerrc <- err\r\n\t\t\t\t\treturn\r\n\t\t\t\t}\r\n\t\t\t}\r\n\t\t\tcleanup()\r\n\t\t\tresyncCh, cleanup = r.resyncChan()\r\n\t\t}\r\n\t}()\r\n\r\n\tfor {\r\n\t\t// give the stopCh a chance to stop the loop, even in case of continue statements further down on errors\r\n\t\tselect {\r\n\t\tcase <-stopCh:\r\n\t\t\treturn nil\r\n\t\tdefault:\r\n\t\t}\r\n\r\n\t\ttimeoutSeconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0))\r\n\t\toptions = metav1.ListOptions{\r\n\t\t\tResourceVersion: resourceVersion,\r\n\t\t\t// We want to avoid situations of hanging watchers. Stop any wachers that do not\r\n\t\t\t// receive any events within the timeout window.\r\n\t\t\tTimeoutSeconds: &timeoutSeconds,\r\n\t\t\t// To reduce load on kube-apiserver on watch restarts, you may enable watch bookmarks.\r\n\t\t\t// Reflector doesn't assume bookmarks are returned at all (if the server do not support\r\n\t\t\t// watch bookmarks, it will ignore this field).\r\n\t\t\tAllowWatchBookmarks: true,\r\n\t\t}\r\n\r\n\t\tw, err := r.listerWatcher.Watch(options)\r\n\t\tif err != nil {\r\n\t\t\tswitch err {\r\n\t\t\tcase io.EOF:\r\n\t\t\t\t// watch closed normally\r\n\t\t\tcase io.ErrUnexpectedEOF:\r\n\t\t\t\tklog.V(1).Infof(\"%s: Watch for %v closed with unexpected EOF: %v\", r.name, r.expectedTypeName, err)\r\n\t\t\tdefault:\r\n\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"%s: Failed to watch %v: %v\", r.name, r.expectedTypeName, err))\r\n\t\t\t}\r\n\t\t\t// If this is \"connection refused\" error, it means that most likely apiserver is not responsive.\r\n\t\t\t// It doesn't make sense to re-list all objects because most likely we will be able to restart\r\n\t\t\t// watch where we ended.\r\n\t\t\t// If that's the case wait and resend watch request.\r\n\t\t\tif utilnet.IsConnectionRefused(err) {\r\n\t\t\t\ttime.Sleep(time.Second)\r\n\t\t\t\tcontinue\r\n\t\t\t}\r\n\t\t\treturn nil\r\n\t\t}\r\n\r\n\t\tif err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil {\r\n\t\t\tif err != errorStopRequested {\r\n\t\t\t\tswitch {\r\n\t\t\t\tcase apierrs.IsResourceExpired(err):\r\n\t\t\t\t\tklog.V(4).Infof(\"%s: watch of %v ended with: %v\", r.name, r.expectedTypeName, err)\r\n\t\t\t\tdefault:\r\n\t\t\t\t\tklog.Warningf(\"%s: watch of %v ended with: %v\", r.name, r.expectedTypeName, err)\r\n\t\t\t\t}\r\n\t\t\t}\r\n\t\t\treturn nil\r\n\t\t}\r\n\t}\r\n}\r\n```\r\n\r\nListAndWatch List在程序第一次运行时获取该资源下所有的对象数据并将其存储至DeltaFIFO中。以Informers Example代码示例为例，在其中，我们获取的是所有Pod的资源数据。ListAndWatch List流程图如下所示。\r\n\r\n![list-and-watcher](../images/list-and-watcher.png)\r\n\r\n（1）r.listerWatcher.List用于获取资源下的所有对象的数据，例如，获取所有Pod的资源数据。获取资源数据是由options的ResourceVersion（资源版本号)参数控制的，如果ResourceVersion为0，则表示获取所有Pod的资源数据；如果ResourceVersion非0，则表示根据资源版本号继续获取，功能有些类似于文件传输过程中的“断点续传”，当传输过程中遇到网络故障导致中断，下次再连接时，会根据资源版本号继续传输未完成的部分。可以使本地缓存中的数据与Etcd集群中的数据保持一致。\r\n\r\n（2）listMetaInterface.GetResourceVersion用于获取资源版本号，ResourceVersion （资源版本号）非常重要，Kubernetes中所有的资源都拥有该字段，它标识当前资源对象的版本号。每次修改当前资源对象时，Kubernetes API Server都会更改ResourceVersion，使得client-go执行Watch操作时可以根据ResourceVersion来确定当前资源对象是否发生变化。更多关于ResourceVersion资源版本号的内容，请参考6.5.2节“ResourceVersion资源版本号”。\r\n\r\n（3）meta.ExtractList用于将资源数据转换成资源对象列表，将runtime.Object对象转换成[]runtime.Object对象。因为r.listerWatcher.List获取的是资源下的所有对象的数据，例如所有的Pod资源数据，所以它是一个资源列表。\r\n\r\n（4） r.syncWith用于将资源对象列表中的资源对象和资源版本号存储至DeltaFIFO中，并会替换已存在的对象。\r\n\r\n（5）r.setLastSyncResourceVersion用于设置最新的资源版本号。\r\n\r\nr.listerWatcher.List函数实际调用了Pod Informer下的ListFunc函数（NewFilteredListWatchFromClient），它通过ClientSet客户端与Kubernetes API Server交互并获取Pod资源列表数据.\r\n\r\n<br>\r\n\r\n#### 4.2 watcher\r\n\r\nWatch（监控）操作通过HTTP协议与Kubernetes API Server建立长连接，接收Kubernetes API Server发来的资源变更事件。Watch操作的实现机制使用HTTP协议的分块传输编码（Chunked Transfer Encoding）。当client-go调用Kubernetes API Server时，Kubernetes API Server在Response的HTTPHeader中设置Transfer-Encoding的值为chunked，表示采用分块传输编码，客户端收到该信息后，便与服务端进行连接，并等待下一个数据块（即资源的事件信息）。\r\n\r\nListAndWatch Watch代码示例如下：\r\n\r\n```go\r\nfor {\r\n\t\ttimeoutSeconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0))\r\n\t\t\r\n\t\t// 列出要watcher的资源和timeout时间\r\n\t\toptions = metav1.ListOptions{\r\n\t\t\tResourceVersion: resourceVersion,\r\n\t\t\t// We want to avoid situations of hanging watchers. Stop any wachers that do not\r\n\t\t\t// receive any events within the timeout window.\r\n\t\t\tTimeoutSeconds: &timeoutSeconds,\r\n\t\t}\r\n       \r\n\t\tr.metrics.numberOfWatches.Inc()\r\n\t\t\r\n\t\t// 这个就是reflector提到的watch函数\r\n\t\tw, err := r.listerWatcher.Watch(options)\r\n\t\t\r\n        // 用于处理资源的变更事件。\r\n\t\tif err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil {\r\n\t\t\tif err != errorStopRequested {\r\n\t\t\t\tglog.Warningf(\"%s: watch of %v ended with: %v\", r.name, r.expectedType, err)\r\n\t\t\t}\r\n\t\t\treturn nil\r\n\t\t}\r\n\t}\r\n```\r\n\r\n<br>\r\n\r\nr.watchHandler用于处理资源的变更事件。当触发Added（资源添加）事件、Updated （资源更新）事件、Deleted（资源删除）事件时，将对应的资源对象更新到本地缓存DeltaFIFO中并更新ResourceVersion资源版本号。r.watchHandler代码示例如下：\r\n\r\n```\r\n// watchHandler watches w and keeps *resourceVersion up to date.\r\nfunc (r *Reflector) watchHandler(w watch.Interface, resourceVersion *string, errc chan error, stopCh <-chan struct{}) error {\r\n\tstart := r.clock.Now()\r\n\teventCount := 0\r\n\r\n\t\r\n\t\t\tnewResourceVersion := meta.GetResourceVersion()\r\n\t\t\tswitch event.Type {\r\n\t\t\tcase watch.Added:\r\n\t\t\t\terr := r.store.Add(event.Object)\r\n\t\t\t\tif err != nil {\r\n\t\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"%s: unable to add watch event object (%#v) to store: %v\", r.name, event.Object, err))\r\n\t\t\t\t}\r\n\t\t\tcase watch.Modified:\r\n\t\t\t\terr := r.store.Update(event.Object)\r\n\t\t\t\tif err != nil {\r\n\t\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"%s: unable to update watch event object (%#v) to store: %v\", r.name, event.Object, err))\r\n\t\t\t\t}\r\n\t\t\tcase watch.Deleted:\r\n\t\t\t\t// TODO: Will any consumers need access to the \"last known\r\n\t\t\t\t// state\", which is passed in event.Object? If so, may need\r\n\t\t\t\t// to change this.\r\n\t\t\t\terr := r.store.Delete(event.Object)\r\n\t\t\t\tif err != nil {\r\n\t\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"%s: unable to delete watch event object (%#v) from store: %v\", r.name, event.Object, err))\r\n\t\t\t\t}\r\n\t\t\tdefault:\r\n\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"%s: unable to understand watch event %#v\", r.name, event))\r\n\t\t\t}\r\n\t\t\t// 有改变就将 resourceVersion+1\r\n\t\t\t*resourceVersion = newResourceVersion\r\n\t\t\tr.setLastSyncResourceVersion(newResourceVersion)\r\n\t\t\teventCount++\r\n\t\t}\r\n\t}\r\n\r\n\twatchDuration := r.clock.Now().Sub(start)\r\n\tif watchDuration < 1*time.Second && eventCount == 0 {\r\n\t\tr.metrics.numberOfShortWatches.Inc()\r\n\t\treturn fmt.Errorf(\"very short watch: %s: Unexpected watch close - watch lasted less than a second and no items received\", r.name)\r\n\t}\r\n\tglog.V(4).Infof(\"%s: Watch close - %v total %v items received\", r.name, r.expectedType, eventCount)\r\n\treturn nil\r\n}\r\n```\r\n\r\n<br>\r\n\r\n### 5. DeltaFIFO\r\n\r\nDeltaFIFO可以分开理解，FIFO是一个先进先出的队列，它拥有队列操作的基本方法，例如Add、Update、Delete、List、Pop、Close等，而Delta是一个资源对象存储，它可以保存资源对象的操作类型，例如Added（添加）操作类型、Updated（更新）操作类型、Deleted（删除）操作类型、Sync（同步）操作类型等。DeltaFIFO结构代码示例如下：\r\n\r\n```\r\ntype DeltaFIFO struct {\r\n\t// lock/cond protects access to 'items' and 'queue'.\r\n\tlock sync.RWMutex\r\n\tcond sync.Cond\r\n\r\n\t// We depend on the property that items in the set are in\r\n\t// the queue and vice versa, and that all Deltas in this\r\n\t// map have at least one Delta.\r\n\titems map[string]Deltas\r\n\tqueue []string\r\n\r\n\t// populated is true if the first batch of items inserted by Replace() has been populated\r\n\t// or Delete/Add/Update was called first.\r\n\tpopulated bool\r\n\t// initialPopulationCount is the number of items inserted by the first call of Replace()\r\n\tinitialPopulationCount int\r\n\r\n\t// keyFunc is used to make the key used for queued item\r\n\t// insertion and retrieval, and should be deterministic.\r\n\tkeyFunc KeyFunc\r\n\r\n\t// knownObjects list keys that are \"known\", for the\r\n\t// purpose of figuring out which items have been deleted\r\n\t// when Replace() or Delete() is called.\r\n\tknownObjects KeyListerGetter\r\n\r\n\t// Indication the queue is closed.\r\n\t// Used to indicate a queue is closed so a control loop can exit when a queue is empty.\r\n\t// Currently, not used to gate any of CRED operations.\r\n\tclosed     bool\r\n\tclosedLock sync.Mutex\r\n}\r\n```\r\n\r\nDeltaFIFO与其他队列最大的不同之处是，它会保留所有关于资源对象（obj）的操作类型，队列中会存在拥有不同操作类型的同一个资源对象，消费者在处理该资源对象时能够了解该资源对象所发生的事情。queue字段存储资源对象的key，该key通过KeyOf函数计算得到。items字段通过map数据结构的方式存储，value存储的是对象的Deltas数组。DeltaFIFO存储结构如下图所示。\r\n\r\n![delta](../images/delta.png)\r\n\r\nDeltaFIFO本质上是一个先进先出的队列，有数据的生产者和消费者，其中生产者是Reflector调用的Add方法，消费者是Controller调用的Pop方法。\r\n\r\n#### 5.1 生产者\r\n\r\n```\r\n// Add inserts an item, and puts it in the queue. The item is only enqueued\r\n// if it doesn't already exist in the set.\r\nfunc (f *DeltaFIFO) Add(obj interface{}) error {\r\n\tf.lock.Lock()\r\n\tdefer f.lock.Unlock()\r\n\tf.populated = true\r\n\treturn f.queueActionLocked(Added, obj)\r\n}\r\n\r\n// Update is just like Add, but makes an Updated Delta.\r\nfunc (f *DeltaFIFO) Update(obj interface{}) error {\r\n\tf.lock.Lock()\r\n\tdefer f.lock.Unlock()\r\n\tf.populated = true\r\n\treturn f.queueActionLocked(Updated, obj)\r\n}\r\n\r\n// Delete is just like Add, but makes an Deleted Delta. If the item does not\r\n// already exist, it will be ignored. (It may have already been deleted by a\r\n// Replace (re-list), for example.\r\nfunc (f *DeltaFIFO) Delete(obj interface{}) error {\r\n\tid, err := f.KeyOf(obj)\r\n\tif err != nil {\r\n\t\treturn KeyError{obj, err}\r\n\t}\r\n\tf.lock.Lock()\r\n\tdefer f.lock.Unlock()\r\n\tf.populated = true\r\n\tif f.knownObjects == nil {\r\n\t\tif _, exists := f.items[id]; !exists {\r\n\t\t\t// Presumably, this was deleted when a relist happened.\r\n\t\t\t// Don't provide a second report of the same deletion.\r\n\t\t\treturn nil\r\n\t\t}\r\n\t} else {\r\n\t\t// We only want to skip the \"deletion\" action if the object doesn't\r\n\t\t// exist in knownObjects and it doesn't have corresponding item in items.\r\n\t\t// Note that even if there is a \"deletion\" action in items, we can ignore it,\r\n\t\t// because it will be deduped automatically in \"queueActionLocked\"\r\n\t\t_, exists, err := f.knownObjects.GetByKey(id)\r\n\t\t_, itemsExist := f.items[id]\r\n\t\tif err == nil && !exists && !itemsExist {\r\n\t\t\t// Presumably, this was deleted when a relist happened.\r\n\t\t\t// Don't provide a second report of the same deletion.\r\n\t\t\treturn nil\r\n\t\t}\r\n\t}\r\n\r\n\treturn f.queueActionLocked(Deleted, obj)\r\n}\r\n```\r\n\r\nDeltaFIFO队列中的资源对象在Added（资源添加）事件、Updated（资源更新）事件、Deleted（资源删除）事件中都调用了queueActionLocked函数，它是DeltaFIFO实现的关键，代码示例如下：\r\n\r\n```\r\n// queueActionLocked appends to the delta list for the object.\r\n// Caller must lock first.\r\nfunc (f *DeltaFIFO) queueActionLocked(actionType DeltaType, obj interface{}) error {\r\n\tid, err := f.KeyOf(obj)\r\n\tif err != nil {\r\n\t\treturn KeyError{obj, err}\r\n\t}\r\n\r\n\t// If object is supposed to be deleted (last event is Deleted),\r\n\t// then we should ignore Sync events, because it would result in\r\n\t// recreation of this object.\r\n\tif actionType == Sync && f.willObjectBeDeletedLocked(id) {\r\n\t\treturn nil\r\n\t}\r\n\r\n\tnewDeltas := append(f.items[id], Delta{actionType, obj})\r\n\tnewDeltas = dedupDeltas(newDeltas)\r\n\r\n\t_, exists := f.items[id]\r\n\tif len(newDeltas) > 0 {\r\n\t\tif !exists {\r\n\t\t\tf.queue = append(f.queue, id)\r\n\t\t}\r\n\t\tf.items[id] = newDeltas\r\n\t\tf.cond.Broadcast()\r\n\t} else if exists {\r\n\t\t// We need to remove this from our map (extra items\r\n\t\t// in the queue are ignored if they are not in the\r\n\t\t// map).\r\n\t\tdelete(f.items, id)\r\n\t}\r\n\treturn nil\r\n}\r\n```\r\n\r\nqueueActionLocked代码执行流程如下。\r\n\r\n（1）通过f.KeyOf函数计算出资源对象的key。\r\n\r\n（2）如果操作类型为Sync，则标识该数据来源于Indexer（本地存储）。如果Indexer中的资源对象已经被删除，则直接返回。\r\n\r\n（3）将actionType和资源对象构造成Delta，添加到items中，并通过dedupDeltas函数进行去重操作。\r\n\r\n（4）更新构造后的Delta并通过cond.Broadcast通知所有消费者解除阻塞。\r\n\r\n<br>\r\n\r\n#### 5.2 消费者\r\n\r\nPop方法作为消费者方法使用，从DeltaFIFO的头部取出最早进入队列中的资源对象数据。Pop方法须传入process函数（**而这里的process函数就是**），用于接收并处理对象的回调方法，代码示例如下：\r\n\r\n```\r\n// Pop blocks until an item is added to the queue, and then returns it.  If\r\n// multiple items are ready, they are returned in the order in which they were\r\n// added/updated. The item is removed from the queue (and the store) before it\r\n// is returned, so if you don't successfully process it, you need to add it back\r\n// with AddIfNotPresent().\r\n// process function is called under lock, so it is safe update data structures\r\n// in it that need to be in sync with the queue (e.g. knownKeys). The PopProcessFunc\r\n// may return an instance of ErrRequeue with a nested error to indicate the current\r\n// item should be requeued (equivalent to calling AddIfNotPresent under the lock).\r\n//\r\n// Pop returns a 'Deltas', which has a complete list of all the things\r\n// that happened to the object (deltas) while it was sitting in the queue.\r\nfunc (f *DeltaFIFO) Pop(process PopProcessFunc) (interface{}, error) {\r\n\tf.lock.Lock()\r\n\tdefer f.lock.Unlock()\r\n\tfor {\r\n\t\tfor len(f.queue) == 0 {\r\n\t\t\t// When the queue is empty, invocation of Pop() is blocked until new item is enqueued.\r\n\t\t\t// When Close() is called, the f.closed is set and the condition is broadcasted.\r\n\t\t\t// Which causes this loop to continue and return from the Pop().\r\n\t\t\tif f.IsClosed() {\r\n\t\t\t\treturn nil, FIFOClosedError\r\n\t\t\t}\r\n\r\n\t\t\tf.cond.Wait()\r\n\t\t}\r\n\t\tid := f.queue[0]\r\n\t\tf.queue = f.queue[1:]\r\n\t\titem, ok := f.items[id]\r\n\t\tif f.initialPopulationCount > 0 {\r\n\t\t\tf.initialPopulationCount--\r\n\t\t}\r\n\t\tif !ok {\r\n\t\t\t// Item may have been deleted subsequently.\r\n\t\t\tcontinue\r\n\t\t}\r\n\t\t// 从队列中删除\r\n\t\tdelete(f.items, id)\r\n\t\t// 然后调用 process处理，这里的item还是之前的列表，bojkey1 {“add”,obj1; \"update\",obj1}\r\n\t\terr := process(item)\r\n\t\tif e, ok := err.(ErrRequeue); ok {\r\n\t\t\tf.addIfNotPresent(id, item)\r\n\t\t\terr = e.Err\r\n\t\t}\r\n\t\t// Don't need to copyDeltas here, because we're transferring\r\n\t\t// ownership to the caller.\r\n\t\treturn item, err\r\n\t}\r\n}\r\n```\r\n\r\n<br>\r\n\r\n当队列中没有数据时，通过f.cond.wait阻塞等待数据，只有收到cond.Broadcast时才说明有数据被添加，解除当前阻塞状态。如果队列中不为空，取出f.queue的头部数据，将该对象传入process回调函数，由上层消费者进行处理。如果process回调函数处理出错，则将该对象重新存入队列。Controller的processLoop方法负责从DeltaFIFO队列中取出数据传递给process回调函数。process回调函数代码示例如下：\r\n\r\n```go\r\nfunc (s *sharedIndexInformer) HandleDeltas(obj interface{}) error {\r\n\ts.blockDeltas.Lock()\r\n\tdefer s.blockDeltas.Unlock()\r\n\r\n\t// from oldest to newest\r\n\tfor _, d := range obj.(Deltas) {\r\n\t\tswitch d.Type {\r\n\t\tcase Sync, Added, Updated:\r\n\t\t\tisSync := d.Type == Sync\r\n\t\t\ts.cacheMutationDetector.AddObject(d.Object)\r\n\t\t\tif old, exists, err := s.indexer.Get(d.Object); err == nil && exists {\r\n\t\t\t\tif err := s.indexer.Update(d.Object); err != nil {\r\n\t\t\t\t\treturn err\r\n\t\t\t\t}\r\n\t\t\t\ts.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync)\r\n\t\t\t} else {\r\n\t\t\t\tif err := s.indexer.Add(d.Object); err != nil {\r\n\t\t\t\t\treturn err\r\n\t\t\t\t}\r\n\t\t\t\ts.processor.distribute(addNotification{newObj: d.Object}, isSync)\r\n\t\t\t}\r\n\t\tcase Deleted:\r\n\t\t\tif err := s.indexer.Delete(d.Object); err != nil {\r\n\t\t\t\treturn err\r\n\t\t\t}\r\n\t\t\ts.processor.distribute(deleteNotification{oldObj: d.Object}, false)\r\n\t\t}\r\n\t}\r\n\treturn nil\r\n}\r\n```\r\n\r\nHandleDeltas函数作为process回调函数，当资源对象的操作类型为Added、Updated、Deleted时，将该资源对象存储至Indexer（它是并发安全的存储），并通过distribute函数将资源对象分发至SharedInformer。还记得Informers Example代码示例吗？在Informers Example代码示例中，我们通过informer.AddEventHandler函数添加了对资源事件进行处理的函数，distribute函数则将资源对象分发到该事件处理函数中。\r\n\r\n<br>\r\n\r\n#### 5.3 Resync\r\n\r\nResync机制会将Indexer本地存储中的资源对象同步到DeltaFIFO中，并将这些资源对象设置为Sync的操作类型。Resync函数在Reflector中定时执行，它的执行周期由NewReflector函数传入的resyncPeriod参数设定。Resync→syncKeyLocked代码示例如下：\r\n\r\n```\r\nfunc (f *DeltaFIFO) syncKeyLocked(key string) error {\r\n\tobj, exists, err := f.knownObjects.GetByKey(key)\r\n\tif err != nil {\r\n\t\tglog.Errorf(\"Unexpected error %v during lookup of key %v, unable to queue object for sync\", err, key)\r\n\t\treturn nil\r\n\t} else if !exists {\r\n\t\tglog.Infof(\"Key %v does not exist in known objects store, unable to queue object for sync\", key)\r\n\t\treturn nil\r\n\t}\r\n\r\n\t// If we are doing Resync() and there is already an event queued for that object,\r\n\t// we ignore the Resync for it. This is to avoid the race, in which the resync\r\n\t// comes with the previous value of object (since queueing an event for the object\r\n\t// doesn't trigger changing the underlying store <knownObjects>.\r\n\tid, err := f.KeyOf(obj)\r\n\tif err != nil {\r\n\t\treturn KeyError{obj, err}\r\n\t}\r\n\tif len(f.items[id]) > 0 {\r\n\t\treturn nil\r\n\t}\r\n\r\n\tif err := f.queueActionLocked(Sync, obj); err != nil {\r\n\t\treturn fmt.Errorf(\"couldn't queue object: %v\", err)\r\n\t}\r\n\treturn nil\r\n}\r\n\r\n```\r\n\r\nf.knownObjects是Indexer本地存储对象，通过该对象可以获取client-go目前存储的所有资源对象，Indexer对象在NewDeltaFIFO函数实例化DeltaFIFO对象时传入。\r\n\r\n### 6.Indexer\r\n\r\n的数据与Etcd集群中的数据保持完全一致。client-go可以很方便地从本地存储中读取相应的资源对象数据，而无须每次都从远程Etcd集群中读取，这样可以减轻Kubernetes API Server和Etcd集群的压力。\r\n\r\n在介绍Indexer之前，先介绍一下ThreadSafeMap。ThreadSafeMap是实现并发安全的存储。作为存储，它拥有存储相关的增、删、改、查操作方法，例如Add、Update、Delete、List、Get、Replace、Resync等。Indexer在ThreadSafeMap的基础上进行了封装，它继承了与ThreadSafeMap相关的操作方法并实现了Indexer Func等功能，例如Index、IndexKeys、GetIndexers等方法，这些方法为ThreadSafeMap提供了索引功能。Indexer存储结构如下图所示。\r\n\r\n![index](../images/index.png)\r\n\r\nThreadSafeMap是一个内存中的存储，其中的数据并不会写入本地磁盘中，每次的增、删、改、查操作都会加锁，以保证数据的一致性。ThreadSafeMap将资源对象数据存储于一个map数据结构中，ThreadSafeMap结构代码示例如下：\r\n\r\n```\r\n// threadSafeMap implements ThreadSafeStore\r\ntype threadSafeMap struct {\r\n\tlock  sync.RWMutex\r\n\titems map[string]interface{}\r\n\r\n\t// indexers maps a name to an IndexFunc\r\n\tindexers Indexers\r\n\t// indices maps a name to an Index\r\n\tindices Indices\r\n}\r\n```\r\n\r\nitems字段中存储的是资源对象数据，其中items的key通过keyFunc函数计算得到，计算默认使用MetaNamespaceKeyFunc函数，该函数根据资源对象计算出<namespace>/<name>格式的key，如果资源对象的<namespace>为空，则<name>作为key，而items的value用于存储资源对象。\r\n\r\n<br>\r\n\r\n#### 6.1 Indexer索引器\r\n\r\n在每次增、删、改ThreadSafeMap数据时，都会通过updateIndices或deleteFromIndices函数变更Indexer。Indexer被设计为可以自定义索引函数，这符合Kubernetes高扩展性的特点。Indexer有4个非常重要的数据结构，分别是Indices、Index、Indexers及IndexFunc。直接阅读相关代码会比较晦涩，通过Indexer Example代码示例来理解Indexer，印象会更深刻。Indexer Example代码示例如下：\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n\t\"fmt\"\r\n\t\"k8s.io/api/core/v1\"\r\n\tmetav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\r\n\t\"k8s.io/client-go/tools/cache\"\r\n\t\"strings\"\r\n)\r\n\r\nfunc UsersIndexFunc(obj interface{}) ([]string, error) {\r\n\tpod := obj.(*v1.Pod)\r\n\tuserString := pod.Annotations[\"users\"]\r\n\r\n\treturn strings.Split(userString, \",\"), nil\r\n}\r\n\r\nfunc main() {\r\n\tindex := cache.NewIndexer(cache.MetaNamespaceKeyFunc,cache.Indexers{\"byUser\": UsersIndexFunc})\r\n\r\n\tpod1 := &v1.Pod{ObjectMeta:metav1.ObjectMeta{Name:\"one\",Annotations: map[string]string{\"users\": \"ernie,bert\"}}}\r\n\tpod2 := &v1.Pod{ObjectMeta:metav1.ObjectMeta{Name:\"two\",Annotations: map[string]string{\"users\": \"oscar,bert\"}}}\r\n\tpod3 := &v1.Pod{ObjectMeta:metav1.ObjectMeta{Name:\"tre\",Annotations: map[string]string{\"users\": \"ernie,elmo\"}}}\r\n\r\n    index.Add(pod1)\r\n\tindex.Add(pod2)\r\n\tindex.Add(pod3)\r\n\r\n\terniePods, err := index.ByIndex(\"byUser\", \"ernie\")\r\n\tif err != nil {\r\n\t\tpanic(err)\r\n\t}\r\n\r\n\tfor _, erniePod := range erniePods {\r\n\t\tfmt.Println(erniePod.(*v1.Pod).Name)\r\n\t}\r\n}\r\n\r\n## 输出\r\none\r\ntre\r\n```\r\n\r\n首先定义一个索引器函数UsersIndexFunc，在该函数中，我们定义查询出所有Pod资源下Annotations字段的key为users的Pod。cache.NewIndexer函数实例化了Indexer对象，该函数接收两个参数：第1个参数是KeyFunc，它用于计算资源对象的key，计算默认使用cache.MetaNamespaceKeyFunc函数；第2个参数是cache.Indexers，用于定义索引器，其中key为索引器的名称（即byUser），value为索引器。通过index.Add函数添加3个Pod资源对象。最后通过index.ByIndex函数查询byUser索引器下匹配ernie字段的Pod列表。Indexer Example代码示例最终检索出名称为one和tre的Pod。现在再来理解Indexer的4个重要的数据结构就非常容易了，它们分别是Indexers、IndexFunc、Indices、Index，数据结构如下：\r\n\r\n```\r\n// Index maps the indexed value to a set of keys in the store that match on that value\r\ntype Index map[string]sets.String\r\n\r\n// Indexers maps a name to a IndexFunc\r\ntype Indexers map[string]IndexFunc\r\n\r\n// Indices maps a name to an Index\r\ntype Indices map[string]Index\r\n\r\n// IndexFunc knows how to provide an indexed value for an object.\r\ntype IndexFunc func(obj interface{}) ([]string, error)\r\n```\r\n\r\nIndexer数据结构说明如下。\r\n\r\n● Indexers：存储索引器，key为索引器名称，value为索引器的实现函数。\r\n\r\n● IndexFunc：索引器函数，定义为接收一个资源对象，返回检索结果列表。\r\n\r\n● Indices：存储缓存器，key为缓存器名称（在Indexer Example代码示例中，缓存器命名与索引器命名相对应），value为缓存数据。\r\n\r\n● Index：存储缓存数据，其结构为K/V。\r\n\r\n<br>\r\n\r\n### 7. 总结\r\n\r\n目前通过整体的介绍已经大概理清楚client-go informer的大致过程：\r\n\r\n（1）定义好 informerFactory， 然后初始化一个Informer\r\n\r\n（2）定义好add, update, del处理函数\r\n\r\n（3）informer.run运行\r\n\r\n（4）informer.run初始化了一个reflector，里面实现了list-watch\r\n\r\n（5）reflector里面使用了deltaFIFO队列对list watch的数据进行处理。\r\n一方面：通过该队列的数据 使得本地cache和etcd数据一致 （indexer里面的数据）\r\n\r\n另一方面：之前定义好的add ,update ,del就是这些数据的消费者\r\n\r\n当然这个只是大概的运作过程。接下来将详细研究具体每个过程是如何实现的。\r\n\r\n\r\n### 8.附录\r\n\r\n以下的例子对pod的监听。可以看出来步骤为：\r\n\r\n（1）生成clientset客户端\r\n\r\n（2）New一个 listandwatcher对象，这里是pod\r\n\r\n（3）实例化一个informer，在这个informer中，指定ADD，UPDATE，DELETE的处理函数。\r\n\r\n```\r\n// creates the clientset\r\n\tclientset, err := kubernetes.NewForConfig(cfg)\r\n\tif err != nil {\r\n\t\tglog.Errorf(\"can not creates the clientset: %v\\n\", err)\r\n\t\treturn nil, err\r\n\t}\r\n\r\n\t// create the pod watcher, set the func of list and watch\r\n\tpodListWatcher := cache.NewListWatchFromClient(\r\n\t\tclientset.Core().RESTClient(),\r\n\t\t\"pods\",\r\n\t\tv1.NamespaceAll,\r\n\t\tfields.Everything(),\r\n\t)\r\n\r\n\tindexer, informer := cache.NewIndexerInformer(\r\n\t\tpodListWatcher,\r\n\t\t&v1.Pod{},\r\n\t\t0,\r\n\t\tcache.ResourceEventHandlerFuncs{\r\n\t\t\tAddFunc: func(obj interface{}) {},\r\n\t\t\tUpdateFunc: func(old interface{}, new interface{}) {\r\n\t\t\t\tpusher.PushBlackHole(old, new, opt)\r\n\t\t\t},\r\n\t\t\tDeleteFunc: func(obj interface{}) {},\r\n\t\t},\r\n\t\tcache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc},\r\n\t)\r\n\r\n```\r\n\r\n<br>\r\n\r\n### 9.参考\r\n\r\nhttps://zhuanlan.zhihu.com/p/228534306\r\n\r\nhttps://houmin.cc/posts/1f0eb2ff/\r\n\r\n<<k8s源码解剖-郑东旭>>"
  },
  {
    "path": "k8s/client-go/5. SharedInformerFactory机制.md",
    "content": "Table of Contents\n=================\n\n  * [1.章节介绍](#1章节介绍)\n  * [2. SharedInformerFactory](#2-sharedinformerfactory)\n     * [2.1 SharedInformerFactory实例介绍](#21-sharedinformerfactory实例介绍)\n     * [2.2 sharedInformerFactory结构体](#22-sharedinformerfactory结构体)\n     * [2.3 sharedInformerFactory成员函数](#23-sharedinformerfactory成员函数)\n     * [2.4 总结](#24-总结)\n  * [3. podInformer](#3-podinformer)\n     * [3.1 PodInformer结构体](#31-podinformer结构体)\n     * [3.2 PodInformer成员函数](#32-podinformer成员函数)\n  * [4.总结](#4总结)\n\n### 1.章节介绍\n\n本章首先介绍SharedInformerFactory，了解其组成和作用。\n\n然后以Podinformer为例，了解一个资源实例的Informer应该需要实现哪些函数。\n\n本节并没有设计到具体图中的informer机制，只是从大的入口入手，看看SharedInformerFactory到底是什么\n\n![informer](../images/informer.png)\n\n<br>\n\n### 2. SharedInformerFactory\n\nSharedInformerFactory封装了NewSharedIndexInformer方法。字如其名，SharedInformerFactory使用的是工厂模式来生成各类的Informer。无论是k8s控制器，还是自定义控制器, SharedInformerFactory都是非常重要的一环。所以首先分析SharedInformerFactory。这里以一个实例入手分析SharedInformerFactory。\n\n#### 2.1 SharedInformerFactory实例介绍\n\n```\npackage main\n\nimport (\n    \"fmt\"\n    clientset \"k8s.io/client-go/kubernetes\"\n    \"k8s.io/client-go/rest\"\n    \"k8s.io/client-go/informers\"\n    \"k8s.io/client-go/tools/cache\"\n    \"k8s.io/api/core/v1\"\n    \"k8s.io/apimachinery/pkg/labels\"\n    \"time\"\n)\n\nfunc main()  {\n    config := &rest.Config{\n        Host: \"http://172.21.0.16:8080\",\n    }\n    client := clientset.NewForConfigOrDie(config)\n    // 生成SharedInformerFactory\n    factory := informers.NewSharedInformerFactory(client, 5 * time.Second)\n    // 生成PodInformer\n    podInformer := factory.Core().V1().Pods()\n    // 获得一个cache.SharedIndexInformer 单例模式\n    sharedInformer := podInformer.Informer()\n\n    //注册add, update, del处理事件\n    sharedInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{\n        AddFunc:    func(obj interface{}) {fmt.Printf(\"add: %v\\n\", obj.(*v1.Pod).Name)},\n        UpdateFunc: func(oldObj, newObj interface{}) {fmt.Printf(\"update: %v\\n\", newObj.(*v1.Pod).Name)},\n        DeleteFunc: func(obj interface{}){fmt.Printf(\"delete: %v\\n\", obj.(*v1.Pod).Name)},\n    })\n\n    stopCh := make(chan struct{})\n\n    // 第一种方式\n    // 可以这样启动  也可以按照下面的方式启动\n    // go sharedInformer.Run(stopCh)\n    // time.Sleep(2 * time.Second)\n\n    // 第二种方式，这种方式是启动factory下面所有的informer\n    factory.Start(stopCh)\n    factory.WaitForCacheSync(stopCh)\n\n    pods, _ := podInformer.Lister().Pods(\"default\").List(labels.Everything())\n\n    for _, p := range pods {\n        fmt.Printf(\"list pods: %v\\n\", p.Name)\n    }\n    <- stopCh\n}\n```\n\n#### 2.2 sharedInformerFactory结构体\n\n```\ntype sharedInformerFactory struct {\n  // client客户端\n\tclient           kubernetes.Interface            \n\t// sharedInformerFactory是没有namespaces限制的。不过可以设置namespaces限制该factory后面的informer都是指定namespaces的\n\tnamespace        string          \n  // TweakListOptionsFunc其实就是ListOptions，这个是针对所有Informer List生效的 （WithTweakListOptions可以看出来）\n\ttweakListOptions internalinterfaces.TweakListOptionsFunc\n\tlock             sync.Mutex\n\t// 这个是list默认定期同步的时间间隔\n\tdefaultResync    time.Duration\n\t// 每种informer还可以自定义\n\tcustomResync     map[reflect.Type]time.Duration\n  \n  // 属于该factory下面的所有的informer\n\tinformers map[reflect.Type]cache.SharedIndexInformer\n\t// startedInformers is used for tracking which informers have been started.\n\t// This allows Start() to be called multiple times safely.\n\t// 判断informer是否已经 Run起来了\n\tstartedInformers map[reflect.Type]bool   \n}\n```\n\n<br>\n\n#### 2.3 sharedInformerFactory成员函数\n\n```\n定义customResync\n// WithCustomResyncConfig sets a custom resync period for the specified informer types.\nfunc WithCustomResyncConfig(resyncConfig map[v1.Object]time.Duration) SharedInformerOption \n\n定义tweakListOptions\n// WithTweakListOptions sets a custom filter on all listers of the configured SharedInformerFactory.\nfunc WithTweakListOptions(tweakListOptions internalinterfaces.TweakListOptionsFunc) SharedInformerOption \n\n定义namespaces\n// WithNamespace limits the SharedInformerFactory to the specified namespace.\nfunc WithNamespace(namespace string) SharedInformerOption \n\n// start所有的informer\n// Start initializes all requested informers.\nfunc (f *sharedInformerFactory) Start(stopCh <-chan struct{}) {\n\tf.lock.Lock()\n\tdefer f.lock.Unlock()\n\n\tfor informerType, informer := range f.informers {\n\t\tif !f.startedInformers[informerType] {\n\t\t\tgo informer.Run(stopCh)\n\t\t\tf.startedInformers[informerType] = true\n\t\t}\n\t}\n}\n\n// WaitForCacheSync让所有的informers同步cache。一般informer.run函数中都有一个这样的语句。先等cache同步。这个的含义就是等list完了的数据，全部转换到cache中去。\n\t// Wait for all involved caches to be synced, before processing items from the queue is started\n\tif !cache.WaitForCacheSync(stopCh, ctrl.Informer.HasSynced) {\n\t\truntime.HandleError(fmt.Errorf(\"Timed out waiting for caches to sync\"))\n\t\treturn\n\t}\n\t\n// WaitForCacheSync waits for all started informers' cache were synced.\nfunc (f *sharedInformerFactory) WaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool {\n\tinformers := func() map[reflect.Type]cache.SharedIndexInformer {\n\t\tf.lock.Lock()\n\t\tdefer f.lock.Unlock()\n\n\t\tinformers := map[reflect.Type]cache.SharedIndexInformer{}\n\t\tfor informerType, informer := range f.informers {\n\t\t\tif f.startedInformers[informerType] {\n\t\t\t\tinformers[informerType] = informer\n\t\t\t}\n\t\t}\n\t\treturn informers\n\t}()\n\n\tres := map[reflect.Type]bool{}\n\tfor informType, informer := range informers {\n\t\tres[informType] = cache.WaitForCacheSync(stopCh, informer.HasSynced)\n\t}\n\treturn res\n}\n\n\n// InternalInformerFor returns the SharedIndexInformer for obj using an internal\n// client.\nfunc (f *sharedInformerFactory) InformerFor(obj runtime.Object, newFunc internalinterfaces.NewInformerFunc) cache.SharedIndexInformer {\n\tf.lock.Lock()\n\tdefer f.lock.Unlock()\n\n\tinformerType := reflect.TypeOf(obj)\n\tinformer, exists := f.informers[informerType]\n\t// 如果存在同类的，直接返回，不会再new一个。这里的type就是 pod/deploy\n\tif exists {\n\t\treturn informer\n\t}\n\n\tresyncPeriod, exists := f.customResync[informerType]\n\tif !exists {\n\t\tresyncPeriod = f.defaultResync\n\t}\n\n\tinformer = newFunc(f.client, resyncPeriod)\n\tf.informers[informerType] = informer\n\n\treturn informer\n}\n\n// SharedInformerFactory provides shared informers for resources in all known\n// API group versions.\ntype SharedInformerFactory interface {\n\tinternalinterfaces.SharedInformerFactory\n\tForResource(resource schema.GroupVersionResource) (GenericInformer, error)\n\tWaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool\n\n  // 提供k8s内置资源的定义接口，从这里可以看出来\n\tAdmissionregistration() admissionregistration.Interface \n\tApps() apps.Interface\n\tAutoscaling() autoscaling.Interface\n\tBatch() batch.Interface\n\tCertificates() certificates.Interface\n\tCoordination() coordination.Interface\n\tCore() core.Interface\n\tEvents() events.Interface\n\tExtensions() extensions.Interface\n\tNetworking() networking.Interface\n\tPolicy() policy.Interface\n\tRbac() rbac.Interface\n\tScheduling() scheduling.Interface\n\tSettings() settings.Interface\n\tStorage() storage.Interface\n}\n\n// 例如core组下面的资源，f.Core().v1.pods() 就是这个\nfunc (f *sharedInformerFactory) Core() core.Interface {\n\treturn core.New(f, f.namespace, f.tweakListOptions)\n}\n```\n\n<br>\n\n#### 2.4 总结\n\n通过对sharedInformerFactory的成员和函数介绍，了解到：\n\n（1）factory就是提供了一个构造informer的入口，里面包含了一堆Informer\n\n（2）同一中资源类型共用一个Infomer。这样的话就可以节省不必要的资源。例如kcm中，rs可以需要监听pod资源，gc也需要监听Pod资源，通过factory机制就可以使用同一个\n\n（3）但是监听同一种类型的资源，但是不同的listOption看起来也是不行，例如一个Informer监听running的pod，一个Informer监听error的Pod, 是需要多个factory。\n\n### 3. podInformer\n\n从上诉可以看出来，sharedInformerFactory只是一个入口。接下来以podInformer为例，看看一个具体的资源Informer需要实现哪些功能。\n\n#### 3.1 PodInformer结构体\n\n```\n// PodInformer provides access to a shared informer and lister for\n// Pods.\n// 只需要实现Informer，Lister函数\ntype PodInformer interface {\n\tInformer() cache.SharedIndexInformer     \n\tLister() v1.PodLister\n}\n\ntype podInformer struct {\n\tfactory          internalinterfaces.SharedInformerFactory   //  是哪一个factory生成的informer\n\ttweakListOptions internalinterfaces.TweakListOptionsFunc    //  有哪些filter\n\tnamespace        string                                     //  命名空间\n}\n```\n\n#### 3.2 PodInformer成员函数\n\n从函数定义可以看出来，informer其实就是 cache.SharedIndexInformer\n\nNew SharedIndexInformer的时候指定了ListWatch函数。\n\nlistFunc:  client.CoreV1().Pods(namespace).List(options)\n\nWatchFunc:  client.CoreV1().Pods(namespace).Watch(options)\n\n所以从结构体上推测：\n\n（1) informer最终都是 cache.SharedIndexInformer。但是 cache.SharedIndexInformer需要先定义好list, watch函数\n\n（2）cache.SharedIndexInformer里面的index就是存储+查询。根据定义好的list, watch更新index的数据\n\n接下来继续看看cache.SharedIndexInformer是如何实现的。\n\n```\n// NewPodInformer constructs a new informer for Pod type.\n// Always prefer using an informer factory to get a shared informer instead of getting an independent\n// one. This reduces memory footprint and number of connections to the server.\nfunc NewPodInformer(client kubernetes.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers) cache.SharedIndexInformer {\n\treturn NewFilteredPodInformer(client, namespace, resyncPeriod, indexers, nil)\n}\n\n// NewFilteredPodInformer constructs a new informer for Pod type.\n// Always prefer using an informer factory to get a shared informer instead of getting an independent\n// one. This reduces memory footprint and number of connections to the server.\nfunc NewFilteredPodInformer(client kubernetes.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers, tweakListOptions internalinterfaces.TweakListOptionsFunc) cache.SharedIndexInformer {\n\treturn cache.NewSharedIndexInformer(\n\t\t&cache.ListWatch{\n\t\t\tListFunc: func(options metav1.ListOptions) (runtime.Object, error) {\n\t\t\t\tif tweakListOptions != nil {\n\t\t\t\t\ttweakListOptions(&options)\n\t\t\t\t}\n\t\t\t\treturn client.CoreV1().Pods(namespace).List(options)\n\t\t\t},\n\t\t\tWatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {\n\t\t\t\tif tweakListOptions != nil {\n\t\t\t\t\ttweakListOptions(&options)\n\t\t\t\t}\n\t\t\t\treturn client.CoreV1().Pods(namespace).Watch(options)\n\t\t\t},\n\t\t},\n\t\t&corev1.Pod{},\n\t\tresyncPeriod,\n\t\tindexers,\n\t)\n}\n\n// 默认只有namespaces这个indexer\nfunc (f *podInformer) defaultInformer(client kubernetes.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {\n\treturn NewFilteredPodInformer(client, f.namespace, resyncPeriod, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, f.tweakListOptions)\n}\n\n\nfunc (f *podInformer) Informer() cache.SharedIndexInformer {\n\treturn f.factory.InformerFor(&corev1.Pod{}, f.defaultInformer)\n}\n\n// 返回Lister数据, 这里是从index里面获取，而不是从apiserver中获取\nfunc (f *podInformer) Lister() v1.PodLister {\n\treturn v1.NewPodLister(f.Informer().GetIndexer())\n}\n\ncache中的index定义\nk8s.io/client-go/tools/cache/index.go\n// Indexer is a storage interface that lets you list objects using multiple indexing functions\ntype Indexer interface {\n\tStore\n\t// Retrieve list of objects that match on the named indexing function\n\tIndex(indexName string, obj interface{}) ([]interface{}, error)\n\t// IndexKeys returns the set of keys that match on the named indexing function.\n\tIndexKeys(indexName, indexKey string) ([]string, error)\n\t// ListIndexFuncValues returns the list of generated values of an Index func\n\tListIndexFuncValues(indexName string) []string\n\t// ByIndex lists object that match on the named indexing function with the exact key\n\tByIndex(indexName, indexKey string) ([]interface{}, error)\n\t// GetIndexer return the indexers\n\tGetIndexers() Indexers\n\n\t// AddIndexers adds more indexers to this store.  If you call this after you already have data\n\t// in the store, the results are undefined.\n\tAddIndexers(newIndexers Indexers) error\n}\n```\n\n### 4.总结\n\n（1）factory就是提供了一个构造informer的入口，里面包含了一堆Informer\n\n（2）同一中资源类型共用一个Infomer。这样的话就可以节省不必要的资源。例如kcm中，rs可以需要监听pod资源，gc也需要监听Pod资源，通过factory机制就可以使用同一个\n\n（3）但是监听同一种类型的资源，但是不同的listOption看起来也是不行，例如一个Informer监听running的pod，一个Informer监听error的Pod, 是需要多个factory。\n\n（4）当前factory并没有利用到图中表示Informer机制。最终是cache.SharedIndexInformer 包含了所有的参数，实现了上诉图中的Informer机制。下一节开始介绍cache.SharedIndexInformer"
  },
  {
    "path": "k8s/client-go/6. informer机制之cache.indexer机制.md",
    "content": "Table of Contents\n=================\n\n  * [1. 背景](#1-背景)\n  * [2. Indexer结构说明](#2-indexer结构说明)\n  * [3 store结构说明](#3-store结构说明)\n  * [4. cache](#4-cache)\n     * [4.1 cache结构说明](#41-cache结构说明)\n     * [4.2 ThreadSafeStore结构说明](#42-threadsafestore结构说明)\n     * [4.3 举例说明](#43-举例说明)\n     * [4.4 Cache总结](#44-cache总结)\n  * [5. cache.index在informer中的应用](#5-cacheindex在informer中的应用)\n\n### 1. 背景\n\ntool/cache.indexer是informer中提供本地缓存，并且带有丰富索引的机制。\n\nindex是索引的实现。类似于数据库的索引一样，index可以加快查找速度。\n\n本节就是弄清楚cache中indexer是如何实现的\n\n本节研究的内容位置整个informer机制的红色圈起来区域\n\n![informer-indexer](../images/informer-indexer.png)\n\n如何存储+如何索引\n\n<br>\n\n### 2. Indexer结构说明\n\nIndexer是一个接口，包含两个部分：\n\n（1）Store。从Store定义来看，Store是真正保存数据的结构体。Store本身也是一个接口，具体的存储需要实现这些接口。\n\n（2）Index，IndexKeys，ListIndexFuncValues，ByIndex，GetIndexers，AddIndexers 等和操作索引有关的函数\n\n```\n// IndexFunc knows how to provide an indexed value for an object.\ntype IndexFunc func(obj interface{}) ([]string, error)\n\n// Index maps the indexed value to a set of keys in the store that match on that value\ntype Index map[string]sets.String\n\n// Indexers maps a name to a IndexFunc\ntype Indexers map[string]IndexFunc\n\n// Indices maps a name to an Index\ntype Indices map[string]Index\n\n\n// Indexer接口是为了添加或者查询索引用的。当前可能一下子看注释很迷惑，先看看后面的例子就清楚了\ntype Indexer interface {\n\tStore\n\t// 通过indexName获得索引函数，然后obj(pod)对象作为函数输入，输出所有检索值。然后找出来所有包含检索值的对象(pod)\n\t// 举例pod1 通过byuser这个函数，检索出来有ernie，bert两个检索值\n\t// 然后Index(\"byuser\",pod1) 会输出pod1, pod2(包含bert),pod3(包含ernie)\n\t// Retrieve list of objects that match on the named indexing function\n\tIndex(indexName string, obj interface{}) ([]interface{}, error)\n\t\n\t// 通过索引函数的名字（byUser）+具体的值(bert)，获得pod的名字（ns/podName）\n\t// IndexKeys returns the set of keys that match on the named indexing function.\n\tIndexKeys(indexName, indexKey string) ([]string, error)\n\t\n\t// 通过索引函数的名字（byUser）, 获得所有的索引值。这里输入byuser, 输出：ernie, bert, elmo, oscar\n\t// ListIndexFuncValues returns the list of generated values of an Index func\n\tListIndexFuncValues(indexName string) []string\n\t\n\t// 通过索引函数的名字（byUser）+具体的值(bert)，获得pod对象\n\t// ByIndex lists object that match on the named indexing function with the exact key\n\tByIndex(indexName, indexKey string) ([]interface{}, error)\n\t\n\t// 返回所有的索引函数\n\t// GetIndexer return the indexers\n\tGetIndexers() Indexers\n\t\n\t// AddIndexers adds more indexers to this store.  If you call this after you already have data\n\t// in the store, the results are undefined.\n\t// 添加 索引函数。每个索引函数都有一个唯一的名字，那就是 indexName \n\tAddIndexers(newIndexers Indexers) error\n}\n```\n\n<br>\n\nStore是一个存储的接口，后面结合具体存储实现再讲。这里先讲一下 Index, Indexers, Indices的关系。\n\nIndexFunc：索引函数。输入对象，输出对象在该索引函数下匹配的字段（索引值）列表。\n\nIndex： 索引表。 map结构，key索引值， value是对象名（初始化Indexer的时候需要指定，默认是ns+metadata.name表示一个对象）\n\nIndexers：索引函数表。 map结构，索引函数可以有多个，所以每个索引函数需要起一个名字来表示。map的key是一个索引函数的名称，value是一个个的索引函数。\n\nIndices：Index的复数形式。每个索引函数名对应一个索引函数，每个索引函数对应很多索引值。每个索引值会对应很多实际的对象。\n\nindex只能知道索引值对应对象。\n\nIndices可以通过函数索引名，知道每个索引值对应的对象。\n\n<br>\n\n### 3 store结构说明\n\nstore可以认为只是一个父类，它只是一个接口，说明了要想实现存储，必须要实现这些函数。\n\n```\n// Store is a generic object storage interface. Reflector knows how to watch a server\n// and update a store. A generic store is provided, which allows Reflector to be used\n// as a local caching system, and an LRU store, which allows Reflector to work like a\n// queue of items yet to be processed.\n//\n// Store makes no assumptions about stored object identity; it is the responsibility\n// of a Store implementation to provide a mechanism to correctly key objects and to\n// define the contract for obtaining objects by some arbitrary key type.\ntype Store interface {\n\tAdd(obj interface{}) error         //往存储增加，更新，删除元素\n\tUpdate(obj interface{}) error  \n\tDelete(obj interface{}) error\n\tList() []interface{}              \n\tListKeys() []string\n\tGet(obj interface{}) (item interface{}, exists bool, err error)\n\tGetByKey(key string) (item interface{}, exists bool, err error)\n\n\t// Replace will delete the contents of the store, using instead the\n\t// given list. Store takes ownership of the list, you should not reference\n\t// it after calling this function.\n\tReplace([]interface{}, string) error\n\tResync() error\n}\n```\n\n### 4. cache\n\n#### 4.1 cache结构说明\n\ncache结构体本身只有 cacheStorage + keyFunc两个元素。\n\n```\n// cache responsibilities are limited to:\n// 1. Computing keys for objects via keyFunc\n//  2. Invoking methods of a ThreadSafeStorage interface\ntype cache struct {\n   // cacheStorage bears the burden of thread safety for the cache\n   cacheStorage ThreadSafeStore\n   // keyFunc is used to make the key for objects stored in and retrieved from items, and\n   // should be deterministic.\n   keyFunc KeyFunc\n}\n\n```\n\ncacheStorage是真正的存储结构。\n\nkeyFunc 就是如何通过一个 String 定位到一个对象（例如pod）\n\n查看k8s.io/client-go/tools/cache/store.go 中的函数定义。\n\n可以发现 cache即实现了 indexer的所有函数，又实现了store的所有函数。但是cache结构的所有方法都是调用了成员变量cacheStorage的方法。如下：\n\n```\n// Add inserts an item into the cache.\nfunc (c *cache) Add(obj interface{}) error {\n\tkey, err := c.keyFunc(obj)\n\tif err != nil {\n\t\treturn KeyError{obj, err}\n\t}\n\tc.cacheStorage.Add(key, obj)\n\treturn nil\n}\n```\n\n所以`ThreadSafeStore`才是真正实现了 缓存+索引 功能的结构体。\n\n#### 4.2 ThreadSafeStore结构说明\n\nThreadSafeStore本身就是一个接口，定义了 store + indexer的所有函数。threadSafeMap是真正的实现类。\n\n在thread_safe_store.go文件一看就非常清楚\n\nk8s.io/client-go/tools/cache/thread_safe_store.go\n\n```\n// threadSafeMap implements ThreadSafeStore\ntype threadSafeMap struct {\n\tlock  sync.RWMutex\n\titems map[string]interface{}     //真正的存储，存储所有的元数据\n\n\t// indexers maps a name to an IndexFunc\n\tindexers Indexers\n\t// indices maps a name to an Index\n\tindices Indices\n}\n```\n\n#### 4.3 举例说明\n\nthreadSafeMap的实现都非常简单。看看代码就明白了。但是结合上面对Indexer的文字描述太过于枯燥，所以这里以一个例子说明cache.indexer是如何实现 `存储+索引` 的。该例子来源于 k8s.io/client-go/tools/cache/index_test.go 具体如下：\n\n```\n// 1. 先定义一个IndexFunc\n// testUsersIndexFunc 就是上面提到的索引函数\n// 从函数的实现可以看出来。这个就是想根据 pod Annotations中users的名字做索引\nfunc testUsersIndexFunc(obj interface{}) ([]string, error) {\n\tpod := obj.(*v1.Pod)\n\tusersString := pod.Annotations[\"users\"]\n\n\treturn strings.Split(usersString, \",\"), nil\n}\n\n// 2. 初始化一个NewIndexer\n// NewIndexer必须指定一个func，这个func的作用就是KeyFunc， 能用一个string代表 pod对象。这里就是MetaNamespaceKeyFunc，用ns/name来表示一个pod\n// 同时还指定一个Indexers。这个表示，当前Indexers只有一个索引函数testUsersIndexFunc，索引函数名为byUser\nindex := NewIndexer(MetaNamespaceKeyFunc, Indexers{\"byUser\": testUsersIndexFunc})\n\n查看NewIndexer的定义可以发现，就是生成了cache结构体\n// NewIndexer returns an Indexer implemented simply with a map and a lock.\nfunc NewIndexer(keyFunc KeyFunc, indexers Indexers) Indexer {\n\treturn &cache{\n\t\tcacheStorage: NewThreadSafeStore(indexers, Indices{}),\n\t\tkeyFunc:      keyFunc,\n\t}\n}\n\n\n// 3.定义三个pod\n// pod1 -> ernie,bert\n// pod2 -> bert,oscar\n// pod3 -> ernie,elmo\n\tpod1 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: \"one\", Annotations: map[string]string{\"users\": \"ernie,bert\"}}}\n\tpod2 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: \"two\", Annotations: map[string]string{\"users\": \"bert,oscar\"}}}\n\tpod3 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: \"tre\", Annotations: map[string]string{\"users\": \"ernie,elmo\"}}}\n\n// 4.将三个pod放入pod\n\tindex.Add(pod1)\n\tindex.Add(pod2)\n\tindex.Add(pod3)\n\t\n到这里先暂停一下，看看上面提到的IndexFunc，Index，Indexers，Indices都有哪些内容\nIndexFunc：testUsersIndexFunc\nthreadSafeMap.Indexers: {\n   \"byUser\": testUsersIndexFunc\n}\n\nthreadSafeMap.Indices: {\n   \"byUser\": {\n       \"ernie\": [\"one\",\"tre\"],\n       \"bert\": [\"one\",\"two\"],\n       \"oscar\": [\"two\"],\n       \"elmo\": [\"tre\"],\n   }\n}\n\nIndex：就是上面的Indices的一个个数据，就是byUser。因为只有一个索引函数\n\"byUser\": {\n       \"ernie\": [\"one\",\"tre\"],\n       \"bert\": [\"one\",\"two\"],\n       \"oscar\": [\"two\"],\n       \"elmo\": [\"tre\"],\n   }\n   \n\nthreadSafeMap.items {\n\t  \"one\" : pod1,\n\t  \"two\" : pod2,\n\t  \"tre\" : pod3\n}\n其中。pod1，pod2，pod3都是一个个pod结构的对象。\n所以可以看到 threadSafeMap 通过 items实现了存储，Indices + Indexers实现了索引\n\n// 增加一个元素，处理操作items外，还要更新Indices\nfunc (c *threadSafeMap) Add(key string, obj interface{}) {\n\tc.lock.Lock()\n\tdefer c.lock.Unlock()\n\toldObject := c.items[key]\n\tc.items[key] = obj\n\tc.updateIndices(oldObject, obj, key)\n}\n\n\n接下来再看看 threadSafeMap 是如何实现 索引的各个函数的。代码不在贴了，直接写输出。\n\n\n  // indexName就是索引函数名，obj(pod)就是pod对象。\n  // 该函数的功能是, 通过索引函数名，找到索引函数，在将pod作为索引函数的输入，得到所有的检索值。然后再找出来所有包含检索值的对象列表\n\t// Index(\"byuser\",pod1) 会输出[pod1, pod2，pod3]\n\t// 原因：pod1通过byuser这个函数，检索出来有ernie，bert两个检索值\n\t// pod1,pod2,pod3都包含ernie，bert之一，所有都符合条件\n\tIndex(indexName string, obj interface{}) ([]interface{}, error)\n\t\n\t\n\t// 该函数的功能是：通过索引函数名+索引值，得到所有的对象的名字\n\t// 举例：IndexKeys(\"byUser\", \"bert\")的输出是： [\"one\",\"two\"]\n\t// IndexKeys returns the set of keys that match on the named indexing function.\n\tIndexKeys(indexName, indexKey string) ([]string, error)\n\t\n\t\n\t// 该函数的功能是：根据索引函数名，得到所有的索引值\n  // 举例：ListIndexFuncValues(\"byuser\")  输出为：ernie, bert, elmo, oscar\n\t// ListIndexFuncValues returns the list of generated values of an Index func\n\tListIndexFuncValues(indexName string) []string\n\t\n\t// 该函数的功能是：通过索引函数名+索引值，得到所有的对象\n\t// 举例：IndexKeys(\"byUser\", \"bert\")的输出是： [pod1,pod2]\n\t// IndexKeys得到的是对象的名字(key)\n\t// ByIndex lists object that match on the named indexing function with the exact key\n\tByIndex(indexName, indexKey string) ([]interface{}, error)\n\t\n\t\n\t// 返回所有的索引函数\n\t// GetIndexer return the indexers\n\tGetIndexers() Indexers\n\t\n\t// AddIndexers adds more indexers to this store.  If you call this after you already have data\n\t// in the store, the results are undefined.\n\t// 添加 索引函数。每个索引函数都有一个唯一的名字，那就是 indexName \n\tAddIndexers(newIndexers Indexers) error\n```\n\n<br>\n\n#### 4.4 Cache总结\n\n（1）cache提供了 存储+索引的功能，最终是通过threadSafeMap实现的\n\n（2）threadSafeMap中items实现了存储。indexers + Indices 实现了索引\n\n（3）add, del, update元素除了更新items这个map，还要更新indexers + Indices\n\n（4）吐槽一下，indexers,Indices,index这些名字感觉没起好，咋一看莫名其妙\n\n<br>\n\n### 5. cache.index在informer中的应用\n\n以podinformer为例介绍cache这一套在informer中的应用。本节只是介绍podinformer是如何生成cache的。具体cache的更新，结合list watcher再做说明。\n\n<br>k8s.io/client-go/informers/core/v1/pod.go\n\n（1）defaultInformer传入的是cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}\n\nindexers就是一个map。因为索引函数有很多，所以就需要一个名字来区分不同的索引函数。\n\n比如MetaNamespaceIndexFunc，就是根据对象的namespace来做索引\n\n```\nfunc (f *podInformer) defaultInformer(client kubernetes.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {\n\treturn NewFilteredPodInformer(client, f.namespace, resyncPeriod, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, f.tweakListOptions)\n}\n\nkey是一个string\nconst (\n\tNamespaceIndex string = \"namespace\"\n)\n\ntype IndexFunc func(obj interface{}) ([]string, error)\n// MetaNamespaceIndexFunc is a default index function that indexes based on an object's namespace\nfunc MetaNamespaceIndexFunc(obj interface{}) ([]string, error) {\n\tmeta, err := meta.Accessor(obj)\n\tif err != nil {\n\t\treturn []string{\"\"}, fmt.Errorf(\"object has no meta: %v\", err)\n\t}\n\treturn []string{meta.GetNamespace()}, nil\n}\n```\n\n（2）cache.Indexers是一个参数，传入到了SharedIndexInformer的实例化\n\n```\nfunc NewFilteredPodInformer(client kubernetes.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers, tweakListOptions internalinterfaces.TweakListOptionsFunc) cache.SharedIndexInformer {\n\treturn cache.NewSharedIndexInformer(\n\t\t&cache.ListWatch{\n\t\t\tListFunc: func(options metav1.ListOptions) (runtime.Object, error) {\n\t\t\t\tif tweakListOptions != nil {\n\t\t\t\t\ttweakListOptions(&options)\n\t\t\t\t}\n\t\t\t\treturn client.CoreV1().Pods(namespace).List(options)   //直接调用apiserver的list接口\n\t\t\t},\n\t\t\tWatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {\n\t\t\t\tif tweakListOptions != nil {\n\t\t\t\t\ttweakListOptions(&options)\n\t\t\t\t}\n\t\t\t\treturn client.CoreV1().Pods(namespace).Watch(options)  // 直接调用apiserver的watch接口\n\t\t\t},\n\t\t},\n\t\t&corev1.Pod{},   //说明是pod对象\n\t\tresyncPeriod,\n\t\tindexers,    //指定indexer\n\t)\n}\n```\n\n从这里可以看出来，cache只管做缓存+索引。数据来源都定义好了，不用管。\n\n<br>\n\n(3) 实例化时调用了 NewIndexer(DeletionHandlingMetaNamespaceKeyFunc, indexers)\n\n```\nfunc NewSharedIndexInformer(lw ListerWatcher, objType runtime.Object, defaultEventHandlerResyncPeriod time.Duration, indexers Indexers) SharedIndexInformer {\n\trealClock := &clock.RealClock{}\n\tsharedIndexInformer := &sharedIndexInformer{\n\t\tprocessor:                       &sharedProcessor{clock: realClock},\n\t\tindexer:                         NewIndexer(DeletionHandlingMetaNamespaceKeyFunc, indexers),\n\t\tlisterWatcher:                   lw,\n\t\tobjectType:                      objType,\n\t\tresyncCheckPeriod:               defaultEventHandlerResyncPeriod,\n\t\tdefaultEventHandlerResyncPeriod: defaultEventHandlerResyncPeriod,\n\t\tcacheMutationDetector:           NewCacheMutationDetector(fmt.Sprintf(\"%T\", objType)),\n\t\tclock: realClock,\n\t}\n\treturn sharedIndexInformer\n}\n\n\n// NewIndexer returns an Indexer implemented simply with a map and a lock.\nfunc NewIndexer(keyFunc KeyFunc, indexers Indexers) Indexer {\n\treturn &cache{\n\t\tcacheStorage: NewThreadSafeStore(indexers, Indices{}),\n\t\tkeyFunc:      keyFunc,\n\t}\n}\n\nDeletionHandlingMetaNamespaceKeyFunc 最终调用了 MetaNamespaceKeyFunc\n所以 ns/podname 就能代表一个 Pod实例\n// MetaNamespaceKeyFunc is a convenient default KeyFunc which knows how to make\n// keys for API objects which implement meta.Interface.\n// The key uses the format <namespace>/<name> unless <namespace> is empty, then\n// it's just <name>.\n//\n// TODO: replace key-as-string with a key-as-struct so that this\n// packing/unpacking won't be necessary.\nfunc MetaNamespaceKeyFunc(obj interface{}) (string, error) {\n\tif key, ok := obj.(ExplicitKey); ok {\n\t\treturn string(key), nil\n\t}\n\tmeta, err := meta.Accessor(obj)\n\tif err != nil {\n\t\treturn \"\", fmt.Errorf(\"object has no meta: %v\", err)\n\t}\n\tif len(meta.GetNamespace()) > 0 {\n\t\treturn meta.GetNamespace() + \"/\" + meta.GetName(), nil\n\t}\n\treturn meta.GetName(), nil\n}\n```\n\n（4）informer.indexer 最终就是一个熟悉的cache结构体\n\n```\n// \n// cache responsibilities are limited to:\n//\t1. Computing keys for objects via keyFunc\n//  2. Invoking methods of a ThreadSafeStorage interface\ntype cache struct {\n\t// cacheStorage bears the burden of thread safety for the cache\n\tcacheStorage ThreadSafeStore\n\t// keyFunc is used to make the key for objects stored in and retrieved from items, and\n\t// should be deterministic.\n\tkeyFunc KeyFunc\n}\n```\n\n<br>\n\n总结：\n\n到这里就可以看出来一个Informer是如何定义本地存储+索引的。至于整个系统如何运转，看后面的Informer分析。"
  },
  {
    "path": "k8s/client-go/7. informer机制详解.md",
    "content": "Table of Contents\n=================\n\n  * [1.章节介绍](#1章节介绍)\n  * [2. cache.SharedIndexInformer结构介绍](#2-cachesharedindexinformer结构介绍)\n  * [3. sharedIndexInformer.Run](#3-sharedindexinformerrun)\n     * [3.1 NewDeltaFIFO](#31-newdeltafifo)\n        * [3.1.1 DeltaFIFO的定位](#311-deltafifo的定位)\n        * [3.1.2  DeltaFIFO结构介绍](#312--deltafifo结构介绍)\n        * [3.1.3 举例说明deltaFifo核心结构](#313-举例说明deltafifo核心结构)\n     * [3.2 sharedIndexInformer生产数据](#32-sharedindexinformer生产数据)\n        * [3.2.1 controller结构](#321-controller结构)\n        * [3.2.2 controller.run](#322-controllerrun)\n        * [3.2.3 Reflector实例](#323-reflector实例)\n        * [3.2.4 Reflector.run](#324-reflectorrun)\n        * [3.2.5 ListAndWatch](#325-listandwatch)\n           * [知识补充](#知识补充)\n           * [源码分析](#源码分析)\n        * [3.2.6 c.processLoop](#326-cprocessloop)\n           * [HandleDeltas函数](#handledeltas函数)\n           * [理解listeners和syncingListeners的区别](#理解listeners和syncinglisteners的区别)\n     * [3.3 s.processor.run消费数据](#33-sprocessorrun消费数据)\n           * [processorListener结构](#processorlistener结构)\n           * [pod and run](#pod-and-run)\n  * [4 参考](#4-参考)\n\n### 1.章节介绍\n\n从上一章节可以知道。利用informer机制可以非常简单地实现一个资源对象的控制器，具体步骤为\n\n（1）new SharedInformerFactory实例，然后指定indexer,listWatch参数，就可以生成一个 cache.SharedIndexInformer 对象\n\n（2）cache.SharedIndexInformer 实际是完成了下图中的informer机制\n\n![informer.png](../images/informer.png)\n\n这一章节开始从SharedIndexInformer入手研究informer机制。\n\n### 2. cache.SharedIndexInformer结构介绍\n\n```\ntype sharedIndexInformer struct {\n\tindexer    Indexer        //  本地的缓存+索引机制，上一篇文章详解介绍了\n\tcontroller Controller     // 控制器，启动reflector, 这个controller包含reflector：根据用户定义的ListWatch方法获取对象并更新增量队列DeltaFIFO\n\n\tprocessor             *sharedProcessor       // 注册了add,update,del事件的listener集合\n\tcacheMutationDetector CacheMutationDetector   // 突变检测器\n\n\t// This block is tracked to handle late initialization of the controller\n\tlisterWatcher ListerWatcher           // 定义了list, watch函数, 看podinformer那里就可以知道，是直接调用了client往apiserver发送了请求\n\tobjectType    runtime.Object          // 定义要List watch的对象类型。如果是Podinfomer，就是要传入core.v1.pod\n\n\t// resyncCheckPeriod is how often we want the reflector's resync timer to fire so it can call\n\t// shouldResync to check if any of our listeners need a resync.\n\tresyncCheckPeriod time.Duration            // 给自己的controller的reflector每隔多少s<尝试>调用listener的shouldResync方法\n\t// defaultEventHandlerResyncPeriod is the default resync period for any handlers added via\n\t// AddEventHandler (i.e. they don't specify one and just want to use the shared informer's default\n\t// value).\n\tdefaultEventHandlerResyncPeriod time.Duration  // 通过AddEventHandler注册的handler的默认同步值\n\t// clock allows for testability\n\tclock clock.Clock\n\n\tstarted, stopped bool\n\tstartedLock      sync.Mutex\n\n\t// blockDeltas gives a way to stop all event distribution so that a late event handler\n\t// can safely join the shared informer.\n\tblockDeltas sync.Mutex\n}\n```\n\nSharedIndexInformer主要包括以下对象：\n\n（1）indexer\n\n图中右下角的indexer。上一节已经分析了具体的实现。\n\n（2）Controller \n\n图中左边的Controller，启动reflector, list-watch那一套机制。接下来重点分析\n\n（3）processor\n\n图中最下面的listeners，所有往 informer注册了 ResourceEventHandler的都是一个listener。\n\n因为是共享informer，所以存在一个inforemr实例化了多次，然后注册了多个ResourceEventHandler。一般情况下，一个Informer一个listener\n\n```\ntype sharedProcessor struct {\n\tlistenersStarted bool\n\tlistenersLock    sync.RWMutex\n\tlisteners        []*processorListener      // 记录了informer添加的所有listener\n\tsyncingListeners []*processorListener      // 记录了informer中哪些listener处于sync状态。由resyncCheckPeriod参数控制。每隔resyncCheckPeriod秒，listener都需要重新同步一下，同步就是将listener变成syncingListeners。\n\tclock            clock.Clock\n\twg               wait.Group\n}\n```\n\nResourceEventHandler结构体如下。这个就是定义Informer，add, update, del的处理事件。\n\n```\ntype ResourceEventHandler interface {\n   OnAdd(obj interface{})\n   OnUpdate(oldObj, newObj interface{})\n   OnDelete(obj interface{})\n}\n```\n\n（4）CacheMutationDetector\n\n突变检测器，用来检测内存中对象是否发生了突变。测试的时候用，默认不开启。这个先不深入了解\n\n<br>\n\n### 3. sharedIndexInformer.Run\n\nk8s.io/client-go/tools/cache/shared_informer.go\n\n在使用informer的时候，定义好sharedIndexInformer后，就直接运行了sharedIndexInformer.Run函数开始了整个Informer机制。\n\n整个informer的运转逻辑就是：\n\n（1）deltaFIFO接收listAndWatch的全量/增量数据，然后通过pop函数发送到HandleDeltas函数中  （生产）\n\n（2）HandleDeltas将一个一个的事件发送到自定义的handlers 和  更新indexer缓存   （消费）\n\n现在就沿着 Run这个函数入手，看看具体是如何实现的。sharedIndexInformer.Run主要逻辑如下：\n\n1. new一个 deltafifo对象，并且指定对象的keyfun为 MetaNamespaceKeyFunc，就是用 ns/name 来当对象的key\n2. 生成config，利用config 生成一个controller\n3. 运行用户自定义handler的处理逻辑，s.processor.run    （开启消费）\n4. 运行controller.run，实现整体的运作逻辑                          （开启生产）\n\n```\nfunc (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {\n\tdefer utilruntime.HandleCrash()\n  \n \n  // 1. new一个 deltafifo对象，并且指定对象的keyfun为 MetaNamespaceKeyFunc，就是用 ns/name 来当对象的key\n\tfifo := NewDeltaFIFO(MetaNamespaceKeyFunc, s.indexer)\n\n  // 2. 生成config\n\tcfg := &Config{\n\t\tQueue:            fifo,                \n\t\tListerWatcher:    s.listerWatcher,\n\t\tObjectType:       s.objectType,\n\t\tFullResyncPeriod: s.resyncCheckPeriod,          // 同步周期\n\t\tRetryOnError:     false,       \n\t\tShouldResync:     s.processor.shouldResync,     // 这是个函数，用于判断自定义的handler是否需要同步\n\n\t\tProcess: s.HandleDeltas,                        // listwatch来了数据，如何处理的函数\n\t}\n\n\tfunc() {\n\t\ts.startedLock.Lock()\n\t\tdefer s.startedLock.Unlock()\n    \n    // 3. 利用config 生成一个controller\n\t\ts.controller = New(cfg)\n\t\ts.controller.(*controller).clock = s.clock\n\t\ts.started = true\n\t}()\n\n\t// Separate stop channel because Processor should be stopped strictly after controller\n\tprocessorStopCh := make(chan struct{})\n\tvar wg wait.Group\n\tdefer wg.Wait()              // Wait for Processor to stop\n\tdefer close(processorStopCh) // Tell Processor to stop\n\t// 内存突变检测，忽略\n\twg.StartWithChannel(processorStopCh, s.cacheMutationDetector.Run)\n\t// 4. 运行用户自定义handler的处理逻辑\n\twg.StartWithChannel(processorStopCh, s.processor.run)\n  \n\tdefer func() {\n\t\ts.startedLock.Lock()\n\t\tdefer s.startedLock.Unlock()\n\t\ts.stopped = true // Don't want any new listeners\n\t}()\n\t\n  // 5.运行controller\n\ts.controller.Run(stopCh)\n}\n```\n\n#### 3.1 NewDeltaFIFO\n\n##### 3.1.1 DeltaFIFO的定位\n\n在apisever中的list-watch机制介绍中，就可以知道。直接使用list，watch api就可以获得全量和增量数据。\n\n如果让我写一个最简单的client-go客户端，我实现的方式是：\n\n（1）定义一个本地存储cache，list的时候将数据放到cache中\n\n（2）然后watch的时候就更新cache数据，然后再将对象发送到自定义的add, update, del handler函数中\n\n需要cache的原因：本地缓存一份etcd数据，这样控制器需要访问数据的话，直接从本地拿。\n\n<br>\n\n以上可以实现一个很简陋的客户端，但是还远远达不到informer机制的要求。\n\ninformer机制为啥需要DeltaFIFO？\n\n（1）为啥需要FIFO队列？\n\n很容易理解，FIFO是保障有序，不有序就会导致数据错乱。 队列是为了缓冲，如果更新的数据太多，informer机制可能就扛不住了\n\n（2）为啥需要delta？\n\nFIFO队列的元素总共就两个去向。第一用于同步本地缓存。第二用于发送给自定义的add, update, del handler函数。\n\n假设某个极短的时间内，某一个对象做了大量的update，最后被删除了。这样的话，FIFO队列其实是堆积了很多数据。\n\n一个一个发送给handler函数没有问题，因为用户就想知道这个过程。但是如果是一个一个的更新本地缓存，最后又delete了，那前面的update就浪费了。\n\n所以这个时候DeltaFIFO队列出现了。它解决了这个问题。\n\n##### 3.1.2  DeltaFIFO结构介绍\n\nDeltaFIFO可以认为是一个特殊的FIFO队列。Delta就是k8s系统中对象的变化(增、删、改、同步)的一个标记。\n\n增、删、改肯定是需要的，因为就算我们自己实现一个队列也需要当前是做了什么操作。\n\n同步是重新List apiserver的时候需要的\n\n```\n// 有着四种类型\n// Change type definition\nconst (\n\tAdded   DeltaType = \"Added\"\n\tUpdated DeltaType = \"Updated\"\n\tDeleted DeltaType = \"Deleted\"\n\t// The other types are obvious. You'll get Sync deltas when:\n\t//  * A watch expires/errors out and a new list/watch cycle is started.\n\t//  * You've turned on periodic syncs.\n\t// (Anything that trigger's DeltaFIFO's Replace() method.)\n\tSync DeltaType = \"Sync\"\n)\n\n// Delta is the type stored by a DeltaFIFO. It tells you what change\n// happened, and the object's state after* that change.\n//\n// [*] Unless the change is a deletion, and then you'll get the final\n//     state of the object before it was deleted.\ntype Delta struct {\n\tType   DeltaType\n\tObject interface{}    //k8s中的对象\n}\n\n// Deltas is a list of one or more 'Delta's to an individual object.\n// The oldest delta is at index 0, the newest delta is the last one.\ntype Deltas []Delta\n\n\ntype DeltaFIFO struct {\n\t// lock/cond protects access to 'items' and 'queue'.\n\tlock sync.RWMutex\n\tcond sync.Cond   \n\n\t// We depend on the property that items in the set are in\n\t// the queue and vice versa, and that all Deltas in this\n\t// map have at least one Delta.\n\titems map[string]Deltas     \n\tqueue []string\n\n   //  populated和initialPopulationCount 是用来判断 process是否同步的\n\t// populated is true if the first batch of items inserted by Replace() has been populated\n\t// or Delete/Add/Update was called first.\n\tpopulated bool      //队列的元素开始被消费\n\t// initialPopulationCount is the number of items inserted by the first call of Replace()\n\tinitialPopulationCount int \n\n\t// keyFunc is used to make the key used for queued item\n\t// insertion and retrieval, and should be deterministic.\n\tkeyFunc KeyFunc\n\n\t// knownObjects list keys that are \"known\", for the\n\t// purpose of figuring out which items have been deleted\n\t// when Replace() or Delete() is called.\n\tknownObjects KeyListerGetter\n\n\t// Indication the queue is closed.\n\t// Used to indicate a queue is closed so a control loop can exit when a queue is empty.\n\t// Currently, not used to gate any of CRED operations.\n\tclosed     bool\n\tclosedLock sync.Mutex\n}\n```\n\nDeltaFIFO最关键的是， items, queue, 和knownObjects。\n\nitems: 对象的变化过程列表\n\nQueue: 表示对象的key。\n\nknownObjects：从下面的初始化可以看出来，就是 cache.indexer\n\n```\nfifo := NewDeltaFIFO(MetaNamespaceKeyFunc, s.indexer)\n\nfunc NewDeltaFIFO(keyFunc KeyFunc, knownObjects KeyListerGetter) *DeltaFIFO {\n\tf := &DeltaFIFO{\n\t\titems:        map[string]Deltas{},\n\t\tqueue:        []string{},\n\t\tkeyFunc:      keyFunc,\n\t\tknownObjects: knownObjects,\n\t}\n\tf.cond.L = &f.lock\n\treturn f\n}\n```\n\n##### 3.1.3 举例说明deltaFifo核心结构\n\n假设监听了 default命名空间的所有Pod，最开始该命名空间没有Pod，然后监听了一会后，创建了三个Pod, 分别为:\n\n```\npod1 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: \"one\", Annotations: map[string]string{\"users\": \"ernie,bert\"}}}\n\tpod2 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: \"two\", Annotations: map[string]string{\"users\": \"bert,oscar\"}}}\n\tpod3 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: \"tre\", Annotations: map[string]string{\"users\": \"ernie,elmo\"}}}\n```\n\n那么watch函数依次会产生如下的事件：\n\npod1-1：表示pod1对应的第一个阶段 （pending状态）\n\npod1-2：表示pod1对应的第二个阶段 （scheduled状态）\n\npod1-3：表示pod1对应的第三个阶段 （running状态）\n\n```\nADD: pod1-1(省略模式，其实是整个pod的元数据，{ObjectMeta: metav1.ObjectMeta{Name: \"one\", Annotations: map[string]string{\"users\": \"ernie,bert\"}}})\n\nADD： pod2-1\n\nMODIFIED: pod1-2\nADD： pod3-1\nMODIFIED: pod2-2\nMODIFIED: pod3-2\nMODIFIED: pod1-3\nMODIFIED: pod3-3\nMODIFIED: pod2-3\n```\n\n这个时候 deltaFIFO结构体的对象为：\n\ndeltaFIFO {  \n\n​\t\tqueue: [\"one\", \"two\", \"tree\"],\n\n​        Items: {\n\n​               \"one\":  [{\"add\", pod1-1},  {\"update\", pod1-2},   {\"update\",  pod1-3}], \n\n​                \"two\":  [{\"add\", pod2-1},  {\"update\", pod2-2},   {\"update\",  pod2-3}], \n\n​                 \"tre\":  [{\"add\", pod3-1},  {\"update\", pod3-2},   {\"update\",  pod3-3}], \n\n​        }\n\n}\n\n这样的好处就是：\n\n（1）每次是以一个对象为单位进行发送，比如这里一次就将  \"one\":  [{\"add\", pod1-1},  {\"update\", pod1-2},   {\"update\",  pod1-3}] 三个事件发送给了 handler方\n\n（2）indexer可以知道当前对象的最终状态。比如 \"one\":  [{\"add\", pod1-1},  {\"update\", pod1-2},   {\"update\",  pod1-3}], 这个，能跳过pod1-1, pod1-2状态，直接将pod1-3状态更新到缓存中去。\n\n<br>\n\n#### 3.2 sharedIndexInformer生产数据\n\n都知道数据产生方来着 apisever的listAndWatch。接下来看看是如何使用的。这里直接从 controller.run入手。\n\n##### 3.2.1 controller结构\n\ncontroller结构本身非常简单，主要就是一个config，然后根据config实现的一些生产数据相关的函数\n\n```\n// New makes a new Controller from the given Config.\nfunc New(c *Config) Controller {\n\tctlr := &controller{\n\t\tconfig: *c,\n\t\tclock:  &clock.RealClock{},\n\t}\n\treturn ctlr\n}\n\n// Config contains all the settings for a Controller.\ntype Config struct {\n\t// The queue for your objects - has to be a DeltaFIFO due to\n\t// assumptions in the implementation. Your Process() function\n\t// should accept the output of this Queue's Pop() method.\n\t// 弄一个数据缓存\n\tQueue\n\n\t// 从aipserver接收数据\n\tListerWatcher\n\n\t// Something that can process your objects.\n\t// 如何处理接收到的数据\n\tProcess ProcessFunc\n\n\t// The type of your objects.\n\t// 数据是什么类型，Pod? deploy?\n\tObjectType runtime.Object\n\n  \n\tFullResyncPeriod time.Duration\n\n  // 是否需要同步\n\tShouldResync ShouldResyncFunc\n\n  //是否错误重试\n\tRetryOnError bool\n}\n```\n\n##### 3.2.2 controller.run\n\n1. 实例化 NewReflector\n2. 通过List-watch获得生产数据\n3. 处理生产数据，不断执行processLoop，这个方法其实就是从DeltaFIFO pop出对象，再调用reflector的Process（其实是shareIndexInformer的HandleDeltas方法）处理\n\n```\nfunc (c *controller) Run(stopCh <-chan struct{}) {\n\tdefer utilruntime.HandleCrash()\n\tgo func() {\n\t\t<-stopCh\n\t\tc.config.Queue.Close()\n\t}()\n\t\n\t// 1.实例化 NewReflector\n\tr := NewReflector(\n\t\tc.config.ListerWatcher,\n\t\tc.config.ObjectType,\n\t\tc.config.Queue,    \n\t\tc.config.FullResyncPeriod,\n\t)\n\tr.ShouldResync = c.config.ShouldResync\n\tr.clock = c.clock\n\n\tc.reflectorMutex.Lock()\n\tc.reflector = r\n\tc.reflectorMutex.Unlock()\n\n\tvar wg wait.Group\n\tdefer wg.Wait()\n  \n  // 2. 通过List-watch获得生产数据\n\twg.StartWithChannel(stopCh, r.Run)\n  // 3. 处理生产数据\n  // 不断执行processLoop，这个方法其实就是从DeltaFIFO pop出对象，再调用reflector的Process（其实是shareIndexInformer的HandleDeltas方法）处理\n\twait.Until(c.processLoop, time.Second, stopCh)\n}\n```\n\n##### 3.2.3 Reflector实例\n\nReflector核心结构，可以看出来基本都是从config基础下来的。\n\n```\n// Reflector watches a specified resource and causes all changes to be reflected in the given store.\ntype Reflector struct {\n\t// name identifies this reflector. By default it will be a file:line if possible.\n\tname string\n\n\t// The name of the type we expect to place in the store. The name\n\t// will be the stringification of expectedGVK if provided, and the\n\t// stringification of expectedType otherwise. It is for display\n\t// only, and should not be used for parsing or comparison.\n\texpectedTypeName string\n\t// The type of object we expect to place in the store.\n\texpectedType reflect.Type\n\t// The GVK of the object we expect to place in the store if unstructured.\n\texpectedGVK *schema.GroupVersionKind\n\t// The destination to sync up with the watch source\n\tstore Store          //获得数据存放哪里，就是deltaFIFO队列\n\t// listerWatcher is used to perform lists and watches.\n\tlisterWatcher ListerWatcher\n\t// period controls timing between one watch ending and\n\t// the beginning of the next one.\n\tperiod       time.Duration\n\tresyncPeriod time.Duration\n\tShouldResync func() bool\n\t// clock allows tests to manipulate time\n\tclock clock.Clock\n\t// lastSyncResourceVersion is the resource version token last\n\t// observed when doing a sync with the underlying store\n\t// it is thread safe, but not synchronized with the underlying store\n\tlastSyncResourceVersion string         \n\t// lastSyncResourceVersionMutex guards read/write access to lastSyncResourceVersion\n\tlastSyncResourceVersionMutex sync.RWMutex\n\t// WatchListPageSize is the requested chunk size of initial and resync watch lists.\n\t// Defaults to pager.PageSize.\n\tWatchListPageSize int64\n}\n```\n\n<br>\n\n##### 3.2.4 Reflector.run\n\n就是上面的r.un。就做一件事。运行listAndWatch函数。\n\n注意：ListAndWatch函数是1s运行一次哟。\n\n所以relist并不是listAndWatch干的。ListAndWatch只是进行一轮list 和 watch(正常情况会一直保持watch)\n\n当ListAndWatch因为异常/错误或者其他原因退出了，Reflector会自动再次执行listAndWatch\n\n```\n// Run starts a watch and handles watch events. Will restart the watch if it is closed.\n// Run will exit when stopCh is closed.\nfunc (r *Reflector) Run(stopCh <-chan struct{}) {\n\tklog.V(3).Infof(\"Starting reflector %v (%s) from %s\", r.expectedTypeName, r.resyncPeriod, r.name)\n\twait.Until(func() {\n\t\tif err := r.ListAndWatch(stopCh); err != nil {\n\t\t\tutilruntime.HandleError(err)\n\t\t}\n\t}, r.period, stopCh)\n}\n\n\nNewReflector定义了period是1s\n// NewReflector creates a new Reflector object which will keep the given store up to\n// date with the server's contents for the given resource. Reflector promises to\n// only put things in the store that have the type of expectedType, unless expectedType\n// is nil. If resyncPeriod is non-zero, then lists will be executed after every\n// resyncPeriod, so that you can use reflectors to periodically process everything as\n// well as incrementally processing the things that change.\nfunc NewReflector(lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {\n\treturn NewNamedReflector(naming.GetNameFromCallsite(internalPackages...), lw, expectedType, store, resyncPeriod)\n}\n\n\n// NewNamedReflector same as NewReflector, but with a specified name for logging\nfunc NewNamedReflector(name string, lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {\n\tr := &Reflector{\n\t\tname:          name,\n\t\tlisterWatcher: lw,\n\t\tstore:         store,\n\t\tperiod:        time.Second,    // period是1s\n\t\tresyncPeriod:  resyncPeriod,\n\t\tclock:         &clock.RealClock{},\n\t}\n\tr.setExpectedType(expectedType)\n\treturn r\n}\n```\n\n##### 3.2.5 ListAndWatch\n\n###### 知识补充\n\nlistAndWatch核心思路就是：将apiserver list/watch到的数据发送到deltaFIFO队列中去。\n\n在看代码之前，先通过curl kube-apiserver来看看 list-watch的特性。\n\n（1）podList可以认为是一个新的对象，它也是有资源版本的说法\n\n（2）list默认是用来chunk(分段传输)的，chunk的介绍和好处  https://zh.wikipedia.org/wiki/%E5%88%86%E5%9D%97%E4%BC%A0%E8%BE%93%E7%BC%96%E7%A0%81\n\n（3）v1.19 及以上版本的 API 服务器支持 `resourceVersionMatch` 参数，用以确定如何对 LIST 调用应用 resourceVersion 值。 强烈建议在为 LIST 调用设置了 `resourceVersion` 时也设置 `resourceVersionMatch`。 如果 `resourceVersion` 未设置，则 `resourceVersionMatch` 是不允许设置的。 为了向后兼容，客户端必须能够容忍服务器在某些场景下忽略 `resourceVersionMatch` 的行为：\n\n- 当设置 `resourceVersionMatch=NotOlderThan` 且指定了 `limit` 时，客户端必须能够 处理 HTTP 410 \"Gone\" 响应。例如，客户端可以使用更新一点的 `resourceVersion` 来重试，或者回退到 `resourceVersion=\"\"` （即允许返回任何版本）。\n- 当设置了 `resourceVersionMatch=Exact` 且未指定 `limit` 时，客户端必须验证 响应数据中 `ListMeta` 的 `resourceVersion` 与所请求的 `resourceVersion` 匹配， 并处理二者可能不匹配的情况。例如，客户端可以重试设置了 `limit` 的请求。\n\n除非你对一致性有着非常强烈的需求，使用 `resourceVersionMatch=NotOlderThan` 同时为 `resourceVersion` 设定一个已知值是优选的交互方式，因为与不设置 `resourceVersion` 和 `resourceVersionMatch` 相比，这种配置可以取得更好的 集群性能和可扩缩性。后者需要提供带票选能力的读操作。\n\n参考：https://kubernetes.io/zh/docs/reference/using-api/api-concepts/\n\n| resourceVersionMatch 参数             | 分页参数                    | resourceVersion 未设置  | resourceVersion=\"0\"                   | resourceVersion=\"<非零值>\"            |\n| ------------------------------------- | --------------------------- | ----------------------- | ------------------------------------- | ------------------------------------- |\n| resourceVersionMatch 未设置           | limit 未设置                | 最新版本                | 任意版本                              | 不老于指定版本                        |\n| resourceVersionMatch 未设置           | limit=<n>, continue 未设置  | 最新版本                | 任意版本                              | 精确匹配                              |\n| resourceVersionMatch 未设置           | limit=<n>, continue=<token> | 从 token 开始、精确匹配 | 非法请求，视为从 token 开始、精确匹配 | 非法请求，返回 HTTP `400 Bad Request` |\n| resourceVersionMatch=Exact [1]        | limit 未设置                | 非法请求                | 非法请求                              | 精确匹配                              |\n| resourceVersionMatch=Exact [1]        | limit=<n>, continue 未设置  | 非法请求                | 非法请求                              | 精确匹配                              |\n| resourceVersionMatch=NotOlderThan [1] | limit 未设置                | 非法请求                | 任意版本                              | 不老于指定版本                        |\n| resourceVersionMatch=NotOlderThan [1] | limit=<n>, continue 未设置  | 非法请求                | 任意版本                              | 不老于指定版本                        |\n\n```\n// curl http://7.34.19.44:58201/api/v1/namespaces/test-test/pods -i\nHTTP/1.1 200 OK\nAudit-Id: 4ff9e833-e3e0-4001-9e1a-d83c9a9b1937\nCache-Control: no-cache, private\nContent-Type: application/json\nDate: Sat, 20 Nov 2021 02:10:48 GMT\nTransfer-Encoding: chunked\n\n{\n  \"kind\": \"PodList\",\n  \"apiVersion\": \"v1\",\n  \"metadata\": {\n    \"selfLink\": \"/api/v1/namespaces/test-test/pods\",\n    \"resourceVersion\": \"163916927\"\n  },\n  \"items\": [\n        \n\nroot@cld-kmaster1-1051:/home/zouxiang# curl http://7.34.19.44:58201/api/v1/namespaces/test-test/pods?limit=1 -i\nHTTP/1.1 200 OK\nAudit-Id: 17d0d42f-a122-4c5a-9659-70224a22522a\nCache-Control: no-cache, private\nContent-Type: application/json\nDate: Sat, 20 Nov 2021 02:09:32 GMT\nTransfer-Encoding: chunked   //chunked传输\n\n{\n  \"kind\": \"PodList\",\n  \"apiVersion\": \"v1\",\n  \"metadata\": {\n    \"selfLink\": \"/api/v1/namespaces/test-test/pods\",\n    \"resourceVersion\": \"163915936\",\n    // 注意这continue\n    \"continue\":      \"eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTYzOTE1OTM2LCJzdGFydCI6ImFwcC1pc3Rpb3ZlcnNpb24tdGVzdC01NDZkZmZmNTYtNnQ2MnBcdTAwMDAifQ\",\n    \"remainingItemCount\": 23    //表示当前还有23个没有展示处理\n  },\n  \"items\": [\n    {\n      \"metadata\": {\n        \"name\": \"app-istioversion-test-546dfff56-6t62p\",\n        \"generateName\": \"app-istioversion-test-546dfff56-\",\n\n\n// 加上这个continue参数，会把剩下的23个展示出来。\ncurl http://7.34.19.44:58201/api/v1/namespaces/test-test/pods?continue=eyJ2IjoibWV0YS5rOHMuaW8vdjEiLCJydiI6MTYzOTE1OTM2LCJzdGFydCI6ImFwcC1pc3Rpb3ZlcnNpb24tdGVzdC01NDZkZmZmNTYtNnQ2MnBcdTAwMDAifQ\n```\n\n<br>\n\nwatch很简单，就是一个长链接，chunked\n\n```\nroot@cld-kmaster1-1051:/home/zouxiang# curl http://7.34.19.44:58201/api/v1/namespaces/default/pods?watch=true -i\nHTTP/1.1 200 OK\nCache-Control: no-cache, private\nContent-Type: application/json\nDate: Sat, 20 Nov 2021 01:32:06 GMT\nTransfer-Encoding: chunked\n```\n\n<br>\n\n###### 源码分析\n\n```\n// ListAndWatch first lists all items and get the resource version at the moment of call,\n// and then use the resource version to watch.\n// It returns error if ListAndWatch didn't even try to initialize watch.\nfunc (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {\n\tklog.V(3).Infof(\"Listing and watching %v from %s\", r.expectedTypeName, r.name)\n\tvar resourceVersion string\n\n\t// Explicitly set \"0\" as resource version - it's fine for the List()\n\t// to be served from cache and potentially be delayed relative to\n\t// etcd contents. Reflector framework will catch up via Watch() eventually.\n\t// 以版本号ResourceVersion=0开始首次list\n\toptions := metav1.ListOptions{ResourceVersion: \"0\"}\n\n\tif err := func() error {\n\t\tinitTrace := trace.New(\"Reflector ListAndWatch\", trace.Field{\"name\", r.name})\n\t\tdefer initTrace.LogIfLong(10 * time.Second)\n\t\tvar list runtime.Object\n\t\tvar err error\n\t\tlistCh := make(chan struct{}, 1)\n\t\tpanicCh := make(chan interface{}, 1)\n\t\tgo func() {\n\t\t\tdefer func() {\n\t\t\t\tif r := recover(); r != nil {\n\t\t\t\t\tpanicCh <- r\n\t\t\t\t}\n\t\t\t}()\n\t\t\t// Attempt to gather list in chunks, if supported by listerWatcher, if not, the first\n\t\t\t// list request will return the full response.\n\t\t\t// 开始list数据，分页\n\t\t\tpager := pager.New(pager.SimplePageFunc(func(opts metav1.ListOptions) (runtime.Object, error) {\n\t\t\t\treturn r.listerWatcher.List(opts)\n\t\t\t}))\n\t\t\tif r.WatchListPageSize != 0 {\n\t\t\t\tpager.PageSize = r.WatchListPageSize\n\t\t\t}\n\t\t\t// Pager falls back to full list if paginated list calls fail due to an \"Expired\" error.\n\t\t\t// 获取list的数据\n\t\t\tlist, err = pager.List(context.Background(), options)\n\t\t\tclose(listCh)\n\t\t}()\n\t\tselect {\n\t\tcase <-stopCh:\n\t\t\treturn nil\n\t\tcase r := <-panicCh:\n\t\t\tpanic(r)\n\t\tcase <-listCh:\n\t\t}\n\t\tif err != nil {\n\t\t\treturn fmt.Errorf(\"%s: Failed to list %v: %v\", r.name, r.expectedTypeName, err)\n\t\t}\n\t\tinitTrace.Step(\"Objects listed\")\n\t\tlistMetaInterface, err := meta.ListAccessor(list)\n\t\tif err != nil {\n\t\t\treturn fmt.Errorf(\"%s: Unable to understand list result %#v: %v\", r.name, list, err)\n\t\t}\n\t\tresourceVersion = listMetaInterface.GetResourceVersion()\n\t\tinitTrace.Step(\"Resource version extracted\")\n\t\t// 提取list\n\t\titems, err := meta.ExtractList(list)\n\t\tif err != nil {\n\t\t\treturn fmt.Errorf(\"%s: Unable to understand list result %#v (%v)\", r.name, list, err)\n\t\t}\n\t\tinitTrace.Step(\"Objects extracted\")\n\t\t// 提取list的数据\n\t\tif err := r.syncWith(items, resourceVersion); err != nil {\n\t\t\treturn fmt.Errorf(\"%s: Unable to sync list result: %v\", r.name, err)\n\t\t}\n\t\tinitTrace.Step(\"SyncWith done\")\n\t\t// 设置下一次list的resourceVersion\n\t\tr.setLastSyncResourceVersion(resourceVersion)\n\t\tinitTrace.Step(\"Resource version updated\")\n\t\treturn nil\n\t}(); err != nil {\n\t\treturn err\n\t}\n\n\tresyncerrc := make(chan error, 1)\n\tcancelCh := make(chan struct{})\n\tdefer close(cancelCh)\n\tgo func() {\n\t\tresyncCh, cleanup := r.resyncChan()\n\t\tdefer func() {\n\t\t\tcleanup() // Call the last one written into cleanup\n\t\t}()\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase <-resyncCh:\n\t\t\tcase <-stopCh:\n\t\t\t\treturn\n\t\t\tcase <-cancelCh:\n\t\t\t\treturn\n\t\t\t}\n\t\t\tif r.ShouldResync == nil || r.ShouldResync() {\n\t\t\t\tklog.V(4).Infof(\"%s: forcing resync\", r.name)\n\t\t\t\t// 进行deltaFIFo的同步\n\t\t\t\tif err := r.store.Resync(); err != nil {\n\t\t\t\t\tresyncerrc <- err\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t}\n\t\t\tcleanup()\n\t\t\tresyncCh, cleanup = r.resyncChan()\n\t\t}\n\t}()\n\n\tfor {\n\t\t// give the stopCh a chance to stop the loop, even in case of continue statements further down on errors\n\t\tselect {\n\t\tcase <-stopCh:\n\t\t\treturn nil\n\t\tdefault:\n\t\t}\n\n\t\ttimeoutSeconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0))\n\t\toptions = metav1.ListOptions{\n\t\t\tResourceVersion: resourceVersion,\n\t\t\t// We want to avoid situations of hanging watchers. Stop any wachers that do not\n\t\t\t// receive any events within the timeout window.\n\t\t\tTimeoutSeconds: &timeoutSeconds,\n\t\t\t// To reduce load on kube-apiserver on watch restarts, you may enable watch bookmarks.\n\t\t\t// Reflector doesn't assume bookmarks are returned at all (if the server do not support\n\t\t\t// watch bookmarks, it will ignore this field).\n\t\t\tAllowWatchBookmarks: true,\n\t\t}\n   // 开始Watch\n\t\tw, err := r.listerWatcher.Watch(options)\n\t\tif err != nil {\n\t\t\tswitch err {\n\t\t\tcase io.EOF:\n\t\t\t\t// watch closed normally\n\t\t\tcase io.ErrUnexpectedEOF:\n\t\t\t\tklog.V(1).Infof(\"%s: Watch for %v closed with unexpected EOF: %v\", r.name, r.expectedTypeName, err)\n\t\t\tdefault:\n\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"%s: Failed to watch %v: %v\", r.name, r.expectedTypeName, err))\n\t\t\t}\n\t\t\t// If this is \"connection refused\" error, it means that most likely apiserver is not responsive.\n\t\t\t// It doesn't make sense to re-list all objects because most likely we will be able to restart\n\t\t\t// watch where we ended.\n\t\t\t// If that's the case wait and resend watch request.\n\t\t\tif utilnet.IsConnectionRefused(err) {\n\t\t\t\ttime.Sleep(time.Second)\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\treturn nil\n\t\t}\n    \n    // 处理watch的事件\n\t\tif err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil {\n\t\t\tif err != errorStopRequested {\n\t\t\t\tswitch {\n\t\t\t\tcase apierrs.IsResourceExpired(err):\n\t\t\t\t\tklog.V(4).Infof(\"%s: watch of %v ended with: %v\", r.name, r.expectedTypeName, err)\n\t\t\t\tdefault:\n\t\t\t\t\tklog.Warningf(\"%s: watch of %v ended with: %v\", r.name, r.expectedTypeName, err)\n\t\t\t\t}\n\t\t\t}\n\t\t\treturn nil\n\t\t}\n\t}\n}\n```\n\n结合知识补充大概的流程很清楚。回答以下几个问题\n\n（1）list操作为什么需要resoureversion?\n\nA: list机制本来就有resoureversion，resoureversion不同的值有不同的含义。每次list的时候记录了resoureversion，可以保证数据最少是上一次list后的（实际基本都是最新的）\n\n（2）为什么list会分页？\n\n如果设置了limit就会分页\n\n（3）如果提取list的数据\n\n先是通过 items, err := meta.ExtractList(list) ，将list数据保持到items数组中\n\n然后通过syncWith将List数据保持到 deltafIfo队列中去\n\nsyncWith的逻辑如下：\n\n（1）遍历所有list的数据，通过 queueActionLocked(Sync, item)将所有的数据，以(sync, item)的方式追加到 deltafifo的items里面\n\n（2）遍历所有fIfo queue的数据，判断是否存下 fifo有，但是最新list没有的数据。如果存在这种情况。说明fifo漏到了delete请求，所以封装一个(delete, DeletedFinalStateUnknown) 到deltafifo的items里面。\n\n为什么是DeletedFinalStateUnknown呢？\n\n因为Replace方法可能是reflector发生re-list的时候再次调用，这个时候就会出现knownObjects中存在的对象不在Replace list的情况（比\n\n如watch的delete事件丢失了），这个时候是把这些对象筛选出来，封装成DeletedFinalStateUnknown对象以Delete type类型再次加入\n\n到deltaFIFO中，这样最终从detaFIFO处理这个DeletedFinalStateUnknown 增量时就可以更新本地缓存并且触发reconcile。 因为这个对\n\n象最终的结构确实找不到了，所以只能用knownObjects里面的记录来封装delta，所以叫做FinalStateUnknown。\n\n```\n// syncWith replaces the store's items with the given list.\nfunc (r *Reflector) syncWith(items []runtime.Object, resourceVersion string) error {\n\tfound := make([]interface{}, 0, len(items))\n\tfor _, item := range items {\n\t\tfound = append(found, item)\n\t}\n\treturn r.store.Replace(found, resourceVersion)\n}\n\n\n// Replace will delete the contents of 'f', using instead the given map.\n// 'f' takes ownership of the map, you should not reference the map again\n// after calling this function. f's queue is reset, too; upon return, it\n// will contain the items in the map, in no particular order.\nfunc (f *DeltaFIFO) Replace(list []interface{}, resourceVersion string) error {\n\tf.lock.Lock()\n\tdefer f.lock.Unlock()\n\tkeys := make(sets.String, len(list))\n  \n  // 第一次遍历list到的数据\n\tfor _, item := range list {\n\t\tkey, err := f.KeyOf(item)\n\t\tif err != nil {\n\t\t\treturn KeyError{item, err}\n\t\t}\n\t\tkeys.Insert(key)\n\t\t// 2.将数据同步到fifo队列中去。这个就是往fifi的items加入元素。可以看出来，list的都是同步的数据\n\t\t// items的delta有四种类型：add, update, del, sync （这里都是sync）\n\t\tif err := f.queueActionLocked(Sync, item); err != nil {\n\t\t\treturn fmt.Errorf(\"couldn't enqueue object: %v\", err)\n\t\t}\n\t}\n\n  // 这个不存在，因为f.knownObjects=deltafifo\n\tif f.knownObjects == nil {\n\t\t// Do deletion detection against our own list.\n\t}\n\n\t// Detect deletions not already in the queue.\n\tknownKeys := f.knownObjects.ListKeys()\n\tqueuedDeletions := 0\n\t\n\t// 第二次遍历fifo中队列的数据\n\tfor _, k := range knownKeys {\n\t  // 如果fifo中的数据，List也有，那就不用管，因为上面的for循环已经处理了\n\t\tif keys.Has(k) {\n\t\t\tcontinue\n\t\t}\n    \n    // 如果fifo中的数据，list没有，那就是该数据已经删除了，但是由于某些原因，缓存没有收到，所以要删除这个队形\n\t\tdeletedObj, exists, err := f.knownObjects.GetByKey(k)\n\t\tif err != nil {\n\t\t\tdeletedObj = nil\n\t\t\tklog.Errorf(\"Unexpected error %v during lookup of key %v, placing DeleteFinalStateUnknown marker without object\", err, k)\n\t\t} else if !exists {\n\t\t\tdeletedObj = nil\n\t\t\tklog.Infof(\"Key %v does not exist in known objects store, placing DeleteFinalStateUnknown marker without object\", k)\n\t\t}\n\t\tqueuedDeletions++\n\t\t// 发送的是delete的delta，主要这里是DeletedFinalStateUnknown\n\t\t因为Replace方法可能是reflector发生re-list的时候再次调用，这个时候就会出现knownObjects中存在的对象不在Replace list的情况（比如watch的delete事件丢失了），这个时候是把这些对象筛选出来，封装成DeletedFinalStateUnknown对象以Delete type类型再次加入到deltaFIFO中，这样最终从detaFIFO处理这个DeletedFinalStateUnknown 增量时就可以更新本地缓存并且触发reconcile。 因为这个对象最终的结构确实找不到了，所以只能用knownObjects里面的记录来封装delta，所以叫做FinalStateUnknown。\n\t\tif err := f.queueActionLocked(Deleted, DeletedFinalStateUnknown{k, deletedObj}); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\n\tif !f.populated {\n\t\tf.populated = true\n\t\tf.initialPopulationCount = len(list) + queuedDeletions\n\t}\n\n\treturn nil\n}\n```\n\n##### 3.2.6 c.processLoop\n\nlist, watch将apiserver获取的数据最终都保存到了 deltafifo队列中去\n\nprocessLoop将数据进行了分发处理。\n\nprocessLoop就是将一个个元素拿出来，\n\n```\nfunc (c *controller) processLoop() {\n\tfor {\n\t  // for循环的方式将fifo队列中的元素发送到 PopProcessFunc函数中去\n\t\tobj, err := c.config.Queue.Pop(PopProcessFunc(c.config.Process))   // 在new config的时候指定了process=\tcfg :=HandleDeltas 函数\n\t}\n\t\tif err != nil {\n\t\t\tif err == ErrFIFOClosed {\n\t\t\t\treturn\n\t\t\t}\n\t\t\tif c.config.RetryOnError {\n\t\t\t\t// This is the safe way to re-enqueue.\n\t\t\t\tc.config.Queue.AddIfNotPresent(obj)\n\t\t\t}\n\t\t}\n\t}\n}\n\n\nfunc (f *DeltaFIFO) Pop(process PopProcessFunc) (interface{}, error) {\n\tf.lock.Lock()\n\tdefer f.lock.Unlock()\n\tfor {\n\t  // 1.队列为空，判断是否关闭，如果没有关闭就等待，否则返回\n\t\tfor len(f.queue) == 0 {\n\t\t\t// When the queue is empty, invocation of Pop() is blocked until new item is enqueued.\n\t\t\t// When Close() is called, the f.closed is set and the condition is broadcasted.\n\t\t\t// Which causes this loop to continue and return from the Pop().\n\t\t\tif f.IsClosed() {\n\t\t\t\treturn nil, ErrFIFOClosed\n\t\t\t}\n\n\t\t\tf.cond.Wait()\n\t\t}\n\t\t\n\t\t// 2.取出来第一个元素， 注意是 queue里面的一个元素，对应的是Items里面的一个 map key-value对\n\t\tid := f.queue[0]\n\t\tf.queue = f.queue[1:]\n\t\tif f.initialPopulationCount > 0 {\n\t\t\tf.initialPopulationCount--\n\t\t}\n\t\titem, ok := f.items[id]\n\t\tif !ok {\n\t\t\t// Item may have been deleted subsequently.\n\t\t\tcontinue\n\t\t}\n\t\tdelete(f.items, id)\n\t\t\n\t\t// 3.调用process进行处理\n\t\terr := process(item)    \n\t\tif e, ok := err.(ErrRequeue); ok {\n\t\t\tf.addIfNotPresent(id, item)\n\t\t\terr = e.Err\n\t\t}\n\t\t// Don't need to copyDeltas here, because we're transferring\n\t\t// ownership to the caller.\n\t\treturn item, err\n\t}\n}\n```\n\n###### HandleDeltas函数\n\n终于出现了HandleDeltas, 如图中HandleDeltas功能所示：\n\nHandleDeltas就是干两件事情：\n\n（1）更新Indexer （这里很奇怪，没有一次性更新Indexer到位，就是如果Deltas最后一个是del事件，还是会先update后再删除）\n\n（2）将事件进行distribute发送\n\n```\nfunc (s *sharedIndexInformer) HandleDeltas(obj interface{}) error {\n\ts.blockDeltas.Lock()\n\tdefer s.blockDeltas.Unlock()\n\n\t// from oldest to newest\n\tfor _, d := range obj.(Deltas) {\n\t\tswitch d.Type {\n\t\t// 同步就是relist的时候，fifo replace函数发出来的事件\n\t\tcase Sync, Added, Updated:\n\t\t\tisSync := d.Type == Sync\n\t\t\ts.cacheMutationDetector.AddObject(d.Object)\n\t\t\tif old, exists, err := s.indexer.Get(d.Object); err == nil && exists {\n\t\t\t\tif err := s.indexer.Update(d.Object); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\ts.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync)\n\t\t\t} else {\n\t\t\t\tif err := s.indexer.Add(d.Object); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\ts.processor.distribute(addNotification{newObj: d.Object}, isSync)\n\t\t\t}\n\t\tcase Deleted:\n\t\t\tif err := s.indexer.Delete(d.Object); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\ts.processor.distribute(deleteNotification{oldObj: d.Object}, false)\n\t\t}\n\t}\n\treturn nil\n}\n```\n\n<br>\n\ndistribute就很简单，将事件进行发送，这里有一个很简单的逻辑：\n\n就是注册resourceHandler的时候，可以指定是否需要同步。比如我New一个informer，然后指定不同步。\n\n这个时候我对应的resourceHandler就不是syncingListeners.\n\n###### 理解listeners和syncingListeners的区别\n\nprocessor可以支持listener的维度配置是否需要resync：一个informer可以配置多个EventHandler，而一个EventHandler对应processor中的一个listener，每个listener可以配置需不需要resync，如果某个listener需要resync，那么添加到deltaFIFO的Sync增量最终也只会回到对应的listener\n\nreflector中会定时判断每一个listener是否需要进行resync，判断的依据是看配置EventHandler的时候指定的resyncPeriod，0代表该listener不需要resync，否则就每隔resyncPeriod看看是否到时间了\n\n- listeners：记录了informer添加的所有listener\n- syncingListeners：记录了informer中哪些listener处于sync状态\n\nsyncingListeners是listeners的子集，syncingListeners记录那些开启了resync且时间已经到达了的listener，把它们放在一个独立的slice是避免下面分析的distribute方法中把obj增加到了还不需要resync的listener中\n\n```\nfunc (p *sharedProcessor) distribute(obj interface{}, sync bool) {\n   p.listenersLock.RLock()\n   defer p.listenersLock.RUnlock()\n\n   if sync {\n      for _, listener := range p.syncingListeners {\n         listener.add(obj)\n      }\n   } else {\n      for _, listener := range p.listeners {\n         listener.add(obj)\n      }\n   }\n}\n\nadd 就是往 addch chan发送数据\n虽然p.addCh是一个无缓冲的channel，但是因为listener中存在ring buffer，所以这里并不会一直阻塞\nfunc (p *processorListener) add(notification interface{}) {\n\tp.addCh <- notification\n}\n```\n\n#### 3.3 s.processor.run消费数据\n\nsharedIndexInformer.Run指定了controller.run进行数据生产：就是将List, watch到的数据，以delta的方式保存到了deltafifo中\n\n然后HandleDeltas 通过 distribute 函数将 delta变量发送到每一个 listener中去。\n\n接下来分析s.processor.run是如何消费数据的。\n\ns.processor.run的逻辑很清楚。启动每一个listener，run and pop。\n\n```\nfunc (p *sharedProcessor) run(stopCh <-chan struct{}) {\n\tfunc() {\n\t\tp.listenersLock.RLock()\n\t\tdefer p.listenersLock.RUnlock()\n\t\tfor _, listener := range p.listeners {\n\t\t\tp.wg.Start(listener.run)\n\t\t\tp.wg.Start(listener.pop)\n\t\t}\n\t\tp.listenersStarted = true\n\t}()\n\t<-stopCh\n\tp.listenersLock.RLock()\n\tdefer p.listenersLock.RUnlock()\n\tfor _, listener := range p.listeners {\n\t\tclose(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop\n\t}\n\tp.wg.Wait() // Wait for all .pop() and .run() to stop\n}\n```\n\n###### processorListener结构\n\n```\ntype processorListener struct {\n\tnextCh chan interface{}          // 发送给handler的对象\n\taddCh  chan interface{}          // distribute发送下来的对象\n\n\thandler ResourceEventHandler     //定义informer时候的 add, update, del函数\n\n\t// pendingNotifications is an unbounded ring buffer that holds all notifications not yet distributed.\n\t// There is one per listener, but a failing/stalled listener will have infinite pendingNotifications\n\t// added until we OOM.\n\t// TODO: This is no worse than before, since reflectors were backed by unbounded DeltaFIFOs, but\n\t// we should try to do something better.\n\tpendingNotifications buffer.RingGrowing    // 缓存器，避免distribute发送的太快或者 hanlder处理的太慢\n\n\t// requestedResyncPeriod is how frequently the listener wants a full resync from the shared informer\n\trequestedResyncPeriod time.Duration        // 同步周期\n\t// resyncPeriod is how frequently the listener wants a full resync from the shared informer. This\n\t// value may differ from requestedResyncPeriod if the shared informer adjusts it to align with the\n\t// informer's overall resync check period.\n\tresyncPeriod time.Duration           \n\t// nextResync is the earliest time the listener should get a full resync\n\tnextResync time.Time\n\t// resyncLock guards access to resyncPeriod and nextResync\n\tresyncLock sync.Mutex\n}\n```\n\n###### pod and run\n\npop就是将addch 的对象发送到  nextCh。如果nextch满了的话，就保持在pendingNotifications中\n\nrun就是将nextCh的对象发送的 hanlder中去处理。\n\n```\nfunc (p *processorListener) pop() {\n   defer utilruntime.HandleCrash()\n   defer close(p.nextCh) // Tell .run() to stop\n\n   var nextCh chan<- interface{}\n   var notification interface{}\n   for {\n      select {\n      case nextCh <- notification:\n         // Notification dispatched\n         var ok bool\n         notification, ok = p.pendingNotifications.ReadOne()\n         if !ok { // Nothing to pop\n            nextCh = nil // Disable this select case\n         }\n      case notificationToAdd, ok := <-p.addCh:\n         if !ok {\n            return\n         }\n         if notification == nil { // No notification to pop (and pendingNotifications is empty)\n            // Optimize the case - skip adding to pendingNotifications\n            notification = notificationToAdd\n            nextCh = p.nextCh\n         } else { // There is already a notification waiting to be dispatched\n            p.pendingNotifications.WriteOne(notificationToAdd)\n         }\n      }\n   }\n}\n\nfunc (p *processorListener) run() {\n   // this call blocks until the channel is closed.  When a panic happens during the notification\n   // we will catch it, **the offending item will be skipped!**, and after a short delay (one second)\n   // the next notification will be attempted.  This is usually better than the alternative of never\n   // delivering again.\n   stopCh := make(chan struct{})\n   wait.Until(func() {\n      // this gives us a few quick retries before a long pause and then a few more quick retries\n      err := wait.ExponentialBackoff(retry.DefaultRetry, func() (bool, error) {\n         for next := range p.nextCh {\n            switch notification := next.(type) {\n            case updateNotification:\n               p.handler.OnUpdate(notification.oldObj, notification.newObj)\n            case addNotification:\n               p.handler.OnAdd(notification.newObj)\n            case deleteNotification:\n               p.handler.OnDelete(notification.oldObj)\n            default:\n               utilruntime.HandleError(fmt.Errorf(\"unrecognized notification: %T\", next))\n            }\n         }\n         // the only way to get here is if the p.nextCh is empty and closed\n         return true, nil\n      })\n\n      // the only way to get here is if the p.nextCh is empty and closed\n      if err == nil {\n         close(stopCh)\n      }\n   }, 1*time.Minute, stopCh)\n}\n```\n\n### 4. 总结\n\n（1）使用shareInformerFactory机制可以共享informer\n\n（2）Infomer的核心就是下面的reflector机制，运转流程为：\n\n* 通过kube-apiserver的listAndWatch，监听到etcd的资源变化\n\n* 内部通过deltaFIFO队列更好的分发处理这些资源变化\n\n  * deltaFIFO除了原封不动的继承kube-apiserver 的add/update/delete事件(这个是数据库元素的变化)外，还会增加一个sync动作。这个是重新list的时候，FIFO通过replace函数加的。\n\n* 核心的处理函数事HandleDelta函数，它对这些资源变化进行处理分发，核心逻辑如下：\n\n  * informer本身会自带indexer, 不管你使不使用，这是一个本队的缓存\n\n  * 对于一个资源来说，HandleDelta会首先更新本地的indexer缓存。然后再将资源变化发给每个listener。注意：\n\n    （1）kube-apiserver 的add/update/delete事件，不一定是listener看到的事件。比如一个apiserver update事件，如果indexer没有数据，那么下发给listenner的时候就是一个add事件\n\n    （2）indexerInformer通过来指定resyncPeriod，表示indexer的数据会定期这个时间从apiserver拉起全量数据。这些就是sync事件。这个只会同步同步需要sync的listener。\n\n![informer.png](../images/informer.png)\n\n### 5.参考\n\nhttps://jimmysong.io/kubernetes-handbook/develop/client-go-informer-sourcecode-analyse.html\n\n"
  },
  {
    "path": "k8s/client-go/8. client-go的workqueue详解.md",
    "content": "Table of Contents\n=================\n\n  * [1. 章节介绍](#1-章节介绍)\n  * [2. workerqueue介绍](#2-workerqueue介绍)\n     * [2.1 queue](#21-queue)\n        * [2.1.1 queue接口](#211-queue接口)\n           * [add](#add)\n           * [get](#get)\n           * [done](#done)\n     * [2.2 DelayingQueue-延迟队列](#22-delayingqueue-延迟队列)\n        * [2.2.1 waitFor](#221-waitfor)\n        * [2.2. 2 NewNamedDelayingQueue](#22-2-newnameddelayingqueue)\n        * [2.2.3 waitingLoop](#223-waitingloop)\n        * [2.2.4](#224)\n        * [2.2.5 总结](#225-总结)\n     * [2.3 RateLimitingQueue-限速队列](#23-ratelimitingqueue-限速队列)\n        * [2.3.1 RateLimiting结构体](#231-ratelimiting结构体)\n        * [2.3.2 限速器类型](#232-限速器类型)\n           * [BucketRateLimiter](#bucketratelimiter)\n           * [ItemExponentialFailureRateLimiter](#itemexponentialfailureratelimiter)\n           * [ItemFastSlowRateLimiter](#itemfastslowratelimiter)\n           * [MaxOfRateLimiter](#maxofratelimiter)\n           * [WithMaxWaitRateLimiter](#withmaxwaitratelimiter)\n  * [3.总结](#3总结)\n  * [4. 参考文档](#4-参考文档)\n\n### 1. 章节介绍\n\n在介绍完Informer机制后，可以发现如果想自定义控制器非常简单，我们直接注册handler就行。但是绝大部分k8s原生控制器中，handler并没有直接处理。而是统一遵守一套：\n\nAdd , update, Del   -> queue ->  run -> runWorker ->  syncHandler 处理的模式。\n\n例如 namespaces控制器中：\n\n```\n// 1.先是定义了一个限速队列\nqueue:                      workqueue.NewNamedRateLimitingQueue(nsControllerRateLimiter(), \"namespace\"),\n\n\n// 2.然后add, update都是入队列\n// configure the namespace informer event handlers\n\tnamespaceInformer.Informer().AddEventHandlerWithResyncPeriod(\n\t\tcache.ResourceEventHandlerFuncs{\n\t\t\tAddFunc: func(obj interface{}) {\n\t\t\t\tnamespace := obj.(*v1.Namespace)\n\t\t\t\tnamespaceController.enqueueNamespace(namespace)\n\t\t\t},\n\t\t\tUpdateFunc: func(oldObj, newObj interface{}) {\n\t\t\t\tnamespace := newObj.(*v1.Namespace)\n\t\t\t\tnamespaceController.enqueueNamespace(namespace)\n\t\t\t},\n\t\t},\n\t\tresyncPeriod,\n\t)\n\t\n// 3.然后controller.run，启动多个协程\n// Run starts observing the system with the specified number of workers.\nfunc (nm *NamespaceController) Run(workers int, stopCh <-chan struct{}) {\n  \n\tfor i := 0; i < workers; i++ {\n\t\tgo wait.Until(nm.worker, time.Second, stopCh)\n\t}\n\t<-stopCh\n}\n\n// 4. worker处理一个个数据\nfunc (nm *NamespaceController) worker() {\n\n    // 得到对象\n\t\tkey, quit := nm.queue.Get()\n\t\t\n\t\t// 处理完对象\n\t\tdefer nm.queue.Done(key)\n\n\t\terr := nm.syncNamespaceFromKey(key.(string))\n\t\tif err == nil {\n\t\t\t// no error, forget this entry and return\n\t\t\tnm.queue.Forget(key)\n\t\t\treturn false\n\t\t}\n}\n```\n\n可以看出来这一套的一个好处：\n\n（1）利用了Indexer本地缓存机制，queue里面只包括 key就行。数据indexer都有\n\n（2）workqueue除了一个缓冲机制外，还有着错误重试的机制\n\n因此这一节分析一下，client-go提供了哪些workqueue\n\n### 2. workerqueue介绍\n\nclient-go 的 `util/workqueue` 包里主要有三个队列，分别是普通队列，延时队列，限速队列，后一个队列以前一个队列的实现为基础，层层添加新功能，我们按照 Queue、DelayingQueue、RateLimitingQueue 的顺序层层拨开来看限速队列是如何实现的。\n\n#### 2.1 queue\n\n##### 2.1.1 queue接口\n\n```\ntype Interface interface {\n   Add(item interface{})  // 添加一个元素\n   Len() int              // 元素个数\n   Get() (item interface{}, shutdown bool) // 获取一个元素，第二个返回值和 channel 类似，标记队列是否关闭了\n   Done(item interface{}) // 标记一个元素已经处理完\n   ShutDown()             // 关闭队列\n   ShuttingDown() bool    // 是否正在关闭\n}\n\n\ntype Type struct {\n   queue []t            // 定义元素的处理顺序，里面所有元素都应该在 dirty set 中有，而不能出现在 processing set 中\n   dirty set            // 标记所有需要被处理的元素\n   processing set       // 当前正在被处理的元素，当处理完后需要检查该元素是否在 dirty set 中，如果有则添加到 queue 里\n\n   cond *sync.Cond      // 条件锁\n   shuttingDown bool    // 是否正在关闭\n   metrics queueMetrics\n   unfinishedWorkUpdatePeriod time.Duration\n   clock                      clock.Clock\n}\n```\n\n这个 Queue 的工作逻辑大致是这样，里面的三个属性 queue、dirty、processing 都保存 items，但是含义有所不同：\n\n- queue：这是一个 []t 类型，也就是一个切片，因为其有序，所以这里当作一个列表来存储 item 的处理顺序。\n- dirty：这是一个 set 类型，也就是一个集合，这个集合存储的是所有需要处理的 item，这些 item 也会保存在 queue 中，但是 set 里是无序的，set 的特性是唯一。可以认为dirty就是queue的不同实现， queue是为了有序，set是为了保证元素唯一。\n- processing：这也是一个 set，存放的是当前正在处理的 item，也就是说这个 item 来自 queue 出队的元素，同时这个元素会被从 dirty 中删除。\n\n目前看这些还有些懵，直接看看queue的核心函数。\n\n###### add \n\n从这里就可以看出来，queue函数进行了过滤。比如我更新了pod1三次。\t\n\n```\npod1 := &v1.Pod{ObjectMeta: metav1.ObjectMeta{Name: \"one\", Annotations: map[string]string{\"users\": \"ernie,bert\"}}}\n```\n\ninformer的distrube函数会发送三个更新事件，queue也会收到三个更新事件，但是queue里面只会有一个 one(pod1的key)。\n\n为什么只需要保留一个就行？\n\n因为indexer已经更新了，indexer的数据是最新的。所以从这里也可以看出来，使用这一套逻辑，就没有update ,add, delete等区别了。\n\n如果我想统计一下，每个Pod变化了多少次，那就不能使用 workqueue了，必须在handler那里直接实现。\n\n```\n// Add marks item as needing processing.\nfunc (q *Type) Add(item interface{}) {\n\tq.cond.L.Lock()\n\tdefer q.cond.L.Unlock()\n\tif q.shuttingDown {\n\t\treturn\n\t}\n\t\n\t// dirty set 中已经有了该 item，则返回\n\tif q.dirty.has(item) {   \n\t\treturn\n\t}\n\n\tq.metrics.add(item)\n  \n  \n\tq.dirty.insert(item)\n\t// 如果正在处理，也直接返回\n\tif q.processing.has(item) {\n\t\treturn\n\t}\n  \n  // 否则就扔进queue队列\n\tq.queue = append(q.queue, item)\n\tq.cond.Signal()\n}\n```\n\n###### get \n\nget会将元素从queue队列去列，表示这个元素，正在处理中。\n\ndirty和queue保持一致，也会删除这个元素。\n\n```\n// get是从 queue队列中取出一个元素(queue中删除，dirty中删除)\n// 并且标记它正在处理，\nfunc (q *Type) Get() (item interface{}, shutdown bool) {\n\tq.cond.L.Lock()\n\tdefer q.cond.L.Unlock()\n\tfor len(q.queue) == 0 && !q.shuttingDown {\n\t\tq.cond.Wait()\n\t}\n\tif len(q.queue) == 0 {\n\t\t// We must be shutting down.\n\t\treturn nil, true\n\t}\n\n\titem, q.queue = q.queue[0], q.queue[1:]\n\n\tq.metrics.get(item)\n\n\tq.processing.insert(item)\n\tq.dirty.delete(item)\n\n\treturn item, false\n}\n```\n\n###### done\n\ndone表明这个元素被处理完了，从processing队列删除。这里加了一个判断，如果dirty中还存在，还要将其加入 queue\n\n为什么需要这个判断呢？\n\n原因在于有一种请求是 itemA 正在处理，但是还没done，这个时候又来了一次 itemA。\n\n这个时候add 逻辑中，是直接返回的，不会添加itemA到queue的。所以这里要重新添加一次\n\n```\n\n// Done marks item as done processing, and if it has been marked as dirty again\n// while it was being processed, it will be re-added to the queue for\n// re-processing.\nfunc (q *Type) Done(item interface{}) {\n\tq.cond.L.Lock()\n\tdefer q.cond.L.Unlock()\n\n\tq.metrics.done(item)\n\n\tq.processing.delete(item)\n\t// 判断dirty是否有该元素\n\tif q.dirty.has(item) {\n\t\tq.queue = append(q.queue, item)\n\t\tq.cond.Signal()\n\t}\n}\n```\n\n<br>\n\n#### 2.2 DelayingQueue-延迟队列\n\n```\n// delayingType wraps an Interface and provides delayed re-enquing\ntype delayingType struct {\n\tInterface                     //上面的通用队列\n\tclock clock.Clock             // 时钟，用于获取时间\n\tstopCh chan struct{}          // 延时就意味着异步，就要有另一个协程处理，所以需要退出信号\n\tstopOnce sync.Once            // 用来确保 ShutDown() 方法只执行一次\n\theartbeat clock.Ticker        // 定时器，在没有任何数据操作时可以定时的唤醒处理协程\n\twaitingForAddCh chan *waitFor // 所有延迟添加的元素封装成waitFor放到chan中\n\tmetrics retryMetrics\n}\n\ntype DelayingInterface interface {\n\tInterface\n\t// AddAfter adds an item to the workqueue after the indicated duration has passed\n\tAddAfter(item interface{}, duration time.Duration)\n}\n```\n\n##### 2.2.1 waitFor\n\n```\ntype waitFor struct {\n   data    t          // 准备添加到队列中的数据\n   readyAt time.Time  // 应该被加入队列的时间\n   index int          // 在 heap 中的索引\n}\n```\n\nwaitForPriorityQueue是一个数组，实现了最小堆，对比的就是延迟的时间。\n\n```\ntype waitForPriorityQueue []*waitFor\n// heap需要实现的接口，告知队列长度\nfunc (pq waitForPriorityQueue) Len() int {\n    return len(pq)\n}\n// heap需要实现的接口，告知第i个元素是否比第j个元素小\nfunc (pq waitForPriorityQueue) Less(i, j int) bool {\n    return pq[i].readyAt.Before(pq[j].readyAt) // 此处对比的就是时间，所以排序按照时间排序\n}\n// heap需要实现的接口，实现第i和第j个元素换\nfunc (pq waitForPriorityQueue) Swap(i, j int) {\n    // 这种语法好牛逼，有没有，C/C++程序猿没法理解~\n    pq[i], pq[j] = pq[j], pq[i]\n    pq[i].index = i                            // 因为heap没有所以，所以需要自己记录索引，这也是为什么waitFor定义索引参数的原因\n    pq[j].index = j\n}\n// heap需要实现的接口，用于向队列中添加数据\nfunc (pq *waitForPriorityQueue) Push(x interface{}) {\n    n := len(*pq)                       \n    item := x.(*waitFor)\n    item.index = n                             // 记录索引值\n    *pq = append(*pq, item)                    // 放到了数组尾部\n}\n// heap需要实现的接口，用于从队列中弹出最后一个数据\nfunc (pq *waitForPriorityQueue) Pop() interface{} {\n    n := len(*pq)\n    item := (*pq)[n-1]\n    item.index = -1\n    *pq = (*pq)[0:(n - 1)]                     // 缩小数组，去掉了最后一个元素\n    return item\n}\n// 返回第一个元素\nfunc (pq waitForPriorityQueue) Peek() interface{} {\n    return pq[0]\n}\n```\n\n到这里就可以大概猜出来延迟队列的实现了。\n\n就是所有添加的元素，有一个延迟时间，根据延迟时间构造一个最小堆。然后每次时间一到，从堆里面拿出来当前应该加入队列的时间。\n\n<br>\n\n##### 2.2. 2 NewNamedDelayingQueue\n\n```go\n// 这里可以传递一个名字\nfunc NewNamedDelayingQueue(name string) DelayingInterface {\n   return NewDelayingQueueWithCustomClock(clock.RealClock{}, name)\n}\n\n// 上面一个函数只是调用当前函数，附带一个名字，这里加了一个指定 clock 的能力\nfunc NewDelayingQueueWithCustomClock(clock clock.Clock, name string) DelayingInterface {\n  return newDelayingQueue(clock, NewNamed(name), name) // 注意这里的 NewNamed() 函数\n}\n\nfunc newDelayingQueue(clock clock.Clock, q Interface, name string) *delayingType {\n   ret := &delayingType{\n      Interface:       q,\n      clock:           clock,\n      heartbeat:       clock.NewTicker(maxWait), // 10s 一次心跳\n      stopCh:          make(chan struct{}),\n      waitingForAddCh: make(chan *waitFor, 1000),\n      metrics:         newRetryMetrics(name),\n   }\n\n   go ret.waitingLoop() // 核心就是运行 waitingLoop\n   return ret\n}\n```\n\n##### 2.2.3 waitingLoop\n\n```\nfunc (q *delayingType) waitingLoop() {\n   defer utilruntime.HandleCrash()\n   // 队列里没有 item 时实现等待用的\n   never := make(<-chan time.Time)\n   var nextReadyAtTimer clock.Timer\n   // 构造一个优先级队列\n   waitingForQueue := &waitForPriorityQueue{}\n   heap.Init(waitingForQueue) // 这一行其实是多余的，等下提个 pr 给它删掉\n\n   // 这个 map 用来处理重复添加逻辑的，下面会讲到\n   waitingEntryByData := map[t]*waitFor{}\n   // 无限循环\n   for {\n      // 这个地方 Interface 是多余的，等下也提个 pr 把它删掉吧\n      if q.Interface.ShuttingDown() {\n         return\n      }\n\n      now := q.clock.Now()\n      // 队列里有 item 就开始循环\n      for waitingForQueue.Len() > 0 {\n         // 获取第一个 item\n         entry := waitingForQueue.Peek().(*waitFor)\n         // 时间还没到，先不处理\n         if entry.readyAt.After(now) {\n            break\n         }\n        // 时间到了，pop 出第一个元素；注意 waitingForQueue.Pop() 是最后一个 item，heap.Pop() 是第一个元素\n         entry = heap.Pop(waitingForQueue).(*waitFor)\n         // 将数据加到延时队列里\n         q.Add(entry.data)\n         // map 里删除已经加到延时队列的 item\n         delete(waitingEntryByData, entry.data)\n      }\n\n      // 如果队列中有 item，就用第一个 item 的等待时间初始化计时器，如果为空则一直等待\n      nextReadyAt := never\n      if waitingForQueue.Len() > 0 {\n         if nextReadyAtTimer != nil {\n            nextReadyAtTimer.Stop()\n         }\n         entry := waitingForQueue.Peek().(*waitFor)\n         nextReadyAtTimer = q.clock.NewTimer(entry.readyAt.Sub(now))\n         nextReadyAt = nextReadyAtTimer.C()\n      }\n\n      select {\n      case <-q.stopCh:\n         return\n      case <-q.heartbeat.C(): // 心跳时间是 10s，到了就继续下一轮循环\n      case <-nextReadyAt: // 第一个 item 的等到时间到了，继续下一轮循环\n      case waitEntry := <-q.waitingForAddCh: // waitingForAddCh 收到新的 item\n         // 如果时间没到，就加到优先级队列里，如果时间到了，就直接加到延时队列里\n         if waitEntry.readyAt.After(q.clock.Now()) {\n            insert(waitingForQueue, waitingEntryByData, waitEntry)\n         } else {\n            q.Add(waitEntry.data)\n         }\n         // 下面的逻辑就是将 waitingForAddCh 中的数据处理完\n         drained := false\n         for !drained {\n            select {\n            case waitEntry := <-q.waitingForAddCh:\n               if waitEntry.readyAt.After(q.clock.Now()) {\n                  insert(waitingForQueue, waitingEntryByData, waitEntry)\n               } else {\n                  q.Add(waitEntry.data)\n               }\n            default:\n               drained = true\n            }\n         }\n      }\n   }\n}\n```\n\n##### 2.2.4 \n\n这个方法的作用是在指定的延时到达之后，在 work queue 中添加一个元素，源码如下：\n\n```\nfunc (q *delayingType) AddAfter(item interface{}, duration time.Duration) {\n   if q.ShuttingDown() { // 已经在关闭中就直接返回\n      return\n   }\n\n   q.metrics.retry()\n\n   if duration <= 0 { // 如果时间到了，就直接添加\n      q.Add(item)\n      return\n   }\n\n   select {\n   case <-q.stopCh:\n     // 构造 waitFor{}，丢到 waitingForAddCh\n   case q.waitingForAddCh <- &waitFor{data: item, readyAt: q.clock.Now().Add(duration)}:\n   }\n}\n\n其实就是一个往堆加入元素的过程\nfunc insert(q *waitForPriorityQueue, knownEntries map[t]*waitFor, entry *waitFor) {\n   // 这里的主要逻辑是看一个 entry 是否存在，如果已经存在，新的 entry 的 ready 时间更短，就更新时间\n   existing, exists := knownEntries[entry.data]\n   if exists {\n      if existing.readyAt.After(entry.readyAt) {\n         existing.readyAt = entry.readyAt // 如果存在就只更新时间\n         heap.Fix(q, existing.index)\n      }\n\n      return\n   }\n   // 如果不存在就丢到 q 里，同时在 map 里记录一下，用于查重\n   heap.Push(q, entry)\n   knownEntries[entry.data] = entry\n}\n```\n\n<br>\n\n##### 2.2.5 总结\n\n（1）延迟队列的核心就是，根据加入队列的时间，构造一个最小堆，然后再到时间点后，将其加入queue中\n\n（2）上诉判断是否到时间点，不仅仅是一个for循环，还利用了心跳，channel机制\n\n（3）当某个对象处理的时候失败了，可以利用延迟队列的思想，等一会再重试，因为马上重试肯定是失败的\n\n#### 2.3 RateLimitingQueue-限速队列\n\n##### 2.3.1 RateLimiting结构体\n\n```\ntype RateLimitingInterface interface {\n\tDelayingInterface     //延迟队列\n\n\tAddRateLimited(item interface{})     //已限速方式，往队列添加一个元素\n\n\t// 标记介绍重试\n\tForget(item interface{})\n  \n  // 重试了几次\n\tNumRequeues(item interface{}) int\n}\n\n\n// rateLimitingType wraps an Interface and provides rateLimited re-enquing\ntype rateLimitingType struct {\n\tDelayingInterface    \n\n\trateLimiter RateLimiter   //多了一个限速器\n}\n```\n\n##### 2.3.2 限速器类型\n\n可以看出来，限速队列和 延迟队列是一模一样的。\n\n延迟队列是自己决定 某个元素延迟多久。\n\n而限速队列是 有限速器决定 某个元素延迟多久。\n\n```\ntype RateLimiter interface {\n\t// 输入一个对象，判断延迟多久\n\tWhen(item interface{}) time.Duration\n\t\n\t// 标记介绍重试\n\tForget(item interface{})\n\t\n\t// 重试了几次\n\tNumRequeues(item interface{}) int\n}\n```\n\n这个接口有五个实现，分别为：\n\n1. *BucketRateLimiter*\n2. *ItemExponentialFailureRateLimiter*\n3. *ItemFastSlowRateLimiter*\n4. *MaxOfRateLimiter*\n5. *WithMaxWaitRateLimiter*\n\n###### BucketRateLimiter\n\n这个限速器可说的不多，用了 golang 标准库的 `golang.org/x/time/rate.Limiter` 实现。BucketRateLimiter 实例化的时候比如传递一个 `rate.NewLimiter(rate.Limit(10), 100)` 进去，表示令牌桶里最多有 100 个令牌，每秒发放 10 个令牌。\n\n所有元素都是一样的，来几次都是一样，所以NumRequeues，Forget都没有意义。\n\n```\ntype BucketRateLimiter struct {\n   *rate.Limiter\n}\n\nvar _ RateLimiter = &BucketRateLimiter{}\n\nfunc (r *BucketRateLimiter) When(item interface{}) time.Duration {\n   return r.Limiter.Reserve().Delay() // 过多久后给当前 item 发放一个令牌\n}\n\nfunc (r *BucketRateLimiter) NumRequeues(item interface{}) int {\n   return 0\n}\n\n// \nfunc (r *BucketRateLimiter) Forget(item interface{}) {\n}\n```\n\n###### ItemExponentialFailureRateLimiter\n\nExponential 是指数的意思，从这个限速器的名字大概能猜到是失败次数越多，限速越长而且是指数级增长的一种限速器。\n\n结构体定义如下，属性含义基本可以望文生义\n\n```\nfunc (r *ItemExponentialFailureRateLimiter) When(item interface{}) time.Duration {\n   r.failuresLock.Lock()\n   defer r.failuresLock.Unlock()\n\n   exp := r.failures[item]\n   r.failures[item] = r.failures[item] + 1 // 失败次数加一\n\n   // 每调用一次，exp 也就加了1，对应到这里时 2^n 指数爆炸\n   backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow(2, float64(exp))\n   if backoff > math.MaxInt64 { // 如果超过了最大整型，就返回最大延时，不然后面时间转换溢出了\n      return r.maxDelay\n   }\n\n   calculated := time.Duration(backoff)\n   if calculated > r.maxDelay { // 如果超过最大延时，则返回最大延时\n      return r.maxDelay\n   }\n\n   return calculated\n}\n\nfunc (r *ItemExponentialFailureRateLimiter) NumRequeues(item interface{}) int {\n   r.failuresLock.Lock()\n   defer r.failuresLock.Unlock()\n\n   return r.failures[item]\n}\n\nfunc (r *ItemExponentialFailureRateLimiter) Forget(item interface{}) {\n   r.failuresLock.Lock()\n   defer r.failuresLock.Unlock()\n\n   delete(r.failures, item)\n}\n```\n\n###### ItemFastSlowRateLimiter\n\n快慢限速器，也就是先快后慢，定义一个阈值，超过了就慢慢重试。先看类型定义：\n\n```\ntype ItemFastSlowRateLimiter struct {\n   failuresLock sync.Mutex\n   failures     map[interface{}]int\n\n   maxFastAttempts int            // 快速重试的次数\n   fastDelay       time.Duration  // 快重试间隔\n   slowDelay       time.Duration  // 慢重试间隔\n}\n\nfunc (r *ItemFastSlowRateLimiter) When(item interface{}) time.Duration {\n   r.failuresLock.Lock()\n   defer r.failuresLock.Unlock()\n\n   r.failures[item] = r.failures[item] + 1 // 标识重试次数 + 1\n\n   if r.failures[item] <= r.maxFastAttempts { // 如果快重试次数没有用完，则返回 fastDelay\n      return r.fastDelay\n   }\n\n   return r.slowDelay // 反之返回 slowDelay\n}\n\nfunc (r *ItemFastSlowRateLimiter) NumRequeues(item interface{}) int {\n   r.failuresLock.Lock()\n   defer r.failuresLock.Unlock()\n\n   return r.failures[item]\n}\n\nfunc (r *ItemFastSlowRateLimiter) Forget(item interface{}) {\n   r.failuresLock.Lock()\n   defer r.failuresLock.Unlock()\n\n   delete(r.failures, item)\n}\n```\n\n######  MaxOfRateLimiter\n\n组合限速器，内部放多个限速器，然后返回限速最慢的一个延时：\n\n```\ntype MaxOfRateLimiter struct {\n   limiters []RateLimiter\n}\n\nfunc (r *MaxOfRateLimiter) When(item interface{}) time.Duration {\n   ret := time.Duration(0)\n   for _, limiter := range r.limiters {\n      curr := limiter.When(item)\n      if curr > ret {\n         ret = curr\n      }\n   }\n\n   return ret\n}\n```\n\n<br>\n\n###### WithMaxWaitRateLimiter\n\n这个限速器也很简单，就是在其他限速器上包装一个最大延迟的属性，如果到了最大延时，则直接返回。这样就能避免延迟时间不可控，万一一个对象失败了多次，那以后的时间会越来越大。\n\n```\ntype WithMaxWaitRateLimiter struct {\n   limiter  RateLimiter   // 其他限速器\n   maxDelay time.Duration // 最大延时\n}\n\nfunc NewWithMaxWaitRateLimiter(limiter RateLimiter, maxDelay time.Duration) RateLimiter {\n   return &WithMaxWaitRateLimiter{limiter: limiter, maxDelay: maxDelay}\n}\n\nfunc (w WithMaxWaitRateLimiter) When(item interface{}) time.Duration {\n   delay := w.limiter.When(item)\n   if delay > w.maxDelay {\n      return w.maxDelay // 已经超过了最大延时，直接返回最大延时\n   }\n\n   return delay\n}\n```\n\n### 3.总结\n\n（1）workerqueue使用于只关注结果的处理方式。 比如统计一个Pod update了多少次这种关乎 过程的 处理。不能用，因为workerqueue进行了合并\n\n（2）workerqueue实现了很多限速机制，可以更加情况酌情使用\n\n### 4. 参考文档\n\nhttps://blog.csdn.net/weixin_42663840/article/details/81482553\n\nhttps://www.danielhu.cn/post/k8s/client-go-workqueue/"
  },
  {
    "path": "k8s/client-go/9.从0到1使用kubebuilder创建crd.md",
    "content": "- [0. 下载kubebuilder](#0---kubebuilder)\n- [1. 创建目录](#1-----)\n- [2. 初始化项目](#2------)\n- [3. 创建api和controller](#3---api-controller)\n- [4. 实现自己的crd和控制器逻辑](#4------crd------)\n- [5. make manifests, 创建crd的相关yaml](#5-make-manifests----crd---yaml)\n- [6. 在集群中部署crd](#6-------crd)\n- [7. 部署controller](#7---controller)\n\n**简介**\n\n从0到1，手把手教会如何使用kubebuilder创建crd, 并且定制自己的控制器。\n\n代码：https://github.com/zoux86/operator-example\n\n### 0. 下载kubebuilder\n\n```bash\n# download kubebuilder and install locally.\ncurl -L -o kubebuilder https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)\nchmod +x kubebuilder && mv kubebuilder /usr/local/bin/\n```\n\n### 1. 创建目录\n\n~/go/src 是我的go src目录\n\ngithub.com/zoux86/operator-example是想自定义的crd项目\n\n```\n// 可以看出来go mod init 指定的字符串就是mod文件里面的module目录\n~/go/src/github.com/zoux86/operator-example#  go mod init github.com/zoux86/operator-example\ngo: creating new go.mod: module github.com/zoux86/operator-example\n\n ~/go/src/github.com/zoux86/operator-example # ls\ngo.mod\n\n ~/go/src/github.com/zoux86/operator-example # cat go.mod\nmodule github.com/zoux86/operator-example   \n\ngo 1.18\n```\n\n### 2. 初始化项目\n\n执行kubebuilder init这一条命令就行了\n\n```\n ~/go/src/github.com/zoux86/operator-example # ~/kubebuilder init --domain github.com --license apache2 --owner \"zoux86\"\nWriting kustomize manifests for you to edit...\nWriting scaffold for you to edit...\nGet controller runtime:\n$ go get sigs.k8s.io/controller-runtime@v0.11.2\ngo: downloading sigs.k8s.io/controller-runtime v0.11.2\ngo: downloading k8s.io/apimachinery v0.23.5\ngo: downloading k8s.io/client-go v0.23.5\ngo: downloading k8s.io/utils v0.0.0-20211116205334-6203023598ed\ngo: downloading k8s.io/component-base v0.23.5\ngo: downloading k8s.io/api v0.23.5\ngo: downloading k8s.io/apiextensions-apiserver v0.23.5\ngo: downloading sigs.k8s.io/json v0.0.0-20211020170558-c049b76a60c6\ngo: downloading golang.org/x/net v0.0.0-20211209124913-491a49abca63\ngo: downloading golang.org/x/oauth2 v0.0.0-20210819190943-2bc19b11175f\nUpdate dependencies:\n$ go mod tidy\ngo: downloading github.com/Azure/go-autorest/autorest v0.11.18\ngo: downloading github.com/Azure/go-autorest/autorest/adal v0.9.13\ngo: downloading github.com/Azure/go-autorest/tracing v0.6.0\ngo: downloading github.com/Azure/go-autorest/autorest/mocks v0.4.1\ngo: downloading github.com/Azure/go-autorest/autorest/date v0.3.0\ngo: downloading github.com/Azure/go-autorest/logger v0.2.1\ngo: downloading golang.org/x/crypto v0.0.0-20210817164053-32db794688a5\nNext: define a resource with:\n$ kubebuilder create api\n```\n\n<br>\n\n**查看文件目录**\n\nk8s apis通常有三个组件`Resource, Controller, Manager`，它们分别定义/实现在以下的三个package当中：\n\n- **cmd/...**：主流程程序`Manager`入口，负责初始化依赖包、启停`Controller`。用户通常不需要编辑此包，可以依赖脚手架。通过`kubebuilder init`自动创建生成。\n- **pkg/apis/...**：包含API资源的定义。编辑`*_types.go`文件来修改资源定义。每个资源的定义文件存在于`pkg/apis/<api-group-name>/<api-version-name>/<api-kind-name>_types.go`中。通过`kubebuilder create api`自动创建生成。\n- **pkg/controller/...**：包含Controller的实现。编辑`*_controller.go`实现Controller。通过`kubebuilder create api`自动创建生成。\n\n```\n ~/go/src/github.com/zoux86/operator-example  tree\n.\n├── Dockerfile\n├── Makefile\n├── PROJECT\n├── README.md\n├── config\n│   ├── default\n│   │   ├── kustomization.yaml\n│   │   ├── manager_auth_proxy_patch.yaml\n│   │   └── manager_config_patch.yaml\n│   ├── manager\n│   │   ├── controller_manager_config.yaml\n│   │   ├── kustomization.yaml\n│   │   └── manager.yaml\n│   ├── prometheus\n│   │   ├── kustomization.yaml\n│   │   └── monitor.yaml\n│   └── rbac\n│       ├── auth_proxy_client_clusterrole.yaml\n│       ├── auth_proxy_role.yaml\n│       ├── auth_proxy_role_binding.yaml\n│       ├── auth_proxy_service.yaml\n│       ├── kustomization.yaml\n│       ├── leader_election_role.yaml\n│       ├── leader_election_role_binding.yaml\n│       ├── role_binding.yaml\n│       └── service_account.yaml\n├── go.mod\n├── go.sum\n├── hack\n│   └── boilerplate.go.txt\n└── main.go\n```\n\n<br>\n\n### 3. 创建api和controller\n\n其实从create api 后的输出我们可以看出来：我们修改逻辑后就可以部署了\n\n```\n ~/go/src/github.com/zoux86/operator-example # ~/kubebuilder create api --group zouxapp --kind PodCount --version v1\nCreate Resource [y/n]\ny\nCreate Controller [y/n]\ny\nWriting kustomize manifests for you to edit...     // 先修改这2个文件\nWriting scaffold for you to edit...\napi/v1/podcount_types.go\ncontrollers/podcount_controller.go\nUpdate dependencies:\n$ go mod tidy                                   \nRunning make:\n$ make generate                                     \nmkdir -p /Users/game-netease/go/src/github.com/zoux86/operator-example/bin\nGOBIN=/Users/game-netease/go/src/github.com/zoux86/operator-example/bin go install sigs.k8s.io/controller-tools/cmd/controller-gen@v0.8.0\ngo: downloading sigs.k8s.io/controller-tools v0.8.0\ngo: downloading github.com/spf13/cobra v1.2.1\ngo: downloading golang.org/x/tools v0.1.6-0.20210820212750-d4cc65f0b2ff\ngo: downloading github.com/fatih/color v1.12.0\ngo: downloading k8s.io/api v0.23.0\ngo: downloading k8s.io/apimachinery v0.23.0\ngo: downloading github.com/gobuffalo/flect v0.2.3\ngo: downloading k8s.io/apiextensions-apiserver v0.23.0\ngo: downloading github.com/mattn/go-colorable v0.1.8\ngo: downloading github.com/mattn/go-isatty v0.0.12\ngo: downloading golang.org/x/sys v0.0.0-20210831042530-f4d43177bf5e\ngo: downloading golang.org/x/mod v0.4.2\n/Users/game-netease/go/src/github.com/zoux86/operator-example/bin/controller-gen object:headerFile=\"hack/boilerplate.go.txt\" paths=\"./...\"\nNext: implement your new API and generate the manifests (e.g. CRDs,CRs) with:\n$ make manifests\n```\n\n<br>\n\n执行create api后，生成以下文件：\n\n```\napi/v1/groupversion_info.go\napi/v1/podcount_types.go                      // 需要修改这个文件中crd的定义\napi/v1/zz_generated.deepcopy.go\nconfig/crd/kustomization.yaml\nconfig/crd/kustomizeconfig.yaml\nconfig/crd/patches/cainjection_in_podcounts.yaml\nconfig/crd/patches/webhook_in_podcounts.yaml\nconfig/rbac/podcount_editor_role.yaml\nconfig/rbac/podcount_viewer_role.yaml\nconfig/samples/zouxapp_v1_podcount.yaml\ncontrollers/podcount_controller.go            // 需要修改这个文件的controller运行逻辑\ncontrollers/suite_test.go\ngo.mod\nmain.go\n```\n\n### 4. 实现自己的crd和控制器逻辑\n\n根据实际情况而定，这里的控制器逻辑很简单，就是创建同步podCount的spec.count到status里面。\n\n```\nfunc (r *PodCountReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {\n\trlog := log.FromContext(ctx)\n\trlog.Info(\"start to reconciling podCount %s\", req.Name)\n\tpodCount := &zouxappv1.PodCount{}\n\terr := r.Client.Get(ctx, req.NamespacedName, podCount)\n\tif err != nil {\n\t\trlog.Error(err, fmt.Sprintf(\"get podcount %s/%s err during reconcile.\", req.Namespace, req.Name))\n\t\treturn ctrl.Result{}, nil\n\t}\n\tpodCountCopy := podCount.DeepCopy()\n\tif podCount.Spec.Count <= 0 {\n\t\tpodCountCopy.Status.Count = 0\n\t} else {\n\t\tpodCountCopy.Status.Count = podCount.Spec.Count\n\t}\n\n\terr = r.Client.Status().Update(ctx, podCountCopy)\n\tif err != nil {\n\t\trlog.Error(err, fmt.Sprintf(\"update crd podcount status error %s/%s  during reconcile.\", req.Namespace, req.Name))\n\t}\n\t//r.Status().Update(ctx, podCountCopy, metav1.UpdateOptions{})\n\t// TODO(user): your logic here\n\n\treturn ctrl.Result{}, err\n}\n```\n\n### 5. make manifests, 创建crd的相关yaml\n\n```\n ~/go/src/github.com/zoux86/operator-example # make manifests\n/Users/game-netease/go/src/github.com/zoux86/operator-example/bin/controller-gen rbac:roleName=manager-role crd webhook paths=\"./...\" output:crd:artifacts:config=config/crd/bases\n```\n\n执行make manifests之后，我们会得到2个文件。\n\n```\nconfig/crd/bases/\nconfig/rbac/role.yaml\n\n\n# cat config/rbac/role.yaml\n---\napiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRole\nmetadata:\n  creationTimestamp: null\n  name: manager-role\nrules:\n- apiGroups:\n  - zouxapp.github.com\n  resources:\n  - podcounts\n  verbs:\n  - create\n  - delete\n  - get\n  - list\n  - patch\n  - update\n  - watch\n- apiGroups:\n  - zouxapp.github.com\n  resources:\n  - podcounts/finalizers\n  verbs:\n  - update\n- apiGroups:\n  - zouxapp.github.com\n  resources:\n  - podcounts/status\n  verbs:\n  - get\n  - patch\n  - update\n\n# cat config/crd/bases/zouxapp.github.com_podcounts.yaml\n---\napiVersion: apiextensions.k8s.io/v1\nkind: CustomResourceDefinition\nmetadata:\n  annotations:\n    controller-gen.kubebuilder.io/version: v0.8.0\n  creationTimestamp: null\n  name: podcounts.zouxapp.github.com\nspec:\n  group: zouxapp.github.com\n  names:\n    kind: PodCount\n    listKind: PodCountList\n    plural: podcounts\n    singular: podcount\n  scope: Namespaced\n  versions:\n  - name: v1\n    schema:\n      openAPIV3Schema:\n        description: PodCount is the Schema for the podcounts API\n        properties:\n          apiVersion:\n            description: 'APIVersion defines the versioned schema of this representation\n              of an object. Servers should convert recognized schemas to the latest\n              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'\n            type: string\n          kind:\n            description: 'Kind is a string value representing the REST resource this\n              object represents. Servers may infer this from the endpoint the client\n              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'\n            type: string\n          metadata:\n            type: object\n          spec:\n            description: PodCountSpec defines the desired state of PodCount\n            properties:\n              count:\n                description: Foo is an example field of PodCount. Edit podcount_types.go\n                  to remove/update\n                type: integer\n            type: object\n          status:\n            description: PodCountStatus defines the observed state of PodCount\n            properties:\n              count:\n                description: 'INSERT ADDITIONAL STATUS FIELD - define observed state\n                  of cluster Important: Run \"make\" to regenerate code after modifying\n                  this file'\n                type: integer\n            type: object\n        type: object\n    served: true\n    storage: true\n    subresources:\n      status: {}\nstatus:\n  acceptedNames:\n    kind: \"\"\n    plural: \"\"\n  conditions: []\n  storedVersions: []\n```\n\n### 6. 在集群中部署crd\n\n```\n ~/go/src/github.com/zoux86/operator-example #kubectl  --kubeconfig=kubeconfig get node create -f config/crd/bases\ncustomresourcedefinition.apiextensions.k8s.io/podcounts.zouxapp.github.com created\n\n ~/go/src/github.com/zoux86/operator-example # kubectl --kubeconfig=kubeconfig create -f config/samples/zouxapp_v1_podcount.yaml \npodcount.zouxapp.github.com/podcount-sample created\n```\n\n<br>\n\n上集群验证，可以看到创建成功了，但是可以看出来没有status.count，这个因为集群还没部署控制器\n\n```\nroot# kubectl get crd   | grep podc\npodcounts.zouxapp.github.com                            2022-08-25T06:57:09Z\n\n\nroot # kubectl get podcounts.zouxapp.github.com\nNAME              AGE\npodcount-sample   11s\n\nroot # kubectl get podcounts.zouxapp.github.com -oyaml\napiVersion: v1\nitems:\n- apiVersion: zouxapp.github.com/v1\n  kind: PodCount\n  metadata:\n    creationTimestamp: \"2022-08-25T07:01:16Z\"\n    generation: 1\n    name: podcount-sample\n    namespace: default\n    resourceVersion: \"467368378\"\n    selfLink: /apis/zouxapp.github.com/v1/namespaces/default/podcounts/podcount-sample\n    uid: a8b42a4c-1ebd-430a-890f-b0238f4ad125\n  spec:\n    count: 3\nkind: List\nmetadata:\n  resourceVersion: \"\"\n  selfLink: \"\"\n```\n\n### 7. 部署controller\n\n之前 CRD 并不会完成任何工作，只是在 ETCD 中创建了一条记录。所以我们需要部署写的controller。\n\n运行CRD controller\n\n```\n ~/go/src/github.com/zoux86/operator-example ## go run ./main.go\nI0825 15:35:12.827589   63628 request.go:665] Waited for 1.000074041s due to client-side throttling, not priority and fairness, request: GET:https://xxx/apis/apiextensions.k8s.io/v1?timeout=32s\n1.6614129137223601e+09  INFO    controller-runtime.metrics      Metrics server is starting to listen    {\"addr\": \":8080\"}\n1.6614129137230568e+09  INFO    setup   starting manager\n1.661412913723448e+09   INFO    Starting server {\"path\": \"/metrics\", \"kind\": \"metrics\", \"addr\": \"[::]:8080\"}\n1.661412913723448e+09   INFO    Starting server {\"kind\": \"health probe\", \"addr\": \"[::]:8081\"}\n1.661412913723506e+09   INFO    controller.podcount     Starting EventSource    {\"reconciler group\": \"zouxapp.github.com\", \"reconciler kind\": \"PodCount\", \"source\": \"kind source: *v1.PodCount\"}\n1.661412913723542e+09   INFO    controller.podcount     Starting Controller     {\"reconciler group\": \"zouxapp.github.com\", \"reconciler kind\": \"PodCount\"}\n1.661412913825421e+09   INFO    controller.podcount     Starting workers        {\"reconciler group\": \"zouxapp.github.com\", \"reconciler kind\": \"PodCount\", \"worker count\": 1}\nI0825 15:35:13.825542   63628 podcount_controller.go:50] start to reconciling podCount podcount-sample\nI0825 15:35:13.868618   63628 podcount_controller.go:50] start to reconciling podCount podcount-sample\n\n```\n\n**查看发现生效了**\n\n```\nroot# kubectl get podcounts.zouxapp.github.com -oyaml\napiVersion: v1\nitems:\n- apiVersion: zouxapp.github.com/v1\n  kind: PodCount\n  metadata:\n    creationTimestamp: \"2022-08-25T07:01:16Z\"\n    generation: 1\n    name: podcount-sample\n    namespace: default\n    resourceVersion: \"467385745\"\n    selfLink: /apis/zouxapp.github.com/v1/namespaces/default/podcounts/podcount-sample\n    uid: a8b42a4c-1ebd-430a-890f-b0238f4ad125\n  spec:\n    count: 3\n  status:\n    count: 3\nkind: List\nmetadata:\n  resourceVersion: \"\"\n  selfLink: \"\"\n```\n\n"
  },
  {
    "path": "k8s/cni/0.章节介绍.md",
    "content": "本章节主要了解cni的相关知识，章节安排如下：\n\n（1）网路基础知识介绍\n\n（2）容器云用到的网络知识\n\n（3）kubelet中的cni介绍\n\n（4）flannel原理分析\n\n（5）cacilo原理分析\n\n（6）如何订制cni \n"
  },
  {
    "path": "k8s/cni/1. 网络基础知识.md",
    "content": "* [1\\. 网络基础知识](#1-网络基础知识)\n  * [1\\.1 基础概念](#11-基础概念)\n  * [1\\.2  一个宿主是如何处理数据包的](#12--一个宿主是如何处理数据包的)\n* [2\\. 物理层工作原理](#2-物理层工作原理)\n* [3\\. 链路层工作原理](#3-链路层工作原理)\n  * [3\\.1  这个包是发给谁的？谁应该接收？](#31--这个包是发给谁的谁应该接收)\n  * [3\\.2 大家都在发，会不会产生混乱？有没有谁先发、谁后发的规则？](#32-大家都在发会不会产生混乱有没有谁先发谁后发的规则)\n  * [3\\.3  如果发送的时候出现了错误，怎么办？](#33--如果发送的时候出现了错误怎么办)\n* [4\\. 网络层工作原理](#4-网络层工作原理)\n  * [4\\.1 ping的流程](#41-ping的流程)\n  * [4\\.2 不同子网之间的ip访问](#42-不同子网之间的ip访问)\n    * [4\\.2\\.1\\. 路由器是如何工作的](#421-路由器是如何工作的)\n    * [4\\.2\\.2 不同子网的ip通信流程](#422-不同子网的ip通信流程)\n    * [4\\.2\\.3 路由是如何设置的](#423-路由是如何设置的)\n* [5\\. 网卡介绍](#5-网卡介绍)\n  * [5\\.1  查看网卡](#51--查看网卡)\n    * [5\\.1\\.1  ifconfig介绍](#511--ifconfig介绍)\n    * [5\\.1\\.2 其他方式查看网卡](#512-其他方式查看网卡)\n  * [5\\.2 虚拟网卡](#52-虚拟网卡)\n    * [5\\.2\\.1 虚拟网卡介绍](#521-虚拟网卡介绍)\n    * [5\\.2\\.2 云计算中的网络计算\\-虚拟网卡](#522-云计算中的网络计算-虚拟网卡)\n\n本节主要重新学习一下非常基础的网络知识，为后面cni的学习做基础。\n\n参考：\n\n* 网络是怎样连接的-[日]户根勤\n* 趣谈网络协议\n\n### 1. 网络基础知识\n\n#### 1.1 基础概念\n\n数据包（packet）：IP 协议传送数据的单位；\n帧（frame）：链路层传送数据的单位；\n节点（node）：实现了 IP 协议的设备；\n路由器（router）：可以转发不是发给自己的 IP 包的设备；\n主机（host）：不是路由器的节点；\n链路（link）：一种通信机制或介质，节点可以通过它在链路层通信。比如以太网、PPP 链接，也包括隧道；\n接口（interface）：节点与链路的连接（可以理解为抽象的“网卡”）；\n链路层地址（link-layer address）：接口的链路层标识符（如以太网的 mac 地址）\n\n#### 1.2  一个宿主是如何处理数据包的\n\n* 网卡收到包之后会判断mac是不是自己的，如果是自己的会触发硬中断、软中断通知cpu收包（链路层）\n* 之后数据包进入内核网络协议栈，做四层处理，iptables、nat之类的（网络层）\n* 然后送到对应的socket缓冲区（传输层）\n* 最后送到用户空间进程（应用层）\n\n![image-20220323175725888](../images/wangluoxieyi.png)\n\n网络协议栈：是操作系统中对网络相关做处理的逻辑。解封包、iptables、route、netns、vxlan、tunnel、等等都是这里面的一块逻辑（hook）\n\n所以不同的network namespaces会有自己不同的网络协议栈，比如有不同的路由规则等等。这样就达到了隔离的作用。\n\n<br>\n\n### 2. 物理层工作原理\n\n**发送过程**：网卡驱动从 IP 模块获取包之后，会将其复制到网卡内的缓冲区中，然后向 MAC 模块发送发送包的命令。接下来就轮到 MAC 模块进行工作了。 首先，MAC 模块会将包从缓冲区中取出，并在开头加上报头和起始帧 分界符，在末尾加上用于检测错误的帧校验序列。\n\n报头是一串像 10101010…这样 1 和 0 交替出现的比特序列，长度为 56 比特。当这些 1010 的比特序列被转换成电 信号后，会形成如图 高低电平的波形。然后通过光缆或者网线传输出去。\n\n<br>\n\n**集线器**，也叫做**Hub**。这种设备有多个口，可以将多台电脑连接起来。但是和交换机不同，集线器没有大脑，它完全在物理层工作。它会将自己收到的每一个字节，都复制到其他端口上去。这是第一层物理层联通的方案。\n\n### 3. 链路层工作原理\n\n有了物理层的基础就能做到，不同主机直接可以发送数据。但是还有几个问题需要解决\n\n#### 3.1  这个包是发给谁的？谁应该接收？\n\n* 在发送数据的时候，链路层会包装头部，指定目标地址的mac地址。\n\n* 接受时通过mac地址来确定谁应该接受包\n\n* 最开始可能会不知道某个ip的MAC地址，通过 arp 协议在局域网里面广播一下，ip XXX你的mac地址是啥\n\n####  3.2 大家都在发，会不会产生混乱？有没有谁先发、谁后发的规则？\n\n使用的是随机接入协议：有事儿先出门，发现特堵，就回去。错过高峰再出（退避算法）。\n\n#### 3.3  如果发送的时候出现了错误，怎么办？\n\n**循环冗余检测**。通过 XOR 异或的算法，来计算整个包是否在发送的过程中出现了错误。\n\n<br>\n\n链路层工作的设备是交换机。交换机是在局域网工作的，本身不需要ip\n\n交换机的作用就是根据 mac地址进行端口转发。交换机有学习功能，举例如下：\n\n如果机器 1 只知道机器 4 的 IP 地址，当它想要访问机器 4，把包发出去的时候，它必须要知道机器 4 的 MAC 地址。\n\n于是机器 1 发起广播，机器 2 收到这个广播，但是这不是找它的，所以没它什么事。交换机 A 一开始是不知道任何拓扑信息的，在它收到这个广播后，采取的策略是，除了广播包来的方向外，它还要转发给其他所有的网口。于是机器 3 也收到广播信息了，但是这和它也没什么关系。\n\n当然，交换机 B 也是能够收到广播信息的，但是这时候它也是不知道任何拓扑信息的，因而也是进行广播的策略，将包转发到局域网三。这个时候，机器 4 和机器 5 都收到了广播信息。机器 4 主动响应说，这是找我的，这是我的 MAC 地址。于是一个 ARP 请求就成功完成了。\n\n![image-20220324150807288](../images/jiaohuanji.png)\n\n\n\n这里可以会有一个问题，就是可能局域网的机器太多，交换机数量也多，然后就会出现回路。这个时候可能就会出现广播风暴。解决办法就是通过STP 协议解决\n\n### 4. 网络层工作原理\n\n#### 4.1 ping的流程\n\nping 是基于 ICMP 协议工作的。**ICMP**全称**Internet Control Message Protocol**，就是**互联网控制报文协议**。\n\n假定主机 A 的 IP 地址是 192.168.1.1，主机 B 的 IP 地址是 192.168.1.2，它们都在同一个子网。那当你在主机 A 上运行“ping 192.168.1.2”后，会发生什么呢?\n\n（1）ping 命令执行的时候，源主机首先会构建一个 ICMP 请求数据包，ICMP 数据包内包含多个字段。最重要的是两个，第一个是**类型字段**，对于请求数据包而言该字段为 8；另外一个是**顺序号**，主要用于区分连续 ping 的时候发出的多个数据包。每发出一个请求数据包，顺序号会自动加 1。为了能够计算往返时间 RTT，它会在报文的数据部分插入发送时间。\n\n（2）然后，由 ICMP 协议将这个数据包连同地址 192.168.1.2 一起交给 IP 层。IP 层将以 192.168.1.2 作为目的地址，本机 IP 地址作为源地址，加上一些其他控制信息，构建一个 IP 数据包。\n\n（3）接下来，需要加入 MAC 头。如果在本节 ARP 映射表中查找出 IP 地址 192.168.1.2 所对应的 MAC 地址，则可以直接使用；如果没有，则需要发送 ARP 协议查询 MAC 地址，获得 MAC 地址后，由数据链路层构建一个数据帧，目的地址是 IP 层传过来的 MAC 地址，源地址则是本机的 MAC 地址；还要附加上一些控制信息，依据以太网的介质访问规则，将它们传送出去。\n\n（4）主机 B 收到这个数据帧后，先检查它的目的 MAC 地址，并和本机的 MAC 地址对比，如符合，则接收，否则就丢弃。接收后检查该数据帧，将 IP 数据包从帧中提取出来，交给本机的 IP 层。同样，IP 层检查后，将有用的信息提取后交给 ICMP 协议。\n\n（5）主机 B 会构建一个 ICMP 应答包，应答数据包的类型字段为 0，顺序号为接收到的请求数据包中的顺序号，然后再发送出去给主机 A。\n\n（6）在规定的时候间内，源主机如果没有接到 ICMP 的应答包，则说明目标主机不可达；如果接收到了 ICMP 应答包，则说明目标主机可达。此时，源主机会检查，用当前时刻减去该数据包最初从源主机上发出的时刻，就是 ICMP 数据包的时间延迟。\n\n![image-20220324153428137](../images/jiaohuanji-1.png)\n\n<br>\n\n#### 4.2 不同子网之间的ip访问\n\n##### 4.2.1. 路由器是如何工作的\n\n**路由器是一台设备，它有每个网口或者网卡，都连着局域网。每只手的 IP 地址都和局域网的 IP 地址相同的网段，每只手都是它握住的那个局域网的网关。**\n\n其实就是路由器有多个端口，每个端口配置了ip，端口A配置了一个子网A的ip。\n\n端口B配置了一个子网B的ip。\n\n所以子网AB通过路由器就可以通信了。\n\n<br>\n\nGateway 的地址一定是和源 IP 地址是一个网段的。往往不是第一个，就是第二个。\n\n例如 192.168.1.0/24 这个网段，Gateway 往往会是 192.168.1.1/24 或者 192.168.1.2/24。\n\n网关主要是用来连接两种不同的网络，同时，网关它还能够同时与两边的主机之间进行通信。但是两边的主机是不能够直接进行通信，是必须要经过网关才能进行通信。网关的工作是在应用层当中。简单来说，网关它就是为了管理不同网段的IP，我们一般在交换机上做VLAN的时候，就需要在默认的VLAN接口之下做一个IP，而这个IP它就是我们所说的网关。\n\n<br>\n\n##### 4.2.2 不同子网的ip通信流程\n\n![mac-header](../images/mac.png)\n\n**mac头部如上所示**：在 MAC 头里面，先是目标 MAC 地址，然后是源 MAC 地址，然后有一个协议类型，用来说明里面是 IP 协议。IP 头里面的版本号，目前主流的还是 IPv4，服务类型 TOS 在第三节讲 ip addr 命令的时候讲过，TTL 在第 7 节讲 ICMP 协议的时候讲过。另外，还有 8 位标识协议。这里到了下一层的协议，也就是，是 TCP 还是 UDP。最重要的就是源 IP 和目标 IP。先是源 IP 地址，然后是目标 IP 地址。\n\n在任何一台机器上，当要访问另一个 IP 地址的时候，都会先判断，这个目标 IP 地址，和当前机器的 IP 地址，是否在同一个网段。怎么判断同一个网段呢？需要 CIDR 和子网掩码，这个在第三节的时候也讲过了。\n\n**如果是同一个网段**，例如，你访问你旁边的兄弟的电脑，那就没网关什么事情，直接将源地址和目标地址放入 IP 头中，然后通过 ARP 获得 MAC 地址，将源 MAC 和目的 MAC 放入 MAC 头中，发出去就可以了。\n\n**如果不是同一网段**，例如，你要访问你们校园网里面的 BBS，该怎么办？这就需要发往默认网关 Gateway。Gateway 的地址一定是和源 IP 地址是一个网段的。往往不是第一个，就是第二个。例如 192.168.1.0/24 这个网段，Gateway 往往会是 192.168.1.1/24 或者 192.168.1.2/24。\n\n**举例说明：**\n\n![image-20220324162338881](../images/luyou.png)\n\n服务器A属于子网： 192.168.1.101/24\n\n服务器B属于子网：192.168.4.101/24\n\nA服务器需要访问B服务器。访问的过程如下：\n\n（1）服务器A配置了mac包信息\n\n- 源 MAC：服务器 A 的 MAC\n- 目标 MAC：192.168.1.1 **这个网口的 MAC **    //注意这里是吓一跳的mac地址，而不是目标地址的mac地址\n- 源 IP：192.168.1.101\n- 目标 IP：192.168.4.101\n\n这里为什么会知道mac地址，是因为服务器A会通过自己的路由设置，判断这个包下一跳是路由器 192.168.1.1 。由于192.168.1.1 和服务器A是一个子网。所以是可以知道mac地址，并且可以通过mac地址将包发送给路由器的。\n\n（2）包到达 192.168.1.1 这个网口，发现 MAC 一致，将包收进来，开始思考往哪里转发。\n\n在路由器 A 中配置了静态路由之后，要想访问 192.168.4.0/24，要从 192.168.56.1 这个口出去，下一跳为 192.168.56.2。\n\n这个时候mac地址变程了192.168.56.2的mac\n\n（3）包到达 192.168.56.2 这个网口，发现 MAC 一致，将包收进来，开始思考往哪里转发。\n\n在路由器 B 中配置了静态路由，要想访问 192.168.4.0/24，要从 192.168.4.1 这个口出去，没有下一跳了。因为我右手这个网卡，就是这个网段的，我是最后一跳了。\n\n于是，路由器 B 思考的时候，匹配上了这条路由，要从 192.168.4.1 这个口发出去，发给 192.168.4.101。那 192.168.4.101 的 MAC 地址是多少呢？路由器 B 发送 ARP 获取 192.168.4.101 的 MAC 地址，然后发送包。\n\n通过这个过程可以看出，每到一个新的局域网，MAC 都是要变的，但是 IP 地址都不变。在 IP 头里面，不会保存任何网关的 IP 地址。**所谓的下一跳是，某个 IP 要将这个 IP 地址转换为 MAC 放入 MAC 头。**\n\n<br>\n\n有的时候是不同的私有网络访问，可能2个子网都是一致的，这个时候路由器/网络会有NAT 转换功能。\n\n<br>\n\n##### 4.2.3 路由是如何设置的\n\n（1）静态路由可以通过手动配置，修改iptables规则等\n\n（2）动态路由通过链路状态路由算法等等动态设置\n\n<br>\n\n### 5. 网卡介绍\n\n网卡的作用是负责接收网络上的数据包，通过和自己本身的物理地址相比较决定是否为本机应接信息，解包后将数据通过主板上的总线传输给本地计算机，另一方面将本地计算机上的数据打包后送出网络。 网卡是一块被设计用来允许计算机在计算机网络上进行通讯的计算机硬件。 由于其拥有MAC地址，因此属于OSI模型的第2层。\n\n<br>\n\n#### 5.1  查看网卡\n\n##### 5.1.1  ifconfig介绍\n\n```\nroot@onlinegame:/home/zouxiang# ifconfig\nbr-15db8aed13ee: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500\n        inet 172.18.0.1  netmask 255.255.0.0  broadcast 172.18.255.255   // 桥接\n        ether 02:42:e7:49:61:ba  txqueuelen 0  (Ethernet)\n        RX packets 24  bytes 2202 (2.1 KiB)\n        RX errors 0  dropped 0  overruns 0  frame 0\n        TX packets 29  bytes 1965 (1.9 KiB)\n        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0\n\nbr-837c9a286528: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500\n        inet 172.19.0.1  netmask 255.255.0.0  broadcast 172.19.255.255\n        ether 02:42:6a:d2:3e:4b  txqueuelen 0  (Ethernet)\n        RX packets 0  bytes 0 (0.0 B)\n        RX errors 0  dropped 0  overruns 0  frame 0\n        TX packets 0  bytes 0 (0.0 B)\n        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0\n\ndocker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500\n        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255\n        ether 02:42:17:8d:1d:41  txqueuelen 0  (Ethernet)\n        RX packets 180476  bytes 12421319 (11.8 MiB)\n        RX errors 0  dropped 0  overruns 0  frame 0\n        TX packets 289194  bytes 417833816 (398.4 MiB)\n        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0\n\neth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1400\n        inet 10.212.31.96  netmask 255.255.255.0  broadcast 10.212.31.255\n        ether 52:54:00:2d:09:10  txqueuelen 1000  (Ethernet)\n        RX packets 31059650  bytes 6131166764 (5.7 GiB)\n        RX errors 0  dropped 0  overruns 0  frame 0\n        TX packets 29660001  bytes 6106589785 (5.6 GiB)\n        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0\n\nlo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536               // loop设备\n        inet 127.0.0.1  netmask 255.0.0.0\n        loop  txqueuelen 1  (Local Loopback)\n        RX packets 10494509  bytes 790297025 (753.6 MiB)\n        RX errors 0  dropped 0  overruns 0  frame 0\n        TX packets 10494509  bytes 790297025 (753.6 MiB)\n        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0\n```\n\neth0 表示第一块网卡， 其中 ether表示网卡的mac地址，可以看到目前这个网卡的物理地址(MAC地址）是 52:54:00:2d:09:10\n\ninet addr 用来表示网卡的IP地址，此网卡的 IP地址是 192.168.120.204，广播地址， Bcast:192.168.120.255，掩码地址Mask:255.255.255.0\n\nlo 是表示主机的回环地址，这个一般是用来测试一个网络程序，但又不想让局域网或外网的用户能够查看，只能在此台主机上运行和查看所用的网络接口。比如把 HTTPD服务器的指定到回坏地址，在浏览器输入 127.0.0.1 就能看到你所架WEB网站了。但只是您能看得到，局域网的其它主机或用户无从知道。\n\n第一行：连接类型：Ethernet（以太网）HWaddr（硬件mac地址）\n\n第二行：网卡的IP地址、子网、掩码\n\n第三行：UP（代表网卡开启状态）RUNNING（代表网卡的网线被接上）MULTICAST（支持组播）MTU:1500（最大传输单元）：1500字节\n\n第四、五行：接收、发送数据包情况统计\n\nRX packets: errors:0 dropped:0 overruns:0 frame:0 接受包数量/出错数量/丢失数量…\n\nTX packets: errors:0 dropped:0 overruns:0 carrier:0 发送包数量/出错数量/丢失数量…\n\n**loop设备**\n\nlo 是loop设备的意思，地址是127.0.0.1即本机回送地址，一般网站服务本地测试的时候时候这个ip进行本地测试\n\n第七行：接收、发送数据字节数统计信息。\n\n**桥接**\n\n真实主机中安装的虚拟主机，需要和外界主机进行通讯的时候，数据需要通过真实主机的网卡进行传输，但是虚拟主机内核无法对真实主机的网卡进行控制，一般情况下需要将虚拟主机先将数据包发送给真实主机的内核，再由真实主机内核将该数据通过真实物理网卡发送出去，该过程成为NAT（网络地址转换），虽然可以实现该功能，但是数据传数度较慢。\n\n怎么办呢？\nlinux内核支持网络接口的桥接，什么意思？就是说可以由真实主机的内核虚拟出来一个接口br0，同时这个也是一个对外的虚拟网卡设备，通过该接口可以将虚拟主机网卡和真实主机网卡直接连接起来，进行正常的数据通讯，提升数据传输效率。该过程就是桥接。（目前只支持以太网接口，linux内核是通过一个虚拟的网桥设备来实现虚拟桥接接口的，这个虚拟设备可以绑定若干个以太网接口设备，从而将它们桥接起来）\n\n##### 5.1.2 其他方式查看网卡\n\n```\nroot@# ip link\n1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\n    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n24578: usb0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 3a:68:dd:49:76:07 brd ff:ff:ff:ff:ff:ff\n2: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 04:3f:72:ed:d5:8a brd ff:ff:ff:ff:ff:ff\n3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 10000\n    link/ether 04:3f:72:ed:d5:8b brd ff:ff:ff:ff:ff:ff\n4: eth3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 04:3f:72:ed:d5:9a brd ff:ff:ff:ff:ff:ff\n5: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 04:3f:72:ed:d5:9b brd ff:ff:ff:ff:ff:ff\n6: eth4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 04:3f:72:ed:d5:be brd ff:ff:ff:ff:ff:ff\n7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 04:3f:72:ed:d5:bf brd ff:ff:ff:ff:ff:ff\n9: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 16:00:1c:04:75:4c brd ff:ff:ff:ff:ff:ff\n10: acc-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 82:ca:f2:02:d4:4b brd ff:ff:ff:ff:ff:ff\n12: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default\n    link/ether 02:42:78:6b:b0:54 brd ff:ff:ff:ff:ff:ff\n17: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN mode DEFAULT group default qlen 1000\n    link/ether 32:35:2a:28:fd:ab brd ff:ff:ff:ff:ff:ff\n71699: qvo_d888ac@if71700: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\n    link/ether fe:54:00:dd:11:4f brd ff:ff:ff:ff:ff:ff link-netns ns_network\n71703: qvo_8f5819@if71704: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\n    link/ether fe:54:00:4e:d0:7b brd ff:ff:ff:ff:ff:ff link-netns ns_network\n71451: qvo_11a63f@if71452: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\n    link/ether fe:54:00:10:56:1e brd ff:ff:ff:ff:ff:ff link-netns ns_network\n71731: qvo_0d237c@if71732: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\n    link/ether fe:54:00:fb:d9:ff brd ff:ff:ff:ff:ff:ff link-netns ns_network\n71514: veth14feb1d@if71513: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP mode DEFAULT group default\n    link/ether 8e:f8:a2:35:53:ba brd ff:ff:ff:ff:ff:ff link-netnsid 7\n71783: qvo_87db25@if71784: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\n    link/ether fe:54:00:36:a2:4a brd ff:ff:ff:ff:ff:ff link-netns ns_network\n71607: qvo_759d55@if71608: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\n    link/ether fe:54:00:41:28:b0 brd ff:ff:ff:ff:ff:ff link-netns ns_network\n71611: qvo_988854@if71612: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\n    link/ether fe:54:00:f1:56:1e brd ff:ff:ff:ff:ff:ff link-netns ns_network\n71387: qvo_9b9a37@if71388: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000\n    link/ether fe:54:00:02:cc:4f brd ff:ff:ff:ff:ff:ff link-netns ns_network\n\n\n// 查看所有的网卡，这个和上面的ip link是一样的\nroot# cd /sys/class/net\nroot# ls\nacc-int  docker0  eth0\teth1  eth2  eth3  eth4\teth5  lo  ovs-system  qvo_0d237c  qvo_11a63f  qvo_759d55  qvo_87db25  qvo_8f5819  qvo_988854  qvo_9b9a37  qvo_d888ac  usb0  veth14feb1d  vxlan_sys_4789\n```\n\n#### 5.2 虚拟网卡\n\n##### 5.2.1 虚拟网卡介绍\n\n虚拟网卡简单来说就是通过软件模拟出来的电脑网卡。在虚拟化中经常用到。\n\n```\n// 查看/sys/devices/virtual/net/这个目录，可以判断出哪些是虚拟网卡\nroot# ls /sys/devices/virtual/net/\nacc-int  docker0  lo  ovs-system  qvo_0d237c  qvo_11a63f  qvo_16bb20  qvo_759d55  qvo_8f5819  qvo_988854  qvo_9b9a37  qvo_d888ac  veth14feb1d  vxlan_sys_4789\nroot@cld-dnode1-1051:/sys/class/net#\n```\n\n<br>\n\n虚拟网卡的实际工作原理就是：\n\n协议栈处理完的会从网卡送出，这些可能是虚拟网卡，虚拟网卡最终会通过IO将数据送到物理网卡（NIC），然后发送出去。\n\n虚拟网卡和物理网卡的连接方式有很多种。比如桥接（通过brideg连接虚拟网卡和物理网卡）。\n\n不同namespaces经常通过veth-pair连接。\n\n例如，在docker内部其实就是veth-pair，一个虚拟网卡在容器内部，一个在宿主，然后进行通信。\n\nveth-pair就是一堆虚拟网卡设备。往一个网卡发送数据，另一个网卡就能收到。\n\n```\nbash-5.1$ route\nKernel IP routing table\nDestination     Gateway         Genmask         Flags Metric Ref    Use Iface\ndefault         7.53.64.65      0.0.0.0         UG    0      0        0 eth0\n7.53.64.64      *               255.255.255.192 U     0      0        0 eth0\nbash-5.1$\nbash-5.1$ ifconfig\neth0      Link encap:Ethernet  HWaddr 52:54:00:BA:9F:2D\n          inet addr:7.53.64.112  Bcast:0.0.0.0  Mask:255.255.255.192\n          UP BROADCAST RUNNING MULTICAST  MTU:1400  Metric:1\n          RX packets:132698569 errors:0 dropped:0 overruns:0 frame:0\n          TX packets:129715622 errors:0 dropped:0 overruns:0 carrier:0\n          collisions:0 txqueuelen:1000\n          RX bytes:11982615680 (11.1 GiB)  TX bytes:13782715446 (12.8 GiB)\n\nlo        Link encap:Local Loopback\n          inet addr:127.0.0.1  Mask:255.0.0.0\n          UP LOOPBACK RUNNING  MTU:65536  Metric:1\n          RX packets:4567760 errors:0 dropped:0 overruns:0 frame:0\n          TX packets:4567760 errors:0 dropped:0 overruns:0 carrier:0\n          collisions:0 txqueuelen:1000\n          RX bytes:332750268 (317.3 MiB)  TX bytes:332750268 (317.3 MiB)\n```\n\n##### 5.2.2 云计算中的网络计算-虚拟网卡\n\n虚拟网卡的作用有很多，当前云计算技术中就离不开虚拟网卡。\n\n云计算中的网络有以下的点需要实现：\n\n（1）**共享**：尽管每个虚拟机都会有一个或者多个虚拟网卡，但是物理机上可能只有有限的网卡。那这么多虚拟网卡如何共享同一个出口？\n\n**通过网桥解决共享问题**\n\n![image-20220325154103687](../images/wangka.png)\n\n（2）**隔离**：分两个方面，一个是安全隔离，两个虚拟机可能属于两个用户，那怎么保证一个用户的数据不被另一个用户窃听？一个是流量隔离，两个虚拟机，如果有一个疯狂下片，会不会导致另外一个上不了网？\n\n有一个命令**vconfig**，可以基于物理网卡 eth0 创建带 VLAN 的虚拟网卡，所有从这个虚拟网卡出去的包，都带这个 VLAN，如果这样，跨物理机的互通和隔离就可以通过这个网卡来实现。\n\n不同的用户由于网桥不通，不能相互通信，一旦出了网桥，由于 VLAN 不同，也不会将包转发到另一个网桥上。另外，出了物理机，也是带着 VLAN ID 的。只要物理交换机也是支持 VLAN 的，到达另一台物理机的时候，VLAN ID 依然在，它只会将包转发给相同 VLAN 的网卡和网桥，所以跨物理机，不同的 VLAN 也不会相互通信。\n\n![image-20220325155034439](../images/wangka-2.png)\n\n（3） **互通**：分两个方面，一个是如果同一台机器上的两个虚拟机，属于同一个用户的话，这两个如何相互通信？另一个是如果不同物理机上的两个虚拟机，属于同一个用户的话，这两个如何相互通信？\n\n如上\n\n（4）**灵活**：虚拟机和物理不同，会经常创建、删除，从一个机器漂移到另一台机器，有的互通、有的不通等等，灵活性比物理网络要好得多，需要能够灵活配置。\n\n通过OpenvSwitch 配置\n\n<br>\n\n虚拟网卡的介绍：https://keenjin.github.io/2019/06/virtual-net/\n"
  },
  {
    "path": "k8s/cni/2. docker 4种 网络模式.md",
    "content": "* [1\\. 介绍](#1-介绍)\n* [2 bridge模式](#2-bridge模式)\n* [3 host模式](#3-host模式)\n* [4\\. none模式](#4-none模式)\n* [5 container模式](#5-container模式)\n\n### 1. 介绍\n\ndocker run创建Docker容器时，可以用–net选项指定容器的网络模式，Docker有以下4种网络模式：\n\n（1）bridge模式：使用–net =bridge指定，默认设置；\n\n（2）host模式：使用–net =host指定；\n\n（3）none模式：使用–net =none指定；\n\n（4）container模式：使用–net =container:NAMEorID指定。\n\n\n### 2 bridge模式\n\nbridge模式是Docker默认的网络设置，此模式会为每一个容器分配Network Namespace、设置IP等，并将并将一个主机上的Docker容器\n\n连接到一个虚拟网桥上。当Docker server启动时，会在主机上创建一个名为docker0的虚拟网桥，此主机上启动的Docker容器会连接到\n\n这个虚拟网桥上。虚拟网桥的工作方式和物理交换机类似，这样主机上的所有容器就通过交换机连在了一个二层网络中。接下来就要为容\n\n器分配IP了，Docker会从RFC1918所定义的私有IP网段中，选择一个和宿主机不同的IP地址和子网分配给docker0，连接到docker0的容\n\n器就从这个子网中选择一个未占用的IP使用。如一般Docker会使用172.17.0.0/16这个网段，并将172.17.42.1/16分配给docker0网桥（在\n\n主机上使用ifconfig命令是可以看到docker0的，可以认为它是网桥的管理端口，在宿主机上作为一块虚拟网卡使用）\n\n可以看到容器内部，有eth0, 并且可以ping 通外网\n\n```\nroot@k8s-node:~# docker run -it -u root curlimages/curl:7.75.0 sh\n/ # ping www.baidu.com\nPING www.baidu.com (183.232.231.172): 56 data bytes\n64 bytes from 183.232.231.172: seq=0 ttl=55 time=1.892 ms\n64 bytes from 183.232.231.172: seq=1 ttl=55 time=1.833 ms\n64 bytes from 183.232.231.172: seq=2 ttl=55 time=1.834 ms\n^Z[1]+  Stopped                    ping www.baidu.com\n\n/ # ip addr\n1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000\n    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n    inet 127.0.0.1/8 scope host lo\n       valid_lft forever preferred_lft forever\n53: eth0@if54: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP\n    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff\n    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0\n       valid_lft forever preferred_lft forever\n```\n\n<br>\n\n### 3 host模式\n\n如果启动容器的时候使用host模式，那么这个容器将不会获得一个独立的Network Namespace，而是和宿主机共用一个Network Namespace。容器将不会虚拟出自己的网卡，配置自己的IP等，而是使用宿主机的IP和端口。\n\n使用host模式启动容器后可以发现，使用ip addr查看网络环境时，看到的都是宿主机上的信息。这种方式创建出来的容器，可以看到host上的所有网络设备。就是继承了宿主的网络\n\n```\nroot@k8s-node:~# docker run -it -u root --net=host curlimages/curl:7.75.0 sh\n/ # ip addr\n1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000\n    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n    inet 127.0.0.1/8 scope host lo\n       valid_lft forever preferred_lft forever\n    inet6 ::1/128 scope host\n       valid_lft forever preferred_lft forever\n2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000\n    link/ether fa:28:00:0d:3c:2f brd ff:ff:ff:ff:ff:ff\n    inet 172.16.16.5/20 brd 172.16.31.255 scope global eth0\n       valid_lft forever preferred_lft forever\n    inet6 fe80::f828:ff:fe0d:3c2f/64 scope link\n       valid_lft forever preferred_lft forever\n3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN\n    link/ether 02:42:f5:b6:cc:ca brd ff:ff:ff:ff:ff:ff\n    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\n       valid_lft forever preferred_lft forever\n    inet6 fe80::42:f5ff:feb6:ccca/64 scope link\n       valid_lft forever preferred_lft forever\n4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN\n    link/ether b6:fa:84:04:82:55 brd ff:ff:ff:ff:ff:ff\n    inet 10.244.1.0/32 brd 10.244.1.0 scope global flannel.1\n       valid_lft forever preferred_lft forever\n    inet6 fe80::b4fa:84ff:fe04:8255/64 scope link\n       valid_lft forever preferred_lft forever\n5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP qlen 1000\n    link/ether a2:34:ac:2b:00:a3 brd ff:ff:ff:ff:ff:ff\n    inet 10.244.1.1/24 brd 10.244.1.255 scope global cni0\n       valid_lft forever preferred_lft forever\n    inet6 fe80::a034:acff:fe2b:a3/64 scope link\n       valid_lft forever preferred_lft forever\n42: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 62:bb:3c:a3:ac:31 brd ff:ff:ff:ff:ff:ff\n43: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 32:9d:76:89:b4:7a brd ff:ff:ff:ff:ff:ff\n44: veth2@veth3: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 4e:11:eb:21:3a:16 brd ff:ff:ff:ff:ff:ff\n45: veth3@veth2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 1a:74:4a:dd:98:2d brd ff:ff:ff:ff:ff:ff\n46: veth4@veth5: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 1e:f5:74:3f:ae:00 brd ff:ff:ff:ff:ff:ff\n47: veth5@veth4: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 36:a6:e6:d8:49:53 brd ff:ff:ff:ff:ff:ff\n48: veth8625ade0@docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP\n    link/ether f6:57:1a:a5:65:f7 brd ff:ff:ff:ff:ff:ff\n    inet6 fe80::f457:1aff:fea5:65f7/64 scope link\n       valid_lft forever preferred_lft forever\n```\n\n### 4. none模式\n\n在none模式下，Docker容器拥有自己的Network Namespace，但是，并不为Docker容器进行任何网络配置。也就是说，这个Docker容器没有网卡、IP、路由等信息。需要我们自己为Docker容器添加网卡、配置IP等。\n\n```\nroot@k8s-node:~# docker run -it -u root --net=none curlimages/curl:7.75.0 sh\n/ # ip addr\n1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000\n    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n    inet 127.0.0.1/8 scope host lo\n       valid_lft forever preferred_lft forever\n/ #\n```\n\n### 5 container模式\n\n\n这个模式指定新创建的容器和已经存在的一个容器共享一个Network Namespace，而不是和宿主机共享。新创建的容器不会创建自己的网卡，配置自己的IP，而是和一个指定的容器共享IP、端口范围等。同样，两个容器除了网络方面，其他的如文件系统、进程列表等还是隔离的。两个容器的进程可以通过lo网卡设备通信。\n\n```\nd66875e6adc3是一个bridge的容器\nroot@k8s-node:~# docker run -it -u root --net=container:d66875e6adc3 curlimages/curl:7.75.0 sh\n/ # ip addr\n1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000\n    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n    inet 127.0.0.1/8 scope host lo\n       valid_lft forever preferred_lft forever\n3: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP\n    link/ether 32:db:65:58:d2:29 brd ff:ff:ff:ff:ff:ff\n    inet 10.244.1.10/24 brd 10.244.1.255 scope global eth0\n       valid_lft forever preferred_lft forever\n/ # exit\n\n\n622ee25b7390 是一个hostNetwork的容器\nroot@k8s-node:~# docker run -it -u root --net=container:622ee25b7390  curlimages/curl:7.75.0 sh\n/ # ip addr\n1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000\n    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n    inet 127.0.0.1/8 scope host lo\n       valid_lft forever preferred_lft forever\n    inet6 ::1/128 scope host\n       valid_lft forever preferred_lft forever\n2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000\n    link/ether fa:28:00:0d:3c:2f brd ff:ff:ff:ff:ff:ff\n    inet 172.16.16.5/20 brd 172.16.31.255 scope global eth0\n       valid_lft forever preferred_lft forever\n    inet6 fe80::f828:ff:fe0d:3c2f/64 scope link\n       valid_lft forever preferred_lft forever\n3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN\n    link/ether 02:42:f5:b6:cc:ca brd ff:ff:ff:ff:ff:ff\n    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0\n       valid_lft forever preferred_lft forever\n    inet6 fe80::42:f5ff:feb6:ccca/64 scope link\n       valid_lft forever preferred_lft forever\n4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN\n    link/ether b6:fa:84:04:82:55 brd ff:ff:ff:ff:ff:ff\n    inet 10.244.1.0/32 brd 10.244.1.0 scope global flannel.1\n       valid_lft forever preferred_lft forever\n    inet6 fe80::b4fa:84ff:fe04:8255/64 scope link\n       valid_lft forever preferred_lft forever\n5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP qlen 1000\n    link/ether a2:34:ac:2b:00:a3 brd ff:ff:ff:ff:ff:ff\n    inet 10.244.1.1/24 brd 10.244.1.255 scope global cni0\n       valid_lft forever preferred_lft forever\n    inet6 fe80::a034:acff:fe2b:a3/64 scope link\n       valid_lft forever preferred_lft forever\n42: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 62:bb:3c:a3:ac:31 brd ff:ff:ff:ff:ff:ff\n43: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 32:9d:76:89:b4:7a brd ff:ff:ff:ff:ff:ff\n44: veth2@veth3: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 4e:11:eb:21:3a:16 brd ff:ff:ff:ff:ff:ff\n45: veth3@veth2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 1a:74:4a:dd:98:2d brd ff:ff:ff:ff:ff:ff\n46: veth4@veth5: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 1e:f5:74:3f:ae:00 brd ff:ff:ff:ff:ff:ff\n47: veth5@veth4: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000\n    link/ether 36:a6:e6:d8:49:53 brd ff:ff:ff:ff:ff:ff\n48: veth8625ade0@docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP\n    link/ether f6:57:1a:a5:65:f7 brd ff:ff:ff:ff:ff:ff\n    inet6 fe80::f457:1aff:fea5:65f7/64 scope link\n       valid_lft forever preferred_lft forever\n```\n\n"
  },
  {
    "path": "k8s/cni/3. docker容器网络的底层实现.md",
    "content": "* [1\\. 背景](#1-背景)\n* [2\\. 如何理解network namespaces](#2-如何理解network-namespaces)\n* [3\\. 不同namespaces之间是如何通信的](#3-不同namespaces之间是如何通信的)\n  * [3\\.1 创建network namespace](#31-创建network-namespace)\n  * [3\\.2 两个networknamespaces之间的通信](#32-两个networknamespaces之间的通信)\n* [4\\. 多个namespaces之间的通信](#4-多个namespaces之间的通信)\n  * [4\\.1 创建3个namespaces](#41-创建3个namespaces)\n  * [4\\.2 创建bridge](#42-创建bridge)\n  * [4\\.3 创建 veth pair](#43-创建-veth-pair)\n  * [4\\.4 将 veth pair 的一头挂到 namespace 中，一头挂到 bridge 上，并设 IP 地址](#44-将-veth-pair-的一头挂到-namespace-中一头挂到-bridge-上并设-ip-地址)\n  * [4\\.5 验证多Namespaces互通](#45-验证多namespaces互通)\n* [5\\. 补充](#5-补充)\n  * [5\\.1 如何查看容器内和 宿主的 veth pair对](#51-如何查看容器内和-宿主的-veth-pair对)\n\n### 1. 背景\n\n上文提到的docker 4中网络模式，核心就是network namespaces的不同，比如可以共享宿主的network(hostnetwork)。容器网络模式的核心就是：\n\n通过network namespaces隔离了各个容器，然后通过设置veth pari, bride等虚拟网络设备来达到   容器networks 与宿主hostnetwork的通信。最终是通过宿主的物理网卡进行了传输。\n\n本节就是说明一些docker 网络到底是如何实现的。\n\n<br>\n\n### 2. 如何理解network namespaces\n\n摘抄自：容器实战高手课-极客时间\n\n对于 Network Namespace，我们从字面上去理解的话，可以知道它是在一台 Linux 节点上对网络的隔离，不过它具体到底隔离了哪部分的网络资源呢？\n\n我们还是先来看看操作手册，在Linux Programmer’s Manual里对 [Network Namespace](https://man7.org/linux/man-pages/man7/network_namespaces.7.html) 有一个段简短的描述，在里面就列出了最主要的几部分资源，它们都是通过 Network Namespace 隔离的。\n\n我把这些资源给你做了一个梳理：\n\n* 第一种，网络设备，这里指的是 lo，eth0 等网络设备。你可以通过 ip link命令看到它们。\n\n* 第二种是 IPv4 和 IPv6 协议栈。从这里我们可以知道，IP 层以及上面的 TCP 和 UDP 协议栈也是每个 Namespace 独立工作的。所以 IP、TCP、UDP 的很多协议，它们的相关参数也是每个 Namespace 独立的，这些参数大多数都在 /proc/sys/net/ 目录下面，同时也包括了 TCP 和 UDP 的 port 资源。\n* 第三种，IP 路由表，这个资源也是比较好理解的，你可以在不同的 Network Namespace 运行 ip route 命令，就能看到不同的路由表了。\n* 第四种是防火墙规则，其实这里说的就是 iptables 规则了，每个 Namespace 里都可以独立配置 iptables 规则。\n* 最后一种是网络的状态信息，这些信息你可以从 /proc/net 和 /sys/class/net 里得到，这里的状态基本上包括了前面 4 种资源的的状态信息。\n\n<br>\n\n再结合前面笔记，关于协议栈部分的介绍。network namesapces 隔离了 网络设备，网络参数，协议栈配置等等信息。这样network namespaces里面的包发送自然会受到限制，从而达到了隔离的作用。\n\n### 3. 不同namespaces之间是如何通信的\n\nnamespaces的隔离很简单：一个新的namespace啥都没有，只有一个lo，只能访问自己\n\n通信就需要网络设备了，这里介绍一下docker常用的veth pair对和  bridge。\n\n#### 3.1 创建network namespace\n\n参考：https://www.cnblogs.com/bakari/p/10443484.html\n\n（1）创建namespaces\n\n（2）每个 namespace 在创建的时候会自动创建一个回环接口 lo ，默认不启用，可以通过 ip link set lo up 启用\n\n```\n// 1.创建namespaces\nroot@k8s-node:~#ip netns add netns1\n\nroot@k8s-node:~# ip netns ls\nnetns1\n\nroot@k8s-node:~# ip netns  exec netns1 ip addr\n1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000\n    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n\n// 2.每个 namespace 在创建的时候会自动创建一个回环接口 lo ，默认不启用，可以通过 ip link set lo up 启用。\nroot@k8s-node:~# ip netns  exec netns1 bash\nroot@k8s-node:~# ping www.baidu.com\nping: www.baidu.com: Temporary failure in name resolution\nroot@k8s-node:~# ping 127.0.0.1\nconnect: Network is unreachable\nroot@k8s-node:~# ip link set lo up\nroot@k8s-node:~# ping 127.0.0.1\nPING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.\n64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.021 ms\n64 bytes from 127.0.0.1: icmp_seq=2 ttl=64 time=0.016 ms\n64 bytes from 127.0.0.1: icmp_seq=3 ttl=64 time=0.018 ms\n64 bytes from 127.0.0.1: icmp_seq=4 ttl=64 time=0.027 ms\n^Z\n[1]+  Stopped                 ping 127.0.0.1\n```\n\n#### 3.2 两个networknamespaces之间的通信\n\n（1）再创建一个namespaces\n\n```\nroot@k8s-node:~#  ip netns add netns0\nroot@k8s-node:~# ip netns ls\nnetns0\nnetns1\n```\n\n（2）生成一堆veth pair\n\n```\nroot@k8s-node:~# ip link add type veth\n\nroot@k8s-node:~# ip link\n42: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 62:bb:3c:a3:ac:31 brd ff:ff:ff:ff:ff:ff\n43: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 32:9d:76:89:b4:7a brd ff:ff:ff:ff:ff:ff\n```\n\n（3）给 veth pair 配上 ip 地址\n\n```\n// 进入netns0将veth0设备启动\nroot@k8s-node:~# ip netns exec netns0 ip link set veth0 up\n\n// 查看已经开启了，UP\nroot@k8s-node:~# ip netns exec netns0 ip addr\n1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000\n    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n42: veth0@if43: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN group default qlen 1000\n    link/ether 62:bb:3c:a3:ac:31 brd ff:ff:ff:ff:ff:ff link-netns netns1\n    \n// 进入netns1将veth1设备启动\nroot@k8s-node:~# ip netns exec netns1 ip link set veth1 up\nroot@k8s-node:~# ip netns exec netns1 ip addr\n1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000\n    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00\n    inet 127.0.0.1/8 scope host lo\n       valid_lft forever preferred_lft forever\n    inet6 ::1/128 scope host \n       valid_lft forever preferred_lft forever\n43: veth1@if42: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000\n    link/ether 32:9d:76:89:b4:7a brd ff:ff:ff:ff:ff:ff link-netns netns0\n    inet6 fe80::309d:76ff:fe89:b47a/64 scope link \n       valid_lft forever preferred_lft forever\n```\n\n(4) 给veth pair网卡配置Ip\n\n```\nveth0 对应 netns0，对应Ip段10.1.1.1/24\nroot@k8s-node:~#  ip netns exec netns0 ip addr add 10.1.1.1/24 dev veth0\n\nveth1 对应 netns1，对应Ip段10.1.1.2/24\nroot@k8s-node:~# ip netns exec netns1 ip addr add 10.1.1.2/24 dev veth1\n\nnetns0 ping veth1对应的ip段可通\nroot@k8s-node:~#  ip netns exec netns0 ping 10.1.1.2\nPING 10.1.1.2 (10.1.1.2) 56(84) bytes of data.\n64 bytes from 10.1.1.2: icmp_seq=1 ttl=64 time=0.038 ms\n64 bytes from 10.1.1.2: icmp_seq=2 ttl=64 time=0.022 ms\n64 bytes from 10.1.1.2: icmp_seq=3 ttl=64 time=0.023 ms\n^Z\n[1]+  Stopped                 ip netns exec netns0 ping 10.1.1.2\n```\n\n### 4. 多个namespaces之间的通信\n\n参考：https://www.cnblogs.com/bakari/p/10443484.html\n\n2 个 namespace 之间通信可以借助 `veth pair` ，多个 namespace 之间的通信则可以使用 bridge 来转接，不然每两个 namespace 都去配 `veth pair` 将会是一件麻烦的事。下面就看看如何使用 bridge 来转接。\n\n拓扑图如下：\n\n![image-20220327163813484](../images/docker-net-1.png)\n\n#### 4.1 创建3个namespaces\n\n```\nroot@k8s-node:~# ip netns add net0\nroot@k8s-node:~# ip netns add net1\nroot@k8s-node:~# ip netns add net2\n```\n\n#### 4.2 创建bridge\n\n```\nroot@k8s-node:~# ip link add br0 type bridge\n// 开启这个网络设备\nroot@k8s-node:~# ip link set dev br0 up\n\nroot@k8s-node:~# ip link \n\n57: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000\n    link/ether 5e:3c:aa:99:dc:09 brd ff:ff:ff:ff:ff:ff\n```\n\n####  4.3 创建 veth pair\n\n```\n//（1）创建 3 个 veth pair\n# ip link add type veth\n# ip link add type veth\n# ip link add type veth\n\n\nveth是递增的，所以这三对是 23， 45， 67这三对\nroot@k8s-node:~# ip link\n44: veth2@veth3: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 4e:11:eb:21:3a:16 brd ff:ff:ff:ff:ff:ff\n45: veth3@veth2: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 1a:74:4a:dd:98:2d brd ff:ff:ff:ff:ff:ff\n46: veth4@veth5: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 1e:f5:74:3f:ae:00 brd ff:ff:ff:ff:ff:ff\n47: veth5@veth4: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 36:a6:e6:d8:49:53 brd ff:ff:ff:ff:ff:ff\n48: veth8625ade0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default \n    link/ether f6:57:1a:a5:65:f7 brd ff:ff:ff:ff:ff:ff link-netnsid 0\n55: veth6@veth7: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether 5a:37:5e:74:01:0f brd ff:ff:ff:ff:ff:ff\n56: veth7@veth6: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000\n    link/ether fe:9c:e3:75:6c:23 brd ff:ff:ff:ff:ff:ff\n```\n\n#### 4.4 将 veth pair 的一头挂到 namespace 中，一头挂到 bridge 上，并设 IP 地址\n\n```\n// 配置第一个ns 和 bridge\n// 将veth2 挂到 net0 这个命名空间下\nroot@k8s-node:~# ip link set dev veth2 netns net0\n\n// 将namespaces ip link看到的veth2 改名为eth0\nroot@k8s-node:~# ip netns exec net0 ip link set dev veth2 name eth0\n\n// 设置ip 10.0.1.1/24\nroot@k8s-node:~# ip netns exec net0 ip addr add 10.0.1.1/24 dev eth0\n\n// 开启网络设备eth0，其实就是veth2\nroot@k8s-node:~# ip netns exec net0 ip link set dev eth0 up\n\n// 将veth3 挂着bro网桥上\nroot@k8s-node:~# ip link set dev veth3 master br0\n\n// 开启bridge的网络设备 veth3\nroot@k8s-node:~# ip link set dev veth3 up\n\n\n\n// 配置第 2 个 net1\n# ip link set dev veth4 netns net1\n# ip netns exec net1 ip link set dev veth4 name eth0\n# ip netns exec net1 ip addr add 10.0.1.2/24 dev eth0\n# ip netns exec net1 ip link set dev eth0 up\n#\n# ip link set dev veth5 master br0\n# ip link set dev veth5 up\n\n\n\n// 配置第 3 个 net2 (这里我配错了一个，重新又生成了1对，所以是veth0,veth1)\n# ip link set dev veth0 netns net2\n# ip netns exec net2 ip link set dev veth0 name eth0\n# ip netns exec net2 ip addr add 10.0.1.2/24 dev eth0\n# ip netns exec net2 ip link set dev eth0 up\n#\n# ip link set dev veth1 master br0\n# ip link set dev veth1 up\n```\n\n#### 4.5 验证多Namespaces互通\n\n这样之后，竟然通不了，经查阅 [参见](https://segmentfault.com/q/1010000010011053/a-1020000010025650) ，是因为\n\n> 原因是因为系统为bridge开启了iptables功能，导致所有经过br0的数据包都要受iptables里面规则的限制，而docker为了安全性（我的系统安装了 docker），将iptables里面filter表的FORWARD链的默认策略设置成了drop，于是所有不符合docker规则的数据包都不会被forward，导致你这种情况ping不通。\n>\n> 解决办法有两个，二选一：\n>\n> 1. 关闭系统bridge的iptables功能，这样数据包转发就不受iptables影响了：echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables\n> 2. 为br0添加一条iptables规则，让经过br0的包能被forward：iptables -A FORWARD -i br0 -j ACCEPT\n>\n> 第一种方法不确定会不会影响docker，建议用第二种方法。\n\n我采用以下方法解决：\n\n`Copyiptables -A FORWARD -i br0 -j ACCEPT`\n\n\n\n```\nroot@k8s-node:~# ip netns exec net0 ping -c 2 10.0.1.2\nPING 10.0.1.2 (10.0.1.2) 56(84) bytes of data.\n\n\n^X\n^Z\n[2]+  Stopped                 ip netns exec net0 ping -c 2 10.0.1.2\n\n\nroot@k8s-node:~# iptables -A FORWARD -i br0 -j ACCEPT\nroot@k8s-node:~# \n\n\nroot@k8s-node:~# ip netns exec net0 ping -c 2 10.0.1.2\nPING 10.0.1.2 (10.0.1.2) 56(84) bytes of data.\n64 bytes from 10.0.1.2: icmp_seq=1 ttl=64 time=0.061 ms\n64 bytes from 10.0.1.2: icmp_seq=2 ttl=64 time=0.036 ms\n\n--- 10.0.1.2 ping statistics ---\n2 packets transmitted, 2 received, 0% packet loss, time 12ms\nrtt min/avg/max/mdev = 0.036/0.048/0.061/0.014 ms\n\n\n```\n\n<br>\n\n### 5. 补充\n\n#### 5.1 如何查看容器内和 宿主的 veth pair对\n\n可以参考：https://blog.csdn.net/u011563903/article/details/88593251\n\n"
  },
  {
    "path": "k8s/cni/4.k8s pod通信原理介绍.md",
    "content": "* [1\\. 目标](#1-目标)\n* [2\\. 通信原理](#2-通信原理)\n  * [2\\.1 同一个Pod内部不同容器之间的通信](#21-同一个pod内部不同容器之间的通信)\n  * [2\\.2 同一个节点上不同pod之间的通信原理](#22-同一个节点上不同pod之间的通信原理)\n  * [2\\.3 不同节点之间的Pod通信原理](#23-不同节点之间的pod通信原理)\n  * [2\\.4 K8S集群内部访问服务的原理](#24-k8s集群内部访问服务的原理)\n    * [2\\.4\\.1 clusterIp介绍](#241-clusterip介绍)\n    * [2\\.4\\.2 clusterIp原理说明](#242-clusterip原理说明)\n  * [2\\.5 K8S集群外部部访问服务的原理](#25-k8s集群外部部访问服务的原理)\n    * [2\\.5\\.1 LoadBalancer](#251-loadbalancer)\n    * [2\\.5\\.2 NodePort](#252-nodeport)\n* [2\\.6 ingress](#26-ingress)\n\n### 1. 目标\n\n了解以下情况下k8s集群内部网络的通信原理\n\n（1）同一个Pod内部不同容器之间的通信原理\n\n（2）同一个节点上不同pod之间的通信原理\n\n（3）不同节点之间的Pod通信原理\n\n（4）K8S集群内部访问服务的原理\n\n（5）K8S集群外部部访问服务的原理\n\n<br>\n\n### 2. 通信原理\n\n#### 2.1 同一个Pod内部不同容器之间的通信\n\nPods中的多个container共享一个网络栈，每个pod都是有一个pause容器，就是sandbox。所有的业务容器都是加入了这个namespaces。所以他们可以直接通过localhost通信。\n\n#### 2.2 同一个节点上不同pod之间的通信原理\n\n这里可以先了解一下基础知识：\n\n* 不同network namespace之间可以通过veth pair来互相通信\n* 多个network namespace之间可以通过 bridge 来通信\n\n可以参考：https://www.cnblogs.com/bakari/p/10443484.html\n\n而这个bridge就是docker0。在pods的namespace中，pods的虚拟网络接口为veth0；在宿主机上，物理网络的网络接口为eth0。docker bridge作为veth0的默认网关，用于和宿主机网络的通信。\n\n所有pods的veth0所能分配的IP是一个独立的IP地址范围，来自于创建cluster时候kubeadm的--pod-network-cidr参数设定的CIDR，这里看起来像是172.17.0.0/24，是一个B类局域网IP地址段；所有宿主机的网络接口eth0所能分配的IP是实际物理网络的设定，一般来自于实际物理网络中的路由器通过DHCP分配的，这里看起来是10.100.0.0/24，是一个A类局域网IP地址段。\n\n![image-20220319175922049](../images/cni-1.png)\n\n#### 2.3 不同节点之间的Pod通信原理\n\n上面2个其实是docker做的工作。在k8s层，我们部署的往往需要flannel或者calico来打通网络。\n\n下面的docker0的名字被改成了cbr0，意思是custom bridge。由此，如果左侧的pod想访问右侧的pod，则IP包会通过bridge cbr0来到左侧宿主机的eth0，然后查询宿主机上新增的路由信息，继而将IP包送往右侧的宿主机的eth0，继而再送往右侧的bridge cbr0，最后送往右侧的pod。\n\n![image-20220319205533302](../images/cni-2.png)\n\n<br>\n\n以flannel为例子, flannel为pod分配Ip，并且设置路由。所以一个pod的请求达到docker后会被flannel接收，然后进行转发。\n\nFlannel 是 CoreOS 团队针对 Kubernetes 设计的一个网络规划实现。简单来说，它的功能有以下几点：\n\n1、使集群中的不同 Node 主机创建的 Docker 容器都具有全集群唯一的虚拟 IP 地址；\n\n2、建立一个覆盖网络（overlay network），这个覆盖网络会将数据包原封不动的传递到目标容器中。覆盖网络是建立在另一个网络之上并由其基础设施支持的虚拟网络。覆盖网络通过将一个分组封装在另一个分组内来将网络服务与底层基础设施分离。在将封装的数据包转发到端点后，将其解封装；\n\n3、创建一个新的虚拟网卡 flannel0 接收 docker 网桥的数据，通过维护路由表，对接收到的数据进行封包和转发（VXLAN）；\n\n4、路由信息一般存放到 etcd 中：多个 Node 上的 Flanneld 依赖一个 etcd cluster 来做集中配置服务，etcd 保证了所有 Node 上 Flannel 所看到的配置是一致的。同时每个 Node 上的 Flannel 都可以监听 etcd 上的数据变化，实时感知集群中 Node 的变化；\n\n5、Flannel 首先会在 Node 上创建一个名为 flannel0 的网桥（VXLAN 类型的设备），并且在每个 Node 上运行一个名为 Flanneld 的代理。每个 Node 上的 Flannel 代理会从 etcd 上为当前 Node 申请一个 CIDR 地址块用来给该 Node 上的 Pod 分配地址；\n\n6、Flannel 致力于给 Kubernetes 集群中的 Node 提供一个三层网络，它并不控制 Node 中的容器是如何进行组网的，仅仅关心流量如何在 Node 之间流转。\n\n\n\n![flannel](../images/cni-3.png)\n\n#### 2.4 K8S集群内部访问服务的原理\n\n正式业务下，Pod可能会被重启或者因为其他原因重建，而且一个服务有很多pod如何服务，怎么确定是那个服务？\n\n这个时候就有了service的概念。在集群内部访问一般是 headless or clusterIp\n\n##### 2.4.1 clusterIp介绍\n\n以clusterIp为例, 创建完之后查看svc就有一个 CLUSTER-IP。\n\n这样在集群内的容器或节点上都能够访问Service\n\n```\napiVersion: v1\nkind: Service\nmetadata:\n  labels:\n    app: nginx\n  name: nginx-clusterip\nspec:\n  ports:\n  - name: service0\n    port: 8080                # 访问Service的端口\n    protocol: TCP             # 访问Service的协议，支持TCP和UDP\n    targetPort: 80            # Service访问目标容器的端口，此端口与容器中运行的应用强相关，如本例中nginx镜像默认使用80端口\n  selector:                   # 标签选择器，Service通过标签选择Pod，将访问Service的流量转发给Pod，此处选择带有 app:nginx 标签的Pod\n    app: nginx\n  type: ClusterIP             # Service的类型，ClusterIP表示在集群内访问\n  \n  \n# kubectl get svc\nNAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE\nnginx-clusterip   ClusterIP   10.247.74.52   <none>        8080/TCP   14m\n```\n\n<br>\n\nheadless svc是一种特色的clusterIp类型。他在定义的时候制定了 clusterIp=none。例如：\n\n```\napiVersion: v1\nkind: Service\nmetadata:\n  name: nginx\n  labels:\n    app: nginx\nspec:\n  ports:\n  - port: 80\n    name: nginx-web\n  # clusterIP 设置为 None\n  clusterIP: None\n  selector:\n    app: nginx\n```\n\n`Headless Service`其实就是没头的`Service`。使用场景如下：\n\n- client感知到svc的所有endpoint, 通过查询dns自主选择访问哪个后端\n- `Headless Service`的对应的每一个`Endpoints`，即每一个`Pod`，都会有对应的`DNS`域名；这样`Pod`之间就可以互相访问。StatefulSets就是使用了headless service\n\n##### 2.4.2 clusterIp原理说明\n\n到了svc这层都需要额外的控制器来处理。社区常见的就是 kube-proxy。这里只是简单说一下原理。\n\nkube-proxy在每个节点上监听svc, ep, pod资源的变化，然后通过iptable来控制访问svc的时候，具体访问哪个pod。\n\n iptables可以实现负载均衡：比如通过**--probability** 设置概率来保证负载均衡。\n\n可以参考：https://blog.csdn.net/ksj367043706/article/details/89764546\n\n比如： clusterip=10.247.74.52 的svc有两个Pod, kube-proxy会设置每个节点上iptables的规则。\n\n访问ip=10.247.74.52的路由，以50%的概率先访问 podA; 然后 100%概率访问podB。\n\n<br>\n\n#### 2.5 K8S集群外部部访问服务的原理\n\n##### 2.5.1 LoadBalancer\n\n负载均衡( LoadBalancer )可以通过弹性负载均衡从公网访问到工作负载，与弹性IP方式相比提供了高可靠的保障，一般用于系统中需要暴露到公网的服务。\n\n到这里为止其实和k8s的关系不是很大了，一般各个网络服务提供了LoadBalancer，在定义yaml制定subnet-id，vpc等信息申请LoadBalancer-ip，然后对应的节点的网络配置（路由转发等）通过直接访问这个ip即可（后面负载均衡啥的，网络服务部门已经干了）\n\n```\napiVersion: v1 \nkind: Service \nmetadata: \n  annotations:   \n    kubernetes.io/elb.pass-through: \"true\"\n    kubernetes.io/elb.class: union\n    kubernetes.io/session-affinity-mode: SOURCE_IP\n    kubernetes.io/elb.subnet-id: a9cf6d24-ad43-4f75-94d1-4e0e0464afac\n    kubernetes.io/elb.autocreate: '{\"type\":\"public\",\"bandwidth_name\":\"cce-bandwidth\",\"bandwidth_chargemode\":\"bandwidth\",\"bandwidth_size\":5,\"bandwidth_sharetype\":\"PER\",\"eip_type\":\"5_bgp\",\"name\":\"james\"}'\n  labels: \n    app: nginx \n  name: nginx \nspec: \n  externalTrafficPolicy: Local\n  ports: \n  - name: service0 \n    port: 80\n    protocol: TCP \n    targetPort: 80\n  selector: \n    app: nginx \n  type: LoadBalancer\n```\n\n<br>\n\n##### 2.5.2 NodePort\n\n节点访问 ( NodePort )是指在每个节点的IP上开放一个静态端口，通过静态端口对外暴露服务。节点访问 ( NodePort )会路由到ClusterIP服务，这个ClusterIP服务会自动创建。通过请求 <NodeIP>:<NodePort>，可以从集群的外部访问一个NodePort服务。\n\n部署yaml如下所示：\n\n```\napiVersion: v1\nkind: Service\nmetadata:\n  labels:\n    app: nginx\n  name: nginx-nodeport\nspec:\n  ports:\n  - name: service\n    nodePort: 30000     # 节点端口，取值范围为30000-32767\n    port: 8080          # 访问Service的端口\n    protocol: TCP       # 访问Service的协议，支持TCP和UDP\n    targetPort: 80      # Service访问目标容器的端口，此端口与容器中运行的应用强相关，如本例中nginx镜像默认使用80端口\n  selector:             # 标签选择器，Service通过标签选择Pod，将访问Service的流量转发给Pod，此处选择带有 app:nginx 标签的Pod\n    app: nginx\n  type: NodePort        # Service的类型，NodePort表示在通过节点端口访问\n```\n\nNodePort的核心实现也是非常简单。外部服务通过访问节点某个port, iptables通过转发访问服务，这样就将外部访问变成了内部访问。\n\n### 2.6 ingress\n\nkubernetes提供了Ingress资源对象，Ingress只需要一个NodePort或者一个LB就可以满足暴露多个Service的需求。可以做到7层了负载均衡。这里只是简单说一下，目前还没有研究ingress，后面再补充。"
  },
  {
    "path": "k8s/cni/5. k8s 容器网络接口介绍.md",
    "content": "* [1\\. 背景](#1-背景)\n* [2\\. Kubelet cni介绍](#2-kubelet-cni介绍)\n  * [2\\.1 kubelet cni相关启动参数介绍](#21-kubelet-cni相关启动参数介绍)\n  * [2\\.2 kubelet 调用cni分配ip的流程](#22-kubelet-调用cni分配ip的流程)\n    * [2\\.2\\.1 核心interface](#221-核心interface)\n    * [2\\.2\\.2 kubelet初始化cni](#222-kubelet初始化cni)\n    * [2\\.2\\.3 kubelet 分配ip](#223-kubelet-分配ip)\n      * [plugin\\.addToNetwork](#pluginaddtonetwork)\n      * [cniNet\\.AddNetworkList](#cninetaddnetworklist)\n      * [ExecPluginWithResult](#execpluginwithresult)\n* [3\\. 总结](#3-总结)\n\n### 1. 背景\n\n在之前kubelet创建pod流程的分析过程中，kubelet 创建Pod 的第一步，就是创建并启动一个 Infra 容器，用来“hold”住这个 Pod 的 Network Namespace。\n\nkubelete直接是调用了SetUpPod这个函数来设置网络。这背后其实是做了很多工作的，这些工作被kubelet弄成了一个cni接口。\n\n```\n\terr = ds.network.SetUpPod(config.GetMetadata().Namespace, config.GetMetadata().Name, cID, config.Annotations, networkOptions)\n```\n\n这样设计的目的就是可插拔，不同的厂商或者使用者，只要实现了cni接口就可以 使用自定义的网络模式。\n\n本文就是梳理一下，kubelet中cni是如何定义的，pod创建和删除过程中cni是如何工作的。为后面的自定义cni打一个基础。\n\n<br>\n\n### 2. Kubelet cni介绍\n\n#### 2.1 kubelet cni相关启动参数介绍\n\n（1）network-plugin\n\n指定要使用的网络插件类型，可选值cni、kubenet、\"\"。默认为空串，代表Noop，即不配置网络插件（不构建pod网络）\n\n**kubenet**： Kubenet 是一个非常基本的、简单的网络插件，仅适用于 Linux。 它本身并不实现更高级的功能，如跨节点网络或网络策略。 它通常与云驱动一起使用，云驱动为节点间或单节点环境中的通信设置路由规则。\n\nKubenet 创建名为 `cbr0` 的网桥，并为每个 pod 创建了一个 veth 对， 每个 Pod 的主机端都连接到 `cbr0`。 这个 veth 对的 Pod 端会被分配一个 IP 地址，该 IP 地址隶属于节点所被分配的 IP 地址范围内。节点的 IP 地址范围则通过配置或控制器管理器来设置。 `cbr0` 被分配一个 MTU，该 MTU 匹配主机上已启用的正常接口的最小 MTU。\n\n**cni**：通过给 Kubelet 传递 `--network-plugin=cni` 命令行选项可以选择 CNI 插件。 Kubelet 从 `--cni-conf-dir` （默认是 `/etc/cni/net.d`） 读取文件并使用 该文件中的 CNI 配置来设置各个 Pod 的网络。 CNI 配置文件必须与 [CNI 规约](https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration) 匹配，并且配置所引用的所有所需的 CNI 插件都应存在于 `--cni-bin-dir`（默认是 `/opt/cni/bin`）下。\n\n如果这个目录中有多个 CNI 配置文件，kubelet 将会使用按文件名的字典顺序排列 的第一个作为配置文件。\n\n除了配置文件指定的 CNI 插件外，Kubernetes 还需要标准的 CNI [`lo`](https://github.com/containernetworking/plugins/blob/master/plugins/main/loopback/loopback.go) 插件，最低版本是0.2.0。\n\n这部分更多信息详见：https://kubernetes.io/zh/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/\n\n（2）--cni-conf-dir：CNI 配置文件所在路径。默认值：/etc/cni/net.d。 （和第一个参数配合使用）\n\n（3）--cni-bin-dir：CNI 插件的可执行文件所在路径，kubelet 将在此路径中查找 CNI 插件的可执行文件来执行pod的网络操作。默认值：/opt/cni/bin  （和第一个参数配合使用）\n\n#### 2.2 kubelet 调用cni分配ip的流程\n\n##### 2.2.1 核心interface\n\n这里从ds.network.SetUpPod函数开始，SetUpPod主要调用了pm.plugin.SetUpPod。\n\n```\nfunc (pm *PluginManager) SetUpPod(podNamespace, podName string, id kubecontainer.ContainerID, annotations, options map[string]string) error {\n\tdefer recordOperation(\"set_up_pod\", time.Now())\n\tfullPodName := kubecontainer.BuildPodFullName(podName, podNamespace)\n\tpm.podLock(fullPodName).Lock()\n\tdefer pm.podUnlock(fullPodName)\n\n\tklog.V(3).Infof(\"Calling network plugin %s to set up pod %q\", pm.plugin.Name(), fullPodName)\n\tif err := pm.plugin.SetUpPod(podNamespace, podName, id, annotations, options); err != nil {\n\t\treturn fmt.Errorf(\"networkPlugin %s failed to set up pod %q network: %v\", pm.plugin.Name(), fullPodName, err)\n\t}\n\n\treturn nil\n}\n```\n\nNetworkPlugin interface声明了kubelet网络插件的一些操作方法，不同类型的网络插件只需要实现这些方法即可，其中最关键的就是SetUpPod与TearDownPod方法，作用分别是构建pod网络与销毁pod网络。\n\n```\n// NetworkPlugin is an interface to network plugins for the kubelet\ntype NetworkPlugin interface {\n\t// Init initializes the plugin.  This will be called exactly once\n\t// before any other methods are called.\n\tInit(host Host, hairpinMode kubeletconfig.HairpinMode, nonMasqueradeCIDR string, mtu int) error\n\n\t// Called on various events like:\n\t// NET_PLUGIN_EVENT_POD_CIDR_CHANGE\n\tEvent(name string, details map[string]interface{})\n\n\t// Name returns the plugin's name. This will be used when searching\n\t// for a plugin by name, e.g.\n\tName() string\n\n\t// Returns a set of NET_PLUGIN_CAPABILITY_*\n\tCapabilities() utilsets.Int\n\n\t// SetUpPod is the method called after the infra container of\n\t// the pod has been created but before the other containers of the\n\t// pod are launched.\n\tSetUpPod(namespace string, name string, podSandboxID kubecontainer.ContainerID, annotations, options map[string]string) error\n\n\t// TearDownPod is the method called before a pod's infra container will be deleted\n\tTearDownPod(namespace string, name string, podSandboxID kubecontainer.ContainerID) error\n\n\t// GetPodNetworkStatus is the method called to obtain the ipv4 or ipv6 addresses of the container\n\tGetPodNetworkStatus(namespace string, name string, podSandboxID kubecontainer.ContainerID) (*PodNetworkStatus, error)\n\n\t// Status returns error if the network plugin is in error state\n\tStatus() error\n}\n```\n\n这里我们针对cni进行分析，cniNetworkPlugin struct实现了NetworkPlugin interface，实现了SetUpPod与TearDownPod等方法。\n\n\t// pkg/kubelet/dockershim/network/cni/cni.go\n\ttype cniNetworkPlugin struct {\n\t\tnetwork.NoopNetworkPlugin\n\t  loNetwork *cniNetwork\n\t\n\t  sync.RWMutex\n\t  defaultNetwork *cniNetwork\n\t\n\t  host        network.Host\n\t  execer      utilexec.Interface\n\t  nsenterPath string\n\t  confDir     string\n\t  binDirs     []string\n\t  cacheDir    string\n\t  podCidr     string\n\t}\n##### 2.2.2 kubelet初始化cni\n\n这里直接写调用链如下：\nmain （cmd/kubelet/kubelet.go）\n-> NewKubeletCommand （cmd/kubelet/app/server.go）\n-> Run （cmd/kubelet/app/server.go）\n-> run （cmd/kubelet/app/server.go）\n-> RunKubelet （cmd/kubelet/app/server.go）\n-> CreateAndInitKubelet（cmd/kubelet/app/server.go）\n-> kubelet.NewMainKubelet（pkg/kubelet/kubelet.go)\n-> cni.ProbeNetworkPlugins & network.InitNetworkPlugin（pkg/kubelet/network/plugins.go)\n<br>\n\n在cri的时候，如果是docker的话，调用dockershim.NewDockerService函数进行初始化：\n\n```\nswitch containerRuntime {\n\tcase kubetypes.DockerContainerRuntime:\n\t\t// Create and start the CRI shim running as a grpc server.\n\t\tstreamingConfig := getStreamingConfig(kubeCfg, kubeDeps, crOptions)\n\t\tds, err := dockershim.NewDockerService(kubeDeps.DockerClientConfig, crOptions.PodSandboxImage, streamingConfig,\n\t\t\t&pluginSettings, runtimeCgroups, kubeCfg.CgroupDriver, crOptions.DockershimRootDirectory, !crOptions.RedirectContainerStreaming, crOptions.NoJsonLogPath)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tif crOptions.RedirectContainerStreaming {\n\t\t\tklet.criHandler = ds\n\t\t}\n```\n\n<br>\n\n这里只关心cni相关的函数：\n\n1. 调用cni.ProbeNetworkPlugins，ProbeNetworkPlugins函数就是根据confDir,binDirs等配置，实例化一个cniNetworkPlugin结构体\n2. 调用InitNetworkPlugin初始化，调用的是 cniNetWorkPlugin.Init。 Init函数逻辑为：\n   * 调用platformInit执行nsenter命令，看是否可以进入ns\n   * 启动一个goroutine，每隔5秒，调用一次plugin.syncNetworkConfig，作用就是根据kubelet启动参数配置，去对应的cni conf文件夹下寻找cni配置文件，返回包含cni信息的cniNetwork结构体，赋值给cniNetworkPlugin结构体的defaultNetwork属性，从而达到cni conf以及bin更新后，kubelet也能感知并更新cniNetworkPlugin结构体的效果\n3. 将上面步骤中获取到的cniNetworkPlugin结构体，赋值给dockerService struct的network属性，待后续创建pod、删除pod时可以调用cniNetworkPlugin的SetUpPod、TearDownPod方法来构建pod的网络、销毁pod的网络\n\n```\n// NewDockerService creates a new `DockerService` struct.\n// NOTE: Anything passed to DockerService should be eventually handled in another way when we switch to running the shim as a different process.\nfunc NewDockerService(config *ClientConfig, podSandboxImage string, streamingConfig *streaming.Config, pluginSettings *NetworkPluginSettings,\n\tcgroupsName string, kubeCgroupDriver string, dockershimRootDir string, startLocalStreamingServer bool, noJsonLogPath string) (DockerService, error) {\n\n\tclient := NewDockerClientFromConfig(config)\n\n\tc := libdocker.NewInstrumentedInterface(client)\n\n\tcheckpointManager, err := checkpointmanager.NewCheckpointManager(filepath.Join(dockershimRootDir, sandboxCheckpointDir))\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tds := &dockerService{\n\t\tclient:          c,\n\t\tos:              kubecontainer.RealOS{},\n\t\tpodSandboxImage: podSandboxImage,\n\t\tstreamingRuntime: &streamingRuntime{\n\t\t\tclient:      client,\n\t\t\texecHandler: &NativeExecHandler{},\n\t\t},\n\t\tcontainerManager:          cm.NewContainerManager(cgroupsName, client),\n\t\tcheckpointManager:         checkpointManager,\n\t\tstartLocalStreamingServer: startLocalStreamingServer,\n\t\tnetworkReady:              make(map[string]bool),\n\t\tcontainerCleanupInfos:     make(map[string]*containerCleanupInfo),\n\t\tnoJsonLogPath:             noJsonLogPath,\n\t}\n\n\t// check docker version compatibility.\n\tif err = ds.checkVersionCompatibility(); err != nil {\n\t\treturn nil, err\n\t}\n\n\t// create streaming server if configured.\n\tif streamingConfig != nil {\n\t\tvar err error\n\t\tds.streamingServer, err = streaming.NewServer(*streamingConfig, ds.streamingRuntime)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\n\t// Determine the hairpin mode.\n\tif err := effectiveHairpinMode(pluginSettings); err != nil {\n\t\t// This is a non-recoverable error. Returning it up the callstack will just\n\t\t// lead to retries of the same failure, so just fail hard.\n\t\treturn nil, err\n\t}\n\tklog.Infof(\"Hairpin mode set to %q\", pluginSettings.HairpinMode)\n  \n  \n  // 1.调用cni.ProbeNetworkPlugins，函数就是根据confDir,binDirs等配置，实例化一个cniNetworkPlugin结构体\n\t// dockershim currently only supports CNI plugins.\n\tpluginSettings.PluginBinDirs = cni.SplitDirs(pluginSettings.PluginBinDirString)\n\tcniPlugins := cni.ProbeNetworkPlugins(pluginSettings.PluginConfDir, pluginSettings.PluginCacheDir, pluginSettings.PluginBinDirs)\n\tcniPlugins = append(cniPlugins, kubenet.NewPlugin(pluginSettings.PluginBinDirs, pluginSettings.PluginCacheDir))\n\tnetHost := &dockerNetworkHost{\n\t\t&namespaceGetter{ds},\n\t\t&portMappingGetter{ds},\n\t}\n\t\n\t// 2.调用InitNetworkPlugin初始化，调用的是 cniNetWorkPlugin.Init\n\tplug, err := network.InitNetworkPlugin(cniPlugins, pluginSettings.PluginName, netHost, pluginSettings.HairpinMode, pluginSettings.NonMasqueradeCIDR, pluginSettings.MTU)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"didn't find compatible CNI plugin with given settings %+v: %v\", pluginSettings, err)\n\t}\n\t\n\t// 3.将上面步骤中获取到的cniNetworkPlugin结构体，赋值给dockerService struct的network属性，待后续创建pod、删除pod时可以调用cniNetworkPlugin的SetUpPod、TearDownPod方法来构建pod的网络、销毁pod的网络\n\tds.network = network.NewPluginManager(plug)\n\tklog.Infof(\"Docker cri networking managed by %v\", plug.Name())\n\n\t// NOTE: cgroup driver is only detectable in docker 1.11+\n\tcgroupDriver := defaultCgroupDriver\n\tdockerInfo, err := ds.client.Info()\n\tklog.Infof(\"Docker Info: %+v\", dockerInfo)\n\tif err != nil {\n\t\tklog.Errorf(\"Failed to execute Info() call to the Docker client: %v\", err)\n\t\tklog.Warningf(\"Falling back to use the default driver: %q\", cgroupDriver)\n\t} else if len(dockerInfo.CgroupDriver) == 0 {\n\t\tklog.Warningf(\"No cgroup driver is set in Docker\")\n\t\tklog.Warningf(\"Falling back to use the default driver: %q\", cgroupDriver)\n\t} else {\n\t\tcgroupDriver = dockerInfo.CgroupDriver\n\t}\n\tif len(kubeCgroupDriver) != 0 && kubeCgroupDriver != cgroupDriver {\n\t\treturn nil, fmt.Errorf(\"misconfiguration: kubelet cgroup driver: %q is different from docker cgroup driver: %q\", kubeCgroupDriver, cgroupDriver)\n\t}\n\tklog.Infof(\"Setting cgroupDriver to %s\", cgroupDriver)\n\tds.cgroupDriver = cgroupDriver\n\tds.versionCache = cache.NewObjectCache(\n\t\tfunc() (interface{}, error) {\n\t\t\treturn ds.getDockerVersion()\n\t\t},\n\t\tversionCacheTTL,\n\t)\n\n\t// Register prometheus metrics.\n\tmetrics.Register()\n\n\treturn ds, nil\n}\n\n\n\n// ProbeNetworkPlugins函数就是根据confDir,binDirs等配置，实例化一个cniNetworkPlugin结构体\n// ProbeNetworkPlugins : get the network plugin based on cni conf file and bin file\nfunc ProbeNetworkPlugins(confDir, cacheDir string, binDirs []string) []network.NetworkPlugin {\n\told := binDirs\n\tbinDirs = make([]string, 0, len(binDirs))\n\tfor _, dir := range old {\n\t\tif dir != \"\" {\n\t\t\tbinDirs = append(binDirs, dir)\n\t\t}\n\t}\n\n\tplugin := &cniNetworkPlugin{\n\t\tdefaultNetwork: nil,\n\t\tloNetwork:      getLoNetwork(binDirs),\n\t\texecer:         utilexec.New(),\n\t\tconfDir:        confDir,\n\t\tbinDirs:        binDirs,\n\t\tcacheDir:       cacheDir,\n\t}\n\n\t// sync NetworkConfig in best effort during probing.\n\tplugin.syncNetworkConfig()\n\treturn []network.NetworkPlugin{plugin}\n}\n\n// 这里只是调用一下nsenter命令，看是否可以进入ns\nfunc (plugin *cniNetworkPlugin) platformInit() error {\n\tvar err error\n\tplugin.nsenterPath, err = plugin.execer.LookPath(\"nsenter\")\n\tif err != nil {\n\t\treturn err\n\t}\n\treturn nil\n}\n\n// 启动一个goroutine，每隔5秒，调用一次plugin.syncNetworkConfig，作用就是根据kubelet启动参数配置，去对应的cni conf文件夹下寻找cni配置文件，返回包含cni信息的cniNetwork结构体，赋值给cniNetworkPlugin结构体的defaultNetwork属性，从而达到cni conf以及bin更新后，kubelet也能感知并更新cniNetworkPlugin结构体的效果。\nfunc (plugin *cniNetworkPlugin) Init(host network.Host, hairpinMode kubeletconfig.HairpinMode, nonMasqueradeCIDR string, mtu int) error {\n\terr := plugin.platformInit()\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tplugin.host = host\n\n\tplugin.syncNetworkConfig()\n\n\t// start a goroutine to sync network config from confDir periodically to detect network config updates in every 5 seconds\n\tgo wait.Forever(plugin.syncNetworkConfig, defaultSyncConfigPeriod)\n\n\treturn nil\n\n```\n\n<br>\n\n##### 2.2.3 kubelet 分配ip\n\n在kubelet创建pod的流程分析中，分配Pod ip的调用链路如下：\n\n-> klet.syncPod（pkg/kubelet/kubelet.go）\n-> kl.containerRuntime.SyncPod（pkg/kubelet/kubelet.go）\n-> m.createPodSandbox（pkg/kubelet/kuberuntime/kuberuntime_manager.go）\n-> m.runtimeService.RunPodSandbox （pkg/kubelet/kuberuntime/kuberuntime_sandbox.go）\n-> ds.network.SetUpPod（pkg/kubelet/dockershim/docker_sandbox.go）\n-> pm.plugin.SetUpPod（pkg/kubelet/dockershim/network/plugins.go)\n-> SetUpPod（pkg/kubelet/dockershim/network/cni/cni.go)\n\n这里直接从 SetUpPod分析看看，Kubelet是如何调用cni的。\n\ncniNetworkPlugin.SetUpPod方法作用cni网络插件构建pod网络的调用入口。其主要逻辑为：\n\n（1）调用plugin.checkInitialized()：检查网络插件是否已经初始化完成，还有Podcidr 是否设置等。\n\n（2）调用plugin.host.GetNetNS()：获取容器网络命名空间路径，格式/proc/${容器PID}/ns/net；\n\n（3）调用context.WithTimeout()：设置调用cni网络插件的超时时间；\n\n（4）调用plugin.addToNetwork()：如果是linux环境，则调用cni网络插件，给pod构建回环网络；\n\n（5）调用plugin.addToNetwork()：调用cni网络插件，给pod构建默认网络。\n\n这里核心是plugin.addToNetwork函数，接着往下看\n\n```\n// pkg/kubelet/dockershim/network/cni/cni.go\nfunc (plugin *cniNetworkPlugin) SetUpPod(namespace string, name string, id kubecontainer.ContainerID, annotations, options map[string]string) error {\n\tif err := plugin.checkInitialized(); err != nil {\n\t\treturn err\n\t}\n\tnetnsPath, err := plugin.host.GetNetNS(id.ID)\n\tif err != nil {\n\t\treturn fmt.Errorf(\"CNI failed to retrieve network namespace path: %v\", err)\n\t}\n\n\t// Todo get the timeout from parent ctx\n\tcniTimeoutCtx, cancelFunc := context.WithTimeout(context.Background(), network.CNITimeoutSec*time.Second)\n\tdefer cancelFunc()\n\t// Windows doesn't have loNetwork. It comes only with Linux\n\tif plugin.loNetwork != nil {\n\t\tif _, err = plugin.addToNetwork(cniTimeoutCtx, plugin.loNetwork, name, namespace, id, netnsPath, annotations, options); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\n\t_, err = plugin.addToNetwork(cniTimeoutCtx, plugin.getDefaultNetwork(), name, namespace, id, netnsPath, annotations, options)\n\treturn err\n}\n```\n\n<br>\n\n###### plugin.addToNetwork\n\nplugin.addToNetwork方法的作用就是调用cni网络插件，给pod构建指定类型的网络，其主要逻辑为：\n\n（1）调用plugin.buildCNIRuntimeConf()：构建调用cni网络插件的配置，报告 podCIDR, dns capability配置等\n\n（2）调用cniNet.AddNetworkList()：调用cni网络插件，进行网络构建\n\n这里核心是AddNetworkList函数，接着往下看\n\n```\nfunc (plugin *cniNetworkPlugin) addToNetwork(ctx context.Context, network *cniNetwork, podName string, podNamespace string, podSandboxID kubecontainer.ContainerID, podNetnsPath string, annotations, options map[string]string) (cnitypes.Result, error) {\n\trt, err := plugin.buildCNIRuntimeConf(podName, podNamespace, podSandboxID, podNetnsPath, annotations, options)\n\tif err != nil {\n\t\tklog.Errorf(\"Error adding network when building cni runtime conf: %v\", err)\n\t\treturn nil, err\n\t}\n\n\tpdesc := podDesc(podNamespace, podName, podSandboxID)\n\tnetConf, cniNet := network.NetworkConfig, network.CNIConfig\n\tklog.V(4).Infof(\"Adding %s to network %s/%s netns %q\", pdesc, netConf.Plugins[0].Network.Type, netConf.Name, podNetnsPath)\n\tres, err := cniNet.AddNetworkList(ctx, netConf, rt)\n\tif err != nil {\n\t\tklog.Errorf(\"Error adding %s to network %s/%s: %v\", pdesc, netConf.Plugins[0].Network.Type, netConf.Name, err)\n\t\treturn nil, err\n\t}\n\tklog.V(4).Infof(\"Added %s to network %s: %v\", pdesc, netConf.Name, res)\n\treturn res, nil\n}\n```\n\n###### cniNet.AddNetworkList\n\nAddNetworkList方法中主要是调用了addNetwork方法，所以来看下addNetwork方法的逻辑：\n\n（1）调用c.exec.FindInPath()：拼接出cni网络插件可执行文件的绝对路径；\n\n（2）调用buildOneConfig()：构建配置；\n\n（3）调用c.args()：构建调用cni网络插件的参数；\n\n（4）调用invoke.ExecPluginWithResult()：调用cni网络插件进行pod网络的构建操作。\n\n这里的核心就是ExecPluginWithResult，接着往下看\n\n```\n// AddNetworkList executes a sequence of plugins with the ADD command\nfunc (c *CNIConfig) AddNetworkList(ctx context.Context, list *NetworkConfigList, rt *RuntimeConf) (types.Result, error) {\n\tvar err error\n\tvar result types.Result\n\tfor _, net := range list.Plugins {\n\t\tresult, err = c.addNetwork(ctx, list.Name, list.CNIVersion, net, result, rt)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\n\tif err = setCachedResult(result, list.Name, rt); err != nil {\n\t\treturn nil, fmt.Errorf(\"failed to set network %q cached result: %v\", list.Name, err)\n\t}\n\n\treturn result, nil\n}\n\n\n\nfunc (c *CNIConfig) addNetwork(ctx context.Context, name, cniVersion string, net *NetworkConfig, prevResult types.Result, rt *RuntimeConf) (types.Result, error) {\n\tc.ensureExec()\n\tpluginPath, err := c.exec.FindInPath(net.Network.Type, c.Path)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tnewConf, err := buildOneConfig(name, cniVersion, net, prevResult, rt)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\treturn invoke.ExecPluginWithResult(ctx, pluginPath, newConf.Bytes, c.args(\"ADD\", rt), c.exec)\n}\n```\n\n<br>\n\n###### ExecPluginWithResult\n\ninvoke.ExecPluginWithResult主要是将调用参数变成env，然后调用cni网络插件可执行文件，并获取返回结果。\n\n```\nfunc ExecPluginWithResult(ctx context.Context, pluginPath string, netconf []byte, args CNIArgs, exec Exec) (types.Result, error) {\n   if exec == nil {\n      exec = defaultExec\n   }\n\n   stdoutBytes, err := exec.ExecPlugin(ctx, pluginPath, netconf, args.AsEnv())\n   if err != nil {\n      return nil, err\n   }\n\n   // Plugin must return result in same version as specified in netconf\n   versionDecoder := &version.ConfigDecoder{}\n   confVersion, err := versionDecoder.Decode(netconf)\n   if err != nil {\n      return nil, err\n   }\n\n   return version.NewResult(confVersion, stdoutBytes)\n}\n```\n\n<br>\n\nc.args方法作用是构建调用cni网络插件可执行文件时的参数。\n\n从代码中可以看出，参数有Command（命令，Add代表构建网络，Del代表销毁网络）、ContainerID（容器ID）、NetNS（容器网络命名空间路径）、IfName（Interface Name即网络接口名称）、PluginArgs（其他参数如pod名称、pod命名空间等）等。\n\n```\nfunc (c *CNIConfig) args(action string, rt *RuntimeConf) *invoke.Args {\n\treturn &invoke.Args{\n\t\tCommand:     action,\n\t\tContainerID: rt.ContainerID,\n\t\tNetNS:       rt.NetNS,\n\t\tPluginArgs:  rt.Args,\n\t\tIfName:      rt.IfName,\n\t\tPath:        strings.Join(c.Path, string(os.PathListSeparator)),\n\t}\n}\n```\n\n\n\n### 3. 总结\n\n总的来说，kubelet中的cni就是封装了接口。然后根据配置，调用cni的二进制生成网络。（包括podip, mac地址，mtu等等设置）\n\n\n"
  },
  {
    "path": "k8s/cni/6.如何订制自己的cni.md",
    "content": "* [1\\. 背景](#1-背景)\n* [2\\. conf如何配置](#2-conf如何配置)\n* [3\\. cni插件如何实现](#3-cni插件如何实现)\n  * [3\\.1 摘抄部分](#31-摘抄部分)\n  * [3\\.2 原创部分](#32-原创部分)\n* [4\\. 参考](#4-参考)\n\n### 1. 背景\n\nCNI的定义可以参照[官方文档](https://github.com/containernetworking/cni)，这里不详细介绍。\n\nCNI插件是由kubelet加载和运行，具体的目录和配置可以由参数`--network-plugin --cni-conf-dir --cni-bin-dir`指定。\n\n参数必须是 --network-plugin = cni， --cni-bin-dir 里面放的是自定义cni的二进制文件。 --cni-conf-dir 是配置文件。\n\n以calico为例：\n\n```\n# CNI和IPAM的二进制文件\n# ls /opt/cni/bin/\ncalico  calico-ipam  loopback\n\n# CNI的配置文件\n# ls /etc/cni/net.d/\n10-calico.conf  calico-kubeconfig\n```\n\n<br>\n\n可以看到关键在于2点：\n\n（1）conf如何配置\n\n（2）二进制代码需要如何实现\n\n### 2. conf如何配置\n\n一般来说，CNI 插件需要在集群的每个节点上运行，在 CNI 的规范里面，实现一个 CNI 插件首先需要一个 JSON 格式的配置文件，配置文件需要放到每个节点的 `/etc/cni/net.d/` 目录，一般命名为 `<数字>-<CNI-plugin>.conf`，而且配置文件至少需要以下几个必须的字段：\n\n1. `cniVersion`: CNI 插件的字符串版本号，要求符合  [Semantic Version 2.0 规范](https://semver.org/)\n2. `name`: 字符串形式的网络名；\n3. `type`: 字符串表示的 CNI 插件的可运行文件；\n\n除此之外，我们也可以增加一些自定义的配置字段，用于传递参数给 CNI 插件，这些配置会在运行时传递给 CNI 插件。在我们的例子里面，需要配置每个宿主机网桥的设备名、网络设备的最大传输单元(MTU)以及每个节点分配的 24 位子网地址，因此，我们的 CNI 插件的配置看起来会像下面这样：\n\n```\n{\n    \"cniVersion\": \"0.1.0\",\n    \"name\": \"minicni\",\n    \"type\": \"minicni\",\n    \"bridge\": \"minicni0\",\n    \"mtu\": 1500,\n    \"subnet\": __NODE_SUBNET__\n}\n```\n\nNote: 确保配置文件放到 `/etc/cni/net.d/` 目录，kubelet 默认此目录寻找 CNI 插件配置；并且，插件的配置可以分为多个插件链的形式来运行，但是为了简单起见，在我们的例子中，只配置一个独立的 CNI 插件，因为配置文件的后缀名为 `.conf`。\n\n### 3. cni插件如何实现\n\n#### 3.1 摘抄部分\n\n本节摘抄自：https://jishuin.proginn.com/p/763bfbd57bc0\n\n接下来就开始看怎么实现 CNI 插件来管理 pod IP 地址以及配置容器网络设备。在此之前，我们需要明确的是，CNI 介入的时机是 kubelet 创建 pause 容器创建对应的网络命名空间之后，同时当 CNI 插件被调用的时候，kubelet 会将相关操作命令以及参数通过环境变量的形式传递给它。这些环境变量包括：\n\n1. `CNI_COMMAND`: CNI 操作命令，包括 ADD, DEL, CHECK 以及 VERSION\n2. `CNI_CONTAINERID`: 容器 ID\n3. `CNI_NETNS`: pod 网络命名空间\n4. `CNI_IFNAME`: pod 网络设备名称\n5. `CNI_PATH`: CNI 插件可执行文件的搜索路径\n6. `CNI_ARGS`: 可选的其他参数，形式类似于 `key1=value1,key2=value2...`\n\n在运行时，kubelet 通过 CNI 配置文件寻找 CNI 可执行文件，然后基于上述几个环境变量来执行相关的操作。CNI 插件必须支持的操作包括：\n\n1. ADD: 将 pod 加入到 pod 网络中\n2. DEL: 将 pod 从 pod 网络中删除\n3. CHECK: 检查 pod 网络配置正常\n4. VERSION: 返回可选 CNI 插件的版本信息\n\n```\nfunc main() {\n cmd, cmdArgs, err := args.GetArgsFromEnv()\n if err != nil {\n  fmt.Fprintf(os.Stderr, \"getting cmd arguments with error: %v\", err)\n }\n\n fh := handler.NewFileHandler(IPStore)\n\n switch cmd {\n case \"ADD\":\n  err = fh.HandleAdd(cmdArgs)\n case \"DEL\":\n  err = fh.HandleDel(cmdArgs)\n case \"CHECK\":\n  err = fh.HandleCheck(cmdArgs)\n case \"VERSION\":\n  err = fh.HandleVersion(cmdArgs)\n default:\n  err = fmt.Errorf(\"unknown CNI_COMMAND: %s\", cmd)\n }\n if err != nil {\n  fmt.Fprintf(os.Stderr, \"Failed to handle CNI_COMMAND %q: %v\", cmd, err)\n  os.Exit(1)\n }\n}\n```\n\n可以看到，我们首先调用 `GetArgsFromEnv()` 函数将 CNI 插件的操作命令以及相关参数通过环境变量读入，同时从标准输入获取 CNI 插件的 JSON 配置，然后基于不同的 CNI 操作命令执行不同的处理函数。\n\n需要注意的是，我们将处理函数的集合实现为一个**接口**[12]，这样就可以很容易的扩展不同的接口实现。在最基础的版本实现中，我们基本文件存储分配的 IP 信息。但是，这种实现方式存在很多问题，例如，文件存储不可靠，读写可能会发生冲突等，在后续的版本中，我们会实现基于 kubernetes 存储的接口实现，将子网信息以及 IP 信息存储到 apiserver 中，从而实现可靠存储。\n\n接下来，我们就看看基于文件的接口实现是怎么处理这些 CNI 操作命令的。\n\n对于 ADD 命令：\n\n1. 从标准输入获取 CNI 插件的配置信息，最重要的是当前宿主机网桥的设备名、网络设备的最大传输单元(MTU)以及当前节点分配的 24 位子网地址；\n2. 然后从环境变量中找到对应的 CNI 操作参数，包括 pod 容器网络命名空间以及 pod 网络设备名等；\n3. 接下来创建或者更新节点宿主机网桥，从当前节点分配的 24 位子网地址中抽取子网的网关地址，准备分配给节点宿主机网桥；\n4. 接着将从文件读取已经分配的 IP 地址列表，遍历 24 位子网地址并从中取出第一个没有被分配的 IP 地址信息，准备分配给 pod 网络设备；pod 网络设备是 veth 设备对，一端在 pod 网络命名空间中，另外一端连接着宿主机上的网桥设备，同时所有的 pod 网络设备将宿主机上的网桥设备当作默认网关；\n5. 最终成功后需要将新的 pod IP 写入到文件中\n\n看起来很简单对吧？其实作为最简单的方式，这种方案可以实现最基础的 ADD 功能：\n\n```\nfunc (fh *FileHandler) HandleAdd(cmdArgs *args.CmdArgs) error {\n cniConfig := args.CNIConfiguration{}\n if err := json.Unmarshal(cmdArgs.StdinData, &cniConfig); err != nil {\n  return err\n }\n allIPs, err := nettool.GetAllIPs(cniConfig.Subnet)\n if err != nil {\n  return err\n }\n gwIP := allIPs[0]\n\n // open or create the file that stores all the reserved IPs\n f, err := os.OpenFile(fh.IPStore, os.O_RDWR|os.O_CREATE, 0600)\n if err != nil {\n  return fmt.Errorf(\"failed to open file that stores reserved IPs %v\", err)\n }\n defer f.Close()\n\n // get all the reserved IPs from file\n content, err := ioutil.ReadAll(f)\n if err != nil {\n  return err\n }\n reservedIPs := strings.Split(strings.TrimSpace(string(content)), \"\\n\")\n \n podIP := \"\"\n for _, ip := range allIPs[1:] {\n  reserved := false\n  for _, rip := range reservedIPs {\n   if ip == rip {\n    reserved = true\n    break\n   }\n  }\n  if !reserved {\n   podIP = ip\n   reservedIPs = append(reservedIPs, podIP)\n   break\n  }\n }\n if podIP == \"\" {\n  return fmt.Errorf(\"no IP available\")\n }\n\n // Create or update bridge\n brName := cniConfig.Bridge\n if brName != \"\" {\n  // fall back to default bridge name: minicni0\n  brName = \"minicni0\"\n }\n mtu := cniConfig.MTU\n if mtu == 0 {\n  // fall back to default MTU: 1500\n  mtu = 1500\n }\n br, err := nettool.CreateOrUpdateBridge(brName, gwIP, mtu)\n if err != nil {\n  return err\n }\n\n netns, err := ns.GetNS(cmdArgs.Netns)\n if err != nil {\n  return err\n }\n\n if err := nettool.SetupVeth(netns, br, cmdArgs.IfName, podIP, gwIP, mtu); err != nil {\n  return err\n }\n\n // write reserved IPs back into file\n if err := ioutil.WriteFile(fh.IPStore, []byte(strings.Join(reservedIPs, \"\\n\")), 0600); err != nil {\n  return fmt.Errorf(\"failed to write reserved IPs into file: %v\", err)\n }\n\n return nil\n```\n\n一个关键的问题是如何选择合适的 Go 语言库函数来操作 Linux 网络设备，如创建网桥设备、网络命名空间以及连接 veth 设备对。在我们的例子中，选择了比较成熟的 **netlink**[13]，实际上，所有基于 iproute2 工具包的命令在 netlink 库中都有对应的 API，例如 `ip link add` 可以通过调用 `AddLink()` 函数来实现。\n\n还有一个问题需要格外小心，那就是处理网络命名空间切换、Go 协程与线程调度问题。在 Linux 中，不同的操作系统线程可能会设置不同的网络命名空间，而 Go 语言的协程会基于操作系统线程的负载以及其他信息动态地在不同的操作系统线程之间切换，这样可能会导致 Go 协程在意想不到的情况下切换到不同的网络命名空间中。\n\n比较稳妥的做法是，利用 Go 语言提供的 `runtime.LockOSThread()` 函数保证特定的 Go 协程绑定到当前的操作系统线程中。\n\n对于 ADD 操作的返回，确保操作成功之后向标准输出中写入 ADD 操作的返回信息：\n\n```\naddCmdResult := &AddCmdResult{\n  CniVersion: cniConfig.CniVersion,\n  IPs: &nettool.AllocatedIP{\n   Version: \"IPv4\",\n   Address: podIP,\n   Gateway: gwIP,\n  },\n }\n addCmdResultBytes, err := json.Marshal(addCmdResult)\n if err != nil {\n  return err\n }\n\n // kubelet expects json format from stdout if success\n fmt.Print(string(addCmdResultBytes))\n\n    return nil\n```\n\n其他三个 CNI 操作命令的处理就更简单了。DEL 操作只需要回收分配的 IP 地址，从文件中删除对应的条目，我们不需要处理 pod 网络设备的删除，原因是 kubelet 在删除 pod 网络命名空间之后这些 pod 网络设备也会自动被删除；CHECK 命令检查之前创建的网络设备与配置，暂时是可选的；VERSION 命令以 JSON 形式输出 CNI 版本信息到标准输出。\n\n```\nfunc (fh *FileHandler) HandleDel(cmdArgs *args.CmdArgs) error {\n netns, err := ns.GetNS(cmdArgs.Netns)\n if err != nil {\n  return err\n }\n ip, err := nettool.GetVethIPInNS(netns, cmdArgs.IfName)\n if err != nil {\n  return err\n }\n\n // open or create the file that stores all the reserved IPs\n f, err := os.OpenFile(fh.IPStore, os.O_RDWR|os.O_CREATE, 0600)\n if err != nil {\n  return fmt.Errorf(\"failed to open file that stores reserved IPs %v\", err)\n }\n defer f.Close()\n\n // get all the reserved IPs from file\n content, err := ioutil.ReadAll(f)\n if err != nil {\n  return err\n }\n reservedIPs := strings.Split(strings.TrimSpace(string(content)), \"\\n\")\n\n for i, rip := range reservedIPs {\n  if rip == ip {\n   reservedIPs = append(reservedIPs[:i], reservedIPs[i+1:]...)\n   break\n  }\n }\n\n // write reserved IPs back into file\n if err := ioutil.WriteFile(fh.IPStore, []byte(strings.Join(reservedIPs, \"\\n\")), 0600); err != nil {\n  return fmt.Errorf(\"failed to write reserved IPs into file: %v\", err)\n }\n\n return nil\n}\n\nfunc (fh *FileHandler) HandleCheck(cmdArgs *args.CmdArgs) error {\n // to br implemented\n return nil\n}\n\nfunc (fh *FileHandler) HandleVersion(cmdArgs *args.CmdArgs) error {\n versionInfo, err := json.Marshal(fh.VersionInfo)\n if err != nil {\n  return err\n }\n fmt.Print(string(versionInfo))\n return nil\n```\n\n#### 3.2 原创部分\n\nkubelet会将`pod_namespace pod_name infra_container_id`连同CNI的配置一起作为参数传递给CNI插件，CNI插件需要完成对`infra container`的网络配置和IP分配，并将结果通过标准输出返回给kubelet。\n\n而在CNI的二进制中，实际上只需要实现两个方法\n\n![image-20220401155049438](../images/cni-0401-1.png)\n\nCni可以获取到Pod的元数据，我们可以再pod Annotation里面携带vpc信息，实现定制化操作。\n\n<br>\n\n其实cni的核心就是根据 kubelet传入的参数，初始化网络环境。其实这样知道了原理，很容易实现一个自定义的cni。\n\n可以看看这个repo，直接通过shell脚本就实现了一个cni： https://github.com/eranyanay/cni-from-scratch/\n\n### 4. 参考\n\nhttps://jishuin.proginn.com/p/763bfbd57bc0\n\nhttps://github.com/containernetworking/cni/blob/main/SPEC.md\n\n https://github.com/eranyanay/cni-from-scratch/\n\n"
  },
  {
    "path": "k8s/cni/7. flannel原理浅析分析.md",
    "content": "### 1. 原理简介\n\nFlannel 是 CoreOS 团队针对 Kubernetes 设计的一个网络规划实现。简单来说，它的功能有以下几点：\n\n1、使集群中的不同 Node 主机创建的 Docker 容器都具有全集群唯一的虚拟 IP 地址；\n\n2、建立一个覆盖网络（overlay network），这个覆盖网络会将数据包原封不动的传递到目标容器中。覆盖网络是建立在另一个网络之上并由其基础设施支持的虚拟网络。覆盖网络通过将一个分组封装在另一个分组内来将网络服务与底层基础设施分离。在将封装的数据包转发到端点后，将其解封装；\n\n3、创建一个新的虚拟网卡 flannel0 接收 docker 网桥的数据，通过维护路由表，对接收到的数据进行封包和转发（VXLAN）；\n\n4、路由信息一般存放到 etcd 中：多个 Node 上的 Flanneld 依赖一个 etcd cluster 来做集中配置服务，etcd 保证了所有 Node 上 Flannel 所看到的配置是一致的。同时每个 Node 上的 Flannel 都可以监听 etcd 上的数据变化，实时感知集群中 Node 的变化；\n\n5、Flannel 首先会在 Node 上创建一个名为 flannel0 的网桥（VXLAN 类型的设备），并且在每个 Node 上运行一个名为 Flanneld 的代理。每个 Node 上的 Flannel 代理会从 etcd 上为当前 Node 申请一个 CIDR 地址块用来给该 Node 上的 Pod 分配地址；\n\n6、Flannel 致力于给 Kubernetes 集群中的 Node 提供一个三层网络，它并不控制 Node 中的容器是如何进行组网的，仅仅关心流量如何在 Node 之间流转。\n\n![flannel](../images/cni-3.png)\n\n\n\n### 2. 源码分析\n\n待补充"
  },
  {
    "path": "k8s/cni/8. calico原理浅析md.md",
    "content": "\n\n他写的太好了，可以参考：[https://www.cnblogs.com/goldsunshine/p/10701242.html](https://links.jianshu.com/go?to=https%3A%2F%2Fwww.cnblogs.com%2Fgoldsunshine%2Fp%2F10701242.html)\n\ncalico有两种模式：ipip(默认)、bgp。bgp效率相对更高\n\n* 如果宿主机在同一个网段，可以使用ipip模式；\n* 如果宿主机不在同一个网段，pod通过BGP的hostGW是不可能互相通讯的，此时需要使用ipip模式（如果仍想使用bgp模式，除非你在中间路由器上手动添加路由）\n\n\n\nflannel 是overlay类型的。\n\n缺点是：\n\n1. 不支持pod之间的网络隔离。Flannel设计思想是将所有的pod都放在一个大的二层网络中，所以pod之间没有隔离策略。\n2. 设备复杂，效率不高。Flannel模型下有三种设备，数量经过多种设备的封装、解析，势必会造成传输效率的下降。\n\n\n\nCalico是Underlay类型的。\n\n缺点是：\n\n* 复杂\n\n* 1台 Host 上可能虚拟化十几或几十个容器实例，过多的 iptables 规则造成复杂性和不可调试性，同时也存在性能损耗。"
  },
  {
    "path": "k8s/install-k8s-from source code/1-debian二进制安装v1.17 k8s.md",
    "content": "Table of Contents\n=================\n\n  * [1. 集群规划](#1-集群规划)\n  * [2.准备工作](#2准备工作)\n     * [2.1 修改主机名](#21-修改主机名)\n     * [2.1 关闭 SElinux 和防火墙](#21-关闭-selinux-和防火墙)\n     * [2.3 同步机器时间](#23-同步机器时间)\n  * [3. etcd集群部署](#3-etcd集群部署)\n     * [2.1 etcd部署前的准备工作](#21-etcd部署前的准备工作)\n        * [2.1.1 准备cfssl证书生成工具](#211-准备cfssl证书生成工具)\n        * [2.1.2 自签证书颁发机构（CA）](#212-自签证书颁发机构ca)\n        * [2.1.3 使用自签CA签发Etcd HTTPS证书](#213-使用自签ca签发etcd-https证书)\n     * [2.2 下载etcd](#22-下载etcd)\n     * [2.3 安装etcd](#23-安装etcd)\n  * [3. node和master 安装docker](#3-node和master-安装docker)\n  * [4. 部署kmaster组件](#4-部署kmaster组件)\n     * [4.1 部署kube-apiserver](#41-部署kube-apiserver)\n        * [4.1.1 生成kube-apiserver证书](#411-生成kube-apiserver证书)\n        * [4.2.1 确定二进制文件和配置文件路径](#421-确定二进制文件和配置文件路径)\n        * [4.2.2 启用 TLS Bootstrapping 机制](#422-启用-tls-bootstrapping-机制)\n        * [4.2.3 systemd管理apiserver](#423-systemd管理apiserver)\n        * [4.2.4 授权kubelet-bootstrap用户允许请求证书](#424-授权kubelet-bootstrap用户允许请求证书)\n     * [4.2 部署kube-controller-manager](#42-部署kube-controller-manager)\n        * [4.2.1 创建配置文件](#421-创建配置文件)\n        * [4.2.2 systemd管理controller-manager](#422-systemd管理controller-manager)\n     * [4.3 部署kube-scheduler](#43-部署kube-scheduler)\n        * [4.3.1 创建配置文件](#431-创建配置文件)\n        * [4.3.2 systemd管理scheduler](#432-systemd管理scheduler)\n        * [4.3.3 启动并设置开机启动](#433-启动并设置开机启动)\n        * [4.3.4 查看集群状态](#434-查看集群状态)\n  * [5.部署dnode节点](#5部署dnode节点)\n     * [5.1 文件和目录准备](#51-文件和目录准备)\n     * [5.2 部署kubelet](#52-部署kubelet)\n        * [5.2.1. 创建配置文件](#521-创建配置文件)\n        * [5.2.2 配置参数文件](#522-配置参数文件)\n        * [5.2.3 生成bootstrap.kubeconfig文件](#523-生成bootstrapkubeconfig文件)\n        * [5.2.4 systemd管理kubelet](#524-systemd管理kubelet)\n        * [5.2.5 批准kubelet证书申请并加入集群](#525-批准kubelet证书申请并加入集群)\n     * [5.3 部署kube-proxy](#53-部署kube-proxy)\n        * [5.3.1 创建配置文件](#531-创建配置文件)\n        * [5.3.2 配置参数文件](#532-配置参数文件)\n        * [5.3.3. 生成kube-proxy.kubeconfig文件](#533-生成kube-proxykubeconfig文件)\n        * [5.3.4. systemd管理kube-proxy](#534-systemd管理kube-proxy)\n     * [5.4 部署网络环境](#54-部署网络环境)\n     * [5.5  授权apiserver访问kubelet](#55--授权apiserver访问kubelet)\n  * [6 新增加Node](#6-新增加node)\n        * [6.1. 拷贝已部署好的Node相关文件到新节点](#61-拷贝已部署好的node相关文件到新节点)\n        * [6.2 删除kubelet证书和kubeconfig文件](#62-删除kubelet证书和kubeconfig文件)\n        * [6.3. 修改主机名](#63-修改主机名)\n        * [6.4. 启动并设置开机启动](#64-启动并设置开机启动)\n        * [6.5. 在Master上批准新Node kubelet证书申请](#65-在master上批准新node-kubelet证书申请)\n        * [6.6. 查看Node状态](#66-查看node状态)\n  * [7.可能遇到的坑](#7可能遇到的坑)\n\n### 1. 集群规划\n\n这里使用了百度云的两条主机作为集群搭建。配置如下：\n\n两台机器都是：2核，4GB，40GB， 1M   计算型C3\n\n| 主机1       | 主机2           |\n| ----------- | --------------- |\n| 192.168.0.4 | kmaster & dnode |\n| 192.168.0.5 | dnode           |\n\n\n其中etcd集群：部署在 192.168.0.4,192.168.0.5中\n\n192.168.0.4 节点又当kmaster又当dnode\n192.168.0.5 节点又当dnode\n\n### 2.准备工作\n#### 2.1 修改主机名\n默认的云机器名都是一个字符串，这里我进行了修改\n（1） 在192.168.0.4 使用如下的命令，将主机名修改为 k8s-master\n```\nhostname k8s-master\n```\n（2）在192.168.0.5 使用如下的命令，将主机名修改为 k8s-node\n```\nhostname k8s-node\n```\n\n#### 2.1 关闭 SElinux 和防火墙\n\ndebian 可能下面的配置，没有就跳过\n\n```\n[root@k8s-master ~]# cat /etc/selinux/config \n\n# This file controls the state of SELinux on the system.\n# SELINUX= can take one of these three values:\n#     disabled - SELinux security policy is enforced.\n#     permissive - SELinux prints warnings instead of disabled.\n#     disabled - No SELinux policy is loaded.\nSELINUX=disabled\n# SELINUXTYPE= can take one of three values:\n#     targeted - Targeted processes are protected,\n#     minimum - Modification of targeted policy. Only selected processes are protected. \n#     mls - Multi Level Security protection.\nSELINUXTYPE=targeted\n\n\n[root@k8s-master ~]# \n[root@k8s-master ~]# systemctl stop firewalld\n```\n\n#### 2.3 同步机器时间\n\n一般云主机时间都是对的，像虚拟机一般都要同步一下时间\n\n```\nntpdate time.windows.com\n\n```\n\n<br>\n\n### 3. etcd集群部署\n\n#### 2.1 etcd部署前的准备工作\n\n##### 2.1.1 准备cfssl证书生成工具\n\ncfssl是一个开源的证书管理工具，使用json文件生成证书，相比openssl更方便使用。\n\n找任意一台服务器操作，这里用Master节点。\n\n```\nwget https://pkg.cfssl.org/R1.2/cfssl_linux-amd64\nwget https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64\nwget https://pkg.cfssl.org/R1.2/cfssl-certinfo_linux-amd64\nchmod +x cfssl_linux-amd64 cfssljson_linux-amd64 cfssl-certinfo_linux-amd64\nmv cfssl_linux-amd64 /usr/local/bin/cfssl\nmv cfssljson_linux-amd64 /usr/local/bin/cfssljson\nmv cfssl-certinfo_linux-amd64 /usr/bin/cfssl-certinfo\n```\n\n##### 2.1.2 自签证书颁发机构（CA）\n（1） 创建工作目录：\n```\nmkdir -p ~/TLS/{etcd,k8s}\n\ncd TLS/etcd\n```\n\n(2) 自签CA：\n\n```\ncat > ca-config.json << EOF\n{\n  \"signing\": {\n    \"default\": {\n      \"expiry\": \"87600h\"\n    },\n    \"profiles\": {\n      \"www\": {\n         \"expiry\": \"87600h\",\n         \"usages\": [\n            \"signing\",\n            \"key encipherment\",\n            \"server auth\",\n            \"client auth\"\n        ]\n      }\n    }\n  }\n}\nEOF\n\ncat > ca-csr.json << EOF\n{\n    \"CN\": \"etcd CA\",\n    \"key\": {\n        \"algo\": \"rsa\",\n        \"size\": 2048\n    },\n    \"names\": [\n        {\n            \"C\": \"CN\",\n            \"L\": \"Beijing\",\n            \"ST\": \"Beijing\"\n        }\n    ]\n}\nEOF\n```\n\n(3) 生成证书：\n\n```\ncfssl gencert -initca ca-csr.json | cfssljson -bare ca -\n```\n\n查看是否成功，只要有ca-key.pem ca.pem就是成功了\n```\nls *pem\nca-key.pem  ca.pem\n```\n\n##### 2.1.3 使用自签CA签发Etcd HTTPS证书\n\n（1）创建证书申请文件：\n\n```\ncat > server-csr.json << EOF\n{\n    \"CN\": \"etcd\",\n    \"hosts\": [\n    \"192.168.0.4\",\n    \"192.168.0.5\"\n    ],\n    \"key\": {\n        \"algo\": \"rsa\",\n        \"size\": 2048\n    },\n    \"names\": [\n        {\n            \"C\": \"CN\",\n            \"L\": \"BeiJing\",\n            \"ST\": \"BeiJing\"\n        }\n    ]\n}\nEOF\n```\n上述文件hosts字段中IP为所有etcd节点的集群内部通信IP，一个都不能少！为了方便后期扩容可以多写几个预留的IP。\n\n（2）生成证书：\n\n```\ncfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=www server-csr.json | cfssljson -bare server\n```\n查看是否成功，只要有server-key.pem server.pem就是成功了\n```\nls server*pem\nserver-key.pem  server.pem\n```\n\n#### 2.2 下载etcd\n\n不同的k8s版本对应不同的etcd版本，这个可以在官网的changelog里面看到。这里下载的是3.4.3版本\n\n下载地址：https://github.com/etcd-io/etcd/releases\n\n#### 2.3 安装etcd\n\n\n（1）确定二进制文件和配置文件路径\n\n/opt/etcd/bin 是存放二进制文件的，主要是 ectd, etcdctl\n\n/opt/etcd/cfg 是存放etcd 配置的\n\n/opt/etcd/ssl 是存放ectd 证书的\n\n\n```\nroot@k8s-master:~# mkdir /opt/etcd/{bin,cfg,ssl} -p\n\n[root@k8s-master ]# cd /opt/etcd/\n[root@k8s-master etcd]# ls\nbin  cfg  ssl\n\n// bin目录\ntar zxvf etcd-v3.4.3-linux-amd64.tar.gz\ncp etcd etcdctl /opt/etcd/bin/\n\n[root@k8s-master bin]# ls\netcd  etcdctl\n\n\n// ssl目录  这里的证书就是，上面第二步生成的etcd证书\ncp ~/TLS/etcd/ca*pem ~/TLS/etcd/server*pem /opt/etcd/ssl/\n\n\n[root@k8s-master etcd-cert]# cd /opt/etcd/ssl/\n[root@k8s-master ssl]# ls\nca-key.pem  ca.pem  server-key.pem  server.pem\n\n\n// config目录\netcd会监听俩个接口，2380是集群之间进行通信的，2379是数据接口，get,put等数据的接口\n\ncat > /opt/etcd/cfg/etcd.conf << EOF\n#[Member]\nETCD_NAME=\"etcd01\"\nETCD_DATA_DIR=\"/var/lib/etcd/default.etcd\"\nETCD_LISTEN_PEER_URLS=\"https://192.168.0.4:2380\"\nETCD_LISTEN_CLIENT_URLS=\"https://192.168.0.4:2379\"\n\n#[Clustering]\nETCD_INITIAL_ADVERTISE_PEER_URLS=\"https://192.168.0.4:2380\"\nETCD_ADVERTISE_CLIENT_URLS=\"https://192.168.0.4:2379\"\nETCD_INITIAL_CLUSTER=\"etcd01=https://192.168.0.4:2380,etcd02=https://192.168.0.5:2380\"\nETCD_INITIAL_CLUSTER_TOKEN=\"etcd-cluster\"\nETCD_INITIAL_CLUSTER_STATE=\"new\"\nEOF\n\nETCD_NAME：节点名称，集群中唯一\nETCD_DATA_DIR：数据目录\nETCD_LISTEN_PEER_URLS：集群通信监听地址\nETCD_LISTEN_CLIENT_URLS：客户端访问监听地址\nETCD_INITIAL_ADVERTISE_PEER_URLS：集群通告地址\nETCD_ADVERTISE_CLIENT_URLS：客户端通告地址\nETCD_INITIAL_CLUSTER：集群节点地址\nETCD_INITIAL_CLUSTER_TOKEN：集群Token\nETCD_INITIAL_CLUSTER_STATE：加入集群的当前状态，new是新集群，existing表示加入已有集群\n\n```\n\n(2) systemd管理etcd\n```\ncat > /usr/lib/systemd/system/etcd.service << EOF\n[Unit]\nDescription=Etcd Server\nAfter=network.target\nAfter=network-online.target\nWants=network-online.target\n\n[Service]\nType=notify\nEnvironmentFile=/opt/etcd/cfg/etcd.conf\nExecStart=/opt/etcd/bin/etcd \\\n--cert-file=/opt/etcd/ssl/server.pem \\\n--key-file=/opt/etcd/ssl/server-key.pem \\\n--peer-cert-file=/opt/etcd/ssl/server.pem \\\n--peer-key-file=/opt/etcd/ssl/server-key.pem \\\n--trusted-ca-file=/opt/etcd/ssl/ca.pem \\\n--peer-trusted-ca-file=/opt/etcd/ssl/ca.pem \\\n--logger=zap\nRestart=on-failure\nLimitNOFILE=65536\n\n[Install]\nWantedBy=multi-user.target\nEOF\n```\n\n(3) 启动并设置开机启动\n```\nsystemctl daemon-reload\nsystemctl start etcd\nsystemctl enable etcd\n```\n\n第一次启动都是会失败的，因为第二个节点还没有启动etcd\n查看关于etcd 服务最后40行日志, 有时候还可以通过：，tail -f /var/log/message 查看哪里出现了问题。\n```\njournalctl -n 40 -u etcd\n```\n\n\n(4) 在其他节点上启动etcd服务\n\n```\n1. 将master的相关配置复制到node节点\nscp -r /opt/etcd/ root@192.168.0.5:/opt/\n\nscp /usr/lib/systemd/system/etcd.service root@192.168.0.5:/usr/lib/systemd/system/\n\n2. 在node修改不一致的地方\nroot@k8s-dnode:~# cat /opt/etcd/cfg/etcd.conf \n#[Member]\nETCD_NAME=\"etcd02\"   \nETCD_DATA_DIR=\"/var/lib/etcd/default.etcd\"\nETCD_LISTEN_PEER_URLS=\"https://192.168.0.5:2380\"\nETCD_LISTEN_CLIENT_URLS=\"https://192.168.0.5:2379\"\n\n#[Clustering]\nETCD_INITIAL_ADVERTISE_PEER_URLS=\"https://192.168.0.5:2380\"\nETCD_ADVERTISE_CLIENT_URLS=\"https://192.168.0.5:2379\"\nETCD_INITIAL_CLUSTER=\"etcd01=https://192.168.0.4:2380,etcd02=https://192.168.0.5:2380\"\nETCD_INITIAL_CLUSTER_TOKEN=\"etcd-cluster\"\nETCD_INITIAL_CLUSTER_STATE=\"new\"\n\n3.设置开机启动\nsystemctl daemon-reload\nsystemctl start etcd\nsystemctl enable etcd\n```\n\n（5）检查etcd集群是否正常运行\n```\nroot@k8s-master:/usr/lib/systemd/system# systemctl enable etcd\nCreated symlink /etc/systemd/system/multi-user.target.wants/etcd.service → /lib/systemd/system/etcd.service.\nroot@k8s-master:/usr/lib/systemd/system# \nroot@k8s-master:/usr/lib/systemd/system# \nroot@k8s-master:/usr/lib/systemd/system# systemctl status etcd\n● etcd.service - Etcd Server\n   Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled)\n   Active: active (running) since Sat 2021-10-23 15:58:02 CST; 20s ago\n Main PID: 3728 (etcd)\n    Tasks: 10 (limit: 4700)\n   Memory: 23.8M\n   CGroup: /system.slice/etcd.service\n           └─3728 /opt/etcd/bin/etcd --cert-file=/opt/etcd/ssl/server.pem --key-file=/opt/etcd/ssl/server-key.pem --peer-cert-file=/opt/etcd/ssl/server.pem --peer-key-file=/opt/etc\n\nOct 23 15:58:02 k8s-master etcd[3728]: {\"level\":\"info\",\"ts\":\"2021-10-23T15:58:02.698+0800\",\"caller\":\"raft/raft.go:765\",\"msg\":\"5ac283d796e472ba became leader at term 579\"}\nOct 23 15:58:02 k8s-master etcd[3728]: {\"level\":\"info\",\"ts\":\"2021-10-23T15:58:02.698+0800\",\"caller\":\"raft/node.go:325\",\"msg\":\"raft.node: 5ac283d796e472ba elected leader 5ac283d796e\nOct 23 15:58:02 k8s-master etcd[3728]: {\"level\":\"warn\",\"ts\":\"2021-10-23T15:58:02.703+0800\",\"caller\":\"etcdserver/server.go:2045\",\"msg\":\"failed to publish local member to cluster thr\nOct 23 15:58:02 k8s-master etcd[3728]: {\"level\":\"info\",\"ts\":\"2021-10-23T15:58:02.707+0800\",\"caller\":\"etcdserver/server.go:2016\",\"msg\":\"published local member to cluster through raf\nOct 23 15:58:02 k8s-master etcd[3728]: {\"level\":\"info\",\"ts\":\"2021-10-23T15:58:02.709+0800\",\"caller\":\"embed/serve.go:191\",\"msg\":\"serving client traffic securely\",\"address\":\"192.168.\nOct 23 15:58:02 k8s-master systemd[1]: Started Etcd Server.\nOct 23 15:58:02 k8s-master etcd[3728]: {\"level\":\"info\",\"ts\":\"2021-10-23T15:58:02.719+0800\",\"caller\":\"etcdserver/server.go:2501\",\"msg\":\"setting up initial cluster version\",\"cluster-\nOct 23 15:58:02 k8s-master etcd[3728]: {\"level\":\"info\",\"ts\":\"2021-10-23T15:58:02.722+0800\",\"caller\":\"membership/cluster.go:558\",\"msg\":\"set initial cluster version\",\"cluster-id\":\"a8\nOct 23 15:58:02 k8s-master etcd[3728]: {\"level\":\"info\",\"ts\":\"2021-10-23T15:58:02.722+0800\",\"caller\":\"api/capability.go:76\",\"msg\":\"enabled capabilities for version\",\"cluster-version\nOct 23 15:58:02 k8s-master etcd[3728]: {\"level\":\"info\",\"ts\":\"2021-10-23T15:58:02.722+0800\",\"caller\":\"etcdserver/server.go:2533\",\"msg\":\"cluster version is updated\",\"cluster-version\"\nroot@k8s-master:/usr/lib/systemd/system#\n\n\n查看集群健康状态\nroot@k8s-master:/usr/lib/systemd/system# ETCDCTL_API=3 /opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem --cert=/opt/etcd/ssl/server.pem --key=/opt/etcd/ssl/server-key.pem --endpoints=\"https://192.168.0.4:2379,https://192.168.0.5:2379\" endpoint health\nhttps://192.168.0.4:2379 is healthy: successfully committed proposal: took = 12.092244ms\nhttps://192.168.0.5:2379 is healthy: successfully committed proposal: took = 12.96782m\n\n```\n\n\n<br>\n\n### 3. node和master 安装docker\n\n这里我master节点也想使用docker,所以在每个节点都安装了。\n\n具体步骤如下：\n（1）下载二进制\n下载地址：https://download.docker.com/linux/static/stable/x86_64/docker-19.03.9.tgz\n\n（2）解压二进制包\n```\ntar zxvf docker-19.03.9.tgz\nmv docker/* /usr/bin\n```\n(3) systemd管理docker\n```\ncat > /usr/lib/systemd/system/docker.service << EOF\n[Unit]\nDescription=Docker Application Container Engine\nDocumentation=https://docs.docker.com\nAfter=network-online.target firewalld.service\nWants=network-online.target\n\n[Service]\nType=notify\nExecStart=/usr/bin/dockerd\nExecReload=/bin/kill -s HUP $MAINPID\nLimitNOFILE=infinity\nLimitNPROC=infinity\nLimitCORE=infinity\nTimeoutStartSec=0\nDelegate=yes\nKillMode=process\nRestart=on-failure\nStartLimitBurst=3\nStartLimitInterval=60s\n\n[Install]\nWantedBy=multi-user.target\nEOF\n```\n(4) 创建配置文件\n\nregistry-mirrors 阿里云镜像加速器\n```\nmkdir /etc/docker\ncat > /etc/docker/daemon.json << EOF\n{\n  \"registry-mirrors\": [\"https://b9pmyelo.mirror.aliyuncs.com\"]\n}\nEOF\n```\n(5) 启动并设置开机启动\n```\nsystemctl daemon-reload\nsystemctl start docker\nsystemctl enable docker\n```\n\n\n<br>\n\n### 4. 部署kmaster组件\n#### 4.1 部署kube-apiserver\n\n##### 4.1.1 生成kube-apiserver证书\n\n(1) 自签证书颁发机构（CA）\n\n在 ~/TLS/k8s目录下生成\n\n```\ncat > ca-config.json << EOF\n{\n  \"signing\": {\n    \"default\": {\n      \"expiry\": \"87600h\"\n    },\n    \"profiles\": {\n      \"kubernetes\": {\n         \"expiry\": \"87600h\",\n         \"usages\": [\n            \"signing\",\n            \"key encipherment\",\n            \"server auth\",\n            \"client auth\"\n        ]\n      }\n    }\n  }\n}\nEOF\ncat > ca-csr.json << EOF\n{\n    \"CN\": \"kubernetes\",\n    \"key\": {\n        \"algo\": \"rsa\",\n        \"size\": 2048\n    },\n    \"names\": [\n        {\n            \"C\": \"CN\",\n            \"L\": \"Beijing\",\n            \"ST\": \"Beijing\",\n            \"O\": \"k8s\",\n            \"OU\": \"System\"\n        }\n    ]\n}\nEOF\n```\n\n(2) 生成ca证书：\n```\nroot@k8s-master:~/TLS/k8s# cfssl gencert -initca ca-csr.json | cfssljson -bare ca -\n2021/10/23 16:27:02 [INFO] generating a new CA key and certificate from CSR\n2021/10/23 16:27:02 [INFO] generate received request\n2021/10/23 16:27:02 [INFO] received CSR\n2021/10/23 16:27:02 [INFO] generating key: rsa-2048\n2021/10/23 16:27:02 [INFO] encoded CSR\n2021/10/23 16:27:02 [INFO] signed certificate with serial number 691553883019556193564185774219449501300204309030\nroot@k8s-master:~/TLS/k8s# ls *pem\nca-key.pem  ca.pem\n```\n\n(3) 使用自签CA签发kube-apiserver HTTPS证书\n```\ncat > server-csr.json << EOF\n{\n    \"CN\": \"kubernetes\",\n    \"hosts\": [\n      \"10.0.0.1\",\n      \"127.0.0.1\",\n      \"192.168.0.4\",\n      \"192.168.0.5\",\n      \"kubernetes\",\n      \"kubernetes.default\",\n      \"kubernetes.default.svc\",\n      \"kubernetes.default.svc.cluster\",\n      \"kubernetes.default.svc.cluster.local\"\n    ],\n    \"key\": {\n        \"algo\": \"rsa\",\n        \"size\": 2048\n    },\n    \"names\": [\n        {\n            \"C\": \"CN\",\n            \"L\": \"BeiJing\",\n            \"ST\": \"BeiJing\",\n            \"O\": \"k8s\",\n            \"OU\": \"System\"\n        }\n    ]\n}\nEOF\n```\n\n注：上述文件hosts字段中IP为所有Master/LB/VIP IP，一个都不能少！为了方便后期扩容可以多写几个预留的IP。\n\n(4) 生成证书：\n```\nroot@k8s-master:~/TLS/k8s# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes server-csr.json | cfssljson -bare server\n2021/10/23 16:30:16 [INFO] generate received request\n2021/10/23 16:30:16 [INFO] received CSR\n2021/10/23 16:30:16 [INFO] generating key: rsa-2048\n2021/10/23 16:30:16 [INFO] encoded CSR\n2021/10/23 16:30:16 [INFO] signed certificate with serial number 85202347845231770518313014605424297876620496751\n2021/10/23 16:30:16 [WARNING] This certificate lacks a \"hosts\" field. This makes it unsuitable for\nwebsites. For more information see the Baseline Requirements for the Issuance and Management\nof Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);\nspecifically, section 10.2.3 (\"Information Requirements\").\nroot@k8s-master:~/TLS/k8s# ls server*pem\nserver-key.pem  server.pem\n```\n\n##### 4.2.1 确定二进制文件和配置文件路径\n(1) 从Github下载二进制文件\n\n下载地址： https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.17.md\n\n注：打开链接你会发现里面有很多包，下载一个server包就够了，包含了Master和Worker Node二进制文件\n\n\n（2）bin目录\n```\nmkdir -p /opt/kubernetes/{bin,cfg,ssl,logs} \ntar zxvf kubernetes-server-linux-amd64.tar.gz\ncd kubernetes/server/bin\ncp kube-apiserver kube-scheduler kube-controller-manager /opt/kubernetes/bin\ncp kubectl /usr/bin/\n```\n\n\n（3）cfg目录\n```\ncat > /opt/kubernetes/cfg/kube-apiserver.conf << EOF\nKUBE_APISERVER_OPTS=\"--logtostderr=false \\\\\n--v=4 \\\\\n--log-dir=/opt/kubernetes/logs \\\\\n--etcd-servers=https://192.168.0.4:2379,https://192.168.0.4:2379 \\\\\n--bind-address=192.168.0.4 \\\\\n--secure-port=6443 \\\\\n--advertise-address=192.168.0.4 \\\\\n--allow-privileged=true \\\\\n--service-cluster-ip-range=10.0.0.0/24 \\\\\n--enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,ResourceQuota,NodeRestriction \\\\\n--authorization-mode=RBAC,Node \\\\\n--enable-bootstrap-token-auth=true \\\\\n--token-auth-file=/opt/kubernetes/cfg/token.csv \\\\\n--service-node-port-range=30000-32767 \\\\\n--kubelet-client-certificate=/opt/kubernetes/ssl/server.pem \\\\\n--kubelet-client-key=/opt/kubernetes/ssl/server-key.pem \\\\\n--tls-cert-file=/opt/kubernetes/ssl/server.pem  \\\\\n--tls-private-key-file=/opt/kubernetes/ssl/server-key.pem \\\\\n--client-ca-file=/opt/kubernetes/ssl/ca.pem \\\\\n--service-account-key-file=/opt/kubernetes/ssl/ca-key.pem \\\\\n--etcd-cafile=/opt/etcd/ssl/ca.pem \\\\\n--etcd-certfile=/opt/etcd/ssl/server.pem \\\\\n--etcd-keyfile=/opt/etcd/ssl/server-key.pem \\\\\n--audit-log-maxage=30 \\\\\n--audit-log-maxbackup=3 \\\\\n--audit-log-maxsize=100 \\\\\n--audit-log-path=/opt/kubernetes/logs/k8s-audit.log\"\nEOF\n```\n\n注：上面两个\\ \\ 第一个是转义符，第二个是换行符，使用转义符是为了使用EOF保留换行符。\n\n–logtostderr：启用日志\n\n—v：日志等级\n\n–log-dir：日志目录\n\n–etcd-servers：etcd集群地址\n\n–bind-address：监听地址\n\n–secure-port：https安全端口\n\n–advertise-address：集群通告地址\n\n–allow-privileged：启用授权\n\n–service-cluster-ip-range：Service虚拟IP地址段\n\n–enable-admission-plugins：准入控制模块\n\n–authorization-mode：认证授权，启用RBAC授权和节点自管理\n\n–enable-bootstrap-token-auth：启用TLS bootstrap机制\n\n–token-auth-file：bootstrap token文件\n\n–service-node-port-range：Service nodeport类型默认分配端口范围\n\n–kubelet-client-xxx：apiserver访问kubelet客户端证书\n\n–tls-xxx-file：apiserver https证书\n\n–etcd-xxxfile：连接Etcd集群证书\n\n–audit-log-xxx：审计日志\n\n（4）ssl目录\n\n把刚才生成的证书拷贝到配置文件中的路径：\n```\ncp ~/TLS/k8s/ca*pem ~/TLS/k8s/server*pem /opt/kubernetes/ssl/\n\n```\n\n##### 4.2.2 启用 TLS Bootstrapping 机制\nTLS Bootstraping：Master apiserver启用TLS认证后，Node节点kubelet和kube-proxy要与kube-apiserver进行通信，必须使用CA签发的有效证书才可以，当Node节点很多时，这种客户端证书颁发需要大量工作，同样也会增加集群扩展复杂度。为了简化流程，Kubernetes引入了TLS bootstraping机制来自动颁发客户端证书，kubelet会以一个低权限用户自动向apiserver申请证书，kubelet的证书由apiserver动态签署。所以强烈建议在Node上使用这种方式，目前主要用于kubelet，kube-proxy还是由我们统一颁发一个证书。\n\nTLS bootstraping 工作流程：\n![bootstraping](../images/bootstraping.png)\n\n\n创建上述配置文件中token文件：\n\n```\ncat > /opt/kubernetes/cfg/token.csv << EOF\nc47ffb939f5ca36231d9e3121a252940,kubelet-bootstrap,10001,\"system:node-bootstrapper\"\nEOF\n```\n格式：token，用户名，UID，用户组\n\ntoken也可用这个命令自行生成替换：\n```\nhead -c 16 /dev/urandom | od -An -t x | tr -d ' '\n```\n\n##### 4.2.3 systemd管理apiserver\n```\ncat > /usr/lib/systemd/system/kube-apiserver.service << EOF\n[Unit]\nDescription=Kubernetes API Server\nDocumentation=https://github.com/kubernetes/kubernetes\n\n[Service]\nEnvironmentFile=/opt/kubernetes/cfg/kube-apiserver.conf\nExecStart=/opt/kubernetes/bin/kube-apiserver \\$KUBE_APISERVER_OPTS\nRestart=on-failure\n\n[Install]\nWantedBy=multi-user.target\nEOF\n```\n\nsystemctl daemon-reload\nsystemctl start kube-apiserver\nsystemctl enable kube-apiserver\n\n这个时候用 systemclt status kube-apiserver 是running的。\n并且kubectl get svc有输出的\n```\nroot@k8s-master:~/kubernetes/server/bin# kubectl get svc\nNAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE\nkubernetes   ClusterIP   10.0.0.1     <none>        443/TCP   44s\n```\n\n##### 4.2.4 授权kubelet-bootstrap用户允许请求证书\n```\nkubectl create clusterrolebinding kubelet-bootstrap --clusterrole=system:node-bootstrapper --user=kubelet-bootstrap\n```\n\n#### 4.2 部署kube-controller-manager\n\n##### 4.2.1 创建配置文件\n```\ncat > /opt/kubernetes/cfg/kube-controller-manager.conf << EOF\nKUBE_CONTROLLER_MANAGER_OPTS=\"--logtostderr=false \\\\\n--v=4 \\\\\n--log-dir=/opt/kubernetes/logs \\\\\n--leader-elect=true \\\\\n--master=127.0.0.1:8080 \\\\\n--bind-address=127.0.0.1 \\\\\n--allocate-node-cidrs=true \\\\\n--cluster-cidr=10.244.0.0/16 \\\\\n--service-cluster-ip-range=10.0.0.0/24 \\\\\n--cluster-signing-cert-file=/opt/kubernetes/ssl/ca.pem \\\\\n--cluster-signing-key-file=/opt/kubernetes/ssl/ca-key.pem  \\\\\n--root-ca-file=/opt/kubernetes/ssl/ca.pem \\\\\n--service-account-private-key-file=/opt/kubernetes/ssl/ca-key.pem \\\\\n--experimental-cluster-signing-duration=87600h0m0s\"\nEOF\n```\n–master：通过本地非安全本地端口8080连接apiserver。\n\n–leader-elect：当该组件启动多个时，自动选举（HA）\n\n–cluster-signing-cert-file/–cluster-signing-key-file：自动为kubelet颁发证书的CA，与apiserver保持一致\n\n##### 4.2.2 systemd管理controller-manager\n```\ncat > /usr/lib/systemd/system/kube-controller-manager.service << EOF\n[Unit]\nDescription=Kubernetes Controller Manager\nDocumentation=https://github.com/kubernetes/kubernetes\n\n[Service]\nEnvironmentFile=/opt/kubernetes/cfg/kube-controller-manager.conf\nExecStart=/opt/kubernetes/bin/kube-controller-manager \\$KUBE_CONTROLLER_MANAGER_OPTS\nRestart=on-failure\n\n[Install]\nWantedBy=multi-user.target\nEOF\n```\n\nsystemctl daemon-reload\nsystemctl start kube-controller-manager\nsystemctl enable kube-controller-manager\n\n这个时候kcm状态是running的\n```\nroot@k8s-master:/opt/kubernetes/cfg# systemctl status kube-controller-manager\n● kube-controller-manager.service - Kubernetes Controller Manager\n   Loaded: loaded (/lib/systemd/system/kube-controller-manager.service; enabled; vendor preset: enabled)\n   Active: active (running) since Sat 2021-10-23 17:03:50 CST; 22s ago\n     Docs: https://github.com/kubernetes/kubernetes\n Main PID: 4957 (kube-controller)\n    Tasks: 9 (limit: 4700)\n   Memory: 29.0M\n   CGroup: /system.slice/kube-controller-manager.service\n           └─4957 /opt/kubernetes/bin/kube-controller-manager --logtostderr=false --v=4 --log-dir=/opt/kubernetes/logs --leader-elect=true --master=127.0.0.1:8080 --bind-address=12\n\nOct 23 17:03:50 k8s-master systemd[1]: Started Kubernetes Controller Manager.\nOct 23 17:03:52 k8s-master kube-controller-manager[4957]: E1023 17:03:52.290939    4957 core.go:91] Failed to start service controller: WARNING: no cloud provider provided, service\nOct 23 17:03:52 k8s-master kube-controller-manager[4957]: E1023 17:03:52.545623    4957 core.go:232] failed to start cloud node lifecycle controller: no cloud provider provided\nOct 23 17:04:02 k8s-master kube-controller-manager[4957]: E1023 17:04:02.670438    4957 clusterroleaggregation_controller.go:180] admin failed with : Operation cannot be fulfilled \nOct 23 17:04:02 k8s-master kube-controller-manager[4957]: E1023 17:04:02.683306    4957 clusterroleaggregation_controller.go:180] admin failed with : Operation cannot be fulfilled \nroot@k8s-master:/opt/kubernetes/cfg# \n```\n\n#### 4.3 部署kube-scheduler\n\n##### 4.3.1 创建配置文件\n```\ncat > /opt/kubernetes/cfg/kube-scheduler.conf << EOF\nKUBE_SCHEDULER_OPTS=\"--logtostderr=false \\\n--v=4 \\\n--log-dir=/opt/kubernetes/logs \\\n--leader-elect \\\n--master=127.0.0.1:8080 \\\n--bind-address=127.0.0.1\"\nEOF\n```\n\n–master：通过本地非安全本地端口8080连接apiserver。\n\n–leader-elect：当该组件启动多个时，自动选举（HA）\n\n##### 4.3.2 systemd管理scheduler\n```\ncat > /usr/lib/systemd/system/kube-scheduler.service << EOF\n[Unit]\nDescription=Kubernetes Scheduler\nDocumentation=https://github.com/kubernetes/kubernetes\n\n[Service]\nEnvironmentFile=/opt/kubernetes/cfg/kube-scheduler.conf\nExecStart=/opt/kubernetes/bin/kube-scheduler \\$KUBE_SCHEDULER_OPTS\nRestart=on-failure\n\n[Install]\nWantedBy=multi-user.target\nEOF\n```\n##### 4.3.3 启动并设置开机启动\nsystemctl daemon-reload\nsystemctl start kube-scheduler\nsystemctl enable kube-scheduler\n\n##### 4.3.4 查看集群状态\n如下输出说明Master节点组件运行正常。\n\n\n```\nroot@k8s-master:/opt/kubernetes/cfg# kubectl get cs\nNAME                 STATUS    MESSAGE             ERROR\nscheduler            Healthy   ok                  \ncontroller-manager   Healthy   ok                  \netcd-0               Healthy   {\"health\":\"true\"} \n```\n\n\n### 5.部署dnode节点\n\n#### 5.1 文件和目录准备\n\n下面还是在Master Node上操作，即同时也作为Node\n\n**master节点：**\n\n从master节点拷贝：\n\ncd kubernetes/server/bin\ncp kubelet kube-proxy /opt/kubernetes/bin   # 本地拷贝\n\n**node节点**\n\n在所有worker node创建工作目录：\n\nmkdir -p /opt/kubernetes/{bin,cfg,ssl,logs} \n\n从master节点拷贝：\nscp -r /root/kubernetes/server/bin/ root@192.168.0.5:/root/kubernetes/server/bin\ncd kubernetes/server/bin\ncp kubelet kube-proxy /opt/kubernetes/bin   # 本地拷贝\n\n#### 5.2 部署kubelet\n\n##### 5.2.1. 创建配置文件\n```\ncat > /opt/kubernetes/cfg/kubelet.conf << EOF\nKUBELET_OPTS=\"--logtostderr=false \\\\\n--v=4 \\\\\n--log-dir=/opt/kubernetes/logs \\\\\n--hostname-override=k8s-master \\\\\n--network-plugin=cni \\\\\n--kubeconfig=/opt/kubernetes/cfg/kubelet.kubeconfig \\\\\n--bootstrap-kubeconfig=/opt/kubernetes/cfg/bootstrap.kubeconfig \\\\\n--config=/opt/kubernetes/cfg/kubelet-config.yml \\\\\n--cert-dir=/opt/kubernetes/ssl \\\\\n--pod-infra-container-image=lizhenliang/pause-amd64:3.0\"\nEOF\n```\n\n–hostname-override：显示名称，集群中唯一\n\n–network-plugin：启用CNI\n\n–kubeconfig：空路径，会自动生成，后面用于连接apiserver\n\n–bootstrap-kubeconfig：首次启动向apiserver申请证书\n\n–config：配置参数文件\n\n–cert-dir：kubelet证书生成目录\n\n–pod-infra-container-image：管理Pod网络容器的镜像\n\n\n##### 5.2.2 配置参数文件\n```\ncat > /opt/kubernetes/cfg/kubelet-config.yml << EOF\nkind: KubeletConfiguration\napiVersion: kubelet.config.k8s.io/v1beta1\naddress: 0.0.0.0\nport: 10250\nreadOnlyPort: 10255\ncgroupDriver: cgroupfs\nclusterDNS:\n- 10.0.0.2\nclusterDomain: cluster.local \nfailSwapOn: false\nauthentication:\n  anonymous:\n    enabled: false\n  webhook:\n    cacheTTL: 2m0s\n    enabled: true\n  x509:\n    clientCAFile: /opt/kubernetes/ssl/ca.pem \nauthorization:\n  mode: Webhook\n  webhook:\n    cacheAuthorizedTTL: 5m0s\n    cacheUnauthorizedTTL: 30s\nevictionHard:\n  imagefs.available: 15%\n  memory.available: 100Mi\n  nodefs.available: 10%\n  nodefs.inodesFree: 5%\nmaxOpenFiles: 1000000\nmaxPods: 110\nEOF\n```\n\n##### 5.2.3 生成bootstrap.kubeconfig文件\n```\nKUBE_APISERVER=\"https://192.168.0.4:6443\" # apiserver IP:PORT\nTOKEN=\"c47ffb939f5ca36231d9e3121a252940\" # 与token.csv里保持一致\ncd /opt/kubernetes/cfg/\n\n\n# 生成 kubelet bootstrap kubeconfig 配置文件\nkubectl config set-cluster kubernetes --certificate-authority=/opt/kubernetes/ssl/ca.pem --embed-certs=true --server=${KUBE_APISERVER} --kubeconfig=bootstrap.kubeconfig\n\nkubectl config set-credentials \"kubelet-bootstrap\" --token=${TOKEN}  --kubeconfig=bootstrap.kubeconfig\nkubectl config set-context default --cluster=kubernetes --user=\"kubelet-bootstrap\" --kubeconfig=bootstrap.kubeconfig\nkubectl config use-context default --kubeconfig=bootstrap.kubeconfig\n```\n\n##### 5.2.4 systemd管理kubelet\n```\ncat > /usr/lib/systemd/system/kubelet.service << EOF\n[Unit]\nDescription=Kubernetes Kubelet\nAfter=docker.service\n\n[Service]\nEnvironmentFile=/opt/kubernetes/cfg/kubelet.conf\nExecStart=/opt/kubernetes/bin/kubelet \\$KUBELET_OPTS\nRestart=on-failure\nLimitNOFILE=65536\n\n[Install]\nWantedBy=multi-user.target\nEOF\n```\n\n启动并设置开机启动\nsystemctl daemon-reload\n\nsystemctl start kubelet\n\nsystemctl enable kubelet\n\n##### 5.2.5 批准kubelet证书申请并加入集群\n查看kubelet证书请求\n```\nroot@k8s-master:/opt/kubernetes/cfg# kubectl get csr\nNAME                                                   AGE   REQUESTOR           CONDITION\nnode-csr-uYm2cSUxv0HWPXQ4JNj5bYPaR_B2rLbkCM257un0iV4   41s   kubelet-bootstrap   Pending\n```\n\n批准申请\n```\nkubectl certificate approve node-csr-uYm2cSUxv0HWPXQ4JNj5bYPaR_B2rLbkCM257un0iV4\n```\n\n查看节点\n```\nroot@k8s-master:/opt/kubernetes/cfg# kubectl get node\nNAME         STATUS     ROLES    AGE   VERSION\nk8s-master   NotReady   <none>   4s    v1.17.3\n```\n\n注：由于网络插件还没有部署，节点会没有准备就绪 NotReady\n\n\n#### 5.3 部署kube-proxy\n##### 5.3.1 创建配置文件\n```\ncat > /opt/kubernetes/cfg/kube-proxy.conf << EOF\nKUBE_PROXY_OPTS=\"--logtostderr=false \\\\\n--v=2 \\\\\n--log-dir=/opt/kubernetes/logs \\\\\n--config=/opt/kubernetes/cfg/kube-proxy-config.yml\"\nEOF\n```\n\n##### 5.3.2 配置参数文件\n```\ncat > /opt/kubernetes/cfg/kube-proxy-config.yml << EOF\nkind: KubeProxyConfiguration\napiVersion: kubeproxy.config.k8s.io/v1alpha1\nbindAddress: 0.0.0.0\nmetricsBindAddress: 0.0.0.0:10249\nclientConnection:\n  kubeconfig: /opt/kubernetes/cfg/kube-proxy.kubeconfig\nhostnameOverride: k8s-master\nclusterCIDR: 10.0.0.0/24\nEOF\n```\n\n##### 5.3.3. 生成kube-proxy.kubeconfig文件\n生成kube-proxy证书：\n\n切换工作目录\ncd TLS/k8s\n\n(1) 创建证书请求文件\n```\ncat > kube-proxy-csr.json << EOF\n{\n  \"CN\": \"system:kube-proxy\",\n  \"hosts\": [],\n  \"key\": {\n    \"algo\": \"rsa\",\n    \"size\": 2048\n  },\n  \"names\": [\n    {\n      \"C\": \"CN\",\n      \"L\": \"BeiJing\",\n      \"ST\": \"BeiJing\",\n      \"O\": \"k8s\",\n      \"OU\": \"System\"\n    }\n  ]\n}\nEOF\n```\n\n(2) 生成证书\ncfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kube-proxy-csr.json | cfssljson -bare kube-proxy\n```\nls kube-proxy*pem\nkube-proxy-key.pem  kube-proxy.pem\n```\n将证书拷贝到/opt/kubernetes/ssl/ 目录：  cp kube-proxy-key.pem kube-proxy.pem /opt/kubernetes/ssl/\n\n(3) 生成kubeconfig文件：\n\n```\ncd /opt/kubernetes/cfg/\nKUBE_APISERVER=\"https://192.168.0.4:6443\"\n\nkubectl config set-cluster kubernetes --certificate-authority=/opt/kubernetes/ssl/ca.pem --embed-certs=true --server=${KUBE_APISERVER}  --kubeconfig=kube-proxy.kubeconfig\nkubectl config set-credentials kube-proxy --client-certificate=/opt/kubernetes/ssl/kube-proxy.pem --client-key=/opt/kubernetes/ssl/kube-proxy-key.pem --embed-certs=true --kubeconfig=kube-proxy.kubeconfig\nkubectl config set-context default --cluster=kubernetes --user=kube-proxy --kubeconfig=kube-proxy.kubeconfig\nkubectl config use-context default --kubeconfig=kube-proxy.kubeconfig\n```\n\n##### 5.3.4. systemd管理kube-proxy\n```\ncat > /usr/lib/systemd/system/kube-proxy.service << EOF\n[Unit]\nDescription=Kubernetes Proxy\nAfter=network.target\n\n[Service]\nEnvironmentFile=/opt/kubernetes/cfg/kube-proxy.conf\nExecStart=/opt/kubernetes/bin/kube-proxy \\$KUBE_PROXY_OPTS\nRestart=on-failure\nLimitNOFILE=65536\n\n[Install]\nWantedBy=multi-user.target\nEOF\n```\n\n启动并设置开机启动\nsystemctl daemon-reload\nsystemctl start kube-proxy\nsystemctl enable kube-proxy\n\n\n#### 5.4 部署网络环境\n\n先准备好CNI二进制文件：\n\n下载地址：https://github.com/containernetworking/plugins/releases/download/v0.8.6/cni-plugins-linux-amd64-v0.8.6.tgz\n\n解压二进制包并移动到默认工作目录：\n\nmkdir /opt/cni/bin\ntar zxvf cni-plugins-linux-amd64-v0.8.6.tgz -C /opt/cni/bin\n部署CNI网络：\n```\nwget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml\nsed -i -r \"s#quay.io/coreos/flannel:.*-amd64#lizhenliang/flannel:v0.12.0-amd64#g\" kube-flannel.yml\n```\n默认镜像地址无法访问，修改为docker hub镜像仓库。\n```\nroot@k8s-master:~# kubectl get pod -n kube-system\nNAME                    READY   STATUS    RESTARTS   AGE\nkube-flannel-ds-mwmmn   1/1     Running   0          72s\nroot@k8s-master:~# \nroot@k8s-master:~# \nroot@k8s-master:~# kubectl get node\nNAME         STATUS   ROLES    AGE   VERSION\nk8s-master   Ready    <none>   23m   v1.17.3\n```\n部署好网络插件，Node准备就绪。\n\n\n#### 5.5  授权apiserver访问kubelet\n\n如何没有这个,kubectl exec -it pod会报错\n\n```\ncat > apiserver-to-kubelet-rbac.yaml << EOF\napiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRole\nmetadata:\n  annotations:\n    rbac.authorization.kubernetes.io/autoupdate: \"true\"\n  labels:\n    kubernetes.io/bootstrapping: rbac-defaults\n  name: system:kube-apiserver-to-kubelet\nrules:\n  - apiGroups:\n      - \"\"\n    resources:\n      - nodes/proxy\n      - nodes/stats\n      - nodes/log\n      - nodes/spec\n      - nodes/metrics\n      - pods/log\n    verbs:\n      - \"*\"\n---\napiVersion: rbac.authorization.k8s.io/v1\nkind: ClusterRoleBinding\nmetadata:\n  name: system:kube-apiserver\n  namespace: \"\"\nroleRef:\n  apiGroup: rbac.authorization.k8s.io\n  kind: ClusterRole\n  name: system:kube-apiserver-to-kubelet\nsubjects:\n  - apiGroup: rbac.authorization.k8s.io\n    kind: User\n    name: kubernetes\nEOF\n\nkubectl apply -f apiserver-to-kubelet-rbac.yaml\n```\n\n### 6 新增加Node\n\n##### 6.1. 拷贝已部署好的Node相关文件到新节点\n在Master节点将Worker Node涉及文件拷贝到新节点\n\nscp -r /opt/kubernetes root@192.168.0.5:/opt/\nscp -r /usr/lib/systemd/system/{kubelet,kube-proxy}.service root@192.168.0.5:/usr/lib/systemd/system\nscp -r /opt/cni/ root@192.168.0.5:/opt/\nscp /opt/kubernetes/ssl/ca.pem root@192.168.0.5:/opt/kubernetes/ssl\n\n\n##### 6.2 删除kubelet证书和kubeconfig文件\n```\nrm /opt/kubernetes/cfg/kubelet.kubeconfig \nrm -f /opt/kubernetes/ssl/kubelet*\n```\n注：这几个文件是证书申请审批后自动生成的，每个Node不同，必须删除重新生成。\n\n\n\n##### 6.3. 修改主机名\n```\nvi /opt/kubernetes/cfg/kubelet.conf\n--hostname-override=k8s-node1\n\nvi /opt/kubernetes/cfg/kube-proxy-config.yml\nhostnameOverride: k8s-node1\n```\n\n##### 6.4. 启动并设置开机启动\nsystemctl daemon-reload\nsystemctl start kubelet\nsystemctl enable kubelet\nsystemctl start kube-proxy\nsystemctl enable kube-proxy\n\n\n\n##### 6.5. 在Master上批准新Node kubelet证书申请\n```\nroot@k8s-master:~# kubectl get csr\nNAME                                                   AGE   REQUESTOR           CONDITION\nnode-csr-hqhgEI8ez2hjy5Cm0nJ_OeP2s7pPow99b3c8PUDnmIE   32s   kubelet-bootstrap   Pending\nnode-csr-uYm2cSUxv0HWPXQ4JNj5bYPaR_B2rLbkCM257un0iV4   73m   kubelet-bootstrap   Approved,Issued\nroot@k8s-master:~# \nroot@k8s-master:~# kubectl certificate approve node-csr-hqhgEI8ez2hjy5Cm0nJ_OeP2s7pPow99b3c8PUDnmIE\ncertificatesigningrequest.certificates.k8s.io/node-csr-hqhgEI8ez2hjy5Cm0nJ_OeP2s7pPow99b3c8PUDnmIE approved\n```\n\n##### 6.6. 查看Node状态\n```\nroot@k8s-master:~# kubectl get node\nNAME         STATUS   ROLES    AGE   VERSION\nk8s-master   Ready    <none>   73m   v1.17.3\nk8s-node     Ready    <none>   55s   v1.17.3\n```\n\n正常创建pod测试\n```\nroot@k8s-master:~# kubectl get pod -o wide\nNAME    READY   STATUS    RESTARTS   AGE    IP           NODE       NOMINATED NODE   READINESS GATES\nnginx   1/1     Running   0          114s   10.244.1.2   k8s-node   <none>           <none>\n```\n\n### 7.可能遇到的坑 \nhttps://blog.csdn.net/zhuzhuxiazst/article/details/103887137"
  },
  {
    "path": "k8s/install-k8s-from source code/2.window配置goland环境阅读kubernetes源码.md",
    "content": "Table of Contents\n=================\n\n* [1. 代码下载](#1-代码下载)\n\n### 1. 代码下载\n\n（1）管理员运行git \n\n(2) 然后使用-c core.symlinks=true 来下载链接关系\n```\ngit clone  -c core.symlinks=true https://github.com/kubernetes/kubernetes.git -b v1.17.4\n```\n\n（3）goland 可以使用eval reset 插件，每次打开时激活30天的免费使用，从而达到白嫖\n\nhttps://blog.csdn.net/qq_37699336/article/details/116528062\n\n（4）goland配置如下\n\n\nkubernetes 源码要放到 gopath/src 目录下\n\n![windows-read-sourcecode](../images/windows-read-sourcecode.png)\n\n然后代码就不会变红了，到处乱跳了\n\n\n参考链接：https://zhuanlan.zhihu.com/p/52056165"
  },
  {
    "path": "k8s/kcm/0-kcm启动流程.md",
    "content": "Table of Contents\n=================\n\n* [1. 定义-main](#1-定义-main)\n     * [1.1 NewKubeControllerManagerOptions](#11-newkubecontrollermanageroptions)\n     * [1.2 s.config  实例化一个kubecontrollerconfig.Config](#12-sconfig--实例化一个kubecontrollerconfigconfig)\n        * [1.2.1 s.applyTo](#121-sapplyto)\n        * [1.2.2 结构体定义](#122-结构体定义)\n     * [1.3 Run](#13-run)\n     * [1.4 run函数](#14-run函数)\n     * [1.5 StartControllers](#15-startcontrollers)\n     * [1.6 总结](#16-总结)\n        * [1.6.1 整体流程](#161-整体流程)\n        * [1.6.2 一些思考](#162-一些思考)\n  * [2. 附录](#2-附录)\n     * [2.1 cobra实践](#21-cobra实践)\n     * [2.2 k8s中的选举机制](#22-k8s中的选举机制)\n\n### 1. 定义-main\n\ncmd\\kube-controller-manager\\controller-manager.go\n\n```\nfunc main() {\n\trand.Seed(time.Now().UTC().UnixNano())\n\n\tcommand := app.NewControllerManagerCommand()\n\n\t// TODO: once we switch everything over to Cobra commands, we can go back to calling\n\t// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the\n\t// normalize func and add the go flag set by hand.\n\tpflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc)\n\tpflag.CommandLine.AddGoFlagSet(goflag.CommandLine)\n\t// utilflag.InitFlags()\n\tlogs.InitLogs()\n\tdefer logs.FlushLogs()\n\n\tif err := command.Execute(); err != nil {\n\t\tfmt.Fprintf(os.Stderr, \"%v\\n\", err)\n\t\tos.Exit(1)\n\t}\n}\n```\n\n<br>\n\n```go\n// NewControllerManagerCommand creates a *cobra.Command object with default parameters\nfunc NewControllerManagerCommand() *cobra.Command {\n    // 1.初始化config配置。包括每个controller的配置，例如hpacontroller的 HorizontalPodAutoscalerSyncPeriod\n    // 详见 cmd\\kube-controller-manager\\app\\options\\options.go\n\ts, err := options.NewKubeControllerManagerOptions()\n\tif err != nil {\n\t\tglog.Fatalf(\"unable to initialize command options: %v\", err)\n\t}\n\n\tcmd := &cobra.Command{\n\t\tUse: \"kube-controller-manager\",\n\t\tLong: `The Kubernetes controller manager is a daemon that embeds\nthe core control loops shipped with Kubernetes. In applications of robotics and\nautomation, a control loop is a non-terminating loop that regulates the state of\nthe system. In Kubernetes, a controller is a control loop that watches the shared\nstate of the cluster through the apiserver and makes changes attempting to move the\ncurrent state towards the desired state. Examples of controllers that ship with\nKubernetes today are the replication controller, endpoints controller, namespace\ncontroller, and serviceaccounts controller.`,\n\t\tRun: func(cmd *cobra.Command, args []string) {\n\t\t    // 打印一些信息\n\t\t\tverflag.PrintAndExitIfRequested()\n\t\t\tutilflag.PrintFlags(cmd.Flags())\n           // 2. 实例化一个kubecontrollerconfig.Config\n\t\t\tc, err := s.Config(KnownControllers(), ControllersDisabledByDefault.List())\n\t\t\tif err != nil {\n\t\t\t\tfmt.Fprintf(os.Stderr, \"%v\\n\", err)\n\t\t\t\tos.Exit(1)\n\t\t\t}\n            // 最关键的Run,这里是 neverStop\n\t\t\tif err := Run(c.Complete(), wait.NeverStop); err != nil {\n\t\t\t\tfmt.Fprintf(os.Stderr, \"%v\\n\", err)\n\t\t\t\tos.Exit(1)\n\t\t\t}\n\t\t},\n\t}\n   \n\tfs := cmd.Flags()\n     // 定义cobra的flags，这里就是定义参数的名称，默认值啥的。例如 --url --port等\n\tnamedFlagSets := s.Flags(KnownControllers(), ControllersDisabledByDefault.List())\n\tfor _, f := range namedFlagSets.FlagSets {\n\t\tfs.AddFlagSet(f)\n\t}\n\t\n\t//4.设置 help, usage函数\n\tusageFmt := \"Usage:\\n  %s\\n\"\n\tcols, _, _ := apiserverflag.TerminalSize(cmd.OutOrStdout())\n\tcmd.SetUsageFunc(func(cmd *cobra.Command) error {\n\t\tfmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())\n\t\tapiserverflag.PrintSections(cmd.OutOrStderr(), namedFlagSets, cols)\n\t\treturn nil\n\t})\n\tcmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {\n\t\tfmt.Fprintf(cmd.OutOrStdout(), \"%s\\n\\n\"+usageFmt, cmd.Long, cmd.UseLine())\n\t\tapiserverflag.PrintSections(cmd.OutOrStdout(), namedFlagSets, cols)\n\t})\n\n\treturn cmd\n}\n```\n\n<br>\n\n这个就是 s.flags\n\nnamedFlagSets := s.Flags(KnownControllers(), ControllersDisabledByDefault.List())\n\n```\n// Flags returns flags for a specific APIServer by section name\n// 依次调用其他controller-manager的flags。\nfunc (s *KubeControllerManagerOptions) Flags(allControllers []string, disabledByDefaultControllers []string) apiserverflag.NamedFlagSets {\n\tfss := apiserverflag.NamedFlagSets{}\n\ts.Generic.AddFlags(&fss, allControllers, disabledByDefaultControllers)\n\ts.KubeCloudShared.AddFlags(fss.FlagSet(\"generic\"))\n\ts.ServiceController.AddFlags(fss.FlagSet(\"service controller\"))\n\n\ts.SecureServing.AddFlags(fss.FlagSet(\"secure serving\"))\n\ts.InsecureServing.AddUnqualifiedFlags(fss.FlagSet(\"insecure serving\"))\n\ts.Authentication.AddFlags(fss.FlagSet(\"authentication\"))\n\ts.Authorization.AddFlags(fss.FlagSet(\"authorization\"))\n\n\ts.AttachDetachController.AddFlags(fss.FlagSet(\"attachdetach controller\"))\n\ts.CSRSigningController.AddFlags(fss.FlagSet(\"csrsigning controller\"))\n\ts.DeploymentController.AddFlags(fss.FlagSet(\"deployment controller\"))\n\ts.DaemonSetController.AddFlags(fss.FlagSet(\"daemonset controller\"))\n\ts.DeprecatedFlags.AddFlags(fss.FlagSet(\"deprecated\"))\n\ts.EndpointController.AddFlags(fss.FlagSet(\"endpoint controller\"))\n\ts.GarbageCollectorController.AddFlags(fss.FlagSet(\"garbagecollector controller\"))\n\ts.HPAController.AddFlags(fss.FlagSet(\"horizontalpodautoscaling controller\"))\n\ts.JobController.AddFlags(fss.FlagSet(\"job controller\"))\n\ts.NamespaceController.AddFlags(fss.FlagSet(\"namespace controller\"))\n\ts.NodeIPAMController.AddFlags(fss.FlagSet(\"nodeipam controller\"))\n\ts.NodeLifecycleController.AddFlags(fss.FlagSet(\"nodelifecycle controller\"))\n\ts.PersistentVolumeBinderController.AddFlags(fss.FlagSet(\"persistentvolume-binder controller\"))\n\ts.PodGCController.AddFlags(fss.FlagSet(\"podgc controller\"))\n\ts.ReplicaSetController.AddFlags(fss.FlagSet(\"replicaset controller\"))\n\ts.ReplicationController.AddFlags(fss.FlagSet(\"replicationcontroller\"))\n\ts.ResourceQuotaController.AddFlags(fss.FlagSet(\"resourcequota controller\"))\n\ts.SAController.AddFlags(fss.FlagSet(\"serviceaccount controller\"))\n\ts.TTLAfterFinishedController.AddFlags(fss.FlagSet(\"ttl-after-finished controller\"))\n\n\tfs := fss.FlagSet(\"misc\")\n\tfs.StringVar(&s.Master, \"master\", s.Master, \"The address of the Kubernetes API server (overrides any value in kubeconfig).\")\n\tfs.StringVar(&s.Kubeconfig, \"kubeconfig\", s.Kubeconfig, \"Path to kubeconfig file with authorization and master location information.\")\n\tvar dummy string\n\tfs.MarkDeprecated(\"insecure-experimental-approve-all-kubelet-csrs-for-group\", \"This flag does nothing.\")\n\tfs.StringVar(&dummy, \"insecure-experimental-approve-all-kubelet-csrs-for-group\", \"\", \"This flag does nothing.\")\n\tutilfeature.DefaultFeatureGate.AddFlag(fss.FlagSet(\"generic\"))\n\n\treturn fss\n}\n```\n\n\n\n```\n// AddFlags adds flags related to DeploymentController for controller manager to the specified FlagSet.\nfunc (o *DeploymentControllerOptions) AddFlags(fs *pflag.FlagSet) {\n\tif o == nil {\n\t\treturn\n\t}\n\n\tfs.Int32Var(&o.ConcurrentDeploymentSyncs, \"concurrent-deployment-syncs\", o.ConcurrentDeploymentSyncs, \"The number of deployment objects that are allowed to sync concurrently. Larger number = more responsive deployments, but more CPU (and network) load\")\n\tfs.DurationVar(&o.DeploymentControllerSyncPeriod.Duration, \"deployment-controller-sync-period\", o.DeploymentControllerSyncPeriod.Duration, \"Period for syncing the deployments.\")\n}\n```\n\n比如，以DeploymentControllerOptions.AddFlags为例，这里就是定义了concurrent-deployment-syncs，deployment-controller-sync-period这两个参数，并且赋了默认值。\n\n参考附录，可以加深理解。\n\n<br>\n\n#### 1.1 NewKubeControllerManagerOptions\n\n看起来这里是通过获取默认的参数配置，然后赋值给 KubeControllerManagerOptions\n\n```\n// NewKubeControllerManagerOptions creates a new KubeControllerManagerOptions with a default config.\nfunc NewKubeControllerManagerOptions() (*KubeControllerManagerOptions, error) {\n   componentConfig, err := NewDefaultComponentConfig(ports.InsecureKubeControllerManagerPort)\n   if err != nil {\n      return nil, err\n   }\n\n   s := KubeControllerManagerOptions{\n      Generic:         cmoptions.NewGenericControllerManagerConfigurationOptions(componentConfig.Generic),\n      KubeCloudShared: cmoptions.NewKubeCloudSharedOptions(componentConfig.KubeCloudShared),\n      AttachDetachController: &AttachDetachControllerOptions{\n         ReconcilerSyncLoopPeriod: componentConfig.AttachDetachController.ReconcilerSyncLoopPeriod,\n      },\n      CSRSigningController: &CSRSigningControllerOptions{\n         ClusterSigningCertFile: componentConfig.CSRSigningController.ClusterSigningCertFile,\n         ClusterSigningKeyFile:  componentConfig.CSRSigningController.ClusterSigningKeyFile,\n         ClusterSigningDuration: componentConfig.CSRSigningController.ClusterSigningDuration,\n      },\n      DaemonSetController: &DaemonSetControllerOptions{\n         ConcurrentDaemonSetSyncs: componentConfig.DaemonSetController.ConcurrentDaemonSetSyncs,\n      },\n      DeploymentController: &DeploymentControllerOptions{\n         ConcurrentDeploymentSyncs:      componentConfig.DeploymentController.ConcurrentDeploymentSyncs,\n         DeploymentControllerSyncPeriod: componentConfig.DeploymentController.DeploymentControllerSyncPeriod,\n      },\n      DeprecatedFlags: &DeprecatedControllerOptions{\n         RegisterRetryCount: componentConfig.DeprecatedController.RegisterRetryCount,\n      },\n      EndpointController: &EndpointControllerOptions{\n         ConcurrentEndpointSyncs: componentConfig.EndpointController.ConcurrentEndpointSyncs,\n      },\n      GarbageCollectorController: &GarbageCollectorControllerOptions{\n         ConcurrentGCSyncs:      componentConfig.GarbageCollectorController.ConcurrentGCSyncs,\n         EnableGarbageCollector: componentConfig.GarbageCollectorController.EnableGarbageCollector,\n      },\n      HPAController: &HPAControllerOptions{\n         HorizontalPodAutoscalerSyncPeriod:                   componentConfig.HPAController.HorizontalPodAutoscalerSyncPeriod,\n         HorizontalPodAutoscalerUpscaleForbiddenWindow:       componentConfig.HPAController.HorizontalPodAutoscalerUpscaleForbiddenWindow,\n         HorizontalPodAutoscalerDownscaleForbiddenWindow:     componentConfig.HPAController.HorizontalPodAutoscalerDownscaleForbiddenWindow,\n         HorizontalPodAutoscalerDownscaleStabilizationWindow: componentConfig.HPAController.HorizontalPodAutoscalerDownscaleStabilizationWindow,\n         HorizontalPodAutoscalerCPUInitializationPeriod:      componentConfig.HPAController.HorizontalPodAutoscalerCPUInitializationPeriod,\n         HorizontalPodAutoscalerInitialReadinessDelay:        componentConfig.HPAController.HorizontalPodAutoscalerInitialReadinessDelay,\n         HorizontalPodAutoscalerTolerance:                    componentConfig.HPAController.HorizontalPodAutoscalerTolerance,\n         HorizontalPodAutoscalerUseRESTClients:               componentConfig.HPAController.HorizontalPodAutoscalerUseRESTClients,\n      },\n      JobController: &JobControllerOptions{\n         ConcurrentJobSyncs: componentConfig.JobController.ConcurrentJobSyncs,\n      },\n      NamespaceController: &NamespaceControllerOptions{\n         NamespaceSyncPeriod:      componentConfig.NamespaceController.NamespaceSyncPeriod,\n         ConcurrentNamespaceSyncs: componentConfig.NamespaceController.ConcurrentNamespaceSyncs,\n      },\n      NodeIPAMController: &NodeIPAMControllerOptions{\n         NodeCIDRMaskSize: componentConfig.NodeIPAMController.NodeCIDRMaskSize,\n      },\n      NodeLifecycleController: &NodeLifecycleControllerOptions{\n         EnableTaintManager:     componentConfig.NodeLifecycleController.EnableTaintManager,\n         NodeMonitorGracePeriod: componentConfig.NodeLifecycleController.NodeMonitorGracePeriod,\n         NodeStartupGracePeriod: componentConfig.NodeLifecycleController.NodeStartupGracePeriod,\n         PodEvictionTimeout:     componentConfig.NodeLifecycleController.PodEvictionTimeout,\n      },\n      PersistentVolumeBinderController: &PersistentVolumeBinderControllerOptions{\n         PVClaimBinderSyncPeriod: componentConfig.PersistentVolumeBinderController.PVClaimBinderSyncPeriod,\n         VolumeConfiguration:     componentConfig.PersistentVolumeBinderController.VolumeConfiguration,\n      },\n      PodGCController: &PodGCControllerOptions{\n         TerminatedPodGCThreshold: componentConfig.PodGCController.TerminatedPodGCThreshold,\n      },\n      ReplicaSetController: &ReplicaSetControllerOptions{\n         ConcurrentRSSyncs: componentConfig.ReplicaSetController.ConcurrentRSSyncs,\n      },\n      ReplicationController: &ReplicationControllerOptions{\n         ConcurrentRCSyncs: componentConfig.ReplicationController.ConcurrentRCSyncs,\n      },\n      ResourceQuotaController: &ResourceQuotaControllerOptions{\n         ResourceQuotaSyncPeriod:      componentConfig.ResourceQuotaController.ResourceQuotaSyncPeriod,\n         ConcurrentResourceQuotaSyncs: componentConfig.ResourceQuotaController.ConcurrentResourceQuotaSyncs,\n      },\n      SAController: &SAControllerOptions{\n         ConcurrentSATokenSyncs: componentConfig.SAController.ConcurrentSATokenSyncs,\n      },\n      ServiceController: &cmoptions.ServiceControllerOptions{\n         ConcurrentServiceSyncs: componentConfig.ServiceController.ConcurrentServiceSyncs,\n      },\n      TTLAfterFinishedController: &TTLAfterFinishedControllerOptions{\n         ConcurrentTTLSyncs: componentConfig.TTLAfterFinishedController.ConcurrentTTLSyncs,\n      },\n      SecureServing: apiserveroptions.NewSecureServingOptions().WithLoopback(),\n      InsecureServing: (&apiserveroptions.DeprecatedInsecureServingOptions{\n         BindAddress: net.ParseIP(componentConfig.Generic.Address),\n         BindPort:    int(componentConfig.Generic.Port),\n         BindNetwork: \"tcp\",\n      }).WithLoopback(),\n      Authentication: apiserveroptions.NewDelegatingAuthenticationOptions(),\n      Authorization:  apiserveroptions.NewDelegatingAuthorizationOptions(),\n   }\n\n   s.Authentication.RemoteKubeConfigFileOptional = true\n   s.Authorization.RemoteKubeConfigFileOptional = true\n   s.Authorization.AlwaysAllowPaths = []string{\"/healthz\"}\n\n   s.SecureServing.ServerCert.CertDirectory = \"/var/run/kubernetes\"\n   s.SecureServing.ServerCert.PairName = \"kube-controller-manager\"\n   s.SecureServing.BindPort = ports.KubeControllerManagerPort\n\n   gcIgnoredResources := make([]kubectrlmgrconfig.GroupResource, 0, len(garbagecollector.DefaultIgnoredResources()))\n   for r := range garbagecollector.DefaultIgnoredResources() {\n      gcIgnoredResources = append(gcIgnoredResources, kubectrlmgrconfig.GroupResource{Group: r.Group, Resource: r.Resource})\n   }\n\n   s.GarbageCollectorController.GCIgnoredResources = gcIgnoredResources\n\n   return &s, nil\n}\n```\n\n\n\n可以看出来这里的关键就是：\n\n（1）config函数\n\n（2）Run函数\n\n<br>\n\n#### 1.2 s.config  实例化一个kubecontrollerconfig.Config\n\n这个函数的参数是：allControllers []string, disabledByDefaultControllers []string\n\n核心就是：kubecontrollerconfig.Config\n\n```\n// Config return a controller manager config objective\nfunc (s KubeControllerManagerOptions) Config(allControllers []string, disabledByDefaultControllers []string) (*kubecontrollerconfig.Config, error) {\n\tif err := s.Validate(allControllers, disabledByDefaultControllers); err != nil {\n\t\treturn nil, err\n\t}\n\n\tif err := s.SecureServing.MaybeDefaultWithSelfSignedCerts(\"localhost\", nil, []net.IP{net.ParseIP(\"127.0.0.1\")}); err != nil {\n\t\treturn nil, fmt.Errorf(\"error creating self-signed certificates: %v\", err)\n\t}\n\n\tkubeconfig, err := clientcmd.BuildConfigFromFlags(s.Master, s.Kubeconfig)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tkubeconfig.ContentConfig.ContentType = s.Generic.ClientConnection.ContentType\n\tkubeconfig.QPS = s.Generic.ClientConnection.QPS\n\tkubeconfig.Burst = int(s.Generic.ClientConnection.Burst)\n\n\tclient, err := clientset.NewForConfig(restclient.AddUserAgent(kubeconfig, KubeControllerManagerUserAgent))\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// shallow copy, do not modify the kubeconfig.Timeout.\n\tconfig := *kubeconfig\n\tconfig.Timeout = s.Generic.LeaderElection.RenewDeadline.Duration\n\tleaderElectionClient := clientset.NewForConfigOrDie(restclient.AddUserAgent(&config, \"leader-election\"))\n\n\teventRecorder := createRecorder(client, KubeControllerManagerUserAgent)\n    \n    // 核心就是定义好这样一个结构体\n\tc := &kubecontrollerconfig.Config{\n\t\tClient:               client,                  //用于api-server通信\n\t\tKubeconfig:           kubeconfig,              //kube-config\n\t\tEventRecorder:        eventRecorder,           //event上报\n\t\tLeaderElectionClient: leaderElectionClient,    //选举的客户端\n\t}\n\tif err := s.ApplyTo(c); err != nil {\n\t\treturn nil, err\n\t}\n\n\treturn c, nil\n}\n```\n\n<br>\n\n##### 1.2.1 s.applyTo\n\n```\n// ApplyTo fills up controller manager config with options.\nfunc (s *KubeControllerManagerOptions) ApplyTo(c *kubecontrollerconfig.Config) error {\n   if err := s.Generic.ApplyTo(&c.ComponentConfig.Generic); err != nil {\n      return err\n   }\n   if err := s.KubeCloudShared.ApplyTo(&c.ComponentConfig.KubeCloudShared); err != nil {\n      return err\n   }\n   if err := s.AttachDetachController.ApplyTo(&c.ComponentConfig.AttachDetachController); err != nil {\n      return err\n   }\n   if err := s.CSRSigningController.ApplyTo(&c.ComponentConfig.CSRSigningController); err != nil {\n      return err\n   }\n   if err := s.DaemonSetController.ApplyTo(&c.ComponentConfig.DaemonSetController); err != nil {\n      return err\n   }\n   if err := s.DeploymentController.ApplyTo(&c.ComponentConfig.DeploymentController); err != nil {\n      return err\n   }\n   if err := s.DeprecatedFlags.ApplyTo(&c.ComponentConfig.DeprecatedController); err != nil {\n      return err\n   }\n   if err := s.EndpointController.ApplyTo(&c.ComponentConfig.EndpointController); err != nil {\n      return err\n   }\n   if err := s.GarbageCollectorController.ApplyTo(&c.ComponentConfig.GarbageCollectorController); err != nil {\n      return err\n   }\n   if err := s.HPAController.ApplyTo(&c.ComponentConfig.HPAController); err != nil {\n      return err\n   }\n   if err := s.JobController.ApplyTo(&c.ComponentConfig.JobController); err != nil {\n      return err\n   }\n   if err := s.NamespaceController.ApplyTo(&c.ComponentConfig.NamespaceController); err != nil {\n      return err\n   }\n   if err := s.NodeIPAMController.ApplyTo(&c.ComponentConfig.NodeIPAMController); err != nil {\n      return err\n   }\n   if err := s.NodeLifecycleController.ApplyTo(&c.ComponentConfig.NodeLifecycleController); err != nil {\n      return err\n   }\n   if err := s.PersistentVolumeBinderController.ApplyTo(&c.ComponentConfig.PersistentVolumeBinderController); err != nil {\n      return err\n   }\n   if err := s.PodGCController.ApplyTo(&c.ComponentConfig.PodGCController); err != nil {\n      return err\n   }\n   if err := s.ReplicaSetController.ApplyTo(&c.ComponentConfig.ReplicaSetController); err != nil {\n      return err\n   }\n   if err := s.ReplicationController.ApplyTo(&c.ComponentConfig.ReplicationController); err != nil {\n      return err\n   }\n   if err := s.ResourceQuotaController.ApplyTo(&c.ComponentConfig.ResourceQuotaController); err != nil {\n      return err\n   }\n   if err := s.SAController.ApplyTo(&c.ComponentConfig.SAController); err != nil {\n      return err\n   }\n   if err := s.ServiceController.ApplyTo(&c.ComponentConfig.ServiceController); err != nil {\n      return err\n   }\n   if err := s.TTLAfterFinishedController.ApplyTo(&c.ComponentConfig.TTLAfterFinishedController); err != nil {\n      return err\n   }\n   if err := s.InsecureServing.ApplyTo(&c.InsecureServing, &c.LoopbackClientConfig); err != nil {\n      return err\n   }\n   if err := s.SecureServing.ApplyTo(&c.SecureServing, &c.LoopbackClientConfig); err != nil {\n      return err\n   }\n   if s.SecureServing.BindPort != 0 || s.SecureServing.Listener != nil {\n      if err := s.Authentication.ApplyTo(&c.Authentication, c.SecureServing, nil); err != nil {\n         return err\n      }\n      if err := s.Authorization.ApplyTo(&c.Authorization); err != nil {\n         return err\n      }\n   }\n\n   // sync back to component config\n   // TODO: find more elegant way than syncing back the values.\n   c.ComponentConfig.Generic.Port = int32(s.InsecureServing.BindPort)\n   c.ComponentConfig.Generic.Address = s.InsecureServing.BindAddress.String()\n\n   return nil\n}\n```\n\napplyto 函数的逻辑就是根据KubeControllerManagerOptions，赋值给c *kubecontrollerconfig.Config。\n\n这里随便找一个applyto具体实现看看就知道了\n\n```\n// ApplyTo fills up AttachDetachController config with options.\nfunc (o *AttachDetachControllerOptions) ApplyTo(cfg *kubectrlmgrconfig.AttachDetachControllerConfiguration) error {\n   if o == nil {\n      return nil\n   }\n\n   cfg.DisableAttachDetachReconcilerSync = o.DisableAttachDetachReconcilerSync\n   cfg.ReconcilerSyncLoopPeriod = o.ReconcilerSyncLoopPeriod\n\n   return nil\n}\n```\n\n##### 1.2.2 结构体定义\n\ncmd\\kube-controller-manager\\app\\config\\config.go\n\nApplyTO函数的最终目的就是实例化这样一个结构体。\n\n```\nkubecontrollerconfig.Config\n// Config is the main context object for the controller manager.\ntype Config struct {\n\tComponentConfig kubectrlmgrconfig.KubeControllerManagerConfiguration    //这个是各种manager的config，如下\n\n\tSecureServing *apiserver.SecureServingInfo\n\t// LoopbackClientConfig is a config for a privileged loopback connection\n\tLoopbackClientConfig *restclient.Config\n\n\t// TODO: remove deprecated insecure serving\n\tInsecureServing *apiserver.DeprecatedInsecureServingInfo\n\tAuthentication  apiserver.AuthenticationInfo\n\tAuthorization   apiserver.AuthorizationInfo\n\n\t// the general kube client\n\tClient *clientset.Clientset\n\n\t// the client only used for leader election\n\tLeaderElectionClient *clientset.Clientset\n\n\t// the rest config for the master\n\tKubeconfig *restclient.Config\n\n\t// the event sink\n\tEventRecorder record.EventRecorder\n}\n```\n\npkg\\controller\\apis\\config\\types.go\n\n```\n// KubeControllerManagerConfiguration contains elements describing kube-controller manager.\ntype KubeControllerManagerConfiguration struct {\n\tmetav1.TypeMeta\n\n\t// Generic holds configuration for a generic controller-manager\n\tGeneric GenericControllerManagerConfiguration\n\t// KubeCloudSharedConfiguration holds configuration for shared related features\n\t// both in cloud controller manager and kube-controller manager.\n\tKubeCloudShared KubeCloudSharedConfiguration\n\n\t// AttachDetachControllerConfiguration holds configuration for\n\t// AttachDetachController related features.\n\tAttachDetachController AttachDetachControllerConfiguration\n\t// CSRSigningControllerConfiguration holds configuration for\n\t// CSRSigningController related features.\n\tCSRSigningController CSRSigningControllerConfiguration\n\t// DaemonSetControllerConfiguration holds configuration for DaemonSetController\n\t// related features.\n\tDaemonSetController DaemonSetControllerConfiguration\n\t// DeploymentControllerConfiguration holds configuration for\n\t// DeploymentController related features.\n\tDeploymentController DeploymentControllerConfiguration\n\t// DeprecatedControllerConfiguration holds configuration for some deprecated\n\t// features.\n\tDeprecatedController DeprecatedControllerConfiguration\n\t// EndpointControllerConfiguration holds configuration for EndpointController\n\t// related features.\n\tEndpointController EndpointControllerConfiguration\n\t// GarbageCollectorControllerConfiguration holds configuration for\n\t// GarbageCollectorController related features.\n\tGarbageCollectorController GarbageCollectorControllerConfiguration\n\t// HPAControllerConfiguration holds configuration for HPAController related features.\n\tHPAController HPAControllerConfiguration\n\t// JobControllerConfiguration holds configuration for JobController related features.\n\tJobController JobControllerConfiguration\n\t// NamespaceControllerConfiguration holds configuration for NamespaceController\n\t// related features.\n\tNamespaceController NamespaceControllerConfiguration\n\t// NodeIPAMControllerConfiguration holds configuration for NodeIPAMController\n\t// related features.\n\tNodeIPAMController NodeIPAMControllerConfiguration\n\t// NodeLifecycleControllerConfiguration holds configuration for\n\t// NodeLifecycleController related features.\n\tNodeLifecycleController NodeLifecycleControllerConfiguration\n\t// PersistentVolumeBinderControllerConfiguration holds configuration for\n\t// PersistentVolumeBinderController related features.\n\tPersistentVolumeBinderController PersistentVolumeBinderControllerConfiguration\n\t// PodGCControllerConfiguration holds configuration for PodGCController\n\t// related features.\n\tPodGCController PodGCControllerConfiguration\n\t// ReplicaSetControllerConfiguration holds configuration for ReplicaSet related features.\n\tReplicaSetController ReplicaSetControllerConfiguration\n\t// ReplicationControllerConfiguration holds configuration for\n\t// ReplicationController related features.\n\tReplicationController ReplicationControllerConfiguration\n\t// ResourceQuotaControllerConfiguration holds configuration for\n\t// ResourceQuotaController related features.\n\tResourceQuotaController ResourceQuotaControllerConfiguration\n\t// SAControllerConfiguration holds configuration for ServiceAccountController\n\t// related features.\n\tSAController SAControllerConfiguration\n\t// ServiceControllerConfiguration holds configuration for ServiceController\n\t// related features.\n\tServiceController ServiceControllerConfiguration\n\t// TTLAfterFinishedControllerConfiguration holds configuration for\n\t// TTLAfterFinishedController related features.\n\tTTLAfterFinishedController TTLAfterFinishedControllerConfiguration\n}\n```\n\n<br>\n\n#### 1.3 Run\n\n所以，经过1.1 config函数 c就是补全了所有的config\n\n```\n// Run runs the KubeControllerManagerOptions.  This should never exit.\nfunc Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {\n\t// To help debugging, immediately log version\n\tglog.Infof(\"Version: %+v\", version.Get())\n\n\tif cfgz, err := configz.New(\"componentconfig\"); err == nil {\n\t\tcfgz.Set(c.ComponentConfig)\n\t} else {\n\t\tglog.Errorf(\"unable to register configz: %c\", err)\n\t}\n    \n    // 1.开启http server。默认暴露的端口号:10252。用于controller-manager服务性能检测(如:/debug/profile)及暴露服务相关的metrics供promtheus用于监控。\n\t// Start the controller manager HTTP server\n\t// unsecuredMux is the handler for these controller *after* authn/authz filters have been applied\n\tvar unsecuredMux *mux.PathRecorderMux\n\tif c.SecureServing != nil {\n\t\tunsecuredMux = genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging)\n\t\thandler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, &c.Authorization, &c.Authentication)\n\t\tif err := c.SecureServing.Serve(handler, 0, stopCh); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\tif c.InsecureServing != nil {\n\t\tunsecuredMux = genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging)\n\t\tinsecureSuperuserAuthn := server.AuthenticationInfo{Authenticator: &server.InsecureSuperuser{}}\n\t\thandler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, nil, &insecureSuperuserAuthn)\n\t\tif err := c.InsecureServing.Serve(handler, 0, stopCh); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n    \n    \n    // 2. 定义好run函数\n\trun := func(ctx context.Context) {\n\t\trootClientBuilder := controller.SimpleControllerClientBuilder{\n\t\t\tClientConfig: c.Kubeconfig,\n\t\t}\n\t\tvar clientBuilder controller.ControllerClientBuilder\n\t\tif c.ComponentConfig.KubeCloudShared.UseServiceAccountCredentials {\n\t\t\tif len(c.ComponentConfig.SAController.ServiceAccountKeyFile) == 0 {\n\t\t\t\t// It'c possible another controller process is creating the tokens for us.\n\t\t\t\t// If one isn't, we'll timeout and exit when our client builder is unable to create the tokens.\n\t\t\t\tglog.Warningf(\"--use-service-account-credentials was specified without providing a --service-account-private-key-file\")\n\t\t\t}\n\t\t\tclientBuilder = controller.SAControllerClientBuilder{\n\t\t\t\tClientConfig:         restclient.AnonymousClientConfig(c.Kubeconfig),\n\t\t\t\tCoreClient:           c.Client.CoreV1(),\n\t\t\t\tAuthenticationClient: c.Client.AuthenticationV1(),\n\t\t\t\tNamespace:            \"kube-system\",\n\t\t\t}\n\t\t} else {\n\t\t\tclientBuilder = rootClientBuilder\n\t\t}\n\t\tcontrollerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done())\n\t\tif err != nil {\n\t\t\tglog.Fatalf(\"error building controller context: %v\", err)\n\t\t}\n\t\tsaTokenControllerInitFunc := serviceAccountTokenControllerStarter{rootClientBuilder: rootClientBuilder}.startServiceAccountTokenController\n\n\t\tif err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil {\n\t\t\tglog.Fatalf(\"error starting controllers: %v\", err)\n\t\t}\n\n\t\tcontrollerContext.InformerFactory.Start(controllerContext.Stop)\n\t\tclose(controllerContext.InformersStarted)\n\n\t\tselect {}\n\t}\n    \n    // 3. 如果没有多个就直接run\n\tif !c.ComponentConfig.Generic.LeaderElection.LeaderElect {\n\t\trun(context.TODO())\n\t\tpanic(\"unreachable\")\n\t}\n\n\tid, err := os.Hostname()\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t// add a uniquifier so that two processes on the same host don't accidentally both become active\n\tid = id + \"_\" + string(uuid.NewUUID())\n\trl, err := resourcelock.New(c.ComponentConfig.Generic.LeaderElection.ResourceLock,\n\t\t\"kube-system\",\n\t\t\"kube-controller-manager\",\n\t\tc.LeaderElectionClient.CoreV1(),\n\t\tresourcelock.ResourceLockConfig{\n\t\t\tIdentity:      id,\n\t\t\tEventRecorder: c.EventRecorder,\n\t\t})\n\tif err != nil {\n\t\tglog.Fatalf(\"error creating lock: %v\", err)\n\t}\n\t\t\n\t// 4.设置了选举\n\tleaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{\n\t\tLock:          rl,\n\t\tLeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration,\n\t\tRenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration,\n\t\tRetryPeriod:   c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration,\n\t\tCallbacks: leaderelection.LeaderCallbacks{\n\t\t\tOnStartedLeading: run,               //leader 运行run函数，这个就是第二步定义的函数\n\t\t\tOnStoppedLeading: func() {           // 非leader就打印这个日志。\n\t\t\t\tglog.Fatalf(\"leaderelection lost\")\n\t\t\t},\n\t\t},\n\t})\n\tpanic(\"unreachable\")\n}\n```\n\n<br>\n\n#### 1.4 run函数\n\n这里就是初始化clientBuilder,然后就StartControllers。\n\n```\nrun := func(ctx context.Context) {\n\t\trootClientBuilder := controller.SimpleControllerClientBuilder{\n\t\t\tClientConfig: c.Kubeconfig,\n\t\t}\n\t\tvar clientBuilder controller.ControllerClientBuilder\n\t\tif c.ComponentConfig.KubeCloudShared.UseServiceAccountCredentials {\n\t\t\tif len(c.ComponentConfig.SAController.ServiceAccountKeyFile) == 0 {\n\t\t\t\t// It'c possible another controller process is creating the tokens for us.\n\t\t\t\t// If one isn't, we'll timeout and exit when our client builder is unable to create the tokens.\n\t\t\t\tglog.Warningf(\"--use-service-account-credentials was specified without providing a --service-account-private-key-file\")\n\t\t\t}\n\t\t\tclientBuilder = controller.SAControllerClientBuilder{\n\t\t\t\tClientConfig:         restclient.AnonymousClientConfig(c.Kubeconfig),\n\t\t\t\tCoreClient:           c.Client.CoreV1(),\n\t\t\t\tAuthenticationClient: c.Client.AuthenticationV1(),\n\t\t\t\tNamespace:            \"kube-system\",\n\t\t\t}\n\t\t} else {\n\t\t\tclientBuilder = rootClientBuilder\n\t\t}\n\t\tcontrollerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done())\n\t\tif err != nil {\n\t\t\tglog.Fatalf(\"error building controller context: %v\", err)\n\t\t}\n\t\tsaTokenControllerInitFunc := serviceAccountTokenControllerStarter{rootClientBuilder: rootClientBuilder}.startServiceAccountTokenController\n\n\t\tif err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil {\n\t\t\tglog.Fatalf(\"error starting controllers: %v\", err)\n\t\t}\n\n\t\tcontrollerContext.InformerFactory.Start(controllerContext.Stop)\n\t\tclose(controllerContext.InformersStarted)\n\n\t\tselect {}\n\t}\n```\n\nstartController这里有一个参数是函数NewControllerInitializers，从这里可以看到有这么多controller\n\n```\n// NewControllerInitializers is a public map of named controller groups (you can start more than one in an init func)\n// paired to their InitFunc.  This allows for structured downstream composition and subdivision.\nfunc NewControllerInitializers(loopMode ControllerLoopMode) map[string]InitFunc {\n   controllers := map[string]InitFunc{}\n   controllers[\"endpoint\"] = startEndpointController\n   controllers[\"replicationcontroller\"] = startReplicationController\n   controllers[\"podgc\"] = startPodGCController\n   controllers[\"resourcequota\"] = startResourceQuotaController\n   controllers[\"namespace\"] = startNamespaceController\n   controllers[\"serviceaccount\"] = startServiceAccountController\n   controllers[\"garbagecollector\"] = startGarbageCollectorController\n   controllers[\"daemonset\"] = startDaemonSetController\n   controllers[\"job\"] = startJobController\n   controllers[\"deployment\"] = startDeploymentController\n   controllers[\"replicaset\"] = startReplicaSetController\n   controllers[\"horizontalpodautoscaling\"] = startHPAController\n   controllers[\"disruption\"] = startDisruptionController\n   controllers[\"statefulset\"] = startStatefulSetController\n   controllers[\"cronjob\"] = startCronJobController\n   controllers[\"csrsigning\"] = startCSRSigningController\n   controllers[\"csrapproving\"] = startCSRApprovingController\n   controllers[\"csrcleaner\"] = startCSRCleanerController\n   controllers[\"ttl\"] = startTTLController\n   controllers[\"bootstrapsigner\"] = startBootstrapSignerController\n   controllers[\"tokencleaner\"] = startTokenCleanerController\n   controllers[\"nodeipam\"] = startNodeIpamController\n   if loopMode == IncludeCloudLoops {\n      controllers[\"service\"] = startServiceController\n      controllers[\"route\"] = startRouteController\n      // TODO: volume controller into the IncludeCloudLoops only set.\n      // TODO: Separate cluster in cloud check from node lifecycle controller.\n   }\n   controllers[\"nodelifecycle\"] = startNodeLifecycleController\n   controllers[\"persistentvolume-binder\"] = startPersistentVolumeBinderController\n   controllers[\"attachdetach\"] = startAttachDetachController\n   controllers[\"persistentvolume-expander\"] = startVolumeExpandController\n   controllers[\"clusterrole-aggregation\"] = startClusterRoleAggregrationController\n   controllers[\"pvc-protection\"] = startPVCProtectionController\n   controllers[\"pv-protection\"] = startPVProtectionController\n   controllers[\"ttl-after-finished\"] = startTTLAfterFinishedController\n\n   return controllers\n}\n```\n\n<br>\n\n#### 1.5 StartControllers\n\n```\nfunc StartControllers(ctx ControllerContext, startSATokenController InitFunc, controllers map[string]InitFunc, unsecuredMux *mux.PathRecorderMux) error {\n\t// Always start the SA token controller first using a full-power client, since it needs to mint tokens for the rest\n\t// If this fails, just return here and fail since other controllers won't be able to get credentials.\n\tif _, _, err := startSATokenController(ctx); err != nil {\n\t\treturn err\n\t}\n\n\t// Initialize the cloud provider with a reference to the clientBuilder only after token controller\n\t// has started in case the cloud provider uses the client builder.\n\tif ctx.Cloud != nil {\n\t\tctx.Cloud.Initialize(ctx.ClientBuilder)\n\t}\n\t\n\t// 依次启动controller，这里为啥不用协程呢？\n\tfor controllerName, initFn := range controllers {\n\t\tif !ctx.IsControllerEnabled(controllerName) {\n\t\t\tglog.Warningf(\"%q is disabled\", controllerName)\n\t\t\tcontinue\n\t\t}\n\n\t\ttime.Sleep(wait.Jitter(ctx.ComponentConfig.Generic.ControllerStartInterval.Duration, ControllerStartJitter))\n\n\t\tglog.V(1).Infof(\"Starting %q\", controllerName)\n\t\t// 注意这里的 initFn就是NewControllerInitializers 中指定了。\n\t\tdebugHandler, started, err := initFn(ctx)\n\t\tif err != nil {\n\t\t\tglog.Errorf(\"Error starting %q\", controllerName)\n\t\t\treturn err\n\t\t}\n\t\tif !started {\n\t\t\tglog.Warningf(\"Skipping %q\", controllerName)\n\t\t\tcontinue\n\t\t}\n\t\tif debugHandler != nil && unsecuredMux != nil {\n\t\t\tbasePath := \"/debug/controllers/\" + controllerName\n\t\t\tunsecuredMux.UnlistedHandle(basePath, http.StripPrefix(basePath, debugHandler))\n\t\t\tunsecuredMux.UnlistedHandlePrefix(basePath+\"/\", http.StripPrefix(basePath, debugHandler))\n\t\t}\n\t\tglog.Infof(\"Started %q\", controllerName)\n\t}\n\n\treturn nil\n}\n```\n\n<br>\n\n#### 1.6 总结\n\n##### 1.6.1 整体流程\n\n(1) NewControllerManagerCommand中定义了 NewKubeControllerManagerOptions，名为s。同时调用这个，将命令行的参数，赋值给s\n\n```\nfs := cmd.Flags()\n// 定义cobra的flags，这里就是定义参数的名称，默认值啥的。例如 --url --port等\nnamedFlagSets := s.Flags(KnownControllers(), ControllersDisabledByDefault.List())\nfor _, f := range namedFlagSets.FlagSets {\n   fs.AddFlagSet(f)\n}\n```\n\n（2）然后通过 s.config  实例化一个kubecontrollerconfig.Config, 名为c\n\n（3）通过ApplyTo,将s的值赋值给 C。 这样C对应的每个controller都有自己的config\n\n（4）然后就开始Run逻辑。\n\n```\nRun逻辑：\n(1) 首先初始化clientBuilder\n(2) 然后定义好真正运行的run函数。run函数依次运行所有的controller的init函数。这样每个controller的起点就是这个 init函数。\n(3) 调用选举函数，leader运行run。非leader打印，失去leader锁的日志。\n```\n\n<br>\n\n##### 1.6.2 一些思考\n\n（1） 为啥参数赋值的时候，又要config，又要options，弄完弄清，直接想附录那样赋值不香吗？\n\n这里的一个思考是：kube-controller-manage 想采用机制和策略分离原则，options参数主要是面向cmd.Flag的，用于用户启动kcm参数的接受。config面向kcm，具体来说是为了更方便kcm中各个控制器的启动。每个controller有自己的config。\n\n这样的好像是：option和config参数分离。option 是通过AddFlags赋值。而config 则是ApplyTo赋值。\n\n### 2. 附录\n\n#### 2.1 cobra实践\n\n```\npackage main\n\nimport (\n\t\"fmt\"\n\t\"github.com/spf13/cobra\"\n\n\t\"flag\"\n)\n\ntype Config struct {\n\turl string\n}\n\n\nfunc main() {\n\tvar config = &Config{}\n\n\tvar rootCmd = &cobra.Command{\n\t\tUse: \"test cobra\",\n\t\tRun: func(cmd *cobra.Command, args []string) {\n\t\t\tfmt.Println(config.url)\n\t\t},\n\t}\n\n\trootCmd.PersistentFlags().AddGoFlagSet(flag.CommandLine)\n\trootCmd.Flags().StringVarP(&config.url, \"arg-url\", \"\", \"www.baidu.com\", \"the url is used for connect baidu\")\n\n\trootCmd.Execute()\n}\n\nE:\\goWork\\src\\practice>cobra.exe --arg-url aaa\naaa\n\nE:\\goWork\\src\\practice>cobra.exe\nwww.baidu.com\n```\n\n<br>\n\n#### 2.2 k8s中的选举机制\n\nk8s中的选举机制在Client-go包中实现。具体的做法是：多个客户端创建一起创建成功资源，哪一个goroutine 获得锁，哪一个就是主。\n\n选择config, ep的原因在于他们被list-watcher比较少，后期由于svc, ingres等发展，现在主要是用configmap来做。\n\n比如这个：当前kcm的锁就在k8s-master这个节点上。\n\n```\nroot@k8s-master:~# kubectl get ep -n kube-system kube-controller-manager -o yaml\napiVersion: v1\nkind: Endpoints\nmetadata:\n  annotations:\n    control-plane.alpha.kubernetes.io/leader: '{\"holderIdentity\":\"k8s-master_904e0225-871d-4ff2-becc-62b58a20e3c7\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2021-07-16T21:03:58Z\",\"renewTime\":\"2021-07-17T10:35:50Z\",\"leaderTransitions\":24}'\n  creationTimestamp: \"2021-06-05T12:50:04Z\"\n  name: kube-controller-manager\n  namespace: kube-system\n  resourceVersion: \"8831710\"\n  selfLink: /api/v1/namespaces/kube-system/endpoints/kube-controller-manager\n  uid: 5d530096-9b10-45bb-a11e-43f1f8733fa5\n```\n\n"
  },
  {
    "path": "k8s/kcm/1-rs controller-manager源码分析.md",
    "content": "Table of Contents\n=================\n\n  * [1. startReplicaSetController](#1-startreplicasetcontroller)\n     * [1.1 rs中的expectations机制](#11-rs中的expectations机制)\n  * [2. Pod，rs变化时对应的处理逻辑](#2-podrs变化时对应的处理逻辑)\n     * [2.1 addPod](#21-addpod)\n     * [2.2 updatePod](#22-updatepod)\n     * [2.3 deletePod](#23-deletepod)\n     * [2.4 addRS](#24-addrs)\n     * [2.5 updateRS](#25-updaters)\n     * [2.6 deleteRS](#26-deleters)\n  * [3. rs的处理逻辑](#3-rs的处理逻辑)\n     * [3.1 过滤pod](#31-过滤pod)\n     * [3.2 manageReplicas](#32-managereplicas)\n        * [3.2.1 创建pod](#321-创建pod)\n        * [3.2.2 删除pod](#322-删除pod)\n     * [3.3 calculateStatus](#33-calculatestatus)\n  * [4 总结](#4-总结)\n\n### 1. startReplicaSetController\n\n和deployController一样，kcm中定义了startReplicaSetController，startReplicaSetController和所有的控制器一样，先New一个对象，然后调用run函数。\n\n这里可以看出来，rs控制器监听rs, 和pod的变化。\n\n```\nfunc startReplicaSetController(ctx ControllerContext) (http.Handler, bool, error) {\n\tif !ctx.AvailableResources[schema.GroupVersionResource{Group: \"apps\", Version: \"v1\", Resource: \"replicasets\"}] {\n\t\treturn nil, false, nil\n\t}\n\tgo replicaset.NewReplicaSetController(\n\t\tctx.InformerFactory.Apps().V1().ReplicaSets(),\n\t\tctx.InformerFactory.Core().V1().Pods(),\n\t\tctx.ClientBuilder.ClientOrDie(\"replicaset-controller\"),\n\t\treplicaset.BurstReplicas,\n\t).Run(int(ctx.ComponentConfig.ReplicaSetController.ConcurrentRSSyncs), ctx.Stop)\n\treturn nil, true, nil\n}\n```\n\n<br>\n\n先NewReplicaSetController，再run\n\n```\n// NewReplicaSetController configures a replica set controller with the specified event recorder\nfunc NewReplicaSetController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int) *ReplicaSetController {\n   // event上传\n   eventBroadcaster := record.NewBroadcaster()\n   eventBroadcaster.StartLogging(glog.Infof)\n   eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events(\"\")})\n   return NewBaseController(rsInformer, podInformer, kubeClient, burstReplicas,\n      apps.SchemeGroupVersion.WithKind(\"ReplicaSet\"),\n      \"replicaset_controller\",\n      \"replicaset\",\n      controller.RealPodControl{\n         KubeClient: kubeClient,\n         Recorder:   eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: \"replicaset-controller\"}),\n      },\n   )\n}\n```\n\n```\n// NewBaseController is the implementation of NewReplicaSetController with additional injected\n// parameters so that it can also serve as the implementation of NewReplicationController.\nfunc NewBaseController(rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, kubeClient clientset.Interface, burstReplicas int,\n   gvk schema.GroupVersionKind, metricOwnerName, queueName string, podControl controller.PodControlInterface) *ReplicaSetController {\n   if kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {\n      metrics.RegisterMetricAndTrackRateLimiterUsage(metricOwnerName, kubeClient.CoreV1().RESTClient().GetRateLimiter())\n   }\n\n   rsc := &ReplicaSetController{\n      GroupVersionKind: gvk,\n      kubeClient:       kubeClient,\n      podControl:       podControl,\n      burstReplicas:    burstReplicas,\n      expectations:     controller.NewUIDTrackingControllerExpectations(controller.NewControllerExpectations()),\n      queue:            workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), queueName),\n   }\n\n   rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n      AddFunc:    rsc.enqueueReplicaSet,\n      UpdateFunc: rsc.updateRS,\n      // This will enter the sync loop and no-op, because the replica set has been deleted from the store.\n      // Note that deleting a replica set immediately after scaling it to 0 will not work. The recommended\n      // way of achieving this is by performing a `stop` operation on the replica set.\n      DeleteFunc: rsc.enqueueReplicaSet,\n   })\n   rsc.rsLister = rsInformer.Lister()\n   rsc.rsListerSynced = rsInformer.Informer().HasSynced\n\n   podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n      AddFunc: rsc.addPod,\n      // This invokes the ReplicaSet for every pod change, eg: host assignment. Though this might seem like\n      // overkill the most frequent pod update is status, and the associated ReplicaSet will only list from\n      // local storage, so it should be ok.\n      UpdateFunc: rsc.updatePod,\n      DeleteFunc: rsc.deletePod,\n   })\n   rsc.podLister = podInformer.Lister()\n   rsc.podListerSynced = podInformer.Informer().HasSynced\n\n   rsc.syncHandler = rsc.syncReplicaSet\n\n   return rsc\n}\n```\n\n这里注意一点，syncHandler函数是 syncReplicaSet。\n\n<br>\n\n#### 1.1 rs中的expectations机制\n\n在介绍 rs controller如何处理rs, pod的变动之前，先介绍expectations机制。原因在于addpod, addrs, delrs等等处理函数一直用到了expectations。\n\nexpectations可以理解为一个map。举例来说，这个map可以认为有四个关键字段。\n\nkey:  有rs的ns和 rs的name组成\n\nAdd: 表示这个rs还需要增加多少个rs\n\ndel: 表示这个rs还需要删除多少个pod\n\nTime: 表示\n\n| Key         | Add  | Del  | Time                |\n| ----------- | ---- | ---- | ------------------- |\n| Default/zx1 | 0    | 0    | 2021.07.04 16:00:00 |\n| zx/zx1      | 1    | 0    | 2021.07.04 16:00:00 |\n\n<br>\n\n**GetExpectations**:  输入是key, 输出整个map;\n\n**SatisfiedExpectations**: 输入key, 输出bool；判断某个rs是否符合预期。符合预期： add<=0 && del<=0 或者 超过了同步周期； 其他情况都是不符合预期。\n\n**DeleteExpectations**：输入key, 无输出；从map(缓存)中删除这个key\n\n**SetExpectations**：输入（key, add, del）; 在map中新增加一行。 **这个会更新时间，将time复制为time.Now**\n\n**ExpectCreations**:  输入（key, add);   覆盖map中的内容，del=0， add等于函数的参数。  **这个会更新时间，将time复制为time.Now**\n\n**ExpectDeletions**： 输入（key, del);  覆盖map中的内容，add=0， del等于函数的参数。   **这个会更新时间，将time复制为time.Now**\n\n**CreationObserved**: 输入(key) ;  map中对应的行中 add-1\n\n**DeletionObserved**: 输入(key);  map中对应的行中 del-1\n\n**RaiseExpectations**:  输入(key, add, del)；  map中对应的行中 Add+add, Del+del\n\n**LowerExpectations**: 输入(key, add, del)；  map中对应的行中 Add-add, Del-del\n\n```\n// A TTLCache of pod creates/deletes each rc expects to see.\nexpectations *controller.UIDTrackingControllerExpectations\n\n\ntype UIDTrackingControllerExpectations struct {\n\tControllerExpectationsInterface\n\n  // 原生锁, 这里带了时间操作 sync/mutex.go\n  uidStoreLock sync.Mutex\n\n  // 缓存\n\tuidStore cache.Store\n}\n\ntype ControllerExpectationsInterface interface {\n\tGetExpectations(controllerKey string) (*ControlleeExpectations, bool, error)\n\tSatisfiedExpectations(controllerKey string) bool\n\tDeleteExpectations(controllerKey string)\n\tSetExpectations(controllerKey string, add, del int) error\n\tExpectCreations(controllerKey string, adds int) error\n\tExpectDeletions(controllerKey string, dels int) error\n\tCreationObserved(controllerKey string)\n\tDeletionObserved(controllerKey string)\n\tRaiseExpectations(controllerKey string, add, del int)\n\tLowerExpectations(controllerKey string, add, del int)\n}\n\n\n// ControlleeExpectations track controllee creates/deletes.\ntype ControlleeExpectations struct {\n\t// Important: Since these two int64 fields are using sync/atomic, they have to be at the top of the struct due to a bug on 32-bit platforms\n\t// See: https://golang.org/pkg/sync/atomic/ for more information\n\tadd       int64\n\tdel       int64\n\tkey       string\n\ttimestamp time.Time\n}\n\n\n// SatisfiedExpectations returns true if the required adds/dels for the given controller have been observed.\n// Add/del counts are established by the controller at sync time, and updated as controllees are observed by the controller\n// manager.\nfunc (r *ControllerExpectations) SatisfiedExpectations(controllerKey string) bool {\n\tif exp, exists, err := r.GetExpectations(controllerKey); exists {\n\t  // Fulfilled就是 add<=0并且del<=0\n\t\tif exp.Fulfilled() {\n\t\t\tklog.V(4).Infof(\"Controller expectations fulfilled %#v\", exp)\n\t\t\treturn true\n\t\t} else if exp.isExpired() {\n\t\t\tklog.V(4).Infof(\"Controller expectations expired %#v\", exp)\n\t\t\treturn true\n\t\t} else {\n\t\t\tklog.V(4).Infof(\"Controller still waiting on expectations %#v\", exp)\n\t\t\treturn false\n\t\t}\n\t} else if err != nil {\n\t\tklog.V(2).Infof(\"Error encountered while checking expectations %#v, forcing sync\", err)\n\t} else {\n\t\t// When a new controller is created, it doesn't have expectations.\n\t\t// When it doesn't see expected watch events for > TTL, the expectations expire.\n\t\t//\t- In this case it wakes up, creates/deletes controllees, and sets expectations again.\n\t\t// When it has satisfied expectations and no controllees need to be created/destroyed > TTL, the expectations expire.\n\t\t//\t- In this case it continues without setting expectations till it needs to create/delete controllees.\n\t\tklog.V(4).Infof(\"Controller %v either never recorded expectations, or the ttl expired.\", controllerKey)\n\t}\n\t// Trigger a sync if we either encountered and error (which shouldn't happen since we're\n\t// getting from local store) or this controller hasn't established expectations.\n\treturn true\n}\n\n// Fulfilled就是 add<=0并且del<=0\n// Fulfilled returns true if this expectation has been fulfilled.\nfunc (e *ControlleeExpectations) Fulfilled() bool {\n\t// TODO: think about why this line being atomic doesn't matter\n\treturn atomic.LoadInt64(&e.add) <= 0 && atomic.LoadInt64(&e.del) <= 0\n}\n\n// 判断是否超过同步周期，同步周期是5分钟\nfunc (exp *ControlleeExpectations) isExpired() bool {\n\treturn clock.RealClock{}.Since(exp.timestamp) > ExpectationsTimeout\n}\n\n// 这个会覆盖之前的行，并且del=0\nfunc (r *ControllerExpectations) ExpectCreations(controllerKey string, adds int) error {\n\treturn r.SetExpectations(controllerKey, adds, 0)\n}\n```\n\n<br>\n\n**总结：**\n\n（1）expectations就是通过一个类似map结构的对象，来表示所有rs期望pod和当前现状的差距\n\n（2）\n\n### 2. Pod，rs变化时对应的处理逻辑\n\n#### 2.1 addPod\n\n（1）如果pod有DeletionTimestamp，表明这个pod要被删除。将对应rs的Del+1,然后将rs加入队列。\n\n（2） 如果pod有OwnerReference,判断OwnerReference是否是 rs。如果不是或者是rs，但是指定的rs不存在直接返回。否则rs对应的Add+1，并且将rs加入队列。因为pod数量更新了，rs也要更新。\n\n（3）否则(pod没有OwnerReference)。所以他是一个孤儿，这个时候看有没有rs可以匹配它，如果有也可能要更新。匹配的逻辑： 判断pod的ns 和 rs的ns相等，并且 pod的labels能匹配上 rs。 找出来所有能匹配的rs，然后入队列\n\n```\n// When a pod is created, enqueue the replica set that manages it and update its expectations.\nfunc (rsc *ReplicaSetController) addPod(obj interface{}) {\n\tpod := obj.(*v1.Pod)\n    \n    // 1.如果pod有DeletionTimestamp，表明这个pod要被删除。\n    // 2. deletePod就是将对应rs的Del+1,然后将rs加入队列。\n\tif pod.DeletionTimestamp != nil {\n\t\t// on a restart of the controller manager, it's possible a new pod shows up in a state that\n\t\t// is already pending deletion. Prevent the pod from being a creation observation.\n\t\trsc.deletePod(pod)\n\t\treturn\n\t}\n     \n    // 2. 如果pod有OwnerReference,判断OwnerReference是否是 rs\n    // 如果不是或者是rs，但是指定的rs不存在直接返回。否则rs对应的Add+1，并且将rs加入队列。因为pod数量更新了，rs也要更新。\n\t// If it has a ControllerRef, that's all that matters.\n\tif controllerRef := metav1.GetControllerOf(pod); controllerRef != nil {\n\t\trs := rsc.resolveControllerRef(pod.Namespace, controllerRef)\n\t\tif rs == nil {\n\t\t\treturn\n\t\t}\n\t\trsKey, err := controller.KeyFunc(rs)\n\t\tif err != nil {\n\t\t\treturn\n\t\t}\n\t\tglog.V(4).Infof(\"Pod %s created: %#v.\", pod.Name, pod)\n\t\t// 对应rs的add+1\n\t\trsc.expectations.CreationObserved(rsKey)\n\t\t\n\t\trsc.enqueueReplicaSet(rs)\n\t\treturn\n\t}\n    \n    \n    \n  // 3. 否则(pod没有OwnerReference)。所以他是一个孤儿，这个时候看有没有rs可以匹配它，如果可以也更新。\n  // 匹配的逻辑： 判断pod的ns 和 rs的ns相等，并且 pod的labels能匹配上 rs\n  // 找出来所有能匹配的rs，然后入队列\n\t// Otherwise, it's an orphan. Get a list of all matching ReplicaSets and sync\n\t// them to see if anyone wants to adopt it.\n\t// DO NOT observe creation because no controller should be waiting for an\n\t// orphan.\n\trss := rsc.getPodReplicaSets(pod)\n\tif len(rss) == 0 {\n\t\treturn\n\t}\n\tglog.V(4).Infof(\"Orphan Pod %s created: %#v.\", pod.Name, pod)\n\tfor _, rs := range rss {\n\t\trsc.enqueueReplicaSet(rs)\n\t}\n}\n```\n\n\n\n```\n// When a pod is deleted, enqueue the replica set that manages the pod and update its expectations.\n// obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item.\nfunc (rsc *ReplicaSetController) deletePod(obj interface{}) {\n\tpod, ok := obj.(*v1.Pod)\n\n\t// When a delete is dropped, the relist will notice a pod in the store not\n\t// in the list, leading to the insertion of a tombstone object which contains\n\t// the deleted key/value. Note that this value might be stale. If the pod\n\t// changed labels the new ReplicaSet will not be woken up till the periodic resync.\n\tif !ok {\n\t\ttombstone, ok := obj.(cache.DeletedFinalStateUnknown)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"couldn't get object from tombstone %+v\", obj))\n\t\t\treturn\n\t\t}\n\t\tpod, ok = tombstone.Obj.(*v1.Pod)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"tombstone contained object that is not a pod %#v\", obj))\n\t\t\treturn\n\t\t}\n\t}\n\n\tcontrollerRef := metav1.GetControllerOf(pod)\n\tif controllerRef == nil {\n\t\t// No controller should care about orphans being deleted.\n\t\treturn\n\t}\n\trs := rsc.resolveControllerRef(pod.Namespace, controllerRef)\n\tif rs == nil {\n\t\treturn\n\t}\n\t// 这里keyfunc就是 ns/rsName\n\trsKey, err := controller.KeyFunc(rs)\n\tif err != nil {\n\t\tutilruntime.HandleError(fmt.Errorf(\"couldn't get key for object %#v: %v\", rs, err))\n\t\treturn\n\t}\n\tklog.V(4).Infof(\"Pod %s/%s deleted through %v, timestamp %+v: %#v.\", pod.Namespace, pod.Name, utilruntime.GetCaller(), pod.DeletionTimestamp, pod)\n\t// 调用 expectations.DeletionObserved，然后入队列\n\trsc.expectations.DeletionObserved(rsKey, controller.PodKey(pod))\n\trsc.queue.Add(rsKey)\n}\n```\n\n<br>\n\n#### 2.2 updatePod\n\n（1）ResourceVersion判断pod是否真的更新了\n\n（2）判断pod的DeletionTimestamp是否为空，如果不为空，表明这个pod是要删除的，对应rs的Del-1\n\n（3）如果是pod的ownerRef改变了，首先将旧rs入队，这个是肯定要更新的\n\n（4）判断pod新的ownerRef是否是rs，如果是加入队列，如果设置了MinReadySeconds，等延迟结束再将rs添加到队列，因为到时候pod ready可能会导致rs更新。\n\n（5）和addPod一样，判断出来pod没有OwnerReference。所以他是一个孤儿，这个时候看有没有rs可以匹配它，如果有也可能要更新。匹配的逻辑： 判断pod的ns 和 rs的ns相等，并且 pod的labels能匹配上 rs。 找出来所有能匹配的rs，然后入队列\n\n```\n// When a pod is updated, figure out what replica set/s manage it and wake them\n// up. If the labels of the pod have changed we need to awaken both the old\n// and new replica set. old and cur must be *v1.Pod types.\nfunc (rsc *ReplicaSetController) updatePod(old, cur interface{}) {\n\tcurPod := cur.(*v1.Pod)\n\toldPod := old.(*v1.Pod)\n\t// 1.判断是否是否一致，为啥 ResourceVersion就能判断呢。参考https://fankangbest.github.io/2018/01/16/Kubernetes-resourceVersion%E6%9C%BA%E5%88%B6%E5%88%86%E6%9E%90/\n\tif curPod.ResourceVersion == oldPod.ResourceVersion {\n\t\t// Periodic resync will send update events for all known pods.\n\t\t// Two different versions of the same pod will always have different RVs.\n\t\treturn\n\t}\n\t\n\tlabelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels)\n\t// 2.判断pod是否删除，因为删除分为两步：（1）update DeletionTimestamp, (2)删除\n\tif curPod.DeletionTimestamp != nil {\n\t\t// when a pod is deleted gracefully it's deletion timestamp is first modified to reflect a grace period,\n\t\t// and after such time has passed, the kubelet actually deletes it from the store. We receive an update\n\t\t// for modification of the deletion timestamp and expect an rs to create more replicas asap, not wait\n\t\t// until the kubelet actually deletes the pod. This is different from the Phase of a pod changing, because\n\t\t// an rs never initiates a phase change, and so is never asleep waiting for the same.\n\t\t// 从对应的rs的列表中删除pod\n\t\trsc.deletePod(curPod)\n\t\tif labelChanged {\n\t\t\t// we don't need to check the oldPod.DeletionTimestamp because DeletionTimestamp cannot be unset.\n\t\t\trsc.deletePod(oldPod)\n\t\t}\n\t\treturn\n\t}\n\n\tcurControllerRef := metav1.GetControllerOf(curPod)\n\toldControllerRef := metav1.GetControllerOf(oldPod)\n\tcontrollerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef)\n\t// 3.如果是old rs->new rs。先将old rs进入更新队列。\n\tif controllerRefChanged && oldControllerRef != nil {\n\t\t// The ControllerRef was changed. Sync the old controller, if any.\n\t\tif rs := rsc.resolveControllerRef(oldPod.Namespace, oldControllerRef); rs != nil {\n\t\t\trsc.enqueueReplicaSet(rs)\n\t\t}\n\t}\n    \n    // 4. 如果pod有新的 ownerRef, \n\t// If it has a ControllerRef, that's all that matters.\n\tif curControllerRef != nil {\n\t    // 4.1 新的ownerRef不是 rs。啥都不干。\n\t\trs := rsc.resolveControllerRef(curPod.Namespace, curControllerRef)\n\t\tif rs == nil {\n\t\t\treturn\n\t\t}\n\t\tglog.V(4).Infof(\"Pod %s updated, objectMeta %+v -> %+v.\", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta)\n\t\trsc.enqueueReplicaSet(rs)\n\t\t// TODO: MinReadySeconds in the Pod will generate an Available condition to be added in\n\t\t// the Pod status which in turn will trigger a requeue of the owning replica set thus\n\t\t// having its status updated with the newly available replica. For now, we can fake the\n\t\t// update by resyncing the controller MinReadySeconds after the it is requeued because\n\t\t// a Pod transitioned to Ready.\n\t\t// Note that this still suffers from #29229, we are just moving the problem one level\n\t\t// \"closer\" to kubelet (from the deployment to the replica set controller).\n\t\t\n\t\t// 4.2 如果oldpod ready, newpod not ready，然后设置了 MinReadySeconds延迟再添加rs到队列。\n\t\tif !podutil.IsPodReady(oldPod) && podutil.IsPodReady(curPod) && rs.Spec.MinReadySeconds > 0 {\n\t\t\tglog.V(2).Infof(\"%v %q will be enqueued after %ds for availability check\", rsc.Kind, rs.Name, rs.Spec.MinReadySeconds)\n\t\t\t// Add a second to avoid milliseconds skew in AddAfter.\n\t\t\t// See https://github.com/kubernetes/kubernetes/issues/39785#issuecomment-279959133 for more info.\n\t\t\trsc.enqueueReplicaSetAfter(rs, (time.Duration(rs.Spec.MinReadySeconds)*time.Second)+time.Second)\n\t\t}\n\t\treturn\n\t}\n    \n    // 5. 和addpod一样，判断孤儿pod。\n\t// Otherwise, it's an orphan. If anything changed, sync matching controllers\n\t// to see if anyone wants to adopt it now.\n\tif labelChanged || controllerRefChanged {\n\t\trss := rsc.getPodReplicaSets(curPod)\n\t\tif len(rss) == 0 {\n\t\t\treturn\n\t\t}\n\t\tglog.V(4).Infof(\"Orphan Pod %s updated, objectMeta %+v -> %+v.\", curPod.Name, oldPod.ObjectMeta, curPod.ObjectMeta)\n\t\tfor _, rs := range rss {\n\t\t\trsc.enqueueReplicaSet(rs)\n\t\t}\n\t}\n}\n```\n\nspec.minReadySeconds:  新创建的Pod状态为Ready持续的时间至少为`spec.minReadySeconds`才认为Pod Available(Ready)。\n\n<br>\n\n#### 2.3 deletePod\n\ndeletepod就很简单：\n\n（1）判断墓碑状态的pod是否ok。\n\n（2）找出pod对应的rsA，从rsA中删除该pod，然后将rs加入队列。\n\n```\n// When a pod is deleted, enqueue the replica set that manages the pod and update its expectations.\n// obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item.\nfunc (rsc *ReplicaSetController) deletePod(obj interface{}) {\n\tpod, ok := obj.(*v1.Pod)\n\n\t// When a delete is dropped, the relist will notice a pod in the store not\n\t// in the list, leading to the insertion of a tombstone object which contains\n\t// the deleted key/value. Note that this value might be stale. If the pod\n\t// changed labels the new ReplicaSet will not be woken up till the periodic resync.\n\t// 墓碑状态，这个是存储在etcd中，资源被删除时候的一个状态。 可以参考：https://draveness.me/etcd-introduction/\n\tif !ok {\n\t\ttombstone, ok := obj.(cache.DeletedFinalStateUnknown)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"couldn't get object from tombstone %+v\", obj))\n\t\t\treturn\n\t\t}\n\t\tpod, ok = tombstone.Obj.(*v1.Pod)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"tombstone contained object that is not a pod %#v\", obj))\n\t\t\treturn\n\t\t}\n\t}\n\n\tcontrollerRef := metav1.GetControllerOf(pod)\n\tif controllerRef == nil {\n\t\t// No controller should care about orphans being deleted.\n\t\treturn\n\t}\n\trs := rsc.resolveControllerRef(pod.Namespace, controllerRef)\n\tif rs == nil {\n\t\treturn\n\t}\n\trsKey, err := controller.KeyFunc(rs)\n\tif err != nil {\n\t\treturn\n\t}\n\tglog.V(4).Infof(\"Pod %s/%s deleted through %v, timestamp %+v: %#v.\", pod.Namespace, pod.Name, utilruntime.GetCaller(), pod.DeletionTimestamp, pod)\n\trsc.expectations.DeletionObserved(rsKey, controller.PodKey(pod))\n\trsc.enqueueReplicaSet(rs)\n}\n\n// 这里会先判断是否存在\n// DeletionObserved records the given deleteKey as a deletion, for the given rc.\nfunc (u *UIDTrackingControllerExpectations) DeletionObserved(rcKey, deleteKey string) {\n\tu.uidStoreLock.Lock()\n\tdefer u.uidStoreLock.Unlock()\n\n\tuids := u.GetUIDs(rcKey)\n\tif uids != nil && uids.Has(deleteKey) {\n\t\tklog.V(4).Infof(\"Controller %v received delete for pod %v\", rcKey, deleteKey)\n\t\tu.ControllerExpectationsInterface.DeletionObserved(rcKey)\n\t\tuids.Delete(deleteKey)\n\t}\n}\n```\n\n<br>\n\n从上面可以看出来，Pod的add,update, delete都会将 rs重新加入队列。\n\n#### 2.4 addRS\n\n直接入队列\n\n```\nfunc (rsc *ReplicaSetController) addRS(obj interface{}) {\n   rs := obj.(*apps.ReplicaSet)\n   klog.V(4).Infof(\"Adding %s %s/%s\", rsc.Kind, rs.Namespace, rs.Name)\n   rsc.enqueueRS(rs)\n}\n```\n\n#### 2.5 updateRS\n\n判断了是否真的更新，如果是，就入队列。\n\n```\n// callback when RS is updated\nfunc (rsc *ReplicaSetController) updateRS(old, cur interface{}) {\n   oldRS := old.(*apps.ReplicaSet)\n   curRS := cur.(*apps.ReplicaSet)\n\n   // You might imagine that we only really need to enqueue the\n   // replica set when Spec changes, but it is safer to sync any\n   // time this function is triggered. That way a full informer\n   // resync can requeue any replica set that don't yet have pods\n   // but whose last attempts at creating a pod have failed (since\n   // we don't block on creation of pods) instead of those\n   // replica sets stalling indefinitely. Enqueueing every time\n   // does result in some spurious syncs (like when Status.Replica\n   // is updated and the watch notification from it retriggers\n   // this function), but in general extra resyncs shouldn't be\n   // that bad as ReplicaSets that haven't met expectations yet won't\n   // sync, and all the listing is done using local stores.\n   if *(oldRS.Spec.Replicas) != *(curRS.Spec.Replicas) {\n      glog.V(4).Infof(\"%v %v updated. Desired pod count change: %d->%d\", rsc.Kind, curRS.Name, *(oldRS.Spec.Replicas), *(curRS.Spec.Replicas))\n   }\n   rsc.enqueueReplicaSet(cur)\n}\n```\n\n<br>\n\n#### 2.6 deleteRS\n\n先判断tombstone，再进行map中对应行的删除。然后入队列。\n\n个人认为，这里每次都判断tombstone的原因在于：\n\nk8s删除对象分为两步：（1）设置deletionTimestamp,这是个更新时间。（2）删除对象，这是个删除事件。\n\n所以到了删除的时候，update已经做了一下处理，所以这里要通过tombstone再额外判断一次。\n\n```\nfunc (rsc *ReplicaSetController) deleteRS(obj interface{}) {\n\trs, ok := obj.(*apps.ReplicaSet)\n\tif !ok {\n\t\ttombstone, ok := obj.(cache.DeletedFinalStateUnknown)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"couldn't get object from tombstone %#v\", obj))\n\t\t\treturn\n\t\t}\n\t\trs, ok = tombstone.Obj.(*apps.ReplicaSet)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"tombstone contained object that is not a ReplicaSet %#v\", obj))\n\t\t\treturn\n\t\t}\n\t}\n\n\tkey, err := controller.KeyFunc(rs)\n\tif err != nil {\n\t\tutilruntime.HandleError(fmt.Errorf(\"couldn't get key for object %#v: %v\", rs, err))\n\t\treturn\n\t}\n\n\tklog.V(4).Infof(\"Deleting %s %q\", rsc.Kind, key)\n\n\t// Delete expectations for the ReplicaSet so if we create a new one with the same name it starts clean\n\trsc.expectations.DeleteExpectations(key)\n\n\trsc.queue.Add(key)\n}\n```\n\n### 3. rs的处理逻辑\n\n接下来看看rsController是如何处理队列中的对象。\n\n```\n// Run begins watching and syncing.\nfunc (rsc *ReplicaSetController) Run(workers int, stopCh <-chan struct{}) {\n\tdefer utilruntime.HandleCrash()\n\tdefer rsc.queue.ShutDown()\n\n\tcontrollerName := strings.ToLower(rsc.Kind)\n\tglog.Infof(\"Starting %v controller\", controllerName)\n\tdefer glog.Infof(\"Shutting down %v controller\", controllerName)\n\n\tif !controller.WaitForCacheSync(rsc.Kind, stopCh, rsc.podListerSynced, rsc.rsListerSynced) {\n\t\treturn\n\t}\n\n\tfor i := 0; i < workers; i++ {\n\t\tgo wait.Until(rsc.worker, time.Second, stopCh)\n\t}\n\n\t<-stopCh\n}\n```\n\n一样的套路，最后是  syncHandler。初始化NewBaseController的时候\n\n`rsc.syncHandler = rsc.syncReplicaSet`\n\nsyncReplicaSet就是处理队列中一个一个的元素了。\n\n```\n// worker runs a worker thread that just dequeues items, processes them, and marks them done.\n// It enforces that the syncHandler is never invoked concurrently with the same key.\nfunc (rsc *ReplicaSetController) worker() {\n\tfor rsc.processNextWorkItem() {\n\t}\n}\n\nfunc (rsc *ReplicaSetController) processNextWorkItem() bool {\n\tkey, quit := rsc.queue.Get()\n\tif quit {\n\t\treturn false\n\t}\n\tdefer rsc.queue.Done(key)\n\n\terr := rsc.syncHandler(key.(string))\n\tif err == nil {\n\t\trsc.queue.Forget(key)\n\t\treturn true\n\t}\n\n\tutilruntime.HandleError(fmt.Errorf(\"Sync %q failed with %v\", key, err))\n\trsc.queue.AddRateLimited(key)\n\n\treturn true\n}\n```\n\n<br>\n\n**syncReplicaSet**\n\n（1）判断是否需要 rsNeedsSync， 如果 add<=0 && del<=0 或者 超过了同步周期，则需要同步\n\n（2）获得所有该rs下的pod\n\n（3）如果要同步，并且rs没有删除，调用manageReplicas对pod进行创建/删除\n\n（4）计算当前rs的状态\n\n（5）更新rs的状态\n\n（6）判断是否需要将 rs 加入到延迟队列中\n\n```\n// syncReplicaSet will sync the ReplicaSet with the given key if it has had its expectations fulfilled,\n// meaning it did not expect to see any more of its pods created or deleted. This function is not meant to be\n// invoked concurrently with the same key.\nfunc (rsc *ReplicaSetController) syncReplicaSet(key string) error {\n\n\tstartTime := time.Now()\n\tdefer func() {\n\t\tglog.V(4).Infof(\"Finished syncing %v %q (%v)\", rsc.Kind, key, time.Since(startTime))\n\t}()\n\n\tnamespace, name, err := cache.SplitMetaNamespaceKey(key)\n\tif err != nil {\n\t\treturn err\n\t}\n\trs, err := rsc.rsLister.ReplicaSets(namespace).Get(name)\n\tif errors.IsNotFound(err) {\n\t\tglog.V(4).Infof(\"%v %v has been deleted\", rsc.Kind, key)\n\t\trsc.expectations.DeleteExpectations(key)\n\t\treturn nil\n\t}\n\tif err != nil {\n\t\treturn err\n\t}\n\n    // 1.判断是否需要 rsNeedsSync，这里调用了SatisfiedExpectations\n\trsNeedsSync := rsc.expectations.SatisfiedExpectations(key)\n\tselector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector)\n\tif err != nil {\n\t\tutilruntime.HandleError(fmt.Errorf(\"Error converting pod selector to selector: %v\", err))\n\t\treturn nil\n\t}\n\t\n\t// 2. 获得namespaces下的所有pod\n\t// list all pods to include the pods that don't match the rs`s selector\n\t// anymore but has the stale controller ref.\n\t// TODO: Do the List and Filter in a single pass, or use an index.\n\tallPods, err := rsc.podLister.Pods(rs.Namespace).List(labels.Everything())\n\tif err != nil {\n\t\treturn err\n\t}\n\t// 2.1 过滤inactive的pods\n\t// Ignore inactive pods.\n\tvar filteredPods []*v1.Pod\n\tfor _, pod := range allPods {\n\t\tif controller.IsPodActive(pod) {\n\t\t\tfilteredPods = append(filteredPods, pod)\n\t\t}\n\t}\n\n\t// NOTE: filteredPods are pointing to objects from cache - if you need to\n\t// modify them, you need to copy it first.\n\t// 2.2 重新洗牌，获得真正属于该rs的podlist\n\tfilteredPods, err = rsc.claimPods(rs, selector, filteredPods)\n\tif err != nil {\n\t\treturn err\n\t}\n\t\n\t// 3. 如果要同步，并且rs没有删除，调用manageReplicas对pod进行创建/删除\n\tvar manageReplicasErr error\n\tif rsNeedsSync && rs.DeletionTimestamp == nil {\n\t\tmanageReplicasErr = rsc.manageReplicas(filteredPods, rs)\n\t}\n\t\n\t// 4. 计算 rs 当前的 status\n\trs = rs.DeepCopy()\n\tnewStatus := calculateStatus(rs, filteredPods, manageReplicasErr)\n\n\t// Always updates status as pods come up or die.\n\t// 5. 更新 status\n\tupdatedRS, err := updateReplicaSetStatus(rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace), rs, newStatus)\n\tif err != nil {\n\t\t// Multiple things could lead to this update failing. Requeuing the replica set ensures\n\t\t// Returning an error causes a requeue without forcing a hotloop\n\t\treturn err\n\t}\n\t\n\t// 6. 判断是否需要将 rs 加入到延迟队列中。这里判断的标准也是很简单： ReadyReplicas满足了，但是AvailableReplicas还没满足，那肯定还有pod在启动中\n\t// Resync the ReplicaSet after MinReadySeconds as a last line of defense to guard against clock-skew.\n\tif manageReplicasErr == nil && updatedRS.Spec.MinReadySeconds > 0 &&\n\t\tupdatedRS.Status.ReadyReplicas == *(updatedRS.Spec.Replicas) &&\n\t\tupdatedRS.Status.AvailableReplicas != *(updatedRS.Spec.Replicas) {\n\t\trsc.enqueueReplicaSetAfter(updatedRS, time.Duration(updatedRS.Spec.MinReadySeconds)*time.Second)\n\t}\n\treturn manageReplicasErr\n}\n```\n\n<br>\n\n#### 3.1 过滤pod\n\n（1）过滤inactivedPod: 就是pod状态不是PodSucceeded, PodFailed或者DeletionTimestamp!=nil的pod\n\n（2）重新洗牌，获得真正属于该rs的podlist\n\n<br>\n\nadopt 就是根据lable（原来不匹配，现在匹配了），绑定 rs与pod\n\nrelease就是根据lable（原来匹配了，现在不匹配），释放原来的绑定关系\n\n```\nfunc IsPodActive(p *v1.Pod) bool {\n\treturn v1.PodSucceeded != p.Status.Phase &&\n\t\tv1.PodFailed != p.Status.Phase &&\n\t\tp.DeletionTimestamp == nil\n}\n\n\nfunc (rsc *ReplicaSetController) claimPods(rs *apps.ReplicaSet, selector labels.Selector, filteredPods []*v1.Pod) ([]*v1.Pod, error) {\n\t// If any adoptions are attempted, we should first recheck for deletion with\n\t// an uncached quorum read sometime after listing Pods (see #42639).\n\tcanAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {\n\t\tfresh, err := rsc.kubeClient.AppsV1().ReplicaSets(rs.Namespace).Get(rs.Name, metav1.GetOptions{})\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tif fresh.UID != rs.UID {\n\t\t\treturn nil, fmt.Errorf(\"original %v %v/%v is gone: got uid %v, wanted %v\", rsc.Kind, rs.Namespace, rs.Name, fresh.UID, rs.UID)\n\t\t}\n\t\treturn fresh, nil\n\t})\n\tcm := controller.NewPodControllerRefManager(rsc.podControl, rs, selector, rsc.GroupVersionKind, canAdoptFunc)\n\treturn cm.ClaimPods(filteredPods)\n}\n```\n\n```\n// ClaimPods tries to take ownership of a list of Pods.\n//\n// It will reconcile the following:\n//   * Adopt orphans if the selector matches.\n//   * Release owned objects if the selector no longer matches.\n//\n// Optional: If one or more filters are specified, a Pod will only be claimed if\n// all filters return true.\n//\n// A non-nil error is returned if some form of reconciliation was attempted and\n// failed. Usually, controllers should try again later in case reconciliation\n// is still needed.\n//\n// If the error is nil, either the reconciliation succeeded, or no\n// reconciliation was necessary. The list of Pods that you now own is returned.\nfunc (m *PodControllerRefManager) ClaimPods(pods []*v1.Pod, filters ...func(*v1.Pod) bool) ([]*v1.Pod, error) {\n   var claimed []*v1.Pod\n   var errlist []error\n\n   match := func(obj metav1.Object) bool {\n      pod := obj.(*v1.Pod)\n      // Check selector first so filters only run on potentially matching Pods.\n      if !m.Selector.Matches(labels.Set(pod.Labels)) {\n         return false\n      }\n      for _, filter := range filters {\n         if !filter(pod) {\n            return false\n         }\n      }\n      return true\n   }\n   adopt := func(obj metav1.Object) error {\n      return m.AdoptPod(obj.(*v1.Pod))\n   }\n   release := func(obj metav1.Object) error {\n      return m.ReleasePod(obj.(*v1.Pod))\n   }\n\n   for _, pod := range pods {\n      ok, err := m.ClaimObject(pod, match, adopt, release)\n      if err != nil {\n         errlist = append(errlist, err)\n         continue\n      }\n      if ok {\n         claimed = append(claimed, pod)\n      }\n   }\n   return claimed, utilerrors.NewAggregate(errlist)\n}\n```\n\n<br>\n\n#### 3.2 manageReplicas\n\n（1）计算当前pod和期望pod的数量差距\n\n（2）进行pod的创建和删除\n\n```\nfunc (rsc *ReplicaSetController) manageReplicas(......) error {\n    // 1.计算当前pod数量的差距\n    diff := len(filteredPods) - int(*(rs.Spec.Replicas))\n    rsKey, err := controller.KeyFunc(rs)\n    if err != nil {\n        ......\n    }\n    // 2.diff<0，表示需要创建 pod\n    if diff < 0 {\n        diff *= -1\n        // 2.1 pod创建一轮的上限是500\n        if diff > rsc.burstReplicas {\n            diff = rsc.burstReplicas\n        }\n        // 2.2 更新map的数据，表示当前只需要创建diff个pod\n        rsc.expectations.ExpectCreations(rsKey, diff)\n        // 2.3 调用slowStartBatch创建pod\n        successfulCreations, err := slowStartBatch(diff, controller.SlowStartInitialBatchSize, func() error {\n            err := rsc.podControl.CreatePodsWithControllerRef(rs.Namespace, &rs.Spec.Template, rs, metav1.NewControllerRef(rs, rsc.GroupVersionKind))\n            if err != nil && errors.IsTimeout(err) {\n                return nil\n            }\n            return err\n        })\n        // 2.3 根据创建的结果，更新map的数据\n        if skippedPods := diff - successfulCreations; skippedPods > 0 {\n            for i := 0; i < skippedPods; i++ {\n                rsc.expectations.CreationObserved(rsKey)\n            }\n        }\n        return err\n    } else if diff > 0 {\n        // 3. 如果是删除pod,同样一轮最多只能删除500个\n        if diff > rsc.burstReplicas {\n            diff = rsc.burstReplicas\n        }\n        // 3.1 选择需要删除的pod列表，这个是有优先级的\n        podsToDelete := getPodsToDelete(filteredPods, diff)\n        // 3.2 覆盖map表中的数据\n        rsc.expectations.ExpectDeletions(rsKey, getPodKeys(podsToDelete))\n        // 3.3 进行并发删除\n        errCh := make(chan error, diff)\n        var wg sync.WaitGroup\n        wg.Add(diff)\n        for _, pod := range podsToDelete {\n            go func(targetPod *v1.Pod) {\n                defer wg.Done()\n                if err := rsc.podControl.DeletePod(rs.Namespace, targetPod.Name, rs); err != nil {\n                    podKey := controller.PodKey(targetPod)\n                    rsc.expectations.DeletionObserved(rsKey, podKey)\n                    errCh <- err\n                }\n            }(pod)\n        }\n        wg.Wait()\n        select {\n        case err := <-errCh:\n            if err != nil {\n                return err\n            }\n        default:\n        }\n    }\n    return nil\n}\n```\n\n<br>\n\n##### 3.2.1 创建pod\n\n`slowStartBatch` 创建的 pod 数依次为 1，2，4，8 。以2的指数级增长，如果失败了，直接返回（当前成功创建了多少）。\n\n```\nfunc slowStartBatch(count int, initialBatchSize int, fn func() error) (int, error) {\n    remaining := count\n    successes := 0\n    for batchSize := integer.IntMin(remaining, initialBatchSize); batchSize > 0; batchSize = integer.IntMin(2*batchSize, remaining) {\n        errCh := make(chan error, batchSize)\n        var wg sync.WaitGroup\n        wg.Add(batchSize)\n        for i := 0; i < batchSize; i++ {\n            go func() {\n                defer wg.Done()\n                if err := fn(); err != nil {\n                    errCh <- err\n                }\n            }()\n        }\n        wg.Wait()\n        curSuccesses := batchSize - len(errCh)\n        successes += curSuccesses\n        if len(errCh) > 0 {\n            return successes, <-errCh\n        }\n        remaining -= batchSize\n    }\n    return successes, nil\n}\n```\n\n<br>\n\n##### 3.2.2 删除pod\n\n给pod定义优先级，从优先级最高的依次往下删，优先级越高，表示这个pod越应该删除，根据以下的条件判断优先级：\n\n（1）没有绑定node的pod优先级比 绑定了的高\n\n（2）pod状态是PodPending的高于PodUnknown，PodUnknown高于PodRunning\n\n（3）pod unready的高于 ready\n\n（4）根据运行时间排序，越短优先级越高\n\n（5）pod中容器重启次数越多的，优先级越高\n\n（6）pod创建时间越短，优先级越高\n\n```\nfunc getPodsToDelete(filteredPods []*v1.Pod, diff int) []*v1.Pod {\n    if diff < len(filteredPods) {\n        sort.Sort(controller.ActivePods(filteredPods))\n    }\n    return filteredPods[:diff]\n}\n```\n\n```\ntype ActivePods []*v1.Pod\nfunc (s ActivePods) Len() int      { return len(s) }\nfunc (s ActivePods) Swap(i, j int) { s[i], s[j] = s[j], s[i] }\nfunc (s ActivePods) Less(i, j int) bool {\n    // 1.没有绑定node的pod优先级比绑定了的高\n    if s[i].Spec.NodeName != s[j].Spec.NodeName && (len(s[i].Spec.NodeName) == 0 || len(s[j].Spec.NodeName) == 0) {\n        return len(s[i].Spec.NodeName) == 0\n    }\n    // 2. pod状态是PodPending的高于PodUnknown，PodUnknown高于PodRunning\n    m := map[v1.PodPhase]int{v1.PodPending: 0, v1.PodUnknown: 1, v1.PodRunning: 2}\n    if m[s[i].Status.Phase] != m[s[j].Status.Phase] {\n        return m[s[i].Status.Phase] < m[s[j].Status.Phase]\n    }\n    // 3. pod unready的高于 ready\n    if podutil.IsPodReady(s[i]) != podutil.IsPodReady(s[j]) {\n        return !podutil.IsPodReady(s[i])\n    }\n    // 4. 根据运行时间排序，越短优先级越高\n    if podutil.IsPodReady(s[i]) && podutil.IsPodReady(s[j]) && !podReadyTime(s[i]).Equal(podReadyTime(s[j])) {\n        return afterOrZero(podReadyTime(s[i]), podReadyTime(s[j]))\n    }\n    // 5. pod中容器重启次数越多的优先级越高\n    if maxContainerRestarts(s[i]) != maxContainerRestarts(s[j]) {\n        return maxContainerRestarts(s[i]) > maxContainerRestarts(s[j])\n    }\n    // 6. pod创建时间越短，优先级越高\n    if !s[i].CreationTimestamp.Equal(&s[j].CreationTimestamp) {\n        return afterOrZero(&s[i].CreationTimestamp, &s[j].CreationTimestamp)\n    }\n    return false\n}\n```\n\n<br>\n\n#### 3.3 calculateStatus\n\ncalculateStatus 会通过当前 pod 的状态计算出 rs 中 status 字段值，status 字段如下所示：\n\nreplicas 实际的 pod 副本数\navailableReplicas 现在可用的 Pod 的副本数量，有的副本可能还处在未准备好，或者初始化状态\nreadyReplicas 是处于 ready 状态的 Pod 的副本数量\nfullyLabeledReplicas 意思是这个 ReplicaSet 的标签 selector 对应的副本数量，不同纬度的一种统计\n\n```\n随便一个rs都有\nstatus:\n  availableReplicas: 1\n  fullyLabeledReplicas: 1\n  observedGeneration: 1\n  readyReplicas: 1\n  replicas: 1\n```\n\n\n\n```\nfunc calculateStatus(rs *apps.ReplicaSet, filteredPods []*v1.Pod, manageReplicasErr error) apps.ReplicaSetStatus {\n\tnewStatus := rs.Status\n\t// Count the number of pods that have labels matching the labels of the pod\n\t// template of the replica set, the matching pods may have more\n\t// labels than are in the template. Because the label of podTemplateSpec is\n\t// a superset of the selector of the replica set, so the possible\n\t// matching pods must be part of the filteredPods.\n\tfullyLabeledReplicasCount := 0\n\treadyReplicasCount := 0\n\tavailableReplicasCount := 0\n\ttemplateLabel := labels.Set(rs.Spec.Template.Labels).AsSelectorPreValidated()\n\tfor _, pod := range filteredPods {\n\t\tif templateLabel.Matches(labels.Set(pod.Labels)) {\n\t\t\tfullyLabeledReplicasCount++\n\t\t}\n\t\tif podutil.IsPodReady(pod) {\n\t\t\treadyReplicasCount++\n\t\t\tif podutil.IsPodAvailable(pod, rs.Spec.MinReadySeconds, metav1.Now()) {\n\t\t\t\tavailableReplicasCount++\n\t\t\t}\n\t\t}\n\t}\n\n\tfailureCond := GetCondition(rs.Status, apps.ReplicaSetReplicaFailure)\n\tif manageReplicasErr != nil && failureCond == nil {\n\t\tvar reason string\n\t\tif diff := len(filteredPods) - int(*(rs.Spec.Replicas)); diff < 0 {\n\t\t\treason = \"FailedCreate\"\n\t\t} else if diff > 0 {\n\t\t\treason = \"FailedDelete\"\n\t\t}\n\t\tcond := NewReplicaSetCondition(apps.ReplicaSetReplicaFailure, v1.ConditionTrue, reason, manageReplicasErr.Error())\n\t\tSetCondition(&newStatus, cond)\n\t} else if manageReplicasErr == nil && failureCond != nil {\n\t\tRemoveCondition(&newStatus, apps.ReplicaSetReplicaFailure)\n\t}\n\n\tnewStatus.Replicas = int32(len(filteredPods))\n\tnewStatus.FullyLabeledReplicas = int32(fullyLabeledReplicasCount)\n\tnewStatus.ReadyReplicas = int32(readyReplicasCount)\n\tnewStatus.AvailableReplicas = int32(availableReplicasCount)\n\treturn newStatus\n}\n```\n\n<br>\n\n### 4 总结\n\n（1）expectations确实是一个很巧妙的方法，这种思想可以借鉴\n\n（2）rs根本不感知deploy的存在"
  },
  {
    "path": "k8s/kcm/10-kcm-NodeLifecycleController源码分析.md",
    "content": "* [1\\. startNodeLifecycleController](#1-startnodelifecyclecontroller)\n* [2\\. NewNodeLifecycleController](#2-newnodelifecyclecontroller)\n  * [2\\.1 NodeLifecycleController结构体介绍](#21-nodelifecyclecontroller结构体介绍)\n  * [2\\.2 NewNodeLifecycleController](#22-newnodelifecyclecontroller)\n* [3\\. NodeLifecycleController\\.run](#3-nodelifecyclecontrollerrun)\n  * [3\\.1 nc\\.taintManager\\.Run](#31-nctaintmanagerrun)\n    * [3\\.1\\.1 worker处理](#311-worker处理)\n    * [3\\.1\\.2 handleNodeUpdate](#312-handlenodeupdate)\n      * [3\\.1\\.2\\.1 processPodOnNode](#3121-processpodonnode)\n    * [3\\.1\\.3 handlePodUpdate](#313-handlepodupdate)\n    * [3\\.1\\.3 nc\\.taintManager\\.Run总结](#313-nctaintmanagerrun总结)\n  * [3\\.2 doNodeProcessingPassWorker](#32-donodeprocessingpassworker)\n    * [3\\.2\\.1 doNoScheduleTaintingPass](#321-donoscheduletaintingpass)\n    * [3\\.2\\.2 reconcileNodeLabels](#322-reconcilenodelabels)\n  * [3\\.3 doPodProcessingWorker](#33-dopodprocessingworker)\n    * [3\\.3\\.1 processNoTaintBaseEviction](#331-processnotaintbaseeviction)\n  * [3\\.4 doEvictionPass(if useTaintBasedEvictions==false)](#34-doevictionpassif-usetaintbasedevictionsfalse)\n  * [3\\.5 doNoExecuteTaintingPass(if useTaintBasedEvictions==true)](#35-donoexecutetaintingpassif-usetaintbasedevictionstrue)\n  * [3\\.6 monitorNodeHealth](#36-monitornodehealth)\n    * [3\\.6\\.1  node分类并初始化](#361--node分类并初始化)\n    * [3\\.6\\.2 处理node status](#362-处理node-status)\n      * [3\\.6\\.3 集群健康状态处理](#363-集群健康状态处理)\n* [4 总结](#4-总结)\n\n代码版本：1.17.4\n\n### 1. startNodeLifecycleController\n\n可以看到startNodeLifecycleController就是分为2个步骤：\n\n* NodeLifecycleController\n* NodeLifecycleController.run\n\n```\nfunc startNodeLifecycleController(ctx ControllerContext) (http.Handler, bool, error) {\n\tlifecycleController, err := lifecyclecontroller.NewNodeLifecycleController(\n\t\tctx.InformerFactory.Coordination().V1().Leases(),\n\t\tctx.InformerFactory.Core().V1().Pods(),\n\t\tctx.InformerFactory.Core().V1().Nodes(),\n\t\tctx.InformerFactory.Apps().V1().DaemonSets(),\n\t\t// node lifecycle controller uses existing cluster role from node-controller\n\t\tctx.ClientBuilder.ClientOrDie(\"node-controller\"),\n\t\t\n\t\t// 就是node-monitor-period参数\n\t\tctx.ComponentConfig.KubeCloudShared.NodeMonitorPeriod.Duration,   \n\t\t\n\t\t// 就是node-startup-grace-period参数\n\t\tctx.ComponentConfig.NodeLifecycleController.NodeStartupGracePeriod.Duration,\n\t\t\n\t  // 就是node-monitor-grace-period参数\n\t\tctx.ComponentConfig.NodeLifecycleController.NodeMonitorGracePeriod.Duration,\n\t\t\n\t\t// 就是pod-eviction-timeout参数\n\t\tctx.ComponentConfig.NodeLifecycleController.PodEvictionTimeout.Duration,\n\t\t\n\t\t// 就是node-eviction-rate参数\n\t\tctx.ComponentConfig.NodeLifecycleController.NodeEvictionRate,\n\t\t\n\t\t// 就是secondary-node-eviction-rate参数\n\t\tctx.ComponentConfig.NodeLifecycleController.SecondaryNodeEvictionRate,\n\t\t\n\t\t// 就是large-cluster-size-threshold参数\n\t\tctx.ComponentConfig.NodeLifecycleController.LargeClusterSizeThreshold,\n\t\t\n\t\t// 就是unhealthy-zone-threshold参数\n\t\tctx.ComponentConfig.NodeLifecycleController.UnhealthyZoneThreshold,\n\t\t\n\t\t// 就是enable-taint-manager参数  （默认打开的）\n\t\tctx.ComponentConfig.NodeLifecycleController.EnableTaintManager,\n\t\t\n\t\t// 就是这个是否打开--feature-gates=TaintBasedEvictions=true （默认打开的）\n\t\tutilfeature.DefaultFeatureGate.Enabled(features.TaintBasedEvictions),\n\t)\n\tif err != nil {\n\t\treturn nil, true, err\n\t}\n\tgo lifecycleController.Run(ctx.Stop)\n\treturn nil, true, nil\n}\n\n```\n\n具体参数介绍\n\n* enable-taint-manager                                   默认为true, 表示允许NoExecute污点，并且将会驱逐pod                                                                                   \n* large-cluster-size-threshold                         默认50，基于这个阈值来判断所在集群是否为大规模集群。当集群规模小于等于这个值的时候，会将--secondary-node-eviction-rate参数强制赋值为0\n*  secondary-node-eviction-rate                    默认0.01。 当zone unhealthy时候，一秒内多少个node进行驱逐node上pod。二级驱赶速率，当集群中宕机节点过多时，相应的驱赶速率也降低，默认为0.01。\n* node-eviction-rate float32                            默认为0.1。驱赶速率，即驱赶Node的速率，由令牌桶流控算法实现，默认为0.1，即每秒驱赶0.1个节点，注意这里不是驱赶Pod的速率，而是驱赶节点的速率。相当于每隔10s，清空一个节点。\n*  node-monitor-grace-period duration         默认40s, 多久node没有响应认为node为unhealthy\n* node-startup-grace-period duration           默认1分钟。多久允许刚启动的node未响应，认为unhealthy\n* pod-eviction-timeout duration                     默认5min。当node unhealthy时候多久删除上面的pod（只在taint manager未启用时候生效）\n* unhealthy-zone-threshold float32                默认55%，多少比例的unhealthy node认为zone unhealthy\n\n<br>\n\n### 2. NewNodeLifecycleController\n\n#### 2.1 NodeLifecycleController结构体介绍\n\n```\n// Controller is the controller that manages node's life cycle.\ntype Controller struct {\n  // taintManager监听节点的Taint/Toleration变化，用于驱逐pod\n\ttaintManager *scheduler.NoExecuteTaintManager\n  \n  // 监听pod\n\tpodLister         corelisters.PodLister\n\tpodInformerSynced cache.InformerSynced\n\tkubeClient        clientset.Interface\n\n\t// This timestamp is to be used instead of LastProbeTime stored in Condition. We do this\n\t// to avoid the problem with time skew across the cluster.\n\tnow func() metav1.Time\n\t\n\t// 返回secondary-node-eviction-rate参数值。就是根据集群是否为大集群，如果是大集群，返回secondary-node-eviction-rate,否则返回0\n\tenterPartialDisruptionFunc func(nodeNum int) float32\n\t\n\t// 返回evictionLimiterQPS参数\n\tenterFullDisruptionFunc    func(nodeNum int) float32\n\t\n\t// 返回集群有多少nodeNotReady, 并且返回bool值ZoneState用于判断zone是否健康。利用了unhealthyZoneThreshold参数\n\tcomputeZoneStateFunc       func(nodeConditions []*v1.NodeCondition) (int, ZoneState)\n\t\n\t// node map\n\tknownNodeSet map[string]*v1.Node\n\t\n\t// node健康信息map表\n\t// per Node map storing last observed health together with a local time when it was observed.\n\tnodeHealthMap *nodeHealthMap\n\t\n\t\n\t// evictorLock protects zonePodEvictor and zoneNoExecuteTainter.\n\t// TODO(#83954): API calls shouldn't be executed under the lock.\n\tevictorLock     sync.Mutex\n\t\n\t// 存放node上pod是否已经执行驱逐的状态， 从这读取node eviction的状态是evicted、tobeeviced\n\tnodeEvictionMap *nodeEvictionMap\n\t// workers that evicts pods from unresponsive nodes.\n\t\n\t// zone的需要pod evictor的node列表\n\tzonePodEvictor map[string]*scheduler.RateLimitedTimedQueue\n\t\n\t// 存放需要更新taint的unready node列表--令牌桶队列\n\t// workers that are responsible for tainting nodes.\n\tzoneNoExecuteTainter map[string]*scheduler.RateLimitedTimedQueue\n\t\n\t// 重试列表\n\tnodesToRetry sync.Map\n\t\n\t// 存放每个zone的健康状态,有stateFullDisruption、statePartialDisruption、stateNormal、stateInitial\n\tzoneStates map[string]ZoneState\n\t\n\t// 监听ds相关\n\tdaemonSetStore          appsv1listers.DaemonSetLister\n\tdaemonSetInformerSynced cache.InformerSynced\n\t\n\t// 监听node相关\n\tleaseLister         coordlisters.LeaseLister\n\tleaseInformerSynced cache.InformerSynced\n\tnodeLister          corelisters.NodeLister\n\tnodeInformerSynced  cache.InformerSynced\n  \n\tgetPodsAssignedToNode func(nodeName string) ([]*v1.Pod, error)\n\n\trecorder record.EventRecorder\n\t\n\t// 之前推到的一对参数\n\t// Value controlling Controller monitoring period, i.e. how often does Controller\n\t// check node health signal posted from kubelet. This value should be lower than\n\t// nodeMonitorGracePeriod.\n\t// TODO: Change node health monitor to watch based.\n\tnodeMonitorPeriod time.Duration\n\t\n\t// When node is just created, e.g. cluster bootstrap or node creation, we give\n\t// a longer grace period.\n\tnodeStartupGracePeriod time.Duration\n\n\t// Controller will not proactively sync node health, but will monitor node\n\t// health signal updated from kubelet. There are 2 kinds of node healthiness\n\t// signals: NodeStatus and NodeLease. NodeLease signal is generated only when\n\t// NodeLease feature is enabled. If it doesn't receive update for this amount\n\t// of time, it will start posting \"NodeReady==ConditionUnknown\". The amount of\n\t// time before which Controller start evicting pods is controlled via flag\n\t// 'pod-eviction-timeout'.\n\t// Note: be cautious when changing the constant, it must work with\n\t// nodeStatusUpdateFrequency in kubelet and renewInterval in NodeLease\n\t// controller. The node health signal update frequency is the minimal of the\n\t// two.\n\t// There are several constraints:\n\t// 1. nodeMonitorGracePeriod must be N times more than  the node health signal\n\t//    update frequency, where N means number of retries allowed for kubelet to\n\t//    post node status/lease. It is pointless to make nodeMonitorGracePeriod\n\t//    be less than the node health signal update frequency, since there will\n\t//    only be fresh values from Kubelet at an interval of node health signal\n\t//    update frequency. The constant must be less than podEvictionTimeout.\n\t// 2. nodeMonitorGracePeriod can't be too large for user experience - larger\n\t//    value takes longer for user to see up-to-date node health.\n\tnodeMonitorGracePeriod time.Duration\n\n\tpodEvictionTimeout          time.Duration\n\tevictionLimiterQPS          float32\n\tsecondaryEvictionLimiterQPS float32\n\tlargeClusterThreshold       int32\n\tunhealthyZoneThreshold      float32\n\n\t// if set to true Controller will start TaintManager that will evict Pods from\n\t// tainted nodes, if they're not tolerated.\n\trunTaintManager bool\n\n\t// if set to true Controller will taint Nodes with 'TaintNodeNotReady' and 'TaintNodeUnreachable'\n\t// taints instead of evicting Pods itself.\n\tuseTaintBasedEvictions bool\n  \n  // pod, node队列\n\tnodeUpdateQueue workqueue.Interface\n\tpodUpdateQueue  workqueue.RateLimitingInterface\n}\n```\n\n<br>\n\n#### 2.2 NewNodeLifecycleController\n\n核心逻辑如下：\n\n（1）根据参数初始化Controller\n\n（2）定义了pod的监听处理逻辑。都是先nc.podUpdated，如果enable-taint-manager=true,还会经过nc.taintManager.PodUpdated函数处理\n\n（3）实现找出所有node上pod的函数\n\n（4）如果enable-taint-manager=true，node有变化都需要经过 nc.taintManager.NodeUpdated函数\n\n（5）实现node的监听处理，这里不管开没开taint-manager，都是要监听\n\n（6）实现node, ds, lease的list，用于获取对象\n\n```\n// NewNodeLifecycleController returns a new taint controller.\nfunc NewNodeLifecycleController(\n\tleaseInformer coordinformers.LeaseInformer,\n\tpodInformer coreinformers.PodInformer,\n\tnodeInformer coreinformers.NodeInformer,\n\tdaemonSetInformer appsv1informers.DaemonSetInformer,\n\tkubeClient clientset.Interface,\n\tnodeMonitorPeriod time.Duration,\n\tnodeStartupGracePeriod time.Duration,\n\tnodeMonitorGracePeriod time.Duration,\n\tpodEvictionTimeout time.Duration,\n\tevictionLimiterQPS float32,\n\tsecondaryEvictionLimiterQPS float32,\n\tlargeClusterThreshold int32,\n\tunhealthyZoneThreshold float32,\n\trunTaintManager bool,\n\tuseTaintBasedEvictions bool,\n) (*Controller, error) {\n\n  // 1.根据参数初始化Controller\n\tnc := &Controller{\n\t  省略代码\n\t\t....\n\t}\n\t\n\tif useTaintBasedEvictions {\n\t\tklog.Infof(\"Controller is using taint based evictions.\")\n\t}\n\tnc.enterPartialDisruptionFunc = nc.ReducedQPSFunc\n\tnc.enterFullDisruptionFunc = nc.HealthyQPSFunc\n\tnc.computeZoneStateFunc = nc.ComputeZoneState\n\t\n\t// 2.定义了pod的监听处理逻辑。都是先nc.podUpdated，如果enable-taint-manager=true,还会经过nc.taintManager.PodUpdated\n\tpodInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n\t\t。。。\n\t\t省略代码\n\t})\n\t\n\t// 3.实现找出所有node上pod的函数\n\tnc.podInformerSynced = podInformer.Informer().HasSynced\n\tpodInformer.Informer().AddIndexers(cache.Indexers{\n\t\tnodeNameKeyIndex: func(obj interface{}) ([]string, error) {\n\t\t\tpod, ok := obj.(*v1.Pod)\n\t\t\tif !ok {\n\t\t\t\treturn []string{}, nil\n\t\t\t}\n\t\t\tif len(pod.Spec.NodeName) == 0 {\n\t\t\t\treturn []string{}, nil\n\t\t\t}\n\t\t\treturn []string{pod.Spec.NodeName}, nil\n\t\t},\n\t})\n\n\tpodIndexer := podInformer.Informer().GetIndexer()\n\tnc.getPodsAssignedToNode = func(nodeName string) ([]*v1.Pod, error) {\n\t\tobjs, err := podIndexer.ByIndex(nodeNameKeyIndex, nodeName)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tpods := make([]*v1.Pod, 0, len(objs))\n\t\tfor _, obj := range objs {\n\t\t\tpod, ok := obj.(*v1.Pod)\n\t\t\tif !ok {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tpods = append(pods, pod)\n\t\t}\n\t\treturn pods, nil\n\t}\n\tnc.podLister = podInformer.Lister()\n\t\n\t// 4.如果enable-taint-manager=true，node有变化都需要经过 nc.taintManager.NodeUpdated函数\n\tif nc.runTaintManager {\n\t\tpodGetter := func(name, namespace string) (*v1.Pod, error) { return nc.podLister.Pods(namespace).Get(name) }\n\t\tnodeLister := nodeInformer.Lister()\n\t\tnodeGetter := func(name string) (*v1.Node, error) { return nodeLister.Get(name) }\n\t\tnc.taintManager = scheduler.NewNoExecuteTaintManager(kubeClient, podGetter, nodeGetter, nc.getPodsAssignedToNode)\n\t\tnodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n\t\t\tAddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {\n\t\t\t\tnc.taintManager.NodeUpdated(nil, node)\n\t\t\t\treturn nil\n\t\t\t}),\n\t\t\tUpdateFunc: nodeutil.CreateUpdateNodeHandler(func(oldNode, newNode *v1.Node) error {\n\t\t\t\tnc.taintManager.NodeUpdated(oldNode, newNode)\n\t\t\t\treturn nil\n\t\t\t}),\n\t\t\tDeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error {\n\t\t\t\tnc.taintManager.NodeUpdated(node, nil)\n\t\t\t\treturn nil\n\t\t\t}),\n\t\t})\n\t}\n\t\n\t// 5. 实现node的监听处理，这里不管开没开taint-manager，都是要监听\n\tklog.Infof(\"Controller will reconcile labels.\")\n\tnodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n\t\tAddFunc: nodeutil.CreateAddNodeHandler(func(node *v1.Node) error {\n\t\t\tnc.nodeUpdateQueue.Add(node.Name)\n\t\t\tnc.nodeEvictionMap.registerNode(node.Name)\n\t\t\treturn nil\n\t\t}),\n\t\tUpdateFunc: nodeutil.CreateUpdateNodeHandler(func(_, newNode *v1.Node) error {\n\t\t\tnc.nodeUpdateQueue.Add(newNode.Name)\n\t\t\treturn nil\n\t\t}),\n\t\tDeleteFunc: nodeutil.CreateDeleteNodeHandler(func(node *v1.Node) error {\n\t\t\tnc.nodesToRetry.Delete(node.Name)\n\t\t\tnc.nodeEvictionMap.unregisterNode(node.Name)\n\t\t\treturn nil\n\t\t}),\n\t})\n\t\n\t// 6. 实现node, ds, lease的list，用于获取对象\n\tnc.leaseLister = leaseInformer.Lister()\n\tnc.leaseInformerSynced = leaseInformer.Informer().HasSynced\n\n\tnc.nodeLister = nodeInformer.Lister()\n\tnc.nodeInformerSynced = nodeInformer.Informer().HasSynced\n\n\tnc.daemonSetStore = daemonSetInformer.Lister()\n\tnc.daemonSetInformerSynced = daemonSetInformer.Informer().HasSynced\n\n\treturn nc, nil\n}\n```\n\n### 3. NodeLifecycleController.run\n\n逻辑如下：\n\n（1）等待leaseInformer、nodeInformer、podInformerSynced、daemonSetInformerSynced同步完成。\n\n（2）如果enable-taint-manager=true,开启nc.taintManager.Run\n\n（3）执行doNodeProcessingPassWorker，这个是处理nodeUpdateQueue队列的node\n\n（4）doPodProcessingWorker，这个是处理podUpdateQueue队列的pod\n\n（5）如果开启了feature-gates=TaintBasedEvictions=true，执行doNoExecuteTaintingPass函数。否则执行doEvictionPass函数\n\n（6）一直监听node状态是否健康\n\n```\n// Run starts an asynchronous loop that monitors the status of cluster nodes.\nfunc (nc *Controller) Run(stopCh <-chan struct{}) {\n\tdefer utilruntime.HandleCrash()\n\n\tklog.Infof(\"Starting node controller\")\n\tdefer klog.Infof(\"Shutting down node controller\")\n\t\n\t// 1.等待leaseInformer、nodeInformer、podInformerSynced、daemonSetInformerSynced同步完成。\n\tif !cache.WaitForNamedCacheSync(\"taint\", stopCh, nc.leaseInformerSynced, nc.nodeInformerSynced, nc.podInformerSynced, nc.daemonSetInformerSynced) {\n\t\treturn\n\t}\n\t\n\t// 2.如果enable-taint-manager=true,开启nc.taintManager.Run\n\tif nc.runTaintManager {\n\t\tgo nc.taintManager.Run(stopCh)\n\t}\n\t\n\t// Close node update queue to cleanup go routine.\n\tdefer nc.nodeUpdateQueue.ShutDown()\n\tdefer nc.podUpdateQueue.ShutDown()\n\t\n\t// 3.执行doNodeProcessingPassWorker，这个是处理nodeUpdateQueue队列的node\n\t// Start workers to reconcile labels and/or update NoSchedule taint for nodes.\n\tfor i := 0; i < scheduler.UpdateWorkerSize; i++ {\n\t\t// Thanks to \"workqueue\", each worker just need to get item from queue, because\n\t\t// the item is flagged when got from queue: if new event come, the new item will\n\t\t// be re-queued until \"Done\", so no more than one worker handle the same item and\n\t\t// no event missed.\n\t\tgo wait.Until(nc.doNodeProcessingPassWorker, time.Second, stopCh)\n\t}\n\t\n// 4.doPodProcessingWorker，这个是处理podUpdateQueue队列的pod\n\tfor i := 0; i < podUpdateWorkerSize; i++ {\n\t\tgo wait.Until(nc.doPodProcessingWorker, time.Second, stopCh)\n\t}\n\t\n\t// 5. 如果开启了feature-gates=TaintBasedEvictions=true，执行doNoExecuteTaintingPass函数。否则执行doEvictionPass函数\n\tif nc.useTaintBasedEvictions {\n\t\t// Handling taint based evictions. Because we don't want a dedicated logic in TaintManager for NC-originated\n\t\t// taints and we normally don't rate limit evictions caused by taints, we need to rate limit adding taints.\n\t\tgo wait.Until(nc.doNoExecuteTaintingPass, scheduler.NodeEvictionPeriod, stopCh)\n\t} else {\n\t\t// Managing eviction of nodes:\n\t\t// When we delete pods off a node, if the node was not empty at the time we then\n\t\t// queue an eviction watcher. If we hit an error, retry deletion.\n\t\tgo wait.Until(nc.doEvictionPass, scheduler.NodeEvictionPeriod, stopCh)\n\t}\n\t\n\t\n\t// 6.一直监听node状态是否健康\n\t// Incorporate the results of node health signal pushed from kubelet to master.\n\tgo wait.Until(func() {\n\t\tif err := nc.monitorNodeHealth(); err != nil {\n\t\t\tklog.Errorf(\"Error monitoring node health: %v\", err)\n\t\t}\n\t}, nc.nodeMonitorPeriod, stopCh)\n\n\t<-stopCh\n}\n```\n\n#### 3.1 nc.taintManager.Run\n\n在newNodeLifecycleContainer的时候就初始化了NewNoExecuteTaintManager。\n\ntaint manager是由pod和node事件触发执行，根据node或pod绑定的node是否有的noExcute taint，如果有则对node上所有的pod或这个pod执行删除。\n\n具体逻辑为：如果启用了taint manager就会调用NewNoExecuteTaintManager对taint manager进行初始化。可以看出来这里就是初始化了nodeUpdateQueue，podUpdateQueue队列以及事件上报。\n\n核心数据机构：\n\n* nodeUpdateQueue 在nodelifecycleController的时候定义了，node变化会扔进这个队列\n* podUpdateQueue   在nodelifecycleController的时候定义了，pod变化会扔进这个队列\n\n* taintedNodes是存放node上所有的noExecute taint，handlePodUpdate会从taintedNodes查询node的noExecute taint。\n* taintEvictionQueuetaintEvictionQueue是一个TimedWorkerQueue–定时自动执行队列。因为有的pod设置了污点容忍时间，所以需要一个时间队列来定时删除。\n\n```\n// NewNoExecuteTaintManager creates a new NoExecuteTaintManager that will use passed clientset to\n// communicate with the API server.\nfunc NewNoExecuteTaintManager(c clientset.Interface, getPod GetPodFunc, getNode GetNodeFunc, getPodsAssignedToNode GetPodsByNodeNameFunc) *NoExecuteTaintManager {\n\teventBroadcaster := record.NewBroadcaster()\n\trecorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: \"taint-controller\"})\n\teventBroadcaster.StartLogging(klog.Infof)\n\tif c != nil {\n\t\tklog.V(0).Infof(\"Sending events to api server.\")\n\t\teventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: c.CoreV1().Events(\"\")})\n\t} else {\n\t\tklog.Fatalf(\"kubeClient is nil when starting NodeController\")\n\t}\n\n\ttm := &NoExecuteTaintManager{\n\t\tclient:                c,\n\t\trecorder:              recorder,\n\t\tgetPod:                getPod,\n\t\tgetNode:               getNode,\n\t\tgetPodsAssignedToNode: getPodsAssignedToNode,\n\t\ttaintedNodes:          make(map[string][]v1.Taint),\n\n\t\tnodeUpdateQueue: workqueue.NewNamed(\"noexec_taint_node\"),\n\t\tpodUpdateQueue:  workqueue.NewNamed(\"noexec_taint_pod\"),\n\t}\n\ttm.taintEvictionQueue = CreateWorkerQueue(deletePodHandler(c, tm.emitPodDeletionEvent))\n\n\treturn tm\n}\n```\n\nrun函数逻辑如下：\n\n这里的核心其实就是从nodeUpdateQueue, UpdateWorkerSize 取出一个元素，然后执行worker处理。和一般的controller思想是一样的。\n\n**注意**: 这里用了负载均衡的思想。因为worker数量是UpdateWorkerSize个，所以这里就定义UpdateWorkerSize个channel。然后开启UpdateWorkerSize个协程，处理对应的channel。这样通过哈希取模的方式，就相当于尽可能使得每个channel的元素尽可能相等。\n\n```\n// Run starts NoExecuteTaintManager which will run in loop until `stopCh` is closed.\nfunc (tc *NoExecuteTaintManager) Run(stopCh <-chan struct{}) {\n\tklog.V(0).Infof(\"Starting NoExecuteTaintManager\")\n\n\tfor i := 0; i < UpdateWorkerSize; i++ {\n\t\ttc.nodeUpdateChannels = append(tc.nodeUpdateChannels, make(chan nodeUpdateItem, NodeUpdateChannelSize))\n\t\ttc.podUpdateChannels = append(tc.podUpdateChannels, make(chan podUpdateItem, podUpdateChannelSize))\n\t}\n\n\t// Functions that are responsible for taking work items out of the workqueues and putting them\n\t// into channels.\n\tgo func(stopCh <-chan struct{}) {\n\t\tfor {\n\t\t\titem, shutdown := tc.nodeUpdateQueue.Get()\n\t\t\tif shutdown {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\tnodeUpdate := item.(nodeUpdateItem)\n\t\t\thash := hash(nodeUpdate.nodeName, UpdateWorkerSize)\n\t\t\tselect {\n\t\t\tcase <-stopCh:\n\t\t\t\ttc.nodeUpdateQueue.Done(item)\n\t\t\t\treturn\n\t\t\tcase tc.nodeUpdateChannels[hash] <- nodeUpdate:\n\t\t\t\t// tc.nodeUpdateQueue.Done is called by the nodeUpdateChannels worker\n\t\t\t}\n\t\t}\n\t}(stopCh)\n\n\tgo func(stopCh <-chan struct{}) {\n\t\tfor {\n\t\t\titem, shutdown := tc.podUpdateQueue.Get()\n\t\t\tif shutdown {\n\t\t\t\tbreak\n\t\t\t}\n\t\t\t// The fact that pods are processed by the same worker as nodes is used to avoid races\n\t\t\t// between node worker setting tc.taintedNodes and pod worker reading this to decide\n\t\t\t// whether to delete pod.\n\t\t\t// It's possible that even without this assumption this code is still correct.\n\t\t\tpodUpdate := item.(podUpdateItem)\n\t\t\thash := hash(podUpdate.nodeName, UpdateWorkerSize)\n\t\t\tselect {\n\t\t\tcase <-stopCh:\n\t\t\t\ttc.podUpdateQueue.Done(item)\n\t\t\t\treturn\n\t\t\tcase tc.podUpdateChannels[hash] <- podUpdate:\n\t\t\t\t// tc.podUpdateQueue.Done is called by the podUpdateChannels worker\n\t\t\t}\n\t\t}\n\t}(stopCh)\n\n\twg := sync.WaitGroup{}\n\twg.Add(UpdateWorkerSize)\n\tfor i := 0; i < UpdateWorkerSize; i++ {\n\t\tgo tc.worker(i, wg.Done, stopCh)\n\t}\n\twg.Wait()\n}\n```\n\n<br>\n\n##### 3.1.1 worker处理\n\nworker的处理逻辑其实很简单。就是每个worker协程从对应的chanel取出一个nodeUpdate/podUpdate 事件进行处理。\n\n分别对应：handleNodeUpdate函数和handlePodUpdate函数\n\n**但是**：这里又得注意的是：worker会优先处理nodeUpdate事件。（很好理解，因为处理node事件是驱逐整个节点的Pod, 这个可能包括了Pod）\n\n```\nfunc (tc *NoExecuteTaintManager) worker(worker int, done func(), stopCh <-chan struct{}) {\n\tdefer done()\n\n\t// When processing events we want to prioritize Node updates over Pod updates,\n\t// as NodeUpdates that interest NoExecuteTaintManager should be handled as soon as possible -\n\t// we don't want user (or system) to wait until PodUpdate queue is drained before it can\n\t// start evicting Pods from tainted Nodes.\n\tfor {\n\t\tselect {\n\t\tcase <-stopCh:\n\t\t\treturn\n\t\tcase nodeUpdate := <-tc.nodeUpdateChannels[worker]:\n\t\t\ttc.handleNodeUpdate(nodeUpdate)\n\t\t\ttc.nodeUpdateQueue.Done(nodeUpdate)\n\t\tcase podUpdate := <-tc.podUpdateChannels[worker]:\n\t\t\t// If we found a Pod update we need to empty Node queue first.\n\t\tpriority:\n\t\t\tfor {\n\t\t\t\tselect {\n\t\t\t\tcase nodeUpdate := <-tc.nodeUpdateChannels[worker]:\n\t\t\t\t\ttc.handleNodeUpdate(nodeUpdate)\n\t\t\t\t\ttc.nodeUpdateQueue.Done(nodeUpdate)\n\t\t\t\tdefault:\n\t\t\t\t\tbreak priority\n\t\t\t\t}\n\t\t\t}\n\t\t\t// After Node queue is emptied we process podUpdate.\n\t\t\ttc.handlePodUpdate(podUpdate)\n\t\t\ttc.podUpdateQueue.Done(podUpdate)\n\t\t}\n\t}\n}\n```\n\n##### 3.1.2 handleNodeUpdate\n\n核心逻辑：\n\n（1）先得到该node上所有的taint\n\n（2）得到这个node上所有的pod\n\n（3）for循环执行processPodOnNode来一个个的处理pod\n\n```\nfunc (tc *NoExecuteTaintManager) handleNodeUpdate(nodeUpdate nodeUpdateItem) {\n\tnode, err := tc.getNode(nodeUpdate.nodeName)\n\tif err != nil {\n\t\tif apierrors.IsNotFound(err) {\n\t\t\t// Delete\n\t\t\tklog.V(4).Infof(\"Noticed node deletion: %#v\", nodeUpdate.nodeName)\n\t\t\ttc.taintedNodesLock.Lock()\n\t\t\tdefer tc.taintedNodesLock.Unlock()\n\t\t\tdelete(tc.taintedNodes, nodeUpdate.nodeName)\n\t\t\treturn\n\t\t}\n\t\tutilruntime.HandleError(fmt.Errorf(\"cannot get node %s: %v\", nodeUpdate.nodeName, err))\n\t\treturn\n\t}\n\t\n\t// 1.先得到该node上所有的taint\n\t// Create or Update\n\tklog.V(4).Infof(\"Noticed node update: %#v\", nodeUpdate)\n\ttaints := getNoExecuteTaints(node.Spec.Taints)\n\tfunc() {\n\t\ttc.taintedNodesLock.Lock()\n\t\tdefer tc.taintedNodesLock.Unlock()\n\t\tklog.V(4).Infof(\"Updating known taints on node %v: %v\", node.Name, taints)\n\t\tif len(taints) == 0 {\n\t\t\tdelete(tc.taintedNodes, node.Name)\n\t\t} else {\n\t\t\ttc.taintedNodes[node.Name] = taints\n\t\t}\n\t}()\n\t\n\t// 2. 得到这个node上所有的pod\n\t// This is critical that we update tc.taintedNodes before we call getPodsAssignedToNode:\n\t// getPodsAssignedToNode can be delayed as long as all future updates to pods will call\n\t// tc.PodUpdated which will use tc.taintedNodes to potentially delete delayed pods.\n\tpods, err := tc.getPodsAssignedToNode(node.Name)\n\tif err != nil {\n\t\tklog.Errorf(err.Error())\n\t\treturn\n\t}\n\tif len(pods) == 0 {\n\t\treturn\n\t}\n\t// Short circuit, to make this controller a bit faster.\n\tif len(taints) == 0 {\n\t\tklog.V(4).Infof(\"All taints were removed from the Node %v. Cancelling all evictions...\", node.Name)\n\t\tfor i := range pods {\n\t\t\ttc.cancelWorkWithEvent(types.NamespacedName{Namespace: pods[i].Namespace, Name: pods[i].Name})\n\t\t}\n\t\treturn\n\t}\n\t\n\t// 3. for循环执行processPodOnNode来一个个的处理pod\n\tnow := time.Now()\n\tfor _, pod := range pods {\n\t\tpodNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}\n\t\ttc.processPodOnNode(podNamespacedName, node.Name, pod.Spec.Tolerations, taints, now)\n\t}\n}\n```\n\n###### 3.1.2.1 processPodOnNode\n\n核心逻辑如下：\n\n（1） 如果node没有taint了，那就取消该pod的处理（可能在定时队列中挂着）\n\n（2）通过pod的Tolerations和node的taints进行对比，看该pod有没有完全容忍。\n\n（3）如果没有完全容忍，那就先取消对该pod的处理（防止如果pod已经在队列中，不能添加到队列中去），然后再通过AddWork重新挂进去。注意这里设置的时间都是time.now，意思就是马上删除\n\n（4）如果完全容忍，找出来最短能够容忍的时间。看这个函数就知道。如果没有身容忍时间或者容忍时间为负数，都赋值为0，表示马上删除。如果设置了最大值math.MaxInt64。表示一直容忍，永远不删除。否则就找设置的最小的容忍时间\n\n（5）接下里就是根据最小时间来设置等多久触发删除pod了，但是设置之前还要和之前已有的触发再判断一下\n\n* 如果之前就有在等着到时间删除的，并且这次的触发删除时间在那之前。不删除。举例，podA应该是11点删除，这次更新发现pod应该是10.50删除，那么这次就忽略，还是以上次为准\n* 否则删除后，再次设置这次的删除时间\n\n```\nfunc (tc *NoExecuteTaintManager) processPodOnNode(\n\tpodNamespacedName types.NamespacedName,\n\tnodeName string,\n\ttolerations []v1.Toleration,\n\ttaints []v1.Taint,\n\tnow time.Time,\n) {\n  // 1. 如果node没有taint了，那就取消该pod的处理（可能在定时队列中挂着）\n\tif len(taints) == 0 {\n\t\ttc.cancelWorkWithEvent(podNamespacedName)\n\t}\n\t\n\t// 2.通过pod的Tolerations和node的taints进行对比，看该pod有没有完全容忍。\n\tallTolerated, usedTolerations := v1helper.GetMatchingTolerations(taints, tolerations)\n  // 3.如果没有完全容忍，那就先取消对该pod的处理（防止如果pod已经在队列中，不能添加到队列中去），然后再通过AddWork重新挂进去。注意这里设置的时间都是time.now，意思就是马上删除\n  if !allTolerated {\n\t\tklog.V(2).Infof(\"Not all taints are tolerated after update for Pod %v on %v\", podNamespacedName.String(), nodeName)\n\t\t// We're canceling scheduled work (if any), as we're going to delete the Pod right away.\n\t\ttc.cancelWorkWithEvent(podNamespacedName)\n\t\ttc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), time.Now(), time.Now())\n\t\treturn\n\t}\n\t\n\t// 4.如果完全容忍，找出来最短能够容忍的时间。看这个函数就知道。如果没有身容忍时间或者容忍时间为负数，都赋值为0，表示马上删除。如果设置了最大值math.MaxInt64。表示一直容忍，永远不删除。否则就找设置的最小的容忍时间\n\tminTolerationTime := getMinTolerationTime(usedTolerations)\n\t// getMinTolerationTime returns negative value to denote infinite toleration.\n\tif minTolerationTime < 0 {\n\t\tklog.V(4).Infof(\"New tolerations for %v tolerate forever. Scheduled deletion won't be cancelled if already scheduled.\", podNamespacedName.String())\n\t\treturn\n\t}\n\t\n\t// 5. 接下里就是根据最小时间来设置等多久触发删除pod了\n\tstartTime := now\n\ttriggerTime := startTime.Add(minTolerationTime)\n\tscheduledEviction := tc.taintEvictionQueue.GetWorkerUnsafe(podNamespacedName.String())\n\tif scheduledEviction != nil {\n\t\tstartTime = scheduledEviction.CreatedAt\n\t\t// 5.1 如果之前就有在等着到时间删除的，并且这次的触发删除时间在那之前。不删除。举例，podA应该是11点删除，这次更新发现pod应该是10.50删除，那么这次就忽略，还是以上次为准\n\t\tif startTime.Add(minTolerationTime).Before(triggerTime) {\n\t\t\treturn\n\t\t}\n\t\t// 5.2 否则删除后，再次设置这次的删除时间\n\t\ttc.cancelWorkWithEvent(podNamespacedName)\n\t}\n\ttc.taintEvictionQueue.AddWork(NewWorkArgs(podNamespacedName.Name, podNamespacedName.Namespace), startTime, triggerTime)\n}\n```\n\n##### 3.1.3 handlePodUpdate\n\nhandlePodUpdate是handleNodeUpdate的子集。核心逻辑就是processPodOnNode。这个上面分析了，不在分析了\n\n```\nfunc (tc *NoExecuteTaintManager) handlePodUpdate(podUpdate podUpdateItem) {\n\tpod, err := tc.getPod(podUpdate.podName, podUpdate.podNamespace)\n\tif err != nil {\n\t\tif apierrors.IsNotFound(err) {\n\t\t\t// Delete\n\t\t\tpodNamespacedName := types.NamespacedName{Namespace: podUpdate.podNamespace, Name: podUpdate.podName}\n\t\t\tklog.V(4).Infof(\"Noticed pod deletion: %#v\", podNamespacedName)\n\t\t\ttc.cancelWorkWithEvent(podNamespacedName)\n\t\t\treturn\n\t\t}\n\t\tutilruntime.HandleError(fmt.Errorf(\"could not get pod %s/%s: %v\", podUpdate.podName, podUpdate.podNamespace, err))\n\t\treturn\n\t}\n\n\t// We key the workqueue and shard workers by nodeName. If we don't match the current state we should not be the one processing the current object.\n\tif pod.Spec.NodeName != podUpdate.nodeName {\n\t\treturn\n\t}\n\n\t// Create or Update\n\tpodNamespacedName := types.NamespacedName{Namespace: pod.Namespace, Name: pod.Name}\n\tklog.V(4).Infof(\"Noticed pod update: %#v\", podNamespacedName)\n\tnodeName := pod.Spec.NodeName\n\tif nodeName == \"\" {\n\t\treturn\n\t}\n\ttaints, ok := func() ([]v1.Taint, bool) {\n\t\ttc.taintedNodesLock.Lock()\n\t\tdefer tc.taintedNodesLock.Unlock()\n\t\ttaints, ok := tc.taintedNodes[nodeName]\n\t\treturn taints, ok\n\t}()\n\t// It's possible that Node was deleted, or Taints were removed before, which triggered\n\t// eviction cancelling if it was needed.\n\tif !ok {\n\t\treturn\n\t}\n\ttc.processPodOnNode(podNamespacedName, nodeName, pod.Spec.Tolerations, taints, time.Now())\n}\n```\n\n##### 3.1.3 nc.taintManager.Run总结\n\n**可以看出来nc.taintManager针对NoExecute污点立即生效的，只要节点有污点，我就要开始驱逐，pod你自身通过设置容忍时间来避免马上驱逐**\n\n（1）监听pod, node的add/update事件\n\n（2）通过多个channel的方式，hash打断pod/node事件到不容的chanenl，这样让n个worker负载均衡处理\n\n（3）优先处理node事件，但实际node处理和pod处理是一样的。处理node是将上面的pod一个一个的判断，是否需要驱逐。判断驱逐逻辑核心就是：\n\n*  如果node没有taint了，那就取消该pod的处理（可能在定时队列中挂着）\n\n* 通过pod的Tolerations和node的taints进行对比，看该pod有没有完全容忍。\n\n* 如果没有完全容忍，那就先取消对该pod的处理（防止如果pod已经在队列中，不能添加到队列中去），然后再通过AddWork重新挂进去。注意这里设置的时间都是time.now，意思就是马上删除\n\n* 如果完全容忍，找出来最短能够容忍的时间。看这个函数就知道。如果没有身容忍时间或者容忍时间为负数，都赋值为0，表示马上删除。如果设置了最大值math.MaxInt64。表示一直容忍，永远不删除。否则就找设置的最小的容忍时间\n\n* 接下里就是根据最小时间来设置等多久触发删除pod了，但是设置之前还要和之前已有的触发再判断一下\n  * 如果之前就有在等着到时间删除的，并且这次的触发删除时间在那之前。不删除。举例，podA应该是11点删除，这次更新发现pod应该是10.50删除，那么这次就忽略，还是以上次为准\n  * 否则删除后，再次设置这次的删除时间\n\n![image-20220811113528003](../images/taintManager.png)\n\n#### 3.2 doNodeProcessingPassWorker\n\n可以看出来doNodeProcessingPassWorker核心就是2件事：\n\n（1）给node添加NoScheduleTaint\n\n（2）给node添加lables\n\n```\nfunc (nc *Controller) doNodeProcessingPassWorker() {\n   for {\n      obj, shutdown := nc.nodeUpdateQueue.Get()\n      // \"nodeUpdateQueue\" will be shutdown when \"stopCh\" closed;\n      // we do not need to re-check \"stopCh\" again.\n      if shutdown {\n         return\n      }\n      nodeName := obj.(string)\n      if err := nc.doNoScheduleTaintingPass(nodeName); err != nil {\n         klog.Errorf(\"Failed to taint NoSchedule on node <%s>, requeue it: %v\", nodeName, err)\n         // TODO(k82cn): Add nodeName back to the queue\n      }\n      // TODO: re-evaluate whether there are any labels that need to be\n      // reconcile in 1.19. Remove this function if it's no longer necessary.\n      if err := nc.reconcileNodeLabels(nodeName); err != nil {\n         klog.Errorf(\"Failed to reconcile labels for node <%s>, requeue it: %v\", nodeName, err)\n         // TODO(yujuhong): Add nodeName back to the queue\n      }\n      nc.nodeUpdateQueue.Done(nodeName)\n   }\n}\n```\n\n##### 3.2.1 doNoScheduleTaintingPass\n\n核心逻辑就是检查该 node 是否需要添加对应的NoSchedule\n\n 逻辑为：\n\n- 1、从 nodeLister 中获取该 node 对象；\n- 2、判断该 node 是否存在以下几种 Condition：(1) False 或 Unknown 状态的 NodeReady Condition；(2) MemoryPressureCondition；(3) DiskPressureCondition；(4) NetworkUnavailableCondition；(5) PIDPressureCondition；若任一一种存在会添加对应的 `NoSchedule` taint；\n- 3、判断 node 是否处于 `Unschedulable` 状态，若为 `Unschedulable` 也添加对应的 `NoSchedule` taint；\n- 4、对比 node 已有的 taints 以及需要添加的 taints，以需要添加的 taints 为准，调用 `nodeutil.SwapNodeControllerTaint` 为 node 添加不存在的 taints 并删除不需要的 taints；\n\n```\nfunc (nc *Controller) doNoScheduleTaintingPass(nodeName string) error {\n\tnode, err := nc.nodeLister.Get(nodeName)\n\tif err != nil {\n\t\t// If node not found, just ignore it.\n\t\tif apierrors.IsNotFound(err) {\n\t\t\treturn nil\n\t\t}\n\t\treturn err\n\t}\n\n\t// Map node's condition to Taints.\n\tvar taints []v1.Taint\n\tfor _, condition := range node.Status.Conditions {\n\t\tif taintMap, found := nodeConditionToTaintKeyStatusMap[condition.Type]; found {\n\t\t\tif taintKey, found := taintMap[condition.Status]; found {\n\t\t\t\ttaints = append(taints, v1.Taint{\n\t\t\t\t\tKey:    taintKey,\n\t\t\t\t\tEffect: v1.TaintEffectNoSchedule,\n\t\t\t\t})\n\t\t\t}\n\t\t}\n\t}\n\tif node.Spec.Unschedulable {\n\t\t// If unschedulable, append related taint.\n\t\ttaints = append(taints, v1.Taint{\n\t\t\tKey:    v1.TaintNodeUnschedulable,\n\t\t\tEffect: v1.TaintEffectNoSchedule,\n\t\t})\n\t}\n\n\t// Get exist taints of node.\n\tnodeTaints := taintutils.TaintSetFilter(node.Spec.Taints, func(t *v1.Taint) bool {\n\t\t// only NoSchedule taints are candidates to be compared with \"taints\" later\n\t\tif t.Effect != v1.TaintEffectNoSchedule {\n\t\t\treturn false\n\t\t}\n\t\t// Find unschedulable taint of node.\n\t\tif t.Key == v1.TaintNodeUnschedulable {\n\t\t\treturn true\n\t\t}\n\t\t// Find node condition taints of node.\n\t\t_, found := taintKeyToNodeConditionMap[t.Key]\n\t\treturn found\n\t})\n\ttaintsToAdd, taintsToDel := taintutils.TaintSetDiff(taints, nodeTaints)\n\t// If nothing to add not delete, return true directly.\n\tif len(taintsToAdd) == 0 && len(taintsToDel) == 0 {\n\t\treturn nil\n\t}\n\tif !nodeutil.SwapNodeControllerTaint(nc.kubeClient, taintsToAdd, taintsToDel, node) {\n\t\treturn fmt.Errorf(\"failed to swap taints of node %+v\", node)\n\t}\n\treturn nil\n}\n\n\nnodeConditionToTaintKeyStatusMap = map[v1.NodeConditionType]map[v1.ConditionStatus]string{\n\t\tv1.NodeReady: {\n\t\t\tv1.ConditionFalse:   v1.TaintNodeNotReady,\n\t\t\tv1.ConditionUnknown: v1.TaintNodeUnreachable,\n\t\t},\n\t\tv1.NodeMemoryPressure: {\n\t\t\tv1.ConditionTrue: v1.TaintNodeMemoryPressure,\n\t\t},\n\t\tv1.NodeDiskPressure: {\n\t\t\tv1.ConditionTrue: v1.TaintNodeDiskPressure,\n\t\t},\n\t\tv1.NodeNetworkUnavailable: {\n\t\t\tv1.ConditionTrue: v1.TaintNodeNetworkUnavailable,\n\t\t},\n\t\tv1.NodePIDPressure: {\n\t\t\tv1.ConditionTrue: v1.TaintNodePIDPressure,\n\t\t},\n\t}\n```\n\n##### 3.2.2 reconcileNodeLabels\n\nreconcileNodeLabels就是及时给node更新：\n\n```\n beta.kubernetes.io/arch: amd64\n    beta.kubernetes.io/os: linux\n    kubernetes.io/arch: amd64\n    kubernetes.io/os: linux\n```\n\n<br>\n\n```\n// reconcileNodeLabels reconciles node labels.\nfunc (nc *Controller) reconcileNodeLabels(nodeName string) error {\n\tnode, err := nc.nodeLister.Get(nodeName)\n\tif err != nil {\n\t\t// If node not found, just ignore it.\n\t\tif apierrors.IsNotFound(err) {\n\t\t\treturn nil\n\t\t}\n\t\treturn err\n\t}\n\n\tif node.Labels == nil {\n\t\t// Nothing to reconcile.\n\t\treturn nil\n\t}\n\n\tlabelsToUpdate := map[string]string{}\n\tfor _, r := range labelReconcileInfo {\n\t\tprimaryValue, primaryExists := node.Labels[r.primaryKey]\n\t\tsecondaryValue, secondaryExists := node.Labels[r.secondaryKey]\n\n\t\tif !primaryExists {\n\t\t\t// The primary label key does not exist. This should not happen\n\t\t\t// within our supported version skew range, when no external\n\t\t\t// components/factors modifying the node object. Ignore this case.\n\t\t\tcontinue\n\t\t}\n\t\tif secondaryExists && primaryValue != secondaryValue {\n\t\t\t// Secondary label exists, but not consistent with the primary\n\t\t\t// label. Need to reconcile.\n\t\t\tlabelsToUpdate[r.secondaryKey] = primaryValue\n\n\t\t} else if !secondaryExists && r.ensureSecondaryExists {\n\t\t\t// Apply secondary label based on primary label.\n\t\t\tlabelsToUpdate[r.secondaryKey] = primaryValue\n\t\t}\n\t}\n\n\tif len(labelsToUpdate) == 0 {\n\t\treturn nil\n\t}\n\tif !nodeutil.AddOrUpdateLabelsOnNode(nc.kubeClient, labelsToUpdate, node) {\n\t\treturn fmt.Errorf(\"failed update labels for node %+v\", node)\n\t}\n\treturn nil\n}\n```\n\n#### 3.3 doPodProcessingWorker\n\ndoPodProcessingWorker从podUpdateQueue读取一个pod，执行processPod。（注意这里的podUpdateQueue和tainManger的podUpdateQueue不是一个队列，是同名而已）\n\nprocessPod和新逻辑如下：\n\n（1） 判断NodeCondition是否notReady\n\n（2）如果feature-gates=TaintBasedEvictions=false，则执行processNoTaintBaseEviction\n\n（3）最终都会判断node ReadyCondition是否不为true，如果不为true, 执行MarkPodsNotReady–如果pod的ready condition不为false， 将pod的ready condition设置为false，并更新LastTransitionTimestamp；否则不更新pod\n\n```\nfunc (nc *Controller) doPodProcessingWorker() {\n\tfor {\n\t\tobj, shutdown := nc.podUpdateQueue.Get()\n\t\t// \"podUpdateQueue\" will be shutdown when \"stopCh\" closed;\n\t\t// we do not need to re-check \"stopCh\" again.\n\t\tif shutdown {\n\t\t\treturn\n\t\t}\n\n\t\tpodItem := obj.(podUpdateItem)\n\t\tnc.processPod(podItem)\n\t}\n}\n\n\n// processPod is processing events of assigning pods to nodes. In particular:\n// 1. for NodeReady=true node, taint eviction for this pod will be cancelled\n// 2. for NodeReady=false or unknown node, taint eviction of pod will happen and pod will be marked as not ready\n// 3. if node doesn't exist in cache, it will be skipped and handled later by doEvictionPass\nfunc (nc *Controller) processPod(podItem podUpdateItem) {\n\tdefer nc.podUpdateQueue.Done(podItem)\n\tpod, err := nc.podLister.Pods(podItem.namespace).Get(podItem.name)\n\tif err != nil {\n\t\tif apierrors.IsNotFound(err) {\n\t\t\t// If the pod was deleted, there is no need to requeue.\n\t\t\treturn\n\t\t}\n\t\tklog.Warningf(\"Failed to read pod %v/%v: %v.\", podItem.namespace, podItem.name, err)\n\t\tnc.podUpdateQueue.AddRateLimited(podItem)\n\t\treturn\n\t}\n\n\tnodeName := pod.Spec.NodeName\n\n\tnodeHealth := nc.nodeHealthMap.getDeepCopy(nodeName)\n\tif nodeHealth == nil {\n\t\t// Node data is not gathered yet or node has beed removed in the meantime.\n\t\t// Pod will be handled by doEvictionPass method.\n\t\treturn\n\t}\n\n\tnode, err := nc.nodeLister.Get(nodeName)\n\tif err != nil {\n\t\tklog.Warningf(\"Failed to read node %v: %v.\", nodeName, err)\n\t\tnc.podUpdateQueue.AddRateLimited(podItem)\n\t\treturn\n\t}\n\t\n\t// 1. 判断NodeCondition是否notReady\n\t_, currentReadyCondition := nodeutil.GetNodeCondition(nodeHealth.status, v1.NodeReady)\n\tif currentReadyCondition == nil {\n\t\t// Lack of NodeReady condition may only happen after node addition (or if it will be maliciously deleted).\n\t\t// In both cases, the pod will be handled correctly (evicted if needed) during processing\n\t\t// of the next node update event.\n\t\treturn\n\t}\n  \n  // 2.如果feature-gates=TaintBasedEvictions=false，则执行processNoTaintBaseEviction\n\tpods := []*v1.Pod{pod}\n\t// In taint-based eviction mode, only node updates are processed by NodeLifecycleController.\n\t// Pods are processed by TaintManager.\n\tif !nc.useTaintBasedEvictions {\n\t\tif err := nc.processNoTaintBaseEviction(node, currentReadyCondition, nc.nodeMonitorGracePeriod, pods); err != nil {\n\t\t\tklog.Warningf(\"Unable to process pod %+v eviction from node %v: %v.\", podItem, nodeName, err)\n\t\t\tnc.podUpdateQueue.AddRateLimited(podItem)\n\t\t\treturn\n\t\t}\n\t}\n\t\n\t// 3.最终都会判断node ReadyCondition是否不为true，如果不为true, 执行MarkPodsNotReady–如果pod的ready condition不为false， 将pod的ready condition设置为false，并更新LastTransitionTimestamp；否则不更新pod\n\tif currentReadyCondition.Status != v1.ConditionTrue {\n\t\tif err := nodeutil.MarkPodsNotReady(nc.kubeClient, pods, nodeName); err != nil {\n\t\t\tklog.Warningf(\"Unable to mark pod %+v NotReady on node %v: %v.\", podItem, nodeName, err)\n\t\t\tnc.podUpdateQueue.AddRateLimited(podItem)\n\t\t}\n\t}\n}\n```\n\n##### 3.3.1 processNoTaintBaseEviction\n\n核心逻辑如下：\n\n（1）node最后发现ReadyCondition为false，如果nodeHealthMap里的readyTransitionTimestamp加上podEvictionTimeout的时间是过去的时间–ReadyCondition为false状态已经持续了至少podEvictionTimeout，执行evictPods。\n\n（2）node最后发现ReadyCondition为unknown，如果nodeHealthMap里的probeTimestamp加上podEvictionTimeout的时间是过去的时间–ReadyCondition为false状态已经持续了至少podEvictionTimeout，执行evictPods。\n\n（3）node最后发现ReadyCondition为true，则执行cancelPodEviction–在nodeEvictionMap设置status为unmarked，然后node从zonePodEvictor队列中移除。\n\n**evictPods并不会马上驱逐pod，他还是看node是否已经是驱逐状态。**\n\nevictPods先从nodeEvictionMap获取node驱逐的状态，如果是evicted说明node已经发生驱逐，则把node上的这个pod删除。否则设置状态为toBeEvicted，然后node加入**zonePodEvictor**队列等待执行驱逐pod\n\n```\nfunc (nc *Controller) processNoTaintBaseEviction(node *v1.Node, observedReadyCondition *v1.NodeCondition, gracePeriod time.Duration, pods []*v1.Pod) error {\n\tdecisionTimestamp := nc.now()\n\tnodeHealthData := nc.nodeHealthMap.getDeepCopy(node.Name)\n\tif nodeHealthData == nil {\n\t\treturn fmt.Errorf(\"health data doesn't exist for node %q\", node.Name)\n\t}\n\t// Check eviction timeout against decisionTimestamp\n\tswitch observedReadyCondition.Status {\n\tcase v1.ConditionFalse:\n\t\tif decisionTimestamp.After(nodeHealthData.readyTransitionTimestamp.Add(nc.podEvictionTimeout)) {\n\t\t\tenqueued, err := nc.evictPods(node, pods)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif enqueued {\n\t\t\t\tklog.V(2).Infof(\"Node is NotReady. Adding Pods on Node %s to eviction queue: %v is later than %v + %v\",\n\t\t\t\t\tnode.Name,\n\t\t\t\t\tdecisionTimestamp,\n\t\t\t\t\tnodeHealthData.readyTransitionTimestamp,\n\t\t\t\t\tnc.podEvictionTimeout,\n\t\t\t\t)\n\t\t\t}\n\t\t}\n\tcase v1.ConditionUnknown:\n\t\tif decisionTimestamp.After(nodeHealthData.probeTimestamp.Add(nc.podEvictionTimeout)) {\n\t\t\tenqueued, err := nc.evictPods(node, pods)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif enqueued {\n\t\t\t\tklog.V(2).Infof(\"Node is unresponsive. Adding Pods on Node %s to eviction queues: %v is later than %v + %v\",\n\t\t\t\t\tnode.Name,\n\t\t\t\t\tdecisionTimestamp,\n\t\t\t\t\tnodeHealthData.readyTransitionTimestamp,\n\t\t\t\t\tnc.podEvictionTimeout-gracePeriod,\n\t\t\t\t)\n\t\t\t}\n\t\t}\n\tcase v1.ConditionTrue:\n\t\tif nc.cancelPodEviction(node) {\n\t\t\tklog.V(2).Infof(\"Node %s is ready again, cancelled pod eviction\", node.Name)\n\t\t}\n\t}\n\treturn nil\n}\n\n\n// evictPods:\n// - adds node to evictor queue if the node is not marked as evicted.\n//   Returns false if the node name was already enqueued.\n// - deletes pods immediately if node is already marked as evicted.\n//   Returns false, because the node wasn't added to the queue.\nfunc (nc *Controller) evictPods(node *v1.Node, pods []*v1.Pod) (bool, error) {\n\tnc.evictorLock.Lock()\n\tdefer nc.evictorLock.Unlock()\n\tstatus, ok := nc.nodeEvictionMap.getStatus(node.Name)\n\tif ok && status == evicted {\n\t\t// Node eviction already happened for this node.\n\t\t// Handling immediate pod deletion.\n\t\t_, err := nodeutil.DeletePods(nc.kubeClient, pods, nc.recorder, node.Name, string(node.UID), nc.daemonSetStore)\n\t\tif err != nil {\n\t\t\treturn false, fmt.Errorf(\"unable to delete pods from node %q: %v\", node.Name, err)\n\t\t}\n\t\treturn false, nil\n\t}\n\tif !nc.nodeEvictionMap.setStatus(node.Name, toBeEvicted) {\n\t\tklog.V(2).Infof(\"node %v was unregistered in the meantime - skipping setting status\", node.Name)\n\t}\n\treturn nc.zonePodEvictor[utilnode.GetZoneKey(node)].Add(node.Name, string(node.UID)), nil\n}\n```\n\n#### 3.4 doEvictionPass(if useTaintBasedEvictions==false)\n\n**doEvictionPass是一个令牌桶限速队列(受参数evictionLimiterQPS影响，默认0.1也就是10s驱逐一个node)**，+加入这个队列的node都是 unready状态持续时间大于podEvictionTimeout。(这个就是processNoTaintBaseEviction将node加入了队列)\n\n- 遍历zonePodEvictor，获取一个zone里的node队列，从队列中获取一个node，执行下面步骤\n- 获取node的uid，从缓存中获取node上的所有pod\n- 执行DeletePods–删除daemonset之外的所有pod，保留daemonset的pod\n  1. 遍历所由的pod，检查pod绑定的node是否跟提供的一样，不一样则跳过这个pod\n  2. 执行SetPodTerminationReason–设置pod Status.Reason为`NodeLost`，Status.Message为`\"Node %v which was running pod %v is unresponsive\"`，并更新pod。\n  3. 如果pod 设置了DeletionGracePeriodSeconds，说明pod已经被删除，则跳过这个pod\n  4. 判断pod是否为daemonset的pod，如果是则跳过这个pod\n  5. 删除这个pod\n- 在nodeEvictionMap设置node的状态为evicted\n\n#### 3.5 doNoExecuteTaintingPass(if useTaintBasedEvictions==true)\n\n启用taint manager 执行doNoExecuteTaintingPass–添加NoExecute的taint。这里不执行驱逐，驱逐单独在taint manager里处理。\n\ndoNoExecuteTaintingPass是一个令牌桶限速队列（也是受受参数evictionLimiterQPS影响，默认0.1也就是10s驱逐一个node）\n\n- 遍历zoneNoExecuteTainter，获得一个zone的node队列，从队列中获取一个node，执行下面步骤\n- 从缓存中获取node\n- 如果node ready condition为false，移除“node.kubernetes.io/unreachable”的taint，添加“node.kubernetes.io/not-ready” 的taint，Effect为NoExecute。\n- 如果node ready condition为unknown，移除“node.kubernetes.io/not-ready” 的taint，添加“node.kubernetes.io/unreachable” 的taint，Effect为NoExecute。\n\n#### 3.6 monitorNodeHealth\n\n(3.6该章节摘自https://midbai.com/post/node-lifecycle-controller-manager/)\n\n无论是否启用了 `TaintBasedEvictions` 特性，需要打 taint 或者驱逐 pod 的 node 都会被放在 zoneNoExecuteTainter 或者 zonePodEvictor 队列中，而 `nc.monitorNodeHealth` 就是这两个队列中数据的生产者。`nc.monitorNodeHealth` 的主要功能是持续监控 node 的状态，当 node 处于异常状态时更新 node 的 taint 以及 node 上 pod 的状态或者直接驱逐 node 上的 pod，此外还会为集群下的所有 node 划分 zoneStates 并为每个 zoneStates 设置对应的驱逐速率。\n\n每隔nodeMonitorPeriod周期，执行一次monitorNodeHealth，维护node状态和zone的状态，更新未响应的node–设置node status为unknown和根据集群不同状态设置zone的速率。\n\n##### 3.6.1  node分类并初始化\n\n从缓存中获取所有node列表，借助两个字段knownNodeSet（用来存放已经发现的node集合）和zoneStates（用来存储已经发现zone的状态–状态有Initial、Normal、FullDisruption、PartialDisruption）来进行对node进行分类，分为新加的–add、删除的deleted、新的zone node–newZoneRepresentatives。\n\n对新发现的zone进行初始化–启用taint manager，设置执行node设置taint 队列zoneNoExecuteTainter（存放node为unready，需要添加taint）的速率为evictionLimiterQPS。未启用taint manager，设置安排node执行驱逐队列zonePodEvictor（存放zone里的需要执行pod evictor的node列表）的速率evictionLimiterQPS。同时在zoneStates里设置zone状态为stateInitial。\n\n对新发现的node，添加到knownNodeSet，同时在zoneStates里设置zone状态为stateInitial，如果node的所属的zone未初始化，则进行初始化。启用taint manager，标记node为健康的–移除node上unreachable和notready taint（如果存在），从zoneNoExecuteTainter（存放node为unready，需要添加taint）队列中移除（如果存在）。未启用taint manager，初始化nodeEvictionMap（存放node驱逐执行pod的进度）–设置node的状态为unmarked，从zonePodEvictor（存放zone的需要pod evictor的node列表）队列中移除。\n\n对删除的node，发送一个RemovingNode事件并从knownNodeSet里移除。\n\n##### 3.6.2 处理node status\n\n**超时时间**\n\n如果当前node的ready condition为空，说明node刚注册，所以它的超时时间为nodeStartupGracePeriod，否则它的超时时间为nodeMonitorGracePeriod。\n\n**心跳时间**\n\n最后的心跳时间（probeTimestamp和readyTransitionTimestamp），由下面规则从上往下执行。\n\n如果node刚注册，则nodeHealthMap保存的probeTimestamp和readyTransitionTimestamp都为node的创建时间。\n\n如果nodeHealthMap里没有该node数据，则probeTimestamp和readyTransitionTimestamp都为现在。\n\n如果nodeHealthMap里的 ready condition没有，而现在有ready condition，则probeTimestamp和readyTransitionTimestamp都为现在，status为现在的status。\n\n如果nodeHealthMap里的有ready condition，而现在的ready condition没有，说明发生了未知的异常情况（一般不会发生，只是预防性的代码），则probeTimestamp和readyTransitionTimestamp都为现在，status为现在的status。\n\n如果nodeHealthMap里有ready condition，而现在的ready condition也有，且保存的LastHeartbeatTime与现在不一样。probeTimestamp为现在、status为现在的status。 如果保存的LastTransitionTime与现在的不一样，说明node状态发生了变化，则设置nodeHealthMap的readyTransitionTimestamp为现在。\n\n如果现在的lease存在，且lease的RenewTime在nodeHealthMap保存的RenewTime之后，或者nodeHealthMap里不存在。则probeTimestamp为现在，保存现在lease到nodeHealthMap里。\n\n**尝试更新node状态**\n\n如果probeTimestamp加上超时时间，在现在之前–即status状态更新已经超时，则会更新update node。\n\n更新ready、memorypressure、diskpressure、pidpressure的condition为：\n\n相应condition不存在\n\n```\nv1.NodeCondition{\n\t\tType:               nodeConditionType,//上面的四种类型\n\t\tStatus:             v1.ConditionUnknown,// unknown\n\t\tReason:             \"NodeStatusNeverUpdated\",\n\t\tMessage:            \"Kubelet never posted node status.\",\n\t\tLastHeartbeatTime:  node.CreationTimestamp,//node创建时间\n\t\tLastTransitionTime: nowTimestamp, //现在时间\n}\n```\n\n相应的condition存在\n\n````\ncurrentCondition.Status = v1.ConditionUnknown \ncurrentCondition.Reason = \"NodeStatusUnknown\" \ncurrentCondition.Message = \"Kubelet stopped posting node status.\" \ncurrentCondition.LastTransitionTime = nowTimestamp\n````\n\n如果现在node与之前的node不一样的–发生了更新，则对node执行update。\n\nupdate成功，同时更新nodeHealthMap上的状态–readyTransitionTimestamp改为现在，status改为现在的node.status。\n\n**对unready node进行处理–驱逐pod**\n\nnode当前的ReadyCondition–执行尝试更新node状态之后的node的ReadyCondition\n\nnode最后发现ReadyCondition–执行尝试更新node状态之前node的ReadyCondition\n\n如果当前的ReadyCondition不为空，执行下面操作\n\n1. 从缓存中获取node上pod列表\n2. 如果启用taint manager，执行processTaintBaseEviction–根据node最后发现ReadyCondition 对node的taint进行操作\n   1. node最后发现ReadyCondition为false，如果已经有“node.kubernetes.io/unreachable”的taint，将该taint删除，添加“node.kubernetes.io/not-ready” 的taint。否则将node添加到zoneNoExecuteTainter队列中，等待添加taint。\n   2. node最后发现ReadyCondition为unknown，如果已经有“node.kubernetes.io/not-ready” 的taint，将该taint删除，添加“node.kubernetes.io/unreachable”的taint。否则将node添加到zoneNoExecuteTainter队列中，等待添加taint。\n   3. node最后发现ReadyCondition为true，移除“node.kubernetes.io/not-ready” 和“node.kubernetes.io/unreachable”的taint，如果存在的话，同时从zoneNoExecuteTainter队列中移除。\n3. 未启用taint manager，则执行processNoTaintBaseEviction\n   - node最后发现ReadyCondition为false，nodeHealthMap里的readyTransitionTimestamp加上podEvictionTimeout的时间是过去的时间–ReadyCondition为false状态已经持续了至少podEvictionTimeout，执行evictPods。\n   - node最后发现ReadyCondition为unknown，nodeHealthMap里的readyTransitionTimestamp加上podEvictionTimeout的时间是过去的时间–ReadyCondition为false状态已经持续了至少podEvictionTimeout，执行evictPods。\n   - node最后发现ReadyCondition为true，则执行cancelPodEviction–在nodeEvictionMap设置status为unmarked，然后node从zonePodEvictor队列中移除。\n   - evictPods–先从nodeEvictionMap获取node驱逐的状态，如果是evicted说明node已经发生驱逐，则把node上所有的pod删除。否则设置状态为toBeEvicted，然后node加入zonePodEvictor队列等待执行驱逐pod。\n\n**这里有个疑问**：\n\n为什么要用observedReadyCondition 而不用currentReadyCondition，observedReadyCondition和currentReadyCondition不一定一样？\n\n比如node挂了currentReadyCondition变为unknown，而observedReadyCondition为ready\n\n这样明显有问题，这一周期不会做驱逐或taint，下一周期observedReadyCondition和currentReadyCondition都为unknown 一定会驱逐pod或添加taint。\n\n可能考虑nodeMonitorPeriod都很短，不立马执行驱逐或taint没有什么大问题。\n\n###### 3.6.3 集群健康状态处理\n\n每个zone有四种状态，stateInitial（刚加入的zone）、stateFullDisruption（全挂）、statePartialDisruption（挂的node比例超出了unhealthyZoneThreshold）、stateNormal（剩下的所有情况）\n\nallAreFullyDisrupted代表现在所有zone状态stateFullDisruption全挂\n\nallWasFullyDisrupted为true代表过去所有zone状态stateFullDisruption全挂\n\n集群状态有四种：\n\n- allAreFullyDisrupted为true allWasFullyDisrupted为true\n- allAreFullyDisrupted为true allWasFullyDisrupted为false\n- allAreFullyDisrupted为false allWasFullyDisrupted为true\n- allAreFullyDisrupted为false allWasFullyDisrupted为false\n\n**计算现在集群的状态**\n\n遍历现在所有的zone，每个zone遍历所有node的ready condition，计算出zone的状态。\n\n根据zone的状态设置allAreFullyDisrupted的值\n\n如果zone不在zoneStates，添加进zoneStates并设置状态为stateInitial\n\n**计算过去集群的状态**\n\n从zoneStates读取保存的zone列表，如果不在现在的zone列表里，则从zoneStates移除\n\n根据zoneStates里保存的zone状态设置allWasFullyDisrupted值\n\n**设置zone 每秒安排多少个node来执行taint或驱逐**\n\n当allAreFullyDisrupted为false allWasFullyDisrupted为true–之前zone未全挂，现在所有zone全挂。\n\n1. 遍历所有node，设置node为正常状态。\n   - 启用taint manager，执行markNodeAsReachable–移除“node.kubernetes.io/not-ready”和“node.kubernetes.io/unreachable”的taint，如果存的话，同时从zoneNoExecuteTainter队列中移除\n   - 未启用taint manager，执行cancelPodEviction–在nodeEvictionMap设置status为unmarked，然后node从zonePodEvictor队列中移除\n2. 从zoneStates读取保存的zone列表，设置zone 每秒安排多少个node来执行taint或驱逐\n   - 启用taint manager，设置zoneNoExecuteTainter的速率为0\n   - 未启用taint manager， 设置zonePodEvictor的速率为0\n3. 设置所有zoneStates里的zone为stateFullDisruption\n\n当 allAreFullyDisrupted为true allWasFullyDisrupted为false–过去所有zone全挂，现在所有zone未全挂\n\n1. 遍历所有node更新nodeHealthMap里的probeTImestamp、readyTransitiontimestamp为现在的时间戳\n2. 遍历zoneStates，重新评估zone的每秒安排多少个node来执行taint或驱逐\n   - 当zone的状态为stateNormal，如果启用taint manager，则zoneNoExecuteTainter速率设置为evictionLimiterQPS，否则，设置zonePodEvictor的速率为evictionLimiterQPS的速率\n   - 当zone状态为statePartialDisruption，如果启用taint manager，根据zone里的node数量，当node数量大于largeClusterThreshold，设置zoneNoExecuteTainter速率为SecondEvictionLimiterQPS；小于等于largeClusterThreshold，设置zoneNoExecuteTainter速率为0。未启用taint manager，根据zone里的node数量，当node数量大于largeClusterThreshold，设置zonePodEvictor速率为SecondEvictionLimiterQPS；小于等于largeClusterThreshold，设置zonePodEvictorTainter速率为0。\n   - 当zone状态为stateFullDisruption，如果启用taint manager，则zoneNoExecuteTainter速率设置为evictionLimiterQPS，否则，设置zonePodEvictor的速率为evictionLimiterQPS的速率\n   - 这里不处理stateInitial状态的zone，因为下一周期，zone会变成非stateInitial，下面就是处理这个情况的\n\n除了上面两种情况，还有一个情况要进行处理，allAreFullyDisrupted为false allWasFullyDisrupted为false，就是没有发生集群所有zone全挂。这个时候zone有可能发生状态转换，所以需要重新评估zone的速率\n\n1. 遍历zoneStates，当保存的状态和新的状态不一致的时候–zone状态发生了变化，重新评估zone的速率\n   - 当zone的状态为stateNormal，如果启用taint manager，则zoneNoExecuteTainter速率设置为evictionLimiterQPS，否则，设置zonePodEvictor的速率为evictionLimiterQPS的速率\n   - 当zone状态为statePartialDisruption，如果启用taint manager，根据zone里的node数量，当node数量大于largeClusterThreshold，设置zoneNoExecuteTainter速率为SecondEvictionLimiterQPS；小于等于largeClusterThreshold，设置zoneNoExecuteTainter速率为0。未启用taint manager，根据zone里的node数量，当node数量大于largeClusterThreshold，设置zonePodEvictor速率为SecondEvictionLimiterQPS；小于等于largeClusterThreshold，设置zonePodEvictorTainter速率为0。\n   - 当zone状态为stateFullDisruption，如果启用taint manager，则zoneNoExecuteTainter速率设置为evictionLimiterQPS，否则，设置zonePodEvictor的速率为evictionLimiterQPS的速率\n2. zoneStates里的状态更新为新的状态\n\n而allAreFullyDisrupted为true allWasFullyDisrupted为true，集群一直都是挂着，不需要处理，zone状态没有发生改变。\n\n<br>\n\n### 4 总结\n\nnodeLifecycleController核心逻辑如下：\n\n启动了以下协程：\n\n（1）monitorNodeHealth 更新node的状态，并且更加baseTaint是否开启，将需要处理的node加入NoExecuteTainter或者ZonePodEviction队列。实现按照速率驱逐\n\n（2）doNodeProcessingPassWorker 监听Node，根据node状态设置NoScheduler污点（这个影响调度和驱逐无关）\n\n（3）如果开启了BaseTaint, 那么就会执行doNoExecutingPass从NoExecuteTainter取出node设置污点（这里设置可以控制设置污点的速率）\n\n同时如果开启了BaseTaint，taintManger就会run, 会进行pod的驱逐\n\n（4）如果不开启BaseTaint, 那么就会启动doevctionPass从ZonePodEviction取出node，进行pod驱逐\n\n（5）doPodProcessingPassWorker会监听pod，设置pod状态，如果没有开启BaseTaint，还会进行pod的驱逐\n\n\n\n![image-20220811161358504](../images/taintManager-2.png)\n\n<br>\n\n一般而言，kcm都是有2中设置：\n\npod-eviction-timeout：默认5分钟\n\nenable-taint-manager，TaintBasedEvictions默认true\n\n（1）开启驱逐,或者使用默认值\n\n```\n--pod-eviction-timeout=5m --enable-taint-manager=true --feature-gates=TaintBasedEvictions=true\n```\n\n这个时候**pod-eviction-timeout是不起作用的**，只要node有污点，Pod会马上驱逐。（变更Kubelet的时候要小心这个坑）\n\n（2）不开启污点驱逐\n\n```\n--pod-eviction-timeout=5m --enable-taint-manager=false --feature-gates=TaintBasedEvictions=false\n```\n\n这个时候**pod-eviction-timeout起作用的**，node notReady 5分钟后，pod会被驱逐。\n"
  },
  {
    "path": "k8s/kcm/11.k8s node状态更新机制 .md",
    "content": "**注意**\n\n为了防止参考链接失效，本文摘抄自：https://www.qikqiak.com/post/kubelet-sync-node-status/\n\n\n\n当 Kubernetes 中 Node 节点出现状态异常的情况下，节点上的 Pod 会被重新调度到其他节点上去，但是有的时候我们会发现节点 Down 掉以后，Pod 并不会立即触发重新调度，这实际上就是和 Kubelet 的状态更新机制密切相关的，Kubernetes 提供了一些参数配置来触发重新调度到嗯时间，下面我们来分析下 Kubelet 状态更新的基本流程。\n\n1. kubelet 自身会定期更新状态到 apiserver，通过参数`--node-status-update-frequency`指定上报频率，默认是 10s 上报一次。\n2. kube-controller-manager 会每隔`--node-monitor-period`时间去检查 kubelet 的状态，默认是 5s。\n3. 当 node 失联一段时间后，kubernetes 判定 node 为 `notready` 状态，这段时长通过`--node-monitor-grace-period`参数配置，默认 40s。\n4. 当 node 失联一段时间后，kubernetes 判定 node 为 `unhealthy` 状态，这段时长通过`--node-startup-grace-period`参数配置，默认 1m0s。\n5. 当 node 失联一段时间后，kubernetes 开始删除原 node 上的 pod，这段时长是通过`--pod-eviction-timeout`参数配置，默认 5m0s。\n\n> kube-controller-manager 和 kubelet 是异步工作的，这意味着延迟可能包括任何的网络延迟、apiserver 的延迟、etcd 延迟，一个节点上的负载引起的延迟等等。因此，如果`--node-status-update-frequency`设置为5s，那么实际上 etcd 中的数据变化会需要 6-7s，甚至更长时间。\n\nKubelet在更新状态失败时，会进行`nodeStatusUpdateRetry`次重试，默认为 5 次。\n\nKubelet 会在函数`tryUpdateNodeStatus`中尝试进行状态更新。Kubelet 使用了 Golang 中的`http.Client()`方法，但是没有指定超时时间，因此，如果 API Server 过载时，当建立 TCP 连接时可能会出现一些故障。\n\n因此，在`nodeStatusUpdateRetry` * `--node-status-update-frequency`时间后才会更新一次节点状态。\n\n同时，Kubernetes 的 controller manager 将尝试每`--node-monitor-period`时间周期内检查`nodeStatusUpdateRetry`次。在`--node-monitor-grace-period`之后，会认为节点 unhealthy，然后会在`--pod-eviction-timeout`后删除 Pod。\n\nkube proxy 有一个 watcher API，一旦 Pod 被驱逐了，kube proxy 将会通知更新节点的 iptables 规则，将 Pod 从 Service 的 Endpoints 中移除，这样就不会访问到来自故障节点的 Pod 了。\n\n## 配置\n\n对于这些参数的配置，需要根据不通的集群规模场景来进行配置。\n\n### 社区默认的配置\n\n| 参数                          | 值   |\n| :---------------------------- | :--- |\n| –node-status-update-frequency | 10s  |\n| –node-monitor-period          | 5s   |\n| –node-monitor-grace-period    | 40s  |\n| –pod-eviction-timeout         | 5m   |\n\n### 快速更新和快速响应\n\n| 参数                          | 值   |\n| :---------------------------- | :--- |\n| –node-status-update-frequency | 4s   |\n| –node-monitor-period          | 2s   |\n| –node-monitor-grace-period    | 20s  |\n| –pod-eviction-timeout         | 30s  |\n\n在这种情况下，Pod 将在 50s 被驱逐，因为该节点在 20s 后被视为Down掉了，`--pod-eviction-timeout`在 30s 之后发生，但是，这种情况会给 etcd 产生很大的开销，因为每个节点都会尝试每 2s 更新一次状态。\n\n如果环境有1000个节点，那么每分钟将有15000次节点更新操作，这可能需要大型 etcd 容器甚至是 etcd 的专用节点。\n\n> 如果我们计算尝试次数，则除法将给出5，但实际上每次尝试的 nodeStatusUpdateRetry 尝试将从3到5。 由于所有组件的延迟，尝试总次数将在15到25之间变化。\n\n### 中等更新和平均响应\n\n| 参数                          | 值   |\n| :---------------------------- | :--- |\n| –node-status-update-frequency | 20s  |\n| –node-monitor-period          | 5s   |\n| –node-monitor-grace-period    | 2m   |\n| –pod-eviction-timeout         | 1m   |\n\n这种场景下会 20s 更新一次 node 状态，controller manager 认为 node 状态不正常之前，会有 2m*60⁄20*5=30 次的 node 状态更新，Node 状态为 down 之后 1m，就会触发驱逐操作。\n\n如果有 1000 个节点，1分钟之内就会有 60s/20s*1000=3000 次的节点状态更新操作。\n\n### 低更新和慢响应\n\n| 参数                          | 值   |\n| :---------------------------- | :--- |\n| –node-status-update-frequency | 1m   |\n| –node-monitor-period          | 5s   |\n| –node-monitor-grace-period    | 5m   |\n| –pod-eviction-timeout         | 1m   |\n\nKubelet 将会 1m 更新一次节点的状态，在认为不健康之后会有 5m/1m*5=25 次重试更新的机会。Node为不健康的时候，1m 之后 pod开始被驱逐。\n\n可以有不同的组合，例如快速更新和慢反应以满足特定情况。\n\n原文链接: https://github.com/kubernetes-sigs/kubespray/blob/master/docs/kubernetes-reliability.md"
  },
  {
    "path": "k8s/kcm/2-deployment controller-manager源码分析.md",
    "content": "Table of Contents\n=================\n\n  * [1. deploy基础概念](#1-deploy基础概念)\n     * [1.1. metadata.generation &amp; status.observedGeneration](#11-metadatageneration--statusobservedgeneration)\n     * [1.2. metadata.resourceVersion](#12-metadataresourceversion)\n     * [1.3 status](#13-status)\n  * [2. startDeploymentController](#2-startdeploymentcontroller)\n  * [3. NewDeploymentController](#3-newdeploymentcontroller)\n  * [4. 对deploy, rs, pod的处理](#4-对deploy-rs-pod的处理)\n     * [4.1 add,update, del deploy](#41-addupdate-del-deploy)\n     * [4.2 add,update,del ReplicaSet](#42-addupdatedel-replicaset)\n     * [4.3 del pod](#43-del-pod)\n     * [4.4 getDeploymentForPod](#44-getdeploymentforpod)\n     * [4.5 总结](#45-总结)\n  * [5. syncDeployment](#5-syncdeployment)\n     * [5.1 删除deploy](#51-删除deploy)\n        * [5.1.1 getAllReplicaSetsAndSyncRevision](#511-getallreplicasetsandsyncrevision)\n        * [5.1.2 syncDeploymentStatus](#512-syncdeploymentstatus)\n        * [5.1.3 总结](#513-总结)\n     * [5.2 pause操作](#52-pause操作)\n     * [5.3 Rollback操作](#53-rollback操作)\n     * [5.4 scale操作](#54-scale操作)\n        * [5.4.1 获得最新的一个activeRs](#541-获得最新的一个activers)\n        * [5.4.2 如果newRS已经是期望状态，将所有的oldRS缩到0](#542-如果newrs已经是期望状态将所有的oldrs缩到0)\n     * [5.5 recreate更新](#55-recreate更新)\n     * [5.6 rolloutRolling更新](#56-rolloutrolling更新)\n        * [5.6.1 如果是scaledUp（针对news），返回 syncRolloutStatus](#561-如果是scaledup针对news返回-syncrolloutstatus)\n     * [5.7 scaleReplicaSetAndRecordEvent](#57-scalereplicasetandrecordevent)\n\n### 1. deploy基础概念\n\n```\nroot@k8s-master# kubectl get deploy nginx-deployment -oyaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  annotations:\n    deployment.kubernetes.io/revision: \"2\"            // 这个是版本号，说明这是第二个版本。\n  generation: 4                                       // 这里有个 generation \n  labels:\n    app: nginx\n  name: nginx-deployment\n  resourceVersion: \"59522723\"\n  selfLink: /apis/apps/v1/namespaces/default/deployments/nginx-deployment\n  uid: a6830e24-a479-452d-bbb2-3cb3cad82ebf\nspec:\n  progressDeadlineSeconds: 600\n  replicas: 2\n  revisionHistoryLimit: 2                            // 这个表明只保留2个版本。\n  selector:\n    matchLabels:\n      app: nginx\n  strategy:\n    rollingUpdate:\n      maxSurge: 25%                          // 滚动更新的时候，不是一次就更新完了，而是一批一批的更新\n      maxUnavailable: 25%                    //  升级过程中最多有多少个 pod 处于无法提供服务的状态\n    type: RollingUpdate\n  template:\n    metadata:\n      labels:\n        app: nginx\n    spec:\n      containers:\n      - image: nginx\n        imagePullPolicy: Always\n        name: nginx\n        ports:\n        - containerPort: 8080\n          name: test1\n          protocol: TCP\n        resources: {}\n        terminationMessagePath: /dev/termination-log\n        terminationMessagePolicy: File\n      dnsPolicy: ClusterFirst\n      restartPolicy: Always\n      schedulerName: default-scheduler\n      securityContext: {}\n      terminationGracePeriodSeconds: 30\nstatus:\n  availableReplicas: 2\n  conditions:\n  - lastTransitionTime: \"2020-11-28T08:35:07Z\"\n    lastUpdateTime: \"2020-12-01T02:36:27Z\"\n    message: ReplicaSet \"nginx-deployment-59bc6679cd\" has successfully progressed.\n    reason: NewReplicaSetAvailable\n    status: \"True\"\n    type: Progressing\n  - lastTransitionTime: \"2020-12-01T02:44:17Z\"\n    lastUpdateTime: \"2020-12-01T02:44:17Z\"\n    message: Deployment has minimum availability.\n    reason: MinimumReplicasAvailable\n    status: \"True\"\n    type: Available\n  observedGeneration: 4      //这里也有一个\n  readyReplicas: 2\n  replicas: 2\n  updatedReplicas: 2\n```\n\n#### 1.1. metadata.generation & status.observedGeneration\n\n这两个是对应的，metadata.generation 就是这个 ReplicationSet 的元配置数据被修改了多少次。这里就有个版本迭代的概念。每次我们使用 kuberctl edit 来修改 ReplicationSet 的配置文件，或者更新镜像，这个generation都会增长1，表示增加了一个版本。\n\n这个版本迭代是配置文件只要有改动就进行版本迭代。observedGeneration就是最近观察到的可用的版本迭代。这两个只有在镜像升级的时候有可能不同，当我们使用 `kubectl rollout status` 来探测一个deployment的状态的时候，就是检查observedGeneration是否大于等于generation。\n\n```\nroot@k8s-master:~# kubectl rollout status deployment kube-hpa -n kube-system\ndeployment \"kube-hpa\" successfully rolled out\n```\n\n<br>\n\n#### 1.2. metadata.resourceVersion\n\n每个资源在底层数据库都有版本的概念，我们可以使用 watch 来看某个资源，某个版本之后的操作。这些操作是存储在 etcd 中的。当让，并不是所有的操作都会永久存储，只会保留有限的时间的操作。这个 resourceVersion 就是这个资源对象当前的版本号。\n\n#### 1.3 status\n\nreplicas 实际的 pod 副本数\navailableReplicas 现在可用的 Pod 的副本数量，有的副本可能还处在未准备好，或者初始化状态\nreadyReplicas 是处于 ready 状态的 Pod 的副本数量\nfullyLabeledReplicas 意思是这个 ReplicaSet 的标签 selector 对应的副本数量，不同纬度的一种统计\n\n<br>\n\n### 2. startDeploymentController\n\nkcm启动时，NewControllerInitializers里面定义了所有要启动的manager，如下：\n\n```\n// NewControllerInitializers is a public map of named controller groups (you can start more than one in an init func)\n// paired to their InitFunc.  This allows for structured downstream composition and subdivision.\nfunc NewControllerInitializers(loopMode ControllerLoopMode) map[string]InitFunc {\n\tcontrollers := map[string]InitFunc{}\n\tcontrollers[\"endpoint\"] = startEndpointController\n\tcontrollers[\"endpointslice\"] = startEndpointSliceController\n\tcontrollers[\"replicationcontroller\"] = startReplicationController\n\tcontrollers[\"podgc\"] = startPodGCController\n\tcontrollers[\"resourcequota\"] = startResourceQuotaController\n\tcontrollers[\"namespace\"] = startNamespaceController\n\tcontrollers[\"serviceaccount\"] = startServiceAccountController\n\tcontrollers[\"garbagecollector\"] = startGarbageCollectorController\n\tcontrollers[\"daemonset\"] = startDaemonSetController\n\tcontrollers[\"job\"] = startJobController\n\tcontrollers[\"deployment\"] = startDeploymentController   //启动 deploymentController\n\tcontrollers[\"replicaset\"] = startReplicaSetController\n\tcontrollers[\"horizontalpodautoscaling\"] = startHPAController\n\tcontrollers[\"disruption\"] = startDisruptionController\n\tcontrollers[\"statefulset\"] = startStatefulSetController\n\tcontrollers[\"cronjob\"] = startCronJobController\n\tcontrollers[\"csrsigning\"] = startCSRSigningController\n\tcontrollers[\"csrapproving\"] = startCSRApprovingController\n\tcontrollers[\"csrcleaner\"] = startCSRCleanerController\n\tcontrollers[\"ttl\"] = startTTLController\n\tcontrollers[\"bootstrapsigner\"] = startBootstrapSignerController\n\tcontrollers[\"tokencleaner\"] = startTokenCleanerController\n\tcontrollers[\"nodeipam\"] = startNodeIpamController\n\tcontrollers[\"nodelifecycle\"] = startNodeLifecycleController\n\tif loopMode == IncludeCloudLoops {\n\t\tcontrollers[\"service\"] = startServiceController\n\t\tcontrollers[\"route\"] = startRouteController\n\t\tcontrollers[\"cloud-node-lifecycle\"] = startCloudNodeLifecycleController\n\t\t// TODO: volume controller into the IncludeCloudLoops only set.\n\t}\n\tcontrollers[\"persistentvolume-binder\"] = startPersistentVolumeBinderController\n\tcontrollers[\"attachdetach\"] = startAttachDetachController\n\tcontrollers[\"persistentvolume-expander\"] = startVolumeExpandController\n\tcontrollers[\"clusterrole-aggregation\"] = startClusterRoleAggregrationController\n\tcontrollers[\"pvc-protection\"] = startPVCProtectionController\n\tcontrollers[\"pv-protection\"] = startPVProtectionController\n\tcontrollers[\"ttl-after-finished\"] = startTTLAfterFinishedController\n\tcontrollers[\"root-ca-cert-publisher\"] = startRootCACertPublisher\n\n\treturn controllers\n}\n```\n\n<br>\n\ndeployment 的本质是控制 replicaSet，replicaSet 会控制 pod，然后由 controller 驱动各个对象达到期望状态。所以deployController需要监听pod, rs, deploy三种资源的变化。\n\n```go\ncmd/kube-controller-manager/app/apps.go\nfunc startDeploymentController(ctx ControllerContext) (http.Handler, bool, error) {\n   // 判断当前是否支持deployment这种资源\n   if !ctx.AvailableResources[schema.GroupVersionResource{Group: \"apps\", Version: \"v1\", Resource: \"deployments\"}] {\n      return nil, false, nil\n   }\n   dc, err := deployment.NewDeploymentController(\n      ctx.InformerFactory.Apps().V1().Deployments(),\n      ctx.InformerFactory.Apps().V1().ReplicaSets(),\n      ctx.InformerFactory.Core().V1().Pods(),\n      ctx.ClientBuilder.ClientOrDie(\"deployment-controller\"),\n   )\n   if err != nil {\n      return nil, true, fmt.Errorf(\"error creating Deployment controller: %v\", err)\n   }\n   go dc.Run(int(ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs), ctx.Stop)\n   return nil, true, nil\n}\n```\n\n和其他控制器一样。new deploycontroller之后，就是run。run调用如下：\n\nrun->work->processNextItem->syncHandler \n\n定义时（new deploy），dc.syncHandler = dc.syncDeployment\n\n<br>\n\n### 3. NewDeploymentController\n\n```go\n// NewDeploymentController creates a new DeploymentController.\nfunc NewDeploymentController(dInformer appsinformers.DeploymentInformer, rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {\n   // 记录event\n   eventBroadcaster := record.NewBroadcaster()\n   eventBroadcaster.StartLogging(glog.Infof)\n   eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: client.CoreV1().Events(\"\")})\n\n   if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil {\n      if err := metrics.RegisterMetricAndTrackRateLimiterUsage(\"deployment_controller\", client.CoreV1().RESTClient().GetRateLimiter()); err != nil {\n         return nil, err\n      }\n   }\n   dc := &DeploymentController{\n      client:        client,\n      eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: \"deployment-controller\"}),\n      queue:         workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), \"deployment\"),\n   }\n   dc.rsControl = controller.RealRSControl{\n      KubeClient: client,\n      Recorder:   dc.eventRecorder,\n   }\n\n   dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n      AddFunc:    dc.addDeployment,\n      UpdateFunc: dc.updateDeployment,\n      // This will enter the sync loop and no-op, because the deployment has been deleted from the store.\n      DeleteFunc: dc.deleteDeployment,\n   })\n   rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n      AddFunc:    dc.addReplicaSet,\n      UpdateFunc: dc.updateReplicaSet,\n      DeleteFunc: dc.deleteReplicaSet,\n   })\n   \n  // pod只关注删除？\n   podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n      DeleteFunc: dc.deletePod,\n   })\n\n   dc.syncHandler = dc.syncDeployment\n   dc.enqueueDeployment = dc.enqueue\n\n   dc.dLister = dInformer.Lister()\n   dc.rsLister = rsInformer.Lister()\n   dc.podLister = podInformer.Lister()\n   dc.dListerSynced = dInformer.Informer().HasSynced\n   dc.rsListerSynced = rsInformer.Informer().HasSynced\n   dc.podListerSynced = podInformer.Informer().HasSynced\n   return dc, nil\n}\n```\n\n从这里看出来，这里关注：\n\ndeploy的增删改，rs的增删改，pod的删除。\n\n接下来就是 run->works->processNextWorkItem()->syncDeployment()\n\n<br>\n\n### 4. 对deploy, rs, pod的处理\n\n在之前的分析中，有addDeployment，deleteDeployment，addReplicaSet等函数。这里看一下这些函数做了什么事情。\n\n#### 4.1 add,update, del deploy\n\ndeploy相关的变化都是入队列\n\n```go\nfunc (dc *DeploymentController) addDeployment(obj interface{}) {\n\td := obj.(*apps.Deployment)\n\tglog.V(4).Infof(\"Adding deployment %s\", d.Name)\n\tdc.enqueueDeployment(d)\n}\n\nfunc (dc *DeploymentController) updateDeployment(old, cur interface{}) {\n\toldD := old.(*apps.Deployment)\n\tcurD := cur.(*apps.Deployment)\n\tglog.V(4).Infof(\"Updating deployment %s\", oldD.Name)\n\tdc.enqueueDeployment(curD)\n}\n\nfunc (dc *DeploymentController) deleteDeployment(obj interface{}) {\n\td, ok := obj.(*apps.Deployment)\n\tif !ok {\n\t\ttombstone, ok := obj.(cache.DeletedFinalStateUnknown)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"Couldn't get object from tombstone %#v\", obj))\n\t\t\treturn\n\t\t}\n\t\td, ok = tombstone.Obj.(*apps.Deployment)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"Tombstone contained object that is not a Deployment %#v\", obj))\n\t\t\treturn\n\t\t}\n\t}\n\tglog.V(4).Infof(\"Deleting deployment %s\", d.Name)\n\tdc.enqueueDeployment(d)\n}\n```\n\n<br>\n\n#### 4.2 add,update,del ReplicaSet\n\n```go\n// addReplicaSet enqueues the deployment that manages a ReplicaSet when the ReplicaSet is created.\nfunc (dc *DeploymentController) addReplicaSet(obj interface{}) {\n\trs := obj.(*apps.ReplicaSet)\n    // 1.如果是删除，删除后返回\n\tif rs.DeletionTimestamp != nil {\n\t\t// On a restart of the controller manager, it's possible for an object to\n\t\t// show up in a state that is already pending deletion.\n\t\tdc.deleteReplicaSet(rs)\n\t\treturn\n\t}\n    \n    // 2.判断owneref是否是deploy，是的话，讲对应的rs加入队列。\n\t// If it has a ControllerRef, that's all that matters.\n\tif controllerRef := metav1.GetControllerOf(rs); controllerRef != nil {\n\t\td := dc.resolveControllerRef(rs.Namespace, controllerRef)\n\t\tif d == nil {\n\t\t\treturn\n\t\t}\n\t\tklog.V(4).Infof(\"ReplicaSet %s added.\", rs.Name)\n\t\tdc.enqueueDeployment(d)\n\t\treturn\n\t}\n\n    // 3. 否则，就是孤儿rs，通过label判断 rs是否属于某个deploy\n\t// Otherwise, it's an orphan. Get a list of all matching Deployments and sync\n\t// them to see if anyone wants to adopt it.\n\tds := dc.getDeploymentsForReplicaSet(rs)\n\tif len(ds) == 0 {\n\t\treturn\n\t}\n\tklog.V(4).Infof(\"Orphan ReplicaSet %s added.\", rs.Name)\n\tfor _, d := range ds {\n\t\tdc.enqueueDeployment(d)\n\t}\n}\n```\n\n\n\n```go\n// updateReplicaSet figures out what deployment(s) manage a ReplicaSet when the ReplicaSet\n// is updated and wake them up. If the anything of the ReplicaSets have changed, we need to\n// awaken both the old and new deployments. old and cur must be *apps.ReplicaSet\n// types.\nfunc (dc *DeploymentController) updateReplicaSet(old, cur interface{}) {\n\tcurRS := cur.(*apps.ReplicaSet)\n\toldRS := old.(*apps.ReplicaSet)\n\t// 1. 同样的，ResourceVersion可以判断资源有没有发生改变\n\tif curRS.ResourceVersion == oldRS.ResourceVersion {\n\t\t// Periodic resync will send update events for all known replica sets.\n\t\t// Two different versions of the same replica set will always have different RVs.\n\t\treturn\n\t}\n\n\tcurControllerRef := metav1.GetControllerOf(curRS)\n\toldControllerRef := metav1.GetControllerOf(oldRS)\n\tcontrollerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef)\n\t// 2.先将旧对象删除。旧对象是deploy\n\tif controllerRefChanged && oldControllerRef != nil {\n\t\t// The ControllerRef was changed. Sync the old controller, if any.\n\t\tif d := dc.resolveControllerRef(oldRS.Namespace, oldControllerRef); d != nil {\n\t\t\tdc.enqueueDeployment(d)\n\t\t}\n\t}\n   \n    // 3. 处理新对象，如果新对象还是受deploy管，加入队列\n\t// If it has a ControllerRef, that's all that matters.\n\tif curControllerRef != nil {\n\t\td := dc.resolveControllerRef(curRS.Namespace, curControllerRef)\n\t\tif d == nil {\n\t\t\treturn\n\t\t}\n\t\tklog.V(4).Infof(\"ReplicaSet %s updated.\", curRS.Name)\n\t\tdc.enqueueDeployment(d)\n\t\treturn\n\t}\n   \n    // 4. 孤儿rs。因为是更新，所以如果label都没有改，肯定就不用动。\n\t// Otherwise, it's an orphan. If anything changed, sync matching controllers\n\t// to see if anyone wants to adopt it now.\n\tlabelChanged := !reflect.DeepEqual(curRS.Labels, oldRS.Labels)\n\tif labelChanged || controllerRefChanged {\n\t\tds := dc.getDeploymentsForReplicaSet(curRS)\n\t\tif len(ds) == 0 {\n\t\t\treturn\n\t\t}\n\t\tklog.V(4).Infof(\"Orphan ReplicaSet %s updated.\", curRS.Name)\n\t\tfor _, d := range ds {\n\t\t\tdc.enqueueDeployment(d)\n\t\t}\n\t}\n}\n```\n\n\n\n```go\n// deleteReplicaSet enqueues the deployment that manages a ReplicaSet when\n// the ReplicaSet is deleted. obj could be an *apps.ReplicaSet, or\n// a DeletionFinalStateUnknown marker item.\nfunc (dc *DeploymentController) deleteReplicaSet(obj interface{}) {\n\trs, ok := obj.(*apps.ReplicaSet)\n\n\t// When a delete is dropped, the relist will notice a pod in the store not\n\t// in the list, leading to the insertion of a tombstone object which contains\n\t// the deleted key/value. Note that this value might be stale. If the ReplicaSet\n\t// changed labels the new deployment will not be woken up till the periodic resync.\n\tif !ok {\n\t\ttombstone, ok := obj.(cache.DeletedFinalStateUnknown)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"Couldn't get object from tombstone %#v\", obj))\n\t\t\treturn\n\t\t}\n\t\trs, ok = tombstone.Obj.(*apps.ReplicaSet)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"Tombstone contained object that is not a ReplicaSet %#v\", obj))\n\t\t\treturn\n\t\t}\n\t}\n\n\tcontrollerRef := metav1.GetControllerOf(rs)\n\tif controllerRef == nil {\n\t\t// No controller should care about orphans being deleted.\n\t\treturn\n\t}\n\td := dc.resolveControllerRef(rs.Namespace, controllerRef)\n\tif d == nil {\n\t\treturn\n\t}\n\tklog.V(4).Infof(\"ReplicaSet %s deleted.\", rs.Name)\n\t// 加入队列\n\tdc.enqueueDeployment(d)\n}\n```\n\n#### 4.3 del pod\n\n```\n// deletePod will enqueue a Recreate Deployment once all of its pods have stopped running.\nfunc (dc *DeploymentController) deletePod(obj interface{}) {\n\tpod, ok := obj.(*v1.Pod)\n\n\t// When a delete is dropped, the relist will notice a pod in the store not\n\t// in the list, leading to the insertion of a tombstone object which contains\n\t// the deleted key/value. Note that this value might be stale. If the Pod\n\t// changed labels the new deployment will not be woken up till the periodic resync.\n\tif !ok {\n\t\ttombstone, ok := obj.(cache.DeletedFinalStateUnknown)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"Couldn't get object from tombstone %#v\", obj))\n\t\t\treturn\n\t\t}\n\t\tpod, ok = tombstone.Obj.(*v1.Pod)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"Tombstone contained object that is not a pod %#v\", obj))\n\t\t\treturn\n\t\t}\n\t}\n\tglog.V(4).Infof(\"Pod %s deleted.\", pod.Name)\n\t// 只有当pod全删除，才更新 deploy。这个判断说明是 recreate 了\n\tif d := dc.getDeploymentForPod(pod); d != nil && d.Spec.Strategy.Type == apps.RecreateDeploymentStrategyType {\n\t\t// Sync if this Deployment now has no more Pods.\n\t\trsList, err := util.ListReplicaSets(d, util.RsListFromClient(dc.client.AppsV1()))\n\t\tif err != nil {\n\t\t\treturn\n\t\t}\n\t\tpodMap, err := dc.getPodMapForDeployment(d, rsList)\n\t\tif err != nil {\n\t\t\treturn\n\t\t}\n\t\tnumPods := 0\n\t\tfor _, podList := range podMap {\n\t\t\tnumPods += len(podList.Items)\n\t\t}\n\t\tif numPods == 0 {\n\t\t\tdc.enqueueDeployment(d)\n\t\t}\n\t}\n}\n```\n\ndeployment升级方案：\n\nRecreate：删除所有已存在的pod,重新创建新的;\n\n RollingUpdate：滚动升级，逐步替换的策略，同时滚动升级时，支持更多的附加参数，例如设置最大不可用pod数量，最小升级间隔时间等等。\n\n<br>\n\n#### 4.4 getDeploymentForPod\n\n更加pod或得 rs，然后更加rs 或得deployment\n\n```\n// getDeploymentForPod returns the deployment managing the given Pod.\nfunc (dc *DeploymentController) getDeploymentForPod(pod *v1.Pod) *apps.Deployment {\n   // Find the owning replica set\n   var rs *apps.ReplicaSet\n   var err error\n   controllerRef := metav1.GetControllerOf(pod)\n   if controllerRef == nil {\n      // No controller owns this Pod.\n      return nil\n   }\n   if controllerRef.Kind != apps.SchemeGroupVersion.WithKind(\"ReplicaSet\").Kind {\n      // Not a pod owned by a replica set.\n      return nil\n   }\n   rs, err = dc.rsLister.ReplicaSets(pod.Namespace).Get(controllerRef.Name)\n   if err != nil || rs.UID != controllerRef.UID {\n      klog.V(4).Infof(\"Cannot get replicaset %q for pod %q: %v\", controllerRef.Name, pod.Name, err)\n      return nil\n   }\n\n   // Now find the Deployment that owns that ReplicaSet.\n   controllerRef = metav1.GetControllerOf(rs)\n   if controllerRef == nil {\n      return nil\n   }\n   return dc.resolveControllerRef(rs.Namespace, controllerRef)\n}\n```\n\n#### 4.5 总结\n\n从这里也可以看出来，deployment, rs的add, del, update都可能会导致deployment入队列，然后进入syncDeployment。\n\npod这里只关注删除，原因在于如果是recreate更新的时候，deploy等旧pod删除完才能创建新的pod。\n\n<br>\n\n### 5. syncDeployment\n\n<br>\n\n```\n// syncDeployment will sync the deployment with the given key.\n// This function is not meant to be invoked concurrently with the same key.\nfunc (dc *DeploymentController) syncDeployment(key string) error {\n\tstartTime := time.Now()\n\tklog.V(4).Infof(\"Started syncing deployment %q (%v)\", key, startTime)\n\tdefer func() {\n\t\tklog.V(4).Infof(\"Finished syncing deployment %q (%v)\", key, time.Since(startTime))\n\t}()\n\n\tnamespace, name, err := cache.SplitMetaNamespaceKey(key)\n\tif err != nil {\n\t\treturn err\n\t}\n\tdeployment, err := dc.dLister.Deployments(namespace).Get(name)\n\tif errors.IsNotFound(err) {\n\t\tklog.V(2).Infof(\"Deployment %v has been deleted\", key)\n\t\treturn nil\n\t}\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t// Deep-copy otherwise we are mutating our cache.\n\t// TODO: Deep-copy only when needed.\n\td := deployment.DeepCopy()\n  \n  // 1. 如果一个deploy的label是everything，会直接返回。（虽然会有判断是否要更新状态的说法）\n\teverything := metav1.LabelSelector{}\n\tif reflect.DeepEqual(d.Spec.Selector, &everything) {\n\t\tdc.eventRecorder.Eventf(d, v1.EventTypeWarning, \"SelectingAll\", \"This deployment is selecting all pods. A non-empty selector is required.\")\n\t\tif d.Status.ObservedGeneration < d.Generation {\n\t\t\td.Status.ObservedGeneration = d.Generation\n\t\t\tdc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(d)\n\t\t}\n\t\treturn nil\n\t}\n  \n  // 2. 根据deploy获得rslist。以及根据rslist获得所有的pod（pod是一个map）\n\t// List ReplicaSets owned by this Deployment, while reconciling ControllerRef\n\t// through adoption/orphaning.\n\trsList, err := dc.getReplicaSetsForDeployment(d)\n\tif err != nil {\n\t\treturn err\n\t}\n\t// List all Pods owned by this Deployment, grouped by their ReplicaSet.\n\t// Current uses of the podMap are:\n\t//\n\t// * check if a Pod is labeled correctly with the pod-template-hash label.\n\t// * check that no old Pods are running in the middle of Recreate Deployments.\n\tpodMap, err := dc.getPodMapForDeployment(d, rsList)\n\tif err != nil {\n\t\treturn err\n\t}\n  \n  // 3.如果是删除，则直接调用syncStatusOnly\n\tif d.DeletionTimestamp != nil {\n\t\treturn dc.syncStatusOnly(d, rsList)\n\t}\n\n  // 4.检查是否处于 pause 状态\n\t// Update deployment conditions with an Unknown condition when pausing/resuming\n\t// a deployment. In this way, we can be sure that we won't timeout when a user\n\t// resumes a Deployment with a set progressDeadlineSeconds.\n\tif err = dc.checkPausedConditions(d); err != nil {\n\t\treturn err\n\t}\n\n  // 如果是 pause 状态,同步状态\n\tif d.Spec.Paused {\n\t\treturn dc.sync(d, rsList)\n\t}\n\n\t// rollback is not re-entrant in case the underlying replica sets are updated with a new\n\t// revision so we should ensure that we won't proceed to update replica sets until we\n\t// make sure that the deployment has cleaned up its rollback spec in subsequent enqueues.\n\t// 5.如果annotations中有 deprecated.deployment.rollback.to 这个字段，则进行回滚\n\tif getRollbackTo(d) != nil {\n\t\treturn dc.rollback(d, rsList)\n\t}\n\n  // 6.检查 deployment 是否处于 scale 状态\n\tscalingEvent, err := dc.isScalingEvent(d, rsList)\n\tif err != nil {\n\t\treturn err\n\t}\n\tif scalingEvent {\n\t\treturn dc.sync(d, rsList)\n\t}\n\n  // 7.更新deployment状态\n\tswitch d.Spec.Strategy.Type {\n\tcase apps.RecreateDeploymentStrategyType:\n\t\treturn dc.rolloutRecreate(d, rsList, podMap)\n\tcase apps.RollingUpdateDeploymentStrategyType:\n\t\treturn dc.rolloutRolling(d, rsList)\n\t}\n\treturn fmt.Errorf(\"unexpected deployment strategy type: %s\", d.Spec.Strategy.Type)\n}\n```\n\nsyncDeployment的大流程如下：\n\n（1）如果一个deploy的label是everything，会直接返回。\n\n（2）根据deploy获得rslist。以及根据deploy的label获得所有的pod，然后以rs为key，返回一个podMap（pod是一个map）\n\n（3）如果是删除，则直接调用syncStatusOnly，并返回\n\n（4）检查是否处于 pause 状态，如果是pause，同步状态，并返回\n\n（5）如果需要rollback，进行rollback,然后返回\n\n（5）检查 deployment 是否处于 scale 状态，如果是scale, 同步状态，并返回\n\n（6）如果是滚动更新或者是recreate更新，更新deployment状态，并返回\n\n<br>\n\n接下来从第三步开始，具体做了什么。\n\n#### 5.1 删除deploy\n\n删除deploy调用了syncStatusOnly函数。\n\nsyncStatusOn函数中主要调用了 getAllReplicaSetsAndSyncRevision 和 syncDeploymentStatus 函数。\n\n##### 5.1.1 getAllReplicaSetsAndSyncRevision\n\ngetAllReplicaSetsAndSyncRevision 就是找出来 newRS, oldRSs。\n\nnewRs 就是：**最近的**，满足  rs.spec.template =  deploy.spec.temp  的rs。 **使用最近的原因在于rs.spec.template =  deploy.spec.temp  的rs可能有多个 **\n\noldRss 就是所有的rs中去掉 newRs。\n\n```\n// syncStatusOnly only updates Deployments Status and doesn't take any mutating actions.\nfunc (dc *DeploymentController) syncStatusOnly(d *apps.Deployment, rsList []*apps.ReplicaSet) error {\n\tnewRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)\n\tif err != nil {\n\t\treturn err\n\t}\n  // 这里有点没看懂。 oldRSs + newRS = rsList。 为啥又要来一个allRSs\n\tallRSs := append(oldRSs, newRS)\n\treturn dc.syncDeploymentStatus(allRSs, newRS, d)\n}\n\n\n\n// rsList should come from getReplicaSetsForDeployment(d).\n//\n// 1. Get all old RSes this deployment targets, and calculate the max revision number among them (maxOldV).\n// 2. Get new RS this deployment targets (whose pod template matches deployment's), and update new RS's revision number to (maxOldV + 1),\n//    only if its revision number is smaller than (maxOldV + 1). If this step failed, we'll update it in the next deployment sync loop.\n// 3. Copy new RS's revision number to deployment (update deployment's revision). If this step failed, we'll update it in the next deployment sync loop.\n//\n// Note that currently the deployment controller is using caches to avoid querying the server for reads.\n// This may lead to stale reads of replica sets, thus incorrect deployment status.\nfunc (dc *DeploymentController) getAllReplicaSetsAndSyncRevision(d *apps.Deployment, rsList []*apps.ReplicaSet, createIfNotExisted bool) (*apps.ReplicaSet, []*apps.ReplicaSet, error) {\n\t_, allOldRSs := deploymentutil.FindOldReplicaSets(d, rsList)\n\n\t// Get new replica set with the updated revision number\n\tnewRS, err := dc.getNewReplicaSet(d, rsList, allOldRSs, createIfNotExisted)\n\tif err != nil {\n\t\treturn nil, nil, err\n\t}\n\n\treturn newRS, allOldRSs, nil\n}\n\n\n// FindOldReplicaSets returns the old replica sets targeted by the given Deployment, with the given slice of RSes.\n// Note that the first set of old replica sets doesn't include the ones with no pods, and the second set of old replica sets include all old replica sets.\nfunc FindOldReplicaSets(deployment *apps.Deployment, rsList []*apps.ReplicaSet) ([]*apps.ReplicaSet, []*apps.ReplicaSet) {\n\tvar requiredRSs []*apps.ReplicaSet\n\tvar allRSs []*apps.ReplicaSet\n\tnewRS := FindNewReplicaSet(deployment, rsList)\n\tfor _, rs := range rsList {\n\t\t// Filter out new replica set\n\t\tif newRS != nil && rs.UID == newRS.UID {\n\t\t\tcontinue\n\t\t}\n\t\tallRSs = append(allRSs, rs)\n\t\tif *(rs.Spec.Replicas) != 0 {\n\t\t\trequiredRSs = append(requiredRSs, rs)\n\t\t}\n\t}\n\treturn requiredRSs, allRSs\n}\n\n// FindNewReplicaSet returns the new RS this given deployment targets (the one with the same pod template).\nfunc FindNewReplicaSet(deployment *apps.Deployment, rsList []*apps.ReplicaSet) *apps.ReplicaSet {\n\tsort.Sort(controller.ReplicaSetsByCreationTimestamp(rsList))\n\tfor i := range rsList {\n\t\tif EqualIgnoreHash(&rsList[i].Spec.Template, &deployment.Spec.Template) {\n\t\t\t// In rare cases, such as after cluster upgrades, Deployment may end up with\n\t\t\t// having more than one new ReplicaSets that have the same template as its template,\n\t\t\t// see https://github.com/kubernetes/kubernetes/issues/40415\n\t\t\t// We deterministically choose the oldest new ReplicaSet.\n\t\t\treturn rsList[i]\n\t\t}\n\t}\n\t// new ReplicaSet does not exist.\n\treturn nil\n}\n```\n\n##### 5.1.2 syncDeploymentStatus\n\ncalculateStatus 就是根据allRSs，newRS得到deploy最新的状态。然后再更新。\n\n```\n// syncDeploymentStatus checks if the status is up-to-date and sync it if necessary\nfunc (dc *DeploymentController) syncDeploymentStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, d *apps.Deployment) error {\n\tnewStatus := calculateStatus(allRSs, newRS, d)\n\n\tif reflect.DeepEqual(d.Status, newStatus) {\n\t\treturn nil\n\t}\n\n\tnewDeployment := d\n\tnewDeployment.Status = newStatus\n\t_, err := dc.client.AppsV1().Deployments(newDeployment.Namespace).UpdateStatus(newDeployment)\n\treturn err\n}\n\n\n// calculateStatus calculates the latest status for the provided deployment by looking into the provided replica sets.\nfunc calculateStatus(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) apps.DeploymentStatus {\n\tavailableReplicas := deploymentutil.GetAvailableReplicaCountForReplicaSets(allRSs)\n\ttotalReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)\n\tunavailableReplicas := totalReplicas - availableReplicas\n\t// If unavailableReplicas is negative, then that means the Deployment has more available replicas running than\n\t// desired, e.g. whenever it scales down. In such a case we should simply default unavailableReplicas to zero.\n\tif unavailableReplicas < 0 {\n\t\tunavailableReplicas = 0\n\t}\n\n\tstatus := apps.DeploymentStatus{\n\t\t// TODO: Ensure that if we start retrying status updates, we won't pick up a new Generation value.\n\t\tObservedGeneration:  deployment.Generation,\n\t\tReplicas:            deploymentutil.GetActualReplicaCountForReplicaSets(allRSs),\n\t\tUpdatedReplicas:     deploymentutil.GetActualReplicaCountForReplicaSets([]*apps.ReplicaSet{newRS}),\n\t\tReadyReplicas:       deploymentutil.GetReadyReplicaCountForReplicaSets(allRSs),\n\t\tAvailableReplicas:   availableReplicas,\n\t\tUnavailableReplicas: unavailableReplicas,\n\t\tCollisionCount:      deployment.Status.CollisionCount,\n\t}\n\n\t// Copy conditions one by one so we won't mutate the original object.\n\tconditions := deployment.Status.Conditions\n\tfor i := range conditions {\n\t\tstatus.Conditions = append(status.Conditions, conditions[i])\n\t}\n\n\tif availableReplicas >= *(deployment.Spec.Replicas)-deploymentutil.MaxUnavailable(*deployment) {\n\t\tminAvailability := deploymentutil.NewDeploymentCondition(apps.DeploymentAvailable, v1.ConditionTrue, deploymentutil.MinimumReplicasAvailable, \"Deployment has minimum availability.\")\n\t\tdeploymentutil.SetDeploymentCondition(&status, *minAvailability)\n\t} else {\n\t\tnoMinAvailability := deploymentutil.NewDeploymentCondition(apps.DeploymentAvailable, v1.ConditionFalse, deploymentutil.MinimumReplicasUnavailable, \"Deployment does not have minimum availability.\")\n\t\tdeploymentutil.SetDeploymentCondition(&status, *noMinAvailability)\n\t}\n\n\treturn status\n}\n```\n\n<br>\n\n##### 5.1.3 总结\n\ndeploy 根据DeletionTimestamp判断该pod是否需要删除，deploy controller并没有进行deploy的删除，而是仅仅更新了状态。deploy的删除时gc来做的。后文在详细分析。\n\n这里注意一个问题就是deploy的DeletionTimestamp到底是谁加上去的。\n\n答案是：APIsever。 当kubectl delete 的时候。最终kubelet调用了会调用到 store里面的DELETE函数，这里的的操作就是给DeletionTimestamp赋值。\n\n<br>\n\n#### 5.2 pause操作\n\n目前比较少用到。暂时忽略。\n\n#### 5.3 Rollback操作\n\n（1）判断deploy的annotations中是否有\"deprecated.deployment.rollback.to\" 字段的key，如果有需要rollback\n\n（2）获取deprecated.deployment.rollback.to对应的value, 这个就是表示是需要rollback到哪个rs\n\n（3）将rs.sepc.template 赋值给 deployment.spec.template\n\n（4）更新deploy, 删除annotations中 deprecated.deployment.rollback.to字段\n\n特殊情况：如果value=0，则更新到最近的版本。如果value不存在，则忽略。\n\n```\nif getRollbackTo(d) != nil {\n   return dc.rollback(d, rsList)\n}\n\n// getRollbackTo 就是判断deploy的annotations中是否有\"deprecated.deployment.rollback.to\" 字段的key\n// TODO: Remove this when extensions/v1beta1 and apps/v1beta1 Deployment are dropped.\nfunc getRollbackTo(d *apps.Deployment) *extensions.RollbackConfig {\n\t// Extract the annotation used for round-tripping the deprecated RollbackTo field.\n\trevision := d.Annotations[apps.DeprecatedRollbackTo]\n\tif revision == \"\" {\n\t\treturn nil\n\t}\n\trevision64, err := strconv.ParseInt(revision, 10, 64)\n\tif err != nil {\n\t\t// If it's invalid, ignore it.\n\t\treturn nil\n\t}\n\treturn &extensions.RollbackConfig{\n\t\tRevision: revision64,\n\t}\n}\n\n// 这里的核心思想就是找到对应版本的rs。然后将rs.sepc.template 赋值给 deployment.spec.template\n// 然后更新deploy, 删除annotations中 deprecated.deployment.rollback.to\n// rollback the deployment to the specified revision. In any case cleanup the rollback spec.\nfunc (dc *DeploymentController) rollback(d *apps.Deployment, rsList []*apps.ReplicaSet) error {\n\tnewRS, allOldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tallRSs := append(allOldRSs, newRS)\n\trollbackTo := getRollbackTo(d)\n\t// If rollback revision is 0, rollback to the last revision\n\tif rollbackTo.Revision == 0 {\n\t\tif rollbackTo.Revision = deploymentutil.LastRevision(allRSs); rollbackTo.Revision == 0 {\n\t\t\t// If we still can't find the last revision, gives up rollback\n\t\t\tdc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, \"Unable to find last revision.\")\n\t\t\t// Gives up rollback\n\t\t\treturn dc.updateDeploymentAndClearRollbackTo(d)\n\t\t}\n\t}\n\tfor _, rs := range allRSs {\n\t\tv, err := deploymentutil.Revision(rs)\n\t\tif err != nil {\n\t\t\tklog.V(4).Infof(\"Unable to extract revision from deployment's replica set %q: %v\", rs.Name, err)\n\t\t\tcontinue\n\t\t}\n\t\tif v == rollbackTo.Revision {\n\t\t\tklog.V(4).Infof(\"Found replica set %q with desired revision %d\", rs.Name, v)\n\t\t\t// rollback by copying podTemplate.Spec from the replica set\n\t\t\t// revision number will be incremented during the next getAllReplicaSetsAndSyncRevision call\n\t\t\t// no-op if the spec matches current deployment's podTemplate.Spec\n\t\t\tperformedRollback, err := dc.rollbackToTemplate(d, rs)\n\t\t\tif performedRollback && err == nil {\n\t\t\t\tdc.emitRollbackNormalEvent(d, fmt.Sprintf(\"Rolled back deployment %q to revision %d\", d.Name, rollbackTo.Revision))\n\t\t\t}\n\t\t\treturn err\n\t\t}\n\t}\n\tdc.emitRollbackWarningEvent(d, deploymentutil.RollbackRevisionNotFound, \"Unable to find the revision to rollback to.\")\n\t// Gives up rollback\n\treturn dc.updateDeploymentAndClearRollbackTo(d)\n}\n```\n\n<br>\n\n#### 5.4 scale操作\n\n（1）判断是否需要scale, 这里通过deploy.spec.Replicas是否等于 rs中的annotations中的desired来判断，不相等就要scale\n\n（2）调用scale进行扩缩\n\n```\nscalingEvent, err := dc.isScalingEvent(d, rsList)\n\tif err != nil {\n\t\treturn err\n\t}\n\tif scalingEvent {\n\t\treturn dc.sync(d, rsList)\n\t}\n\t\n// 这里就是判断 deploy.spec.Replicas是否等于 rs中的annotations中的desired，不相等就要scale\n// isScalingEvent checks whether the provided deployment has been updated with a scaling event\n// by looking at the desired-replicas annotation in the active replica sets of the deployment.\n//\n// rsList should come from getReplicaSetsForDeployment(d).\n// podMap should come from getPodMapForDeployment(d, rsList).\nfunc (dc *DeploymentController) isScalingEvent(d *apps.Deployment, rsList []*apps.ReplicaSet) (bool, error) {\n\tnewRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)\n\tif err != nil {\n\t\treturn false, err\n\t}\n\tallRSs := append(oldRSs, newRS)\n\tfor _, rs := range controller.FilterActiveReplicaSets(allRSs) {\n\t\tdesired, ok := deploymentutil.GetDesiredReplicasAnnotation(rs)\n\t\tif !ok {\n\t\t\tcontinue\n\t\t}\n\t\tif desired != *(d.Spec.Replicas) {\n\t\t\treturn true, nil\n\t\t}\n\t}\n\treturn false, nil\n}\n\n//以一个rs为例，annotations中确实有desired-replicas\napiVersion: apps/v1\nkind: ReplicaSet\nmetadata:\n  annotations:\n    deployment.kubernetes.io/desired-replicas: \"1\"\n    deployment.kubernetes.io/max-replicas: \"2\"\n    deployment.kubernetes.io/revision: \"1\"\n  creationTimestamp: \"2021-06-12T14:47:22Z\"\n```\n\n<br>\n\n调用sync->scale 进行扩缩，主要逻辑如下：\n\n（1） 获得最新的一个activeRs，进行扩缩容\n\n（2）如果newRS已经是期望状态，将所有的oldRS缩到0\n\n（3）如果是滚动更新，根据MaxSurge等字段，一步一步的更新，oldRs和newRs。最终的状态是newrs是期望状态，oldrs都是0。\n\n这里如果是recreate更新，则什么都不会做，等到旧pod删除完了之后，自然会进入（1），就直接扩缩容就行了。\n\n```\n// sync is responsible for reconciling deployments on scaling events or when they\n// are paused.\nfunc (dc *DeploymentController) sync(d *apps.Deployment, rsList []*apps.ReplicaSet) error {\n\tnewRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)\n\tif err != nil {\n\t\treturn err\n\t}\n\tif err := dc.scale(d, newRS, oldRSs); err != nil {\n\t\t// If we get an error while trying to scale, the deployment will be requeued\n\t\t// so we can abort this resync\n\t\treturn err\n\t}\n\n\t// Clean up the deployment when it's paused and no rollback is in flight.\n\tif d.Spec.Paused && getRollbackTo(d) == nil {\n\t\tif err := dc.cleanupDeployment(oldRSs, d); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\n\tallRSs := append(oldRSs, newRS)\n\treturn dc.syncDeploymentStatus(allRSs, newRS, d)\n}\n\n\n// scale scales proportionally in order to mitigate risk. Otherwise, scaling up can increase the size\n// of the new replica set and scaling down can decrease the sizes of the old ones, both of which would\n// have the effect of hastening the rollout progress, which could produce a higher proportion of unavailable\n// replicas in the event of a problem with the rolled out template. Should run only on scaling events or\n// when a deployment is paused and not during the normal rollout process.\nfunc (dc *DeploymentController) scale(deployment *apps.Deployment, newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet) error {\n\t// If there is only one active replica set then we should scale that up to the full count of the\n\t// deployment. If there is no active replica set, then we should scale up the newest replica set.\n\t// 1. 获得最新的一个activeRs，进行扩缩容\n\tif activeOrLatest := deploymentutil.FindActiveOrLatest(newRS, oldRSs); activeOrLatest != nil {\n\t\tif *(activeOrLatest.Spec.Replicas) == *(deployment.Spec.Replicas) {\n\t\t\treturn nil\n\t\t}\n\t\t_, _, err := dc.scaleReplicaSetAndRecordEvent(activeOrLatest, *(deployment.Spec.Replicas), deployment)\n\t\treturn err\n\t}\n  \n  // 2. 如果newRS已经是期望状态，将所有的oldRS缩到0\n\t// If the new replica set is saturated, old replica sets should be fully scaled down.\n\t// This case handles replica set adoption during a saturated new replica set.\n\tif deploymentutil.IsSaturated(deployment, newRS) {\n\t\tfor _, old := range controller.FilterActiveReplicaSets(oldRSs) {\n\t\t\tif _, _, err := dc.scaleReplicaSetAndRecordEvent(old, 0, deployment); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t\treturn nil\n\t}\n\n  // 3. 如果是滚动更新，根据MaxSurge等字段，一步一步的更新，oldRs和newRs。最终的状态是newrs是期望状态，oldrs都是0。\n\t// There are old replica sets with pods and the new replica set is not saturated.\n\t// We need to proportionally scale all replica sets (new and old) in case of a\n\t// rolling deployment.\n\tif deploymentutil.IsRollingUpdate(deployment) {\n\t\tallRSs := controller.FilterActiveReplicaSets(append(oldRSs, newRS))\n\t\tallRSsReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)\n\n\t\tallowedSize := int32(0)\n\t\tif *(deployment.Spec.Replicas) > 0 {\n\t\t\tallowedSize = *(deployment.Spec.Replicas) + deploymentutil.MaxSurge(*deployment)\n\t\t}\n\n\t\t// Number of additional replicas that can be either added or removed from the total\n\t\t// replicas count. These replicas should be distributed proportionally to the active\n\t\t// replica sets.\n\t\tdeploymentReplicasToAdd := allowedSize - allRSsReplicas\n\n\t\t// The additional replicas should be distributed proportionally amongst the active\n\t\t// replica sets from the larger to the smaller in size replica set. Scaling direction\n\t\t// drives what happens in case we are trying to scale replica sets of the same size.\n\t\t// In such a case when scaling up, we should scale up newer replica sets first, and\n\t\t// when scaling down, we should scale down older replica sets first.\n\t\tvar scalingOperation string\n\t\tswitch {\n\t\tcase deploymentReplicasToAdd > 0:\n\t\t\tsort.Sort(controller.ReplicaSetsBySizeNewer(allRSs))\n\t\t\tscalingOperation = \"up\"\n\n\t\tcase deploymentReplicasToAdd < 0:\n\t\t\tsort.Sort(controller.ReplicaSetsBySizeOlder(allRSs))\n\t\t\tscalingOperation = \"down\"\n\t\t}\n\n\t\t// Iterate over all active replica sets and estimate proportions for each of them.\n\t\t// The absolute value of deploymentReplicasAdded should never exceed the absolute\n\t\t// value of deploymentReplicasToAdd.\n\t\tdeploymentReplicasAdded := int32(0)\n\t\tnameToSize := make(map[string]int32)\n\t\tfor i := range allRSs {\n\t\t\trs := allRSs[i]\n\n\t\t\t// Estimate proportions if we have replicas to add, otherwise simply populate\n\t\t\t// nameToSize with the current sizes for each replica set.\n\t\t\tif deploymentReplicasToAdd != 0 {\n\t\t\t\tproportion := deploymentutil.GetProportion(rs, *deployment, deploymentReplicasToAdd, deploymentReplicasAdded)\n\n\t\t\t\tnameToSize[rs.Name] = *(rs.Spec.Replicas) + proportion\n\t\t\t\tdeploymentReplicasAdded += proportion\n\t\t\t} else {\n\t\t\t\tnameToSize[rs.Name] = *(rs.Spec.Replicas)\n\t\t\t}\n\t\t}\n\n\t\t// Update all replica sets\n\t\tfor i := range allRSs {\n\t\t\trs := allRSs[i]\n\n\t\t\t// Add/remove any leftovers to the largest replica set.\n\t\t\tif i == 0 && deploymentReplicasToAdd != 0 {\n\t\t\t\tleftover := deploymentReplicasToAdd - deploymentReplicasAdded\n\t\t\t\tnameToSize[rs.Name] = nameToSize[rs.Name] + leftover\n\t\t\t\tif nameToSize[rs.Name] < 0 {\n\t\t\t\t\tnameToSize[rs.Name] = 0\n\t\t\t\t}\n\t\t\t}\n\n\t\t\t// TODO: Use transactions when we have them.\n\t\t\tif _, _, err := dc.scaleReplicaSet(rs, nameToSize[rs.Name], deployment, scalingOperation); err != nil {\n\t\t\t\t// Return as soon as we fail, the deployment is requeued\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t}\n\treturn nil\n}\n```\n\n<br>\n\n##### 5.4.1 获得最新的一个activeRs\n\n从这里可以看出来：activeRs 就是 rs.Spec.Replica>0 的rs\n\n这里的逻辑就是：\n\n* 如果没有一个rs是active的，那就当newRS是当前要扩缩容的。newRs 就是：**最近的**，满足  rs.spec.template =  deploy.spec.temp  的rs。\n* 如果有一个active的rs。那么当其作为要扩缩容的。\n* 如果找到多个active的rs, 那么表示这个可能是滚动更新等复杂情况，走后面的逻辑。\n* 扩缩容直接调用了scaleReplicaSetAndRecordEvent函数，这个最后分析。\n\n```\n\tif activeOrLatest := deploymentutil.FindActiveOrLatest(newRS, oldRSs); activeOrLatest != nil {\n\t\tif *(activeOrLatest.Spec.Replicas) == *(deployment.Spec.Replicas) {\n\t\t\treturn nil\n\t\t}\n\t\t_, _, err := dc.scaleReplicaSetAndRecordEvent(activeOrLatest, *(deployment.Spec.Replicas), deployment)\n\t\treturn err\n\t}\n\t\n\t\n// FindActiveOrLatest returns the only active or the latest replica set in case there is at most one active\n// replica set. If there are more active replica sets, then we should proportionally scale them.\nfunc FindActiveOrLatest(newRS *apps.ReplicaSet, oldRSs []*apps.ReplicaSet) *apps.ReplicaSet {\n\tif newRS == nil && len(oldRSs) == 0 {\n\t\treturn nil\n\t}\n\n\tsort.Sort(sort.Reverse(controller.ReplicaSetsByCreationTimestamp(oldRSs)))\n\tallRSs := controller.FilterActiveReplicaSets(append(oldRSs, newRS))\n\n\tswitch len(allRSs) {\n\tcase 0:\n\t\t// If there is no active replica set then we should return the newest.\n\t\tif newRS != nil {\n\t\t\treturn newRS\n\t\t}\n\t\treturn oldRSs[0]\n\tcase 1:\n\t\treturn allRSs[0]\n\tdefault:\n\t\treturn nil\n\t}\n}\n\n\n// FilterActiveReplicaSets returns replica sets that have (or at least ought to have) pods.\nfunc FilterActiveReplicaSets(replicaSets []*apps.ReplicaSet) []*apps.ReplicaSet {\n\tactiveFilter := func(rs *apps.ReplicaSet) bool {\n\t\treturn rs != nil && *(rs.Spec.Replicas) > 0\n\t}\n\treturn FilterReplicaSets(replicaSets, activeFilter)\n}\n\ntype filterRS func(rs *apps.ReplicaSet) bool\n\n// FilterReplicaSets returns replica sets that are filtered by filterFn (all returned ones should match filterFn).\nfunc FilterReplicaSets(RSes []*apps.ReplicaSet, filterFn filterRS) []*apps.ReplicaSet {\n\tvar filtered []*apps.ReplicaSet\n\tfor i := range RSes {\n\t\tif filterFn(RSes[i]) {\n\t\t\tfiltered = append(filtered, RSes[i])\n\t\t}\n\t}\n\treturn filtered\n}\n```\n\n<br>\n\n##### 5.4.2 如果newRS已经是期望状态，将所有的oldRS缩到0\n\n从这里很直观就可以看出来\n\n```\n\t// If the new replica set is saturated, old replica sets should be fully scaled down.\n\t// This case handles replica set adoption during a saturated new replica set.\n\tif deploymentutil.IsSaturated(deployment, newRS) {\n\t\tfor _, old := range controller.FilterActiveReplicaSets(oldRSs) {\n\t\t\tif _, _, err := dc.scaleReplicaSetAndRecordEvent(old, 0, deployment); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t\treturn nil\n\t}\n\t\n\t// IsSaturated checks if the new replica set is saturated by comparing its size with its deployment size.\n// Both the deployment and the replica set have to believe this replica set can own all of the desired\n// replicas in the deployment and the annotation helps in achieving that. All pods of the ReplicaSet\n// need to be available.\nfunc IsSaturated(deployment *apps.Deployment, rs *apps.ReplicaSet) bool {\n\tif rs == nil {\n\t\treturn false\n\t}\n\tdesiredString := rs.Annotations[DesiredReplicasAnnotation]\n\tdesired, err := strconv.Atoi(desiredString)\n\tif err != nil {\n\t\treturn false\n\t}\n\treturn *(rs.Spec.Replicas) == *(deployment.Spec.Replicas) &&\n\t\tint32(desired) == *(deployment.Spec.Replicas) &&\n\t\trs.Status.AvailableReplicas == *(deployment.Spec.Replicas)\n}\t\n```\n\n<br>\n\n#### 5.5 recreate更新\n\n这种策略就非常简单。先将所有旧rs scaledown到0。然后再将newRs扩到期望值。这里需要注意的是，如果旧rs还有pod running，这这个时候是再次同步，也就是说新的rs是等所有旧pod全部删除完了之后，才会开始创建。\n\n```\n// rolloutRecreate implements the logic for recreating a replica set.\nfunc (dc *DeploymentController) rolloutRecreate(d *apps.Deployment, rsList []*apps.ReplicaSet, podMap map[types.UID][]*v1.Pod) error {\n\t// Don't create a new RS if not already existed, so that we avoid scaling up before scaling down.\n\tnewRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)\n\tif err != nil {\n\t\treturn err\n\t}\n\tallRSs := append(oldRSs, newRS)\n\tactiveOldRSs := controller.FilterActiveReplicaSets(oldRSs)\n\n\t// scale down old replica sets.\n\tscaledDown, err := dc.scaleDownOldReplicaSetsForRecreate(activeOldRSs, d)\n\tif err != nil {\n\t\treturn err\n\t}\n\tif scaledDown {\n\t\t// Update DeploymentStatus.\n\t\treturn dc.syncRolloutStatus(allRSs, newRS, d)\n\t}\n\n   // 如果旧rs还有pod running，这个时候是再次同步。\n\t// Do not process a deployment when it has old pods running.\n\tif oldPodsRunning(newRS, oldRSs, podMap) {\n\t\treturn dc.syncRolloutStatus(allRSs, newRS, d)\n\t}\n\n\t// If we need to create a new RS, create it now.\n\tif newRS == nil {\n\t\tnewRS, oldRSs, err = dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tallRSs = append(oldRSs, newRS)\n\t}\n\n\t// scale up new replica set.\n\tif _, err := dc.scaleUpNewReplicaSetForRecreate(newRS, d); err != nil {\n\t\treturn err\n\t}\n\n\tif util.DeploymentComplete(d, &d.Status) {\n\t\tif err := dc.cleanupDeployment(oldRSs, d); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\n\t// Sync deployment status.\n\treturn dc.syncRolloutStatus(allRSs, newRS, d)\n}\n\n\n\n// scaleDownOldReplicaSetsForRecreate scales down old replica sets when deployment strategy is \"Recreate\".\nfunc (dc *DeploymentController) scaleDownOldReplicaSetsForRecreate(oldRSs []*apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {\n\tscaled := false\n\tfor i := range oldRSs {\n\t\trs := oldRSs[i]\n\t\t// Scaling not required.\n\t\tif *(rs.Spec.Replicas) == 0 {\n\t\t\tcontinue\n\t\t}\n\t\tscaledRS, updatedRS, err := dc.scaleReplicaSetAndRecordEvent(rs, 0, deployment)\n\t\tif err != nil {\n\t\t\treturn false, err\n\t\t}\n\t\tif scaledRS {\n\t\t\toldRSs[i] = updatedRS\n\t\t\tscaled = true\n\t\t}\n\t}\n\treturn scaled, nil\n}\n```\n\n<br>\n\n#### 5.6 rolloutRolling更新\n\n（1）获得newRS, oldRSs\n\n（2）如果是scaledUp，返回 syncRolloutStatus\n\n（3）如果是scaledDown，返回syncRolloutStatus\n\n（4）如果到了这里，说明不是scaledUp也不是scaledDown，那说明可能是达到了期望值，通过DeploymentComplete判断一下\n\n（5）同步状态\n\n```\n// rolloutRolling implements the logic for rolling a new replica set.\nfunc (dc *DeploymentController) rolloutRolling(d *apps.Deployment, rsList []*apps.ReplicaSet) error {\n   newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)\n   if err != nil {\n      return err\n   }\n   allRSs := append(oldRSs, newRS)\n\n   // Scale up, if we can.\n   scaledUp, err := dc.reconcileNewReplicaSet(allRSs, newRS, d)\n   if err != nil {\n      return err\n   }\n   if scaledUp {\n      // Update DeploymentStatus\n      return dc.syncRolloutStatus(allRSs, newRS, d)\n   }\n\n   // Scale down, if we can.\n   scaledDown, err := dc.reconcileOldReplicaSets(allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d)\n   if err != nil {\n      return err\n   }\n   if scaledDown {\n      // Update DeploymentStatus\n      return dc.syncRolloutStatus(allRSs, newRS, d)\n   }\n\n   if deploymentutil.DeploymentComplete(d, &d.Status) {\n      if err := dc.cleanupDeployment(oldRSs, d); err != nil {\n         return err\n      }\n   }\n\n   // Sync deployment status\n   return dc.syncRolloutStatus(allRSs, newRS, d)\n}\n```\n\n<br>\n\n##### 5.6.1 如果是scaledUp（针对news），返回 syncRolloutStatus\n\n这里就是判断是否是scaleup，如果是，还计算了一下需要扩容的副本数。这里更加了更新策略，以及MaxSurge等因素来计算。\n\n然后通过scaleReplicaSetAndRecordEvent来修改rs并发送事件。\n\n```\nfunc (dc *DeploymentController) reconcileNewReplicaSet(allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet, deployment *apps.Deployment) (bool, error) {\n\tif *(newRS.Spec.Replicas) == *(deployment.Spec.Replicas) {\n\t\t// Scaling not required.\n\t\treturn false, nil\n\t}\n\tif *(newRS.Spec.Replicas) > *(deployment.Spec.Replicas) {\n\t\t// Scale down.\n\t\tscaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, *(deployment.Spec.Replicas), deployment)\n\t\treturn scaled, err\n\t}\n\tnewReplicasCount, err := deploymentutil.NewRSNewReplicas(deployment, allRSs, newRS)\n\tif err != nil {\n\t\treturn false, err\n\t}\n\tscaled, _, err := dc.scaleReplicaSetAndRecordEvent(newRS, newReplicasCount, deployment)\n\treturn scaled, err\n}\n\n\n// NewRSNewReplicas calculates the number of replicas a deployment's new RS should have.\n// When one of the followings is true, we're rolling out the deployment; otherwise, we're scaling it.\n// 1) The new RS is saturated: newRS's replicas == deployment's replicas\n// 2) Max number of pods allowed is reached: deployment's replicas + maxSurge == all RSs' replicas\nfunc NewRSNewReplicas(deployment *apps.Deployment, allRSs []*apps.ReplicaSet, newRS *apps.ReplicaSet) (int32, error) {\n\tswitch deployment.Spec.Strategy.Type {\n\tcase apps.RollingUpdateDeploymentStrategyType:\n\t\t// Check if we can scale up.\n\t\tmaxSurge, err := intstrutil.GetValueFromIntOrPercent(deployment.Spec.Strategy.RollingUpdate.MaxSurge, int(*(deployment.Spec.Replicas)), true)\n\t\tif err != nil {\n\t\t\treturn 0, err\n\t\t}\n\t\t// Find the total number of pods\n\t\tcurrentPodCount := GetReplicaCountForReplicaSets(allRSs)\n\t\tmaxTotalPods := *(deployment.Spec.Replicas) + int32(maxSurge)\n\t\tif currentPodCount >= maxTotalPods {\n\t\t\t// Cannot scale up.\n\t\t\treturn *(newRS.Spec.Replicas), nil\n\t\t}\n\t\t// Scale up.\n\t\tscaleUpCount := maxTotalPods - currentPodCount\n\t\t// Do not exceed the number of desired replicas.\n\t\tscaleUpCount = int32(integer.IntMin(int(scaleUpCount), int(*(deployment.Spec.Replicas)-*(newRS.Spec.Replicas))))\n\t\treturn *(newRS.Spec.Replicas) + scaleUpCount, nil\n\tcase apps.RecreateDeploymentStrategyType:\n\t\treturn *(deployment.Spec.Replicas), nil\n\tdefault:\n\t\treturn 0, fmt.Errorf(\"deployment type %v isn't supported\", deployment.Spec.Strategy.Type)\n\t}\n}\n```\n\n<br>\n\nscaledown同样也是差不多的逻辑。计算的是当前旧rs应该减少的部分。\n\n<br>\n\n#### 5.7 scaleReplicaSetAndRecordEvent\n\n这个函数作用和名字一样。通过restful 对rs进行 scale。 然后用事件记录。\n\n```\nfunc (dc *DeploymentController) scaleReplicaSetAndRecordEvent(rs *apps.ReplicaSet, newScale int32, deployment *apps.Deployment) (bool, *apps.ReplicaSet, error) {\n\t// No need to scale\n\tif *(rs.Spec.Replicas) == newScale {\n\t\treturn false, rs, nil\n\t}\n\tvar scalingOperation string\n\tif *(rs.Spec.Replicas) < newScale {\n\t\tscalingOperation = \"up\"\n\t} else {\n\t\tscalingOperation = \"down\"\n\t}\n\tscaled, newRS, err := dc.scaleReplicaSet(rs, newScale, deployment, scalingOperation)\n\treturn scaled, newRS, err\n}\n\n\nfunc (dc *DeploymentController) scaleReplicaSet(rs *apps.ReplicaSet, newScale int32, deployment *apps.Deployment, scalingOperation string) (bool, *apps.ReplicaSet, error) {\n\n\tsizeNeedsUpdate := *(rs.Spec.Replicas) != newScale\n\n\tannotationsNeedUpdate := deploymentutil.ReplicasAnnotationsNeedUpdate(rs, *(deployment.Spec.Replicas), *(deployment.Spec.Replicas)+deploymentutil.MaxSurge(*deployment))\n\n\tscaled := false\n\tvar err error\n\tif sizeNeedsUpdate || annotationsNeedUpdate {\n\t\trsCopy := rs.DeepCopy()\n\t\t*(rsCopy.Spec.Replicas) = newScale\n\t\tdeploymentutil.SetReplicasAnnotations(rsCopy, *(deployment.Spec.Replicas), *(deployment.Spec.Replicas)+deploymentutil.MaxSurge(*deployment))\n\t\trs, err = dc.client.AppsV1().ReplicaSets(rsCopy.Namespace).Update(rsCopy)\n\t\tif err == nil && sizeNeedsUpdate {\n\t\t\tscaled = true\n\t\t\tdc.eventRecorder.Eventf(deployment, v1.EventTypeNormal, \"ScalingReplicaSet\", \"Scaled %s replica set %s to %d\", scalingOperation, rs.Name, newScale)\n\t\t}\n\t}\n\treturn scaled, rs, err\n}\n```\n\n<br>\n\n"
  },
  {
    "path": "k8s/kcm/3-k8s gc源码分析.md",
    "content": "Table of Contents\n=================\n\n  * [1. K8s 的垃圾回收策略](#1-k8s-的垃圾回收策略)\n  * [2 gc 源码分析](#2-gc-源码分析)\n     * [2.1 初始化 garbageCollector 对象](#21-初始化-garbagecollector-对象)\n        * [2.1.1 garbageCollector包含的结构体对象](#211-garbagecollector包含的结构体对象)\n        * [2.1.2 NewGarbageCollector](#212-newgarbagecollector)\n     * [2.2 启动garbageCollector](#22-启动garbagecollector)\n        * [2.2.1 启动dependencyGraphBuilder](#221-启动dependencygraphbuilder)\n        * [2.2.2 runAttemptToDeleteWorker](#222-runattempttodeleteworker)\n        * [2.2.3 runAttemptToOrphanWorker](#223-runattempttoorphanworker)\n        * [2.2.4 总结](#224-总结)\n     * [2.3  runProcessGraphChanges](#23--runprocessgraphchanges)\n     * [2.4 processTransitions函数的处理逻辑](#24-processtransitions函数的处理逻辑)\n     * [2.5 runAttemptToOrphanWorker](#25-runattempttoorphanworker)\n     * [2.6 attemptToDeleteWorker](#26-attempttodeleteworker)\n     * [2.7 uidToNode到底是什么](#27-uidtonode到底是什么)\n  * [3.总结](#3总结)\n\n### 1. K8s 的垃圾回收策略\n\nk8s目前支持三种回收策略：\n\n**（1）前台级联删除（Foreground Cascading Deletion）**：在这种删除策略中，所有者对象的删除将会持续到其所有从属对象都被删除为止。当所有者被删除时，会进入“正在删除”（deletion in progress）状态，此时：\n\n* 对象仍然可以通过 REST API 查询到（可通过 kubectl 或 kuboard 查询到）\n* 对象的 deletionTimestamp 字段被设置\n* 对象的 metadata.finalizers 包含值 foregroundDeletion\n\n**（2）后台级联删除（Background Cascading Deletion）**：这种删除策略会简单很多，它会立即删除所有者的对象，并由垃圾回收器在后台删除其从属对象。这种方式比前台级联删除快的多，因为不用等待时间来删除从属对象。\n\n**（3）孤儿（Orphan）**：这种情况下，对所有者的进行删除只会将其从集群中删除，并使所有对象处于“孤儿”状态。\n\n举例：已有一个deployA, 对应的rs假设为 rsA,  pod为PodA。\n\n（1）前台删除：先删除podA, 再删除rsA, 再删除deployA。  podA的删除如果卡在，rsA也会被卡住。\n\n（2）后台删除：先删除deployA, 再删除rsA, 再删除podA。 podA和rsA是否会删除成功，deploy不会受影响。\n\n（3）孤儿删除：只删除deployA。rsA, podA不受影响。 rsA的owner不再是deployA。\n\n<br>\n\n### 2 gc 源码分析\n\n和deployController, rsController一样，GarbageCollectorController也是kube-controller-manager(kcm)中的一个控制器。\n\nGarbageCollectorController 的启动方法为 `startGarbageCollectorController`，主要逻辑如下：\n\n**从第三步开始每一步都深入展开。第三步对应2.1。**\n\n（1）初始化客户端，用于发现集群中的资源。这个先不关注\n\n（2）获得deletableResources，以及ignoredResources。\n\ndeletableResources： 所有支持\"delete\", \"list\", \"watch\" 操作的资源\n\nignoredResources：kcm启动时GarbageCollectorController的config指定\n\n（3）初始化 garbageCollector 对象。 \n\n（4）启动garbageCollector\n\n（5）garbageCollector同步\n\n（6）开启debug模式\n\n```\nfunc startGarbageCollectorController(ctx ControllerContext) (http.Handler, bool, error) {\n  // 1.初始化客户端\n\tif !ctx.ComponentConfig.GarbageCollectorController.EnableGarbageCollector {\n\t\treturn nil, false, nil\n\t}\n\n\tgcClientset := ctx.ClientBuilder.ClientOrDie(\"generic-garbage-collector\")\n\tdiscoveryClient := cacheddiscovery.NewMemCacheClient(gcClientset.Discovery())\n\n\tconfig := ctx.ClientBuilder.ConfigOrDie(\"generic-garbage-collector\")\n\tmetadataClient, err := metadata.NewForConfig(config)\n\tif err != nil {\n\t\treturn nil, true, err\n\t}\n\n  // 2. 获得deletableResources，以及ignoredResources\n\t// Get an initial set of deletable resources to prime the garbage collector.\n\tdeletableResources := garbagecollector.GetDeletableResources(discoveryClient)\n\tignoredResources := make(map[schema.GroupResource]struct{})\n\tfor _, r := range ctx.ComponentConfig.GarbageCollectorController.GCIgnoredResources {\n\t\tignoredResources[schema.GroupResource{Group: r.Group, Resource: r.Resource}] = struct{}{}\n\t}\n\t\n\t// 3. NewGarbageCollector\n\tgarbageCollector, err := garbagecollector.NewGarbageCollector(\n\t\tmetadataClient,\n\t\tctx.RESTMapper,\n\t\tdeletableResources,\n\t\tignoredResources,\n\t\tctx.ObjectOrMetadataInformerFactory,\n\t\tctx.InformersStarted,\n\t)\n\tif err != nil {\n\t\treturn nil, true, fmt.Errorf(\"failed to start the generic garbage collector: %v\", err)\n\t}\n\n  // 4. 启动garbageCollector\n\t// Start the garbage collector.\n\tworkers := int(ctx.ComponentConfig.GarbageCollectorController.ConcurrentGCSyncs)\n\tgo garbageCollector.Run(workers, ctx.Stop)\n\n\t// Periodically refresh the RESTMapper with new discovery information and sync\n\t// the garbage collector.\n\t// 5. garbageCollector同步\n\tgo garbageCollector.Sync(gcClientset.Discovery(), 30*time.Second, ctx.Stop)\n  \n  // 6. 开启debug模式\n\treturn garbagecollector.NewDebugHandler(garbageCollector), true, nil\n}\n```\n\n<br>\n\n#### 2.1 初始化 garbageCollector 对象\n\n##### 2.1.1 garbageCollector包含的结构体对象\n\n garbageCollector需要额外的结构：\n\nattemptToDelete，attemptToOrphan：限速队列\n\nuidToNode：一个缓存依赖关系的图。一个map结构，key=uid, value是一个node结构。\n\n```\ntype GarbageCollector struct {\n\trestMapper     resettableRESTMapper\n\tmetadataClient metadata.Interface\n\tattemptToDelete workqueue.RateLimitingInterface\n\tattemptToOrphan        workqueue.RateLimitingInterface\n\tdependencyGraphBuilder *GraphBuilder\n\tabsentOwnerCache *UIDCache\n\tworkerLock sync.RWMutex\n}\n\n\n// GraphBuilder: based on the events supplied by the informers, GraphBuilder updates\n// uidToNode, a graph that caches the dependencies as we know, and enqueues\n// items to the attemptToDelete and attemptToOrphan.\ntype GraphBuilder struct {\n\trestMapper meta.RESTMapper\n\n  // 每一个monitor对应一种资源\n\tmonitors    monitors\n\tmonitorLock sync.RWMutex\n\tinformersStarted <-chan struct{}\n\n\tstopCh <-chan struct{}\n\n\trunning bool\n\n\tmetadataClient metadata.Interface\n \n\tgraphChanges workqueue.RateLimitingInterface\n\n\tuidToNode *concurrentUIDToNode\n\tattemptToDelete workqueue.RateLimitingInterface\n\tattemptToOrphan workqueue.RateLimitingInterface\n\n\tabsentOwnerCache *UIDCache\n\tsharedInformers  controller.InformerFactory\n\tignoredResources map[schema.GroupResource]struct{}\n}\n\ntype concurrentUIDToNode struct {\n\tuidToNodeLock sync.RWMutex\n\tuidToNode     map[types.UID]*node\n}\n\ntype node struct {\n\tidentity objectReference\n\tdependentsLock sync.RWMutex\n\tdependents map[*node]struct{}            //该节点的所有依赖\n\n\tdeletingDependents     bool\n\tdeletingDependentsLock sync.RWMutex\n\t\n\tbeingDeleted     bool\n\tbeingDeletedLock sync.RWMutex\n\n\tvirtual     bool\n\tvirtualLock sync.RWMutex\n\t\n\towners []metav1.OwnerReference         //该节点的所有owner\n}\n```\n\n举例来说：\n\n假设集群中有：deployA, rsA, podA三个对象。\n\nmonitors 负责监听这三种资源的变化。然后根据情况扔进 attemptToDelete，attemptToOrphan队列。\n\nGraphBuilder负责构建一个图。在这种情况下，图的内容为：\n\nNode1( key=deployA.uid ):   它的owner为空，dependents=rsA。\n\nNode2( key=rsA.uid ):   它的owner=deployA，dependents=podA。\n\nNode3( key=pod.uid ):   它的owner=rsA，dependents为空。\n\n<br>\n\n同时，每个节点还有beingDeleted，deletingDependents等关键字段。这样gc根据这个图就可以很方便地进行各种策略的删除。\n\n##### 2.1.2 NewGarbageCollector\n\nNewGarbageCollector就做了俩件事\n\n（1）初始化GarbageCollector结构体\n\n（2）调用controllerFor定义对象变化的处理事件。无论是监听到add, update, del都是将其打包成一个event事件，然后加入graphChanges队列。\n\n```\nfunc NewGarbageCollector(\n\tmetadataClient metadata.Interface,\n\tmapper resettableRESTMapper,\n\tdeletableResources map[schema.GroupVersionResource]struct{},\n\tignoredResources map[schema.GroupResource]struct{},\n\tsharedInformers controller.InformerFactory,\n\tinformersStarted <-chan struct{},\n) (*GarbageCollector, error) {\n\tattemptToDelete := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), \"garbage_collector_attempt_to_delete\")\n\tattemptToOrphan := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), \"garbage_collector_attempt_to_orphan\")\n\tabsentOwnerCache := NewUIDCache(500)\n\tgc := &GarbageCollector{\n\t\tmetadataClient:   metadataClient,\n\t\trestMapper:       mapper,\n\t\tattemptToDelete:  attemptToDelete,\n\t\tattemptToOrphan:  attemptToOrphan,\n\t\tabsentOwnerCache: absentOwnerCache,\n\t}\n\tgb := &GraphBuilder{\n\t\tmetadataClient:   metadataClient,\n\t\tinformersStarted: informersStarted,\n\t\trestMapper:       mapper,\n\t\tgraphChanges:     workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), \"garbage_collector_graph_changes\"),\n\t\tuidToNode: &concurrentUIDToNode{\n\t\t\tuidToNode: make(map[types.UID]*node),\n\t\t},\n\t\tattemptToDelete:  attemptToDelete,\n\t\tattemptToOrphan:  attemptToOrphan,\n\t\tabsentOwnerCache: absentOwnerCache,\n\t\tsharedInformers:  sharedInformers,\n\t\tignoredResources: ignoredResources,\n\t}\n\t\n\t// \n\tif err := gb.syncMonitors(deletableResources); err != nil {\n\t\tutilruntime.HandleError(fmt.Errorf(\"failed to sync all monitors: %v\", err))\n\t}\n\tgc.dependencyGraphBuilder = gb\n\n\treturn gc, nil\n}\n```\n\n<br>\n\nsyncMonitors就是同步更新哪些资源需要监听，然后调用controllerFor注册事件处理。\n\n```\nfunc (gb *GraphBuilder) syncMonitors(resources map[schema.GroupVersionResource]struct{}) error {\n\tgb.monitorLock.Lock()\n\tdefer gb.monitorLock.Unlock()\n\n\ttoRemove := gb.monitors\n\tif toRemove == nil {\n\t\ttoRemove = monitors{}\n\t}\n\tcurrent := monitors{}\n\terrs := []error{}\n\tkept := 0\n\tadded := 0\n\tfor resource := range resources {\n\t\tif _, ok := gb.ignoredResources[resource.GroupResource()]; ok {\n\t\t\tcontinue\n\t\t}\n\t\tif m, ok := toRemove[resource]; ok {\n\t\t\tcurrent[resource] = m\n\t\t\tdelete(toRemove, resource)\n\t\t\tkept++\n\t\t\tcontinue\n\t\t}\n\t\tkind, err := gb.restMapper.KindFor(resource)\n\t\tif err != nil {\n\t\t\terrs = append(errs, fmt.Errorf(\"couldn't look up resource %q: %v\", resource, err))\n\t\t\tcontinue\n\t\t}\n\t\tc, s, err := gb.controllerFor(resource, kind)\n\t\tif err != nil {\n\t\t\terrs = append(errs, fmt.Errorf(\"couldn't start monitor for resource %q: %v\", resource, err))\n\t\t\tcontinue\n\t\t}\n\t\tcurrent[resource] = &monitor{store: s, controller: c}\n\t\tadded++\n\t}\n\tgb.monitors = current\n\n\tfor _, monitor := range toRemove {\n\t\tif monitor.stopCh != nil {\n\t\t\tclose(monitor.stopCh)\n\t\t}\n\t}\n\n\tklog.V(4).Infof(\"synced monitors; added %d, kept %d, removed %d\", added, kept, len(toRemove))\n\t// NewAggregate returns nil if errs is 0-length\n\treturn utilerrors.NewAggregate(errs)\n}\n```\n\ncontrollerFor无论是监听到add, update, del都是将其打包成一个event事件，然后加入graphChanges队列。\n\n```\nfunc (gb *GraphBuilder) controllerFor(resource schema.GroupVersionResource, kind schema.GroupVersionKind) (cache.Controller, cache.Store, error) {\n   handlers := cache.ResourceEventHandlerFuncs{\n      // add the event to the dependencyGraphBuilder's graphChanges.\n      AddFunc: func(obj interface{}) {\n         event := &event{\n            eventType: addEvent,\n            obj:       obj,\n            gvk:       kind,\n         }\n         gb.graphChanges.Add(event)\n      },\n      UpdateFunc: func(oldObj, newObj interface{}) {\n         // TODO: check if there are differences in the ownerRefs,\n         // finalizers, and DeletionTimestamp; if not, ignore the update.\n         event := &event{\n            eventType: updateEvent,\n            obj:       newObj,\n            oldObj:    oldObj,\n            gvk:       kind,\n         }\n         gb.graphChanges.Add(event)\n      },\n      DeleteFunc: func(obj interface{}) {\n         // delta fifo may wrap the object in a cache.DeletedFinalStateUnknown, unwrap it\n         if deletedFinalStateUnknown, ok := obj.(cache.DeletedFinalStateUnknown); ok {\n            obj = deletedFinalStateUnknown.Obj\n         }\n         event := &event{\n            eventType: deleteEvent,\n            obj:       obj,\n            gvk:       kind,\n         }\n         gb.graphChanges.Add(event)\n      },\n   }\n   shared, err := gb.sharedInformers.ForResource(resource)\n   if err != nil {\n      klog.V(4).Infof(\"unable to use a shared informer for resource %q, kind %q: %v\", resource.String(), kind.String(), err)\n      return nil, nil, err\n   }\n   klog.V(4).Infof(\"using a shared informer for resource %q, kind %q\", resource.String(), kind.String())\n   // need to clone because it's from a shared cache\n   shared.Informer().AddEventHandlerWithResyncPeriod(handlers, ResourceResyncTime)\n   return shared.Informer().GetController(), shared.Informer().GetStore(), nil\n}\n```\n\n<br>\n\n#### 2.2 启动garbageCollector\n\n```\nfunc (gc *GarbageCollector) Run(workers int, stopCh <-chan struct{}) {\n   defer utilruntime.HandleCrash()\n   defer gc.attemptToDelete.ShutDown()\n   defer gc.attemptToOrphan.ShutDown()\n   defer gc.dependencyGraphBuilder.graphChanges.ShutDown()\n\n   klog.Infof(\"Starting garbage collector controller\")\n   defer klog.Infof(\"Shutting down garbage collector controller\")\n   \n   // 1.启动dependencyGraphBuilder\n   go gc.dependencyGraphBuilder.Run(stopCh)\n\n   if !cache.WaitForNamedCacheSync(\"garbage collector\", stopCh, gc.dependencyGraphBuilder.IsSynced) {\n      return\n   }\n\n   klog.Infof(\"Garbage collector: all resource monitors have synced. Proceeding to collect garbage\")\n   \n   // 启动runAttemptToDeleteWorker，runAttemptToOrphanWorker\n   // gc workers\n   for i := 0; i < workers; i++ {\n      go wait.Until(gc.runAttemptToDeleteWorker, 1*time.Second, stopCh)\n      go wait.Until(gc.runAttemptToOrphanWorker, 1*time.Second, stopCh)\n   }\n\n   <-stopCh\n}\n```\n\n<br>\n\n##### 2.2.1 启动dependencyGraphBuilder\n\n```\n// Run sets the stop channel and starts monitor execution until stopCh is\n// closed. Any running monitors will be stopped before Run returns.\nfunc (gb *GraphBuilder) Run(stopCh <-chan struct{}) {\n\tklog.Infof(\"GraphBuilder running\")\n\tdefer klog.Infof(\"GraphBuilder stopping\")\n\n\t// Set up the stop channel.\n\tgb.monitorLock.Lock()\n\tgb.stopCh = stopCh\n\tgb.running = true\n\tgb.monitorLock.Unlock()\n\n\t// Start monitors and begin change processing until the stop channel is\n\t// closed.\n\t// 1. 启动各个资源的监听\n\tgb.startMonitors()\n\t// 2. runProcessGraphChanges开始处理各种事件\n\twait.Until(gb.runProcessGraphChanges, 1*time.Second, stopCh)\n\n  // 这里就是有monitor关闭后的处理\n\t// Stop any running monitors.\n\tgb.monitorLock.Lock()\n\tdefer gb.monitorLock.Unlock()\n\tmonitors := gb.monitors\n\tstopped := 0\n\tfor _, monitor := range monitors {\n\t\tif monitor.stopCh != nil {\n\t\t\tstopped++\n\t\t\tclose(monitor.stopCh)\n\t\t}\n\t}\n\n\t// reset monitors so that the graph builder can be safely re-run/synced.\n\tgb.monitors = nil\n\tklog.Infof(\"stopped %d of %d monitors\", stopped, len(monitors))\n}\n\n\n// 启动各个资源的监听\nfunc (gb *GraphBuilder) startMonitors() {\n\tgb.monitorLock.Lock()\n\tdefer gb.monitorLock.Unlock()\n\n\tif !gb.running {\n\t\treturn\n\t}\n\n\t// we're waiting until after the informer start that happens once all the controllers are initialized.  This ensures\n\t// that they don't get unexpected events on their work queues.\n\t<-gb.informersStarted\n\n\tmonitors := gb.monitors\n\tstarted := 0\n\tfor _, monitor := range monitors {\n\t\tif monitor.stopCh == nil {\n\t\t\tmonitor.stopCh = make(chan struct{})\n\t\t\tgb.sharedInformers.Start(gb.stopCh)\n\t\t\tgo monitor.Run()\n\t\t\tstarted++\n\t\t}\n\t}\n\tklog.V(4).Infof(\"started %d new monitors, %d currently running\", started, len(monitors))\n}\n```\n\n<br>\n\n##### 2.2.2 runAttemptToDeleteWorker\n\nrunAttemptToDeleteWorker就是从attemptToDelete队列中取出来一个对象处理。\n\n```\nfunc (gc *GarbageCollector) runAttemptToDeleteWorker() {\n   for gc.attemptToDeleteWorker() {\n   }\n}\n\nfunc (gc *GarbageCollector) attemptToDeleteWorker() bool {\n   item, quit := gc.attemptToDelete.Get()\n   ...\n   err := gc.attemptToDeleteItem(n)\n   ...\n   return true\n}\n```\n\n##### 2.2.3 runAttemptToOrphanWorker\n\nrunAttemptToOrphanWorker就是从attemptToOrphan队列中取出来一个对象处理。\n\n```\nfunc (gc *GarbageCollector) runAttemptToOrphanWorker() {\n   for gc.attemptToOrphanWorker() {\n   }\n}\n\n\nfunc (gc *GarbageCollector) attemptToOrphanWorker() bool {\n   item, quit := gc.attemptToOrphan.Get()\n  \n   defer gc.attemptToOrphan.Done(item)\n   owner, ok := item.(*node)\n   if !ok {\n      utilruntime.HandleError(fmt.Errorf(\"expect *node, got %#v\", item))\n      return true\n   }\n   // we don't need to lock each element, because they never get updated\n   owner.dependentsLock.RLock()\n   dependents := make([]*node, 0, len(owner.dependents))\n   for dependent := range owner.dependents {\n      dependents = append(dependents, dependent)\n   }\n   owner.dependentsLock.RUnlock()\n\n   err := gc.orphanDependents(owner.identity, dependents)\n   if err != nil {\n      utilruntime.HandleError(fmt.Errorf(\"orphanDependents for %s failed with %v\", owner.identity, err))\n      gc.attemptToOrphan.AddRateLimited(item)\n      return true\n   }\n   // update the owner, remove \"orphaningFinalizer\" from its finalizers list\n   err = gc.removeFinalizer(owner, metav1.FinalizerOrphanDependents)\n   if err != nil {\n      utilruntime.HandleError(fmt.Errorf(\"removeOrphanFinalizer for %s failed with %v\", owner.identity, err))\n      gc.attemptToOrphan.AddRateLimited(item)\n   }\n   return true\n}\n```\n\n<br>\n\n##### 2.2.4 总结\n\n（1）NewGarbageCollector初始化了graphbuild, attempToDelete, attempToOrphan队列，然后定义了资源变化时的处理对象\n\n（2）GarbageCollector.run  做了三个工作。`第一是`， 让监控的所有资源，都用一个处理逻辑。就是：add, update, del都是将其打包成一个event事件，然后加入graphChanges队列。`第二是` ，启动runProcessGraphChanges处理graphChanges队列的对象。`第三是`， 启动AttemptToOrphanWorker，AttemptToDeleteWorker进行gc处理。\n\n（3）到这里，总的来说逻辑就是：\n\n* NewGarbageCollector监听了所有支持 list, watch, delete操作的事件\n* 然后定义这些对象所有的add, update, del变化都扔进 graphChanges队列\n* 然后启动runProcessGraphChanges，处理graphChanges的对象。runProcessGraphChanges主要做俩件事，一是维护图，二是将可能需要删除的对象，扔进 AttemptToOrphan，或者AttemptToDelete进行处理\n* AttemptToOrphanWorker，AttemptToDeleteWorker进行具体的gc处理。\n\n<br>\n\n到这里为止，gc的初始化，以及大概的流程都清楚了。接下来具体分析runProcessGraphChanges函数，以及AttemptToOrphanWorker，AttemptToDeleteWorker的处理逻辑。\n\n<br>\n\n#### 2.3  runProcessGraphChanges\n\nrunProcessGraphChanges作用就是俩件事：\n\n（1）时刻uidToNode维护图的正确和完整\n\n（2）将可能需要删除的对象扔进AttemptToOrphan，AttemptToDelete队列\n\n**具体逻辑如下：**\n\n（1）从 graphChanges 取出一个 对象（event），然后判断图里面有没有这个对象。如果存在，将该节点标记为 observed。这个是表示，这个节点不是virtual节点。\n\n（2）分三种情况进行处理。具体是：\n\n```\nfunc (gb *GraphBuilder) runProcessGraphChanges() {\n\tfor gb.processGraphChanges() {\n\t}\n}\n\n// Dequeueing an event from graphChanges, updating graph, populating dirty_queue.\nfunc (gb *GraphBuilder) processGraphChanges() bool {\n\titem, quit := gb.graphChanges.Get()\n\tif quit {\n\t\treturn false\n\t}\n\tdefer gb.graphChanges.Done(item)\n\tevent, ok := item.(*event)\n\tif !ok {\n\t\tutilruntime.HandleError(fmt.Errorf(\"expect a *event, got %v\", item))\n\t\treturn true\n\t}\n\tobj := event.obj\n\taccessor, err := meta.Accessor(obj)\n\tif err != nil {\n\t\tutilruntime.HandleError(fmt.Errorf(\"cannot access obj: %v\", err))\n\t\treturn true\n\t}\n\tklog.V(5).Infof(\"GraphBuilder process object: %s/%s, namespace %s, name %s, uid %s, event type %v\", event.gvk.GroupVersion().String(), event.gvk.Kind, accessor.GetNamespace(), accessor.GetName(), string(accessor.GetUID()), event.eventType)\n\t// Check if the node already exists\n\t\n  // 1.判断图里面有没有这个对象\n\texistingNode, found := gb.uidToNode.Read(accessor.GetUID())\n\t// 1.1 如果存在，将其标记为 observed。这个是表示，这个节点不是virtual节点。\n\tif found {\n\t\t// this marks the node as having been observed via an informer event\n\t\t// 1. this depends on graphChanges only containing add/update events from the actual informer\n\t\t// 2. this allows things tracking virtual nodes' existence to stop polling and rely on informer events\n\t\texistingNode.markObserved()\n\t}\n\t\n\t// 2. 分三种情况进行处理。\n\tswitch {\n\tcase (event.eventType == addEvent || event.eventType == updateEvent) && !found:\n\t\tnewNode := &node{\n\t\t\tidentity: objectReference{\n\t\t\t\tOwnerReference: metav1.OwnerReference{\n\t\t\t\t\tAPIVersion: event.gvk.GroupVersion().String(),\n\t\t\t\t\tKind:       event.gvk.Kind,\n\t\t\t\t\tUID:        accessor.GetUID(),\n\t\t\t\t\tName:       accessor.GetName(),\n\t\t\t\t},\n\t\t\t\tNamespace: accessor.GetNamespace(),\n\t\t\t},\n\t\t\tdependents:         make(map[*node]struct{}),\n\t\t\towners:             accessor.GetOwnerReferences(),\n\t\t\tdeletingDependents: beingDeleted(accessor) && hasDeleteDependentsFinalizer(accessor),\n\t\t\tbeingDeleted:       beingDeleted(accessor),\n\t\t}\n\t\tgb.insertNode(newNode)\n\t\t// the underlying delta_fifo may combine a creation and a deletion into\n\t\t// one event, so we need to further process the event.\n\t\tgb.processTransitions(event.oldObj, accessor, newNode)\n\tcase (event.eventType == addEvent || event.eventType == updateEvent) && found:\n\t\t// handle changes in ownerReferences\n\t\tadded, removed, changed := referencesDiffs(existingNode.owners, accessor.GetOwnerReferences())\n\t\tif len(added) != 0 || len(removed) != 0 || len(changed) != 0 {\n\t\t\t// check if the changed dependency graph unblock owners that are\n\t\t\t// waiting for the deletion of their dependents.\n\t\t\tgb.addUnblockedOwnersToDeleteQueue(removed, changed)\n\t\t\t// update the node itself\n\t\t\texistingNode.owners = accessor.GetOwnerReferences()\n\t\t\t// Add the node to its new owners' dependent lists.\n\t\t\tgb.addDependentToOwners(existingNode, added)\n\t\t\t// remove the node from the dependent list of node that are no longer in\n\t\t\t// the node's owners list.\n\t\t\tgb.removeDependentFromOwners(existingNode, removed)\n\t\t}\n\n\t\tif beingDeleted(accessor) {\n\t\t\texistingNode.markBeingDeleted()\n\t\t}\n\t\tgb.processTransitions(event.oldObj, accessor, existingNode)\n\tcase event.eventType == deleteEvent:\n\t\tif !found {\n\t\t\tklog.V(5).Infof(\"%v doesn't exist in the graph, this shouldn't happen\", accessor.GetUID())\n\t\t\treturn true\n\t\t}\n\t\t// removeNode updates the graph\n\t\tgb.removeNode(existingNode)\n\t\texistingNode.dependentsLock.RLock()\n\t\tdefer existingNode.dependentsLock.RUnlock()\n\t\tif len(existingNode.dependents) > 0 {\n\t\t\tgb.absentOwnerCache.Add(accessor.GetUID())\n\t\t}\n\t\tfor dep := range existingNode.dependents {\n\t\t\tgb.attemptToDelete.Add(dep)\n\t\t}\n\t\tfor _, owner := range existingNode.owners {\n\t\t\townerNode, found := gb.uidToNode.Read(owner.UID)\n\t\t\tif !found || !ownerNode.isDeletingDependents() {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\t// this is to let attempToDeleteItem check if all the owner's\n\t\t\t// dependents are deleted, if so, the owner will be deleted.\n\t\t\tgb.attemptToDelete.Add(ownerNode)\n\t\t}\n\t}\n\treturn true\n}\n```\n\n<br>\n\n**第一种：** 如果图中不存在这个节点，并且事件为 add或者update，处理方法为：\n\n(1) 初始化一个node节点。然后插入到map中。\n\n```\ncase (event.eventType == addEvent || event.eventType == updateEvent) && !found:\n\t\tnewNode := &node{\n\t\t  // 该对象的标记，由APIVersion，Kind，UID，Name\n\t\t\tidentity: objectReference{\n\t\t\t\tOwnerReference: metav1.OwnerReference{\n\t\t\t\t\tAPIVersion: event.gvk.GroupVersion().String(),\n\t\t\t\t\tKind:       event.gvk.Kind,\n\t\t\t\t\tUID:        accessor.GetUID(),\n\t\t\t\t\tName:       accessor.GetName(),\n\t\t\t\t},\n\t\t\t\tNamespace: accessor.GetNamespace(),\n\t\t\t},\n\t\t\tdependents:         make(map[*node]struct{}),          // 这里现在是空的\n\t\t\towners:             accessor.GetOwnerReferences(),\n\t\t\t// 判断是否是删dependent\n\t\t\tdeletingDependents: beingDeleted(accessor) && hasDeleteDependentsFinalizer(accessor),   \n\t\t\t// 判断是否在正在删除\n\t\t\tbeingDeleted:       beingDeleted(accessor),\n\t\t}\n\t\tgb.insertNode(newNode)\n\t\t// the underlying delta_fifo may combine a creation and a deletion into\n\t\t// one event, so we need to further process the event.\n\t\tgb.processTransitions(event.oldObj, accessor, newNode)\n```\n\n（2）insertNode，将这个节点加入map中，并且将这个node加入所有的owner node的dependent中。\n\n假设当前是当前节点是rsA, 这一步会将rsA加入map中，并且增加deployA的一个dependent为rsA.\n\n（3）调用processTransitions进行进一步的处理。processTransitions是一个通用函数，它的作用就是将这个对象放入放到AttemptToOrphan或者AttemptToDelete队列，这个等下具体介绍\n\n<br>\n\n**第二种**，  如果图中存在这个节点，并且事件为 add或者update，处理方法为：\n\n（1）处理references Diff\n\n* 首先根据节点的信息 和 对象最新的信息，判断OwnerReference的变化。这里分为三种变化：\n\n  added 表示该对象的OwnerReference中新增了哪些 owner;  removed表示该对象删除了哪些owner；changed表示哪些改变了\n\n* 针对这三种变化做出的处理如下：\n\n  a. 调用addUnblockedOwnersToDeleteQueue将可能阻塞的owner重新加入队列。具体可以看代码注释中的分析\n\n  b. existingNode.owners = accessor.GetOwnerReferences(), 让节点使用最新的owner\n\n  c. 新增了owner，需要在新增owner中的Dependents增加一个Dependent, 就是该节点\n\n  d. 删除了owner，需要在原来的owner中的Dependents删除这个Dependent, 就是该节点\n\n（2） 如果当前对象有deletionStamp，标记这个节点正在删除\n\n（3）调用processTransitions进行进一步的处理。processTransitions是一个通用函数，它的作用就是将这个对象放入放到AttemptToOrphan或者AttemptToDelete队列，这个等下具体介绍\n\n```\ncase (event.eventType == addEvent || event.eventType == updateEvent) && found:\n\t\t// handle changes in ownerReferences\n\t\tadded, removed, changed := referencesDiffs(existingNode.owners, accessor.GetOwnerReferences())\n\t\tif len(added) != 0 || len(removed) != 0 || len(changed) != 0 {\n\t\t\t// check if the changed dependency graph unblock owners that are\n\t\t\t// waiting for the deletion of their dependents.\n\t\t\t// a.调用addUnblockedOwnersToDeleteQueue将可能阻塞的owner重新加入队列。具体可以看代码注释中的分析\n\t\t\tgb.addUnblockedOwnersToDeleteQueue(removed, changed)\n\t\t\t// update the node itself\n\t\t\t// b.让节点使用最新的owner\n\t\t\texistingNode.owners = accessor.GetOwnerReferences()\n\t\t\t// Add the node to its new owners' dependent lists.\n\t\t\t// c. 新增了owner，需要在新增owner中的Dependents增加一个Dependent, 就是该节点\n\t\t\tgb.addDependentToOwners(existingNode, added)\n\t\t\t// remove the node from the dependent list of node that are no longer in\n\t\t\t// the node's owners list.\n\t\t\t// d. 删除了owner，需要在原来的owner中的Dependents删除这个Dependent, 就是该节点\n\t\t\tgb.removeDependentFromOwners(existingNode, removed)\n\t\t}\n    \n\t\tif beingDeleted(accessor) {\n\t\t\texistingNode.markBeingDeleted()\n\t\t}\n\t\tgb.processTransitions(event.oldObj, accessor, existingNode)\n\t\t\n\t\t\n\n// TODO: profile this function to see if a naive N^2 algorithm performs better\n// when the number of references is small.\nfunc referencesDiffs(old []metav1.OwnerReference, new []metav1.OwnerReference) (added []metav1.OwnerReference, removed []metav1.OwnerReference, changed []ownerRefPair) {\n   oldUIDToRef := make(map[string]metav1.OwnerReference)\n   for _, value := range old {\n      oldUIDToRef[string(value.UID)] = value\n   }\n   oldUIDSet := sets.StringKeySet(oldUIDToRef)\n   for _, value := range new {\n      newUID := string(value.UID)\n      if oldUIDSet.Has(newUID) {\n         if !reflect.DeepEqual(oldUIDToRef[newUID], value) {\n            changed = append(changed, ownerRefPair{oldRef: oldUIDToRef[newUID], newRef: value})\n         }\n         oldUIDSet.Delete(newUID)\n      } else {\n         added = append(added, value)\n      }\n   }\n   for oldUID := range oldUIDSet {\n      removed = append(removed, oldUIDToRef[oldUID])\n   }\n\n   return added, removed, changed\n}\n\n\n// 以foreground方式删除deployA的时候，deployA会被Block，原因在于它在等 rsA的删除。\n// 这个时候如果改变rsA的OwnerReference，比如删除owner, deployA。这个时候需要通知deployA,你不用等了，可以直接删除了。\n// addUnblockedOwnersToDeleteQueue就是做这样的事情，检测到rsA的OwnerReference变化，将等待的deployA加入删除队列。\n// if an blocking ownerReference points to an object gets removed, or gets set to\n// \"BlockOwnerDeletion=false\", add the object to the attemptToDelete queue.\nfunc (gb *GraphBuilder) addUnblockedOwnersToDeleteQueue(removed []metav1.OwnerReference, changed []ownerRefPair) {\n\tfor _, ref := range removed {\n\t\tif ref.BlockOwnerDeletion != nil && *ref.BlockOwnerDeletion {\n\t\t\tnode, found := gb.uidToNode.Read(ref.UID)\n\t\t\tif !found {\n\t\t\t\tklog.V(5).Infof(\"cannot find %s in uidToNode\", ref.UID)\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tgb.attemptToDelete.Add(node)\n\t\t}\n\t}\n\tfor _, c := range changed {\n\t\twasBlocked := c.oldRef.BlockOwnerDeletion != nil && *c.oldRef.BlockOwnerDeletion\n\t\tisUnblocked := c.newRef.BlockOwnerDeletion == nil || (c.newRef.BlockOwnerDeletion != nil && !*c.newRef.BlockOwnerDeletion)\n\t\tif wasBlocked && isUnblocked {\n\t\t\tnode, found := gb.uidToNode.Read(c.newRef.UID)\n\t\t\tif !found {\n\t\t\t\tklog.V(5).Infof(\"cannot find %s in uidToNode\", c.newRef.UID)\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tgb.attemptToDelete.Add(node)\n\t\t}\n\t}\n}\n```\n\n<br>\n\n**第三种**，这个对象已经删除, 处理方法为：\n\n（1）从图中删除这个节点，如果这个节点有dependents，将这个节点加入absentOwnerCache。这个是非常有用的。假如deployA删除了，rsA通过absentOwnerCache能判断，deployA确实存在，并且被删除了。\n\n（2）将所有的依赖加入attemptToDelete队列\n\n（3）如果这个节点有owners，并且处于删除Dependents中，那么很有可能它的owners正在等自己。现在自己删除了，所以将owners再加入删除队列\n\n```\ncase event.eventType == deleteEvent:\n\t\tif !found {\n\t\t\tklog.V(5).Infof(\"%v doesn't exist in the graph, this shouldn't happen\", accessor.GetUID())\n\t\t\treturn true\n\t\t}\n\t\t// removeNode updates the graph\n\t\tgb.removeNode(existingNode)\n\t\texistingNode.dependentsLock.RLock()\n\t\tdefer existingNode.dependentsLock.RUnlock()\n\t\tif len(existingNode.dependents) > 0 {\n\t\t\tgb.absentOwnerCache.Add(accessor.GetUID())\n\t\t}\n\t\tfor dep := range existingNode.dependents {\n\t\t\tgb.attemptToDelete.Add(dep)\n\t\t}\n\t\tfor _, owner := range existingNode.owners {\n\t\t\townerNode, found := gb.uidToNode.Read(owner.UID)\n\t\t\tif !found || !ownerNode.isDeletingDependents() {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\t// this is to let attempToDeleteItem check if all the owner's\n\t\t\t// dependents are deleted, if so, the owner will be deleted.\n\t\t\tgb.attemptToDelete.Add(ownerNode)\n\t\t}\n\t}\n```\n\n<br>\n\n#### 2.4 processTransitions函数的处理逻辑\n\n从上面的分析，可以看出来，runProcessGraphChanges就做了两件事情：\n\n（1）时刻维护图的正确和完整\n\n（2）将可能需要删除的对象扔进AttemptToOrphan，AttemptToDelete队列\n\nprocessTransitions就是做第二件事情，将可能需要删除的对象扔进AttemptToOrphan，AttemptToDelete队列。\n\n判断的逻辑很简单：\n\n（1）如果这个对象正在删除，并且有orphan这个Finalizer，就将它扔进attemptToOrphan队列\n\n（1）如果这个对象正在删除，并且有foregroundDeletion这个Finalizer，就将它和它的dependents扔进attemptToDelete\n\n```\nfunc (gb *GraphBuilder) processTransitions(oldObj interface{}, newAccessor metav1.Object, n *node) {\n\n\tif startsWaitingForDependentsOrphaned(oldObj, newAccessor) {\n\t\tklog.V(5).Infof(\"add %s to the attemptToOrphan\", n.identity)\n\t\tgb.attemptToOrphan.Add(n)\n\t\treturn\n\t}\n\t\n\tif startsWaitingForDependentsDeleted(oldObj, newAccessor) {\n\t\tklog.V(2).Infof(\"add %s to the attemptToDelete, because it's waiting for its dependents to be deleted\", n.identity)\n\t\t// if the n is added as a \"virtual\" node, its deletingDependents field is not properly set, so always set it here.\n\t\tn.markDeletingDependents()\n\t\tfor dep := range n.dependents {\n\t\t\tgb.attemptToDelete.Add(dep)\n\t\t}\n\t\tgb.attemptToDelete.Add(n)\n\t}\n}\n```\n\n<br>\n\n#### 2.5 runAttemptToOrphanWorker\n\nrunAttemptToOrphanWorker逻辑如下：\n\n（1）获得这个节点的所有orphanDependents\n\n（2）调用orphanDependents，删除它的orphanDependents的OwnerReferences。\n\n（3）删除orphan这个Finalizer,让该对象可以被删除\n\n```\nfunc (gc *GarbageCollector) runAttemptToOrphanWorker() {\n   for gc.attemptToOrphanWorker() {\n   }\n}\n\n// attemptToOrphanWorker dequeues a node from the attemptToOrphan, then finds its\n// dependents based on the graph maintained by the GC, then removes it from the\n// OwnerReferences of its dependents, and finally updates the owner to remove\n// the \"Orphan\" finalizer. The node is added back into the attemptToOrphan if any of\n// these steps fail.\nfunc (gc *GarbageCollector) attemptToOrphanWorker() bool {\n   item, quit := gc.attemptToOrphan.Get()\n   gc.workerLock.RLock()\n   defer gc.workerLock.RUnlock()\n   if quit {\n      return false\n   }\n   defer gc.attemptToOrphan.Done(item)\n   owner, ok := item.(*node)\n   if !ok {\n      utilruntime.HandleError(fmt.Errorf(\"expect *node, got %#v\", item))\n      return true\n   }\n   // we don't need to lock each element, because they never get updated\n   owner.dependentsLock.RLock()\n   dependents := make([]*node, 0, len(owner.dependents))\n   // 1.获得这个节点的所有orphanDependents\n   for dependent := range owner.dependents {\n      dependents = append(dependents, dependent)\n   }\n   owner.dependentsLock.RUnlock()\n   \n   // 2.调用orphanDependents，删除它的orphanDependents的OwnerReferences。\n   // 举例来说，删除deployA时，删除rsA的OwnerReference，这样rsA就不受deployA控制了。\n   err := gc.orphanDependents(owner.identity, dependents)\n   if err != nil {\n      utilruntime.HandleError(fmt.Errorf(\"orphanDependents for %s failed with %v\", owner.identity, err))\n      gc.attemptToOrphan.AddRateLimited(item)\n      return true\n   }\n   // update the owner, remove \"orphaningFinalizer\" from its finalizers list\n   // 3. 删除orphan这个Finalizer,让deployA可以被删除\n   err = gc.removeFinalizer(owner, metav1.FinalizerOrphanDependents)\n   if err != nil {\n      utilruntime.HandleError(fmt.Errorf(\"removeOrphanFinalizer for %s failed with %v\", owner.identity, err))\n      gc.attemptToOrphan.AddRateLimited(item)\n   }\n   return true\n}\n```\n\n<br>\n\n#### 2.6 attemptToDeleteWorker\n\n主要调用attemptToDeleteItem函数。attemptToDeleteItem的逻辑如下：\n\n（1）如果该对象isBeingDeleted,并且没有在删除Dependents，直接返回\n\n（2）如果该对象正在删除dependents, 将dependents加入attemptToDelete队列\n\n（3）调用classifyReferences，计算solid，dangling，waitingForDependentsDeletion的情况，solid，dangling，waitingForDependentsDeletion是OwnerReferences数组\n\nsolid：当前节点的owner存在，并且owner的状态不是删除Dependents中\n\ndangling：owner不存在\n\nwaitingForDependentsDeletion：owner存在，并且owner的状态是删除Dependents中\n\n（4）根据solid，dangling，waitingForDependentsDeletion的情况进行不同的处理，具体如下：\n\n*  情况1: 如果有至少有一个owner存在，并且不处于删除依赖中。这个时候判断dangling，waitingForDependentsDeletion的数量是否为0。如果为0，说明当前不需要处理；否则，将该节点对应dangling，waitingForDependentsDeletion的节点删除dependents。\n* 情况2: 到这里说明 len(solid)=0，这个时候如果有节点在等待这个节点删除，并且这个节点还有依赖，那么将这个节点的blockOwnerDeletion设置为true。然后后台删除这个节点。\n  这里举一个例子说明：当前台模式删除deployA时，rsA是当前要处理的节点。这个时候rsA发现deployA再等自己删除，但是自己又有依赖podA，所以这里马上将自己设置为前台删除。这样在deployA看来就实现了先删除podA, 再删除rsA，再删除deployA。\n* 情况3: 除了上面的两种情况，根据设置的删除策略删除这个节点。\n\n​       这里举一个例子说明：当后台模式删除deployA时，rsA是当前要处理的节点。这个时候deployA已经删除了，同时没有finalizer，因为只有Orphan, foreGround有finalizer，所以这个时候直接默认以background删除这个节点。\n\n```\nfunc (gc *GarbageCollector) attemptToDeleteWorker() bool {\n   item, quit := gc.attemptToDelete.Get()\n\n   err := gc.attemptToDeleteItem(n)\n\n   return true\n}\n\n\nfunc (gc *GarbageCollector) attemptToDeleteItem(item *node) error {\n\tklog.V(2).Infof(\"processing item %s\", item.identity)\n\t// \"being deleted\" is an one-way trip to the final deletion. We'll just wait for the final deletion, and then process the object's dependents.\n\t// 1.如果该对象isBeingDeleted,并且没有在删除Dependents，直接返回\n\tif item.isBeingDeleted() && !item.isDeletingDependents() {\n\t\tklog.V(5).Infof(\"processing item %s returned at once, because its DeletionTimestamp is non-nil\", item.identity)\n\t\treturn nil\n\t}\n\t// TODO: It's only necessary to talk to the API server if this is a\n\t// \"virtual\" node. The local graph could lag behind the real status, but in\n\t// practice, the difference is small.\n\tlatest, err := gc.getObject(item.identity)\n\tswitch {\n\tcase errors.IsNotFound(err):\n\t\t// the GraphBuilder can add \"virtual\" node for an owner that doesn't\n\t\t// exist yet, so we need to enqueue a virtual Delete event to remove\n\t\t// the virtual node from GraphBuilder.uidToNode.\n\t\tklog.V(5).Infof(\"item %v not found, generating a virtual delete event\", item.identity)\n\t\tgc.dependencyGraphBuilder.enqueueVirtualDeleteEvent(item.identity)\n\t\t// since we're manually inserting a delete event to remove this node,\n\t\t// we don't need to keep tracking it as a virtual node and requeueing in attemptToDelete\n\t\titem.markObserved()\n\t\treturn nil\n\tcase err != nil:\n\t\treturn err\n\t}\n\n\tif latest.GetUID() != item.identity.UID {\n\t\tklog.V(5).Infof(\"UID doesn't match, item %v not found, generating a virtual delete event\", item.identity)\n\t\tgc.dependencyGraphBuilder.enqueueVirtualDeleteEvent(item.identity)\n\t\t// since we're manually inserting a delete event to remove this node,\n\t\t// we don't need to keep tracking it as a virtual node and requeueing in attemptToDelete\n\t\titem.markObserved()\n\t\treturn nil\n\t}\n\n\t// TODO: attemptToOrphanWorker() routine is similar. Consider merging\n\t// attemptToOrphanWorker() into attemptToDeleteItem() as well.\n\t// 2. 如果该对象正在删除dependents, 将dependents加入attemptToDelete队列\n\tif item.isDeletingDependents() {\n\t\treturn gc.processDeletingDependentsItem(item)\n\t}\n  \n\t// compute if we should delete the item\n\townerReferences := latest.GetOwnerReferences()\n\tif len(ownerReferences) == 0 {\n\t\tklog.V(2).Infof(\"object %s's doesn't have an owner, continue on next item\", item.identity)\n\t\treturn nil\n\t}\n  \n  // 3.计算solid，dangling，waitingForDependentsDeletion的情况。\n\tsolid, dangling, waitingForDependentsDeletion, err := gc.classifyReferences(item, ownerReferences)\n\tif err != nil {\n\t\treturn err\n\t}\n\tklog.V(5).Infof(\"classify references of %s.\\nsolid: %#v\\ndangling: %#v\\nwaitingForDependentsDeletion: %#v\\n\", item.identity, solid, dangling, waitingForDependentsDeletion)\n\n\n  // 4.根据solid，dangling，waitingForDependentsDeletion的情况进行不同的处理\n\tswitch {\n\t// 情况1: 如果有至少有一个owner存在，并且不处于删除依赖中。这个时候判断dangling，waitingForDependentsDeletion的数量是否为0。如果为0，说明当前不需要处理；否则，将该节点对应dangling，waitingForDependentsDeletion的节点删除dependents。\n\tcase len(solid) != 0:\n\t\tklog.V(2).Infof(\"object %#v has at least one existing owner: %#v, will not garbage collect\", item.identity, solid)\n\t\tif len(dangling) == 0 && len(waitingForDependentsDeletion) == 0 {\n\t\t\treturn nil\n\t\t}\n\t\tklog.V(2).Infof(\"remove dangling references %#v and waiting references %#v for object %s\", dangling, waitingForDependentsDeletion, item.identity)\n\t\t// waitingForDependentsDeletion needs to be deleted from the\n\t\t// ownerReferences, otherwise the referenced objects will be stuck with\n\t\t// the FinalizerDeletingDependents and never get deleted.\n\t\townerUIDs := append(ownerRefsToUIDs(dangling), ownerRefsToUIDs(waitingForDependentsDeletion)...)\n\t\tpatch := deleteOwnerRefStrategicMergePatch(item.identity.UID, ownerUIDs...)\n\t\t_, err = gc.patch(item, patch, func(n *node) ([]byte, error) {\n\t\t\treturn gc.deleteOwnerRefJSONMergePatch(n, ownerUIDs...)\n\t\t})\n\t\treturn err\n\t// 情况2: 到这里说明 len(solid)=0，这个时候如果有节点在等待这个节点删除，并且这个节点还有依赖，那么将这个节点的blockOwnerDeletion设置为true。然后后台删除这个节点。\n\tcase len(waitingForDependentsDeletion) != 0 && item.dependentsLength() != 0:\n\t\tdeps := item.getDependents()\n\t\tfor _, dep := range deps {\n\t\t\tif dep.isDeletingDependents() {\n\t\t\t\t// this circle detection has false positives, we need to\n\t\t\t\t// apply a more rigorous detection if this turns out to be a\n\t\t\t\t// problem.\n\t\t\t\t// there are multiple workers run attemptToDeleteItem in\n\t\t\t\t// parallel, the circle detection can fail in a race condition.\n\t\t\t\tklog.V(2).Infof(\"processing object %s, some of its owners and its dependent [%s] have FinalizerDeletingDependents, to prevent potential cycle, its ownerReferences are going to be modified to be non-blocking, then the object is going to be deleted with Foreground\", item.identity, dep.identity)\n\t\t\t\tpatch, err := item.unblockOwnerReferencesStrategicMergePatch()\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\tif _, err := gc.patch(item, patch, gc.unblockOwnerReferencesJSONMergePatch); err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\tbreak\n\t\t\t}\n\t\t}\n\t\tklog.V(2).Infof(\"at least one owner of object %s has FinalizerDeletingDependents, and the object itself has dependents, so it is going to be deleted in Foreground\", item.identity)\n\t\t// the deletion event will be observed by the graphBuilder, so the item\n\t\t// will be processed again in processDeletingDependentsItem. If it\n\t\t// doesn't have dependents, the function will remove the\n\t\t// FinalizerDeletingDependents from the item, resulting in the final\n\t\t// deletion of the item.\n\t\tpolicy := metav1.DeletePropagationForeground\n\t\treturn gc.deleteObject(item.identity, &policy)\n\t// 情况3: 除了上面的两种情况，根据设置的删除策略删除这个节点\n\tdefault:\n\t\t// item doesn't have any solid owner, so it needs to be garbage\n\t\t// collected. Also, none of item's owners is waiting for the deletion of\n\t\t// the dependents, so set propagationPolicy based on existing finalizers.\n\t\tvar policy metav1.DeletionPropagation\n\t\tswitch {\n\t\tcase hasOrphanFinalizer(latest):\n\t\t\t// if an existing orphan finalizer is already on the object, honor it.\n\t\t\tpolicy = metav1.DeletePropagationOrphan\n\t\tcase hasDeleteDependentsFinalizer(latest):\n\t\t\t// if an existing foreground finalizer is already on the object, honor it.\n\t\t\tpolicy = metav1.DeletePropagationForeground\n\t\tdefault:\n\t\t\t// otherwise, default to background.\n\t\t\tpolicy = metav1.DeletePropagationBackground\n\t\t}\n\t\tklog.V(2).Infof(\"delete object %s with propagation policy %s\", item.identity, policy)\n\t\treturn gc.deleteObject(item.identity, &policy)\n\t}\n}\n```\n\n<br>\n\n#### 2.7 uidToNode到底是什么\n\n在startGarbageCollectorController的时候 开启debug模式\n\n```\nreturn garbagecollector.NewDebugHandler(garbageCollector), true, nil\n```\n\n利用这个，我们可以看到uidToNode里的数据。数据太多，我这里就看 kube-system命名空间，kube-hpa这个deploy 在uidToNode的数据。\n\nkcm对应的10252端口\n\n```\n。看这个\n// 639d5269-d73d-4964-a7de-d6f386c9c7e4是kube-hpa这个deploy的uid。\n# curl http://127.0.0.1:10252/debug/controllers/garbagecollector/graph?uid=639d5269-d73d-4964-a7de-d6f386c9c7e4\nstrict digraph full {\n  // Node definitions.\n  0 [\n    label=\"\\\"uid=e66e45c0-5695-4c93-82f1-067b20aa035f\\nnamespace=kube-system\\nReplicaSet.v1.apps/kube-hpa-84c884f994\\n\\\"\"\n    group=\"apps\"\n    version=\"v1\"\n    kind=\"ReplicaSet\"\n    namespace=\"kube-system\"\n    name=\"kube-hpa-84c884f994\"\n    uid=\"e66e45c0-5695-4c93-82f1-067b20aa035f\"\n    missing=\"false\"\n    beingDeleted=\"false\"\n    deletingDependents=\"false\"\n    virtual=\"false\"\n  ];\n  1 [\n    label=\"\\\"uid=9833c399-b139-4432-98f7-cec13158f804\\nnamespace=kube-system\\nPod.v1/kube-hpa-84c884f994-7gwpz\\n\\\"\"\n    group=\"\"\n    version=\"v1\"\n    kind=\"Pod\"\n    namespace=\"kube-system\"\n    name=\"kube-hpa-84c884f994-7gwpz\"\n    uid=\"9833c399-b139-4432-98f7-cec13158f804\"\n    missing=\"false\"\n    beingDeleted=\"false\"\n    deletingDependents=\"false\"\n    virtual=\"false\"\n  ];\n  2 [\n    label=\"\\\"uid=639d5269-d73d-4964-a7de-d6f386c9c7e4\\nnamespace=kube-system\\nDeployment.v1.apps/kube-hpa\\n\\\"\"\n    group=\"apps\"\n    version=\"v1\"\n    kind=\"Deployment\"\n    namespace=\"kube-system\"\n    name=\"kube-hpa\"\n    uid=\"639d5269-d73d-4964-a7de-d6f386c9c7e4\"\n    missing=\"false\"\n    beingDeleted=\"false\"\n    deletingDependents=\"false\"\n    virtual=\"false\"\n  ];\n\n  // Edge definitions.\n  0 -> 2;\n  1 -> 0;\n}\n```\n\n可以看出来，这个图就是表示了节点的依赖，同时beingDeleted, deletingDependents表示了当前节点的状态。\n\n这个还可以将图画出来。\n\n```\ncurl http://127.0.0.1:10252/debug/controllers/garbagecollector/graph?uid=639d5269-d73d-4964-a7de-d6f386c9c7e4 > tmp.dot\n\ndot -Tsvg -o graph.svg tmp.dot\n```\n\ngraph.svg如下：\n\n![graph](../images/graph.svg)\n\n\n\n### 3.总结\n\ngc这块的逻辑非常绕，也非常难懂。但是多看几遍就会发现这个其他的妙处。这里再次总结一下整个流程。\n\n(1) kcm启动时，gc controller随之启动。gc 启动时，做了以下的初始化工作见下图：\n\n* 定期获取所有能删除的资源，保存到RestMapper。然后启动这些资源的监听事件\n* 对这些些资源设置add, update, delete事件的处理逻辑：只要有变化就将其封装成一个event，然后扔进graphChanges队列\n\n（2）runProcessGraphChanges负责处理graphChanges队列中的对象。主要做了俩件事情：\n\n* 第一，根据不同的变化，维护uidToNode这个图。一个对象对应了uidToNode中的一个节点，同时该节点有o wner, depends字段。\n* 第二，根据节点的beingDeleted, deletingDependents等字段，判断该节点是否可能要删除。如果要删除，将其扔进attemtToDelete, attemtToOrghan队列\n\n（3）attemtToDeleteWorker, attemtToOrghanWorker负责出来attemtToDelete, attemtToOrghan队列，根据不同的情况进行删除\n\n ![gc-1](../images/gc-1.png)\n\n<br>"
  },
  {
    "path": "k8s/kcm/3-k8s中以不同的策略删除资源时发生了什么.md",
    "content": "Table of Contents\n=================\n\n  * [1. 孤儿模式](#1-孤儿模式)\n  * [2. 后台模式](#2-后台模式)\n  * [3. 前台模式](#3-前台模式)\n  * [4. 总结](#4-总结)\n  * [5. 方法论](#5-方法论)\n     * [5.1 看deployA的yaml发生了什么变化](#51-看deploya的yaml发生了什么变化)\n     * [5.2 增大kcm的日志等级，查看gc的日志](#52-增大kcm的日志等级查看gc的日志)\n     * [5.3 增大apiserver的日志等级，查看apiserver的处理](#53-增大apiserver的日志等级查看apiserver的处理)\n     \n     \n\n接上篇gc源码分析，这篇主要总结以在不同的删除策略（孤儿，前台，后台）模式下，删除k8s资源发生了什么。\n\n以下都是以 deployA , rsA, podA作为介绍。（这个可以类比为任何有这种依赖关系的资源）\n\n### 1. 孤儿模式\n\n孤儿模式删除deployA： deployA会被删除，rsA不会删除，但是rsA的OwnerReference里deployA会被删除。\n\n具体的流程如下：\n\n（1)  客户端发起kubectl delete deploy deployA --cascade=false\n\n（2）apiserver接收到请求，发现删除模式是organ。这个时候apiserver会做俩件事情：\n\n* 设置deployA的deletionStamp\n* 增加一个finalizer，organ\n\n**这个时候apiserver会直接返回，不会一直阻塞在这里等**\n\n（3）这个时候由于apiserver对deployA更新了。所以gc收到了deployA的**更新**事件，然后开始处理工作：\n\n* 一，维护uidToNode图，就是删除了deployA这个node节点，并且将rsA节点的onwer删除。\n* 二，将rsA这个对象的OwnerReference中的deployA删除；\n* 三，将deployA这个对象的organ finalizer删除\n\n（4）将deployA这个对象的organ finalizer删除实际上是一个更新事件。这个时候apiserver收到这个更新事件，发现deployA的所以finalizer被删除了，这个时候调用restful接口真正的删除 deployA。\n\n<br>\n\n### 2. 后台模式\n\n后台模式删除deployA： deployA会被马上删除，然后删除rsA，最后删除pod\n\n具体的流程如下：\n\n（1)  客户端发起kubectl delete deployA propagationPolicy\":\"Background\"\n\n（2）apiserver接收到请求，发现删除模式是Background。这个时候apiserver会直接将deployA删除。\n\n（3）这个时候由于apiserver删除了deployA。所以gc收到了deployA的**删除**事件，然后开始处理工作：\n\n* 一，维护uidToNode图，就是删除了deployA这个node节点，并且将rsA扔进attemptToDelete队列\n* 二，处理rsA时，发现它的owner已经不存在了，所以马上以backgroud的方式，再删除rsA。\n* 三，然后就是同样的操作，先删除了rsA，然后删除了pod。\n\n<br>\n\n### 3. 前台模式\n\n前台模式删除deployA： podA会先删除，然后是rsA，最后是deployA。\n\n具体的流程如下：\n\n（1)  客户端发起kubectl delete  deployA  propagationPolicy:Foreground\n\n（2）apiserver接收到请求，发现删除模式是Foreground。这个时候apiserver会做俩件事情：\n\n* 设置deployA的deletionStamp\n* 增加一个finalizer，Foreground\n\n**这个时候apiserver会直接返回，不会一直阻塞在这里等**\n\n（3）这个时候由于apiserver对deployA更新了。所以gc收到了deployA的更新事件，然后开始处理工作。\n\n具体为：\n\n一，维护uidToNode图。\n\n首先是deployA这个node节点，会标记为 删除depents中。然后将 deployA的依赖（rsA）加入 attempToDelete队列。\n\n处理rsA时，发现rsA的owner在等待删除depents。并且rsA还有自己的 depends。所以这个时候就调用**前台删除**接口删除 来删除rsA。\n\n同样，前台删除rsA时，先标记rsA这个node节点，为 删除depents中，然后将 rsA的依赖（podA）加入 attempToDelete队列。\n\n处理podA时，发现PodA的owner在等待删除depents。但是podA没有自己的 depends。所以这个时候就调用**后台删除**接口删除 来删除podA。\n\n后台删除podA后，apiserver会直接将podA这个对象删除。所以gc收到了 删除事件。这个时候会将 podA这个节点删除，然后再将rsA加入删除队列。\n\n接下来rsA发现自己的depents删除了，所以rsA的finalizer就会删除。然后apiserver就会将rsA删除。\n\n然后gc收到了rsA的删除事件，同样的操作再将deployA删除。\n\n<br>\n\n### 4. 总结\n\ngc的机制非常巧妙，而且和apiserver进行了联动。在实际过程中运用这种gc机制也非常有用。比如有俩个不相关的对象，通过设置OwnerReference, 就可以实现，俩个对象的级联删除。\n\n<br>\n\n### 5. 方法论\n\n以上的流程，通过代码和实践进行验证。\n\n代码分析见上一篇。实践就是通过实验，主要做了以下观察：\n\n（1）看deployA的yaml发生了什么变化\n\n（2）增大kcm的日志等级，查看gc的日志\n\n（3）增大apiserver的日志等级，查看apiserver的处理\n\n#### 5.1 看deployA的yaml发生了什么变化\n\n```\n// -w 一直监控删除前后的变化\nroot@k8s-master:~/testyaml/hpa# kubectl get deploy zx-hpa -oyaml -w\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  annotations:\n    deployment.kubernetes.io/revision: \"1\"\n  creationTimestamp: \"2021-07-09T07:21:48Z\"\n  generation: 1\n  labels:\n    app: zx-hpa-test\n  name: zx-hpa\n  namespace: default\n  resourceVersion: \"6975175\"\n  selfLink: /apis/apps/v1/namespaces/default/deployments/zx-hpa\n  uid: 6ccbe990-e4d3-4ba1-b67f-56a9bfbd69a0\nspec:\n  progressDeadlineSeconds: 600\n  replicas: 2\n  revisionHistoryLimit: 10\n  selector:\n    matchLabels:\n      app: zx-hpa-test\n  strategy:\n    rollingUpdate:\n      maxSurge: 1\n      maxUnavailable: 25%\n    type: RollingUpdate\n  template:\n    metadata:\n      creationTimestamp: null\n      labels:\n        app: zx-hpa-test\n      name: zx-hpa-test\n    spec:\n      containers:\n      - command:\n        - sleep\n        - \"3600\"\n        image: busybox:latest\n        imagePullPolicy: IfNotPresent\n        name: busybox\n        resources: {}\n        terminationMessagePath: /dev/termination-log\n        terminationMessagePolicy: File\n      dnsPolicy: ClusterFirst\n      restartPolicy: Always\n      schedulerName: default-scheduler\n      securityContext: {}\n      terminationGracePeriodSeconds: 5\nstatus:\n  availableReplicas: 2\n  conditions:\n  - lastTransitionTime: \"2021-07-09T07:21:50Z\"\n    lastUpdateTime: \"2021-07-09T07:21:50Z\"\n    message: Deployment has minimum availability.\n    reason: MinimumReplicasAvailable\n    status: \"True\"\n    type: Available\n  - lastTransitionTime: \"2021-07-09T07:21:49Z\"\n    lastUpdateTime: \"2021-07-09T07:21:50Z\"\n    message: ReplicaSet \"zx-hpa-7b56cddd95\" has successfully progressed.\n    reason: NewReplicaSetAvailable\n    status: \"True\"\n    type: Progressing\n  observedGeneration: 1\n  readyReplicas: 2\n  replicas: 2\n  updatedReplicas: 2\n\n\n\n\n\n\n\n\n\n\n\n\n---\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  annotations:\n    deployment.kubernetes.io/revision: \"1\"\n  creationTimestamp: \"2021-07-09T07:21:48Z\"\n  generation: 1\n  labels:\n    app: zx-hpa-test\n  name: zx-hpa\n  namespace: default\n  resourceVersion: \"6975316\"\n  selfLink: /apis/apps/v1/namespaces/default/deployments/zx-hpa\n  uid: 6ccbe990-e4d3-4ba1-b67f-56a9bfbd69a0\nspec:\n  progressDeadlineSeconds: 600\n  replicas: 2\n  revisionHistoryLimit: 10\n  selector:\n    matchLabels:\n      app: zx-hpa-test\n  strategy:\n    rollingUpdate:\n      maxSurge: 1\n      maxUnavailable: 25%\n    type: RollingUpdate\n  template:\n    metadata:\n      creationTimestamp: null\n      labels:\n        app: zx-hpa-test\n      name: zx-hpa-test\n    spec:\n      containers:\n      - command:\n        - sleep\n        - \"3600\"\n        image: busybox:latest\n        imagePullPolicy: IfNotPresent\n        name: busybox\n        resources: {}\n        terminationMessagePath: /dev/termination-log\n        terminationMessagePolicy: File\n      dnsPolicy: ClusterFirst\n      restartPolicy: Always\n      schedulerName: default-scheduler\n      securityContext: {}\n      terminationGracePeriodSeconds: 5\nstatus:\n  availableReplicas: 2\n  conditions:\n  - lastTransitionTime: \"2021-07-09T07:21:50Z\"\n    lastUpdateTime: \"2021-07-09T07:21:50Z\"\n    message: Deployment has minimum availability.\n    reason: MinimumReplicasAvailable\n    status: \"True\"\n    type: Available\n  - lastTransitionTime: \"2021-07-09T07:21:49Z\"\n    lastUpdateTime: \"2021-07-09T07:21:50Z\"\n    message: ReplicaSet \"zx-hpa-7b56cddd95\" has successfully progressed.\n    reason: NewReplicaSetAvailable\n    status: \"True\"\n    type: Progressing\n  observedGeneration: 1\n  readyReplicas: 2\n  replicas: 2\n  updatedReplicas: 2\n```\n\n#### 5.2 增大kcm的日志等级，查看gc的日志\n\n```\nI0709 15:17:45.089271    3183 resource_quota_monitor.go:354] QuotaMonitor process object: apps/v1, Resource=deployments, namespace kube-system, name kube-hpa, uid 639d5269-d73d-4964-a7de-d6f386c9c7e4, event type delete\nI0709 15:17:45.089320    3183 graph_builder.go:543] GraphBuilder process object: apps/v1/Deployment, namespace kube-system, name kube-hpa, uid 639d5269-d73d-4964-a7de-d6f386c9c7e4, event type delete\nI0709 15:17:45.089346    3183 garbagecollector.go:404] processing item [apps/v1/ReplicaSet, namespace: kube-system, name: kube-hpa-84c884f994, uid: e66e45c0-5695-4c93-82f1-067b20aa035f]\nI0709 15:17:45.089576    3183 deployment_controller.go:193] Deleting deployment kube-hpa\nI0709 15:17:45.089591    3183 deployment_controller.go:564] Started syncing deployment \"kube-system/kube-hpa\" (2021-07-09 15:17:45.089588305 +0800 CST m=+38.708727198)\nI0709 15:17:45.089611    3183 deployment_controller.go:575] Deployment kube-system/kube-hpa has been deleted\nI0709 15:17:45.089615    3183 deployment_controller.go:566] Finished syncing deployment \"kube-system/kube-hpa\" (24.606µs)\nI0709 15:17:45.093463    3183 garbagecollector.go:329] according to the absentOwnerCache, object e66e45c0-5695-4c93-82f1-067b20aa035f's owner apps/v1/Deployment, kube-hpa does not exist\nI0709 15:17:45.093480    3183 garbagecollector.go:455] classify references of [apps/v1/ReplicaSet, namespace: kube-system, name: kube-hpa-84c884f994, uid: e66e45c0-5695-4c93-82f1-067b20aa035f].\nsolid: []v1.OwnerReference(nil)\ndangling: []v1.OwnerReference{v1.OwnerReference{APIVersion:\"apps/v1\", Kind:\"Deployment\", Name:\"kube-hpa\", UID:\"639d5269-d73d-4964-a7de-d6f386c9c7e4\", Controller:(*bool)(0xc000ab3817), BlockOwnerDeletion:(*bool)(0xc000ab3818)}}\nwaitingForDependentsDeletion: []v1.OwnerReference(nil)\nI0709 15:17:45.093517    3183 garbagecollector.go:517] delete object [apps/v1/ReplicaSet, namespace: kube-system, name: kube-hpa-84c884f994, uid: e66e45c0-5695-4c93-82f1-067b20aa035f] with propagation policy Background\nI0709 15:17:45.107563    3183 resource_quota_monitor.go:354] QuotaMonitor process object: apps/v1, Resource=replicasets, namespace kube-system, name kube-hpa-84c884f994, uid e66e45c0-5695-4c93-82f1-067b20aa035f, event type delete\nI0709 15:17:45.107635    3183 replica_set.go:349] Deleting ReplicaSet \"kube-system/kube-hpa-84c884f994\"\nI0709 15:17:45.107687    3183 replica_set.go:658] ReplicaSet kube-system/kube-hpa-84c884f994 has been deleted\nI0709 15:17:45.107692    3183 replica_set.go:649] Finished syncing ReplicaSet \"kube-system/kube-hpa-84c884f994\" (16.069µs)\nI0709 15:17:45.107720    3183 graph_builder.go:543] GraphBuilder process object: apps/v1/ReplicaSet, namespace kube-system, name kube-hpa-84c884f994, uid e66e45c0-5695-4c93-82f1-067b20aa035f, event type delete\nI0709 15:17:45.107753    3183 garbagecollector.go:404] processing item [v1/Pod, namespace: kube-system, name: kube-hpa-84c884f994-7gwpz, uid: 9833c399-b139-4432-98f7-cec13158f804]\nI0709 15:17:45.111155    3183 garbagecollector.go:329] according to the absentOwnerCache, object 9833c399-b139-4432-98f7-cec13158f804's owner apps/v1/ReplicaSet, kube-hpa-84c884f994 does not exist\nI0709 15:17:45.111174    3183 garbagecollector.go:455] classify references of [v1/Pod, namespace: kube-system, name: kube-hpa-84c884f994-7gwpz, uid: 9833c399-b139-4432-98f7-cec13158f804].\nsolid: []v1.OwnerReference(nil)\ndangling: []v1.OwnerReference{v1.OwnerReference{APIVersion:\"apps/v1\", Kind:\"ReplicaSet\", Name:\"kube-hpa-84c884f994\", UID:\"e66e45c0-5695-4c93-82f1-067b20aa035f\", Controller:(*bool)(0xc000bde7bf), BlockOwnerDeletion:(*bool)(0xc000bde800)}}\nwaitingForDependentsDeletion: []v1.OwnerReference(nil)\nI0709 15:17:45.111213    3183 garbagecollector.go:517] delete object [v1/Pod, namespace: kube-system, name: kube-hpa-84c884f994-7gwpz, uid: 9833c399-b139-4432-98f7-cec13158f804] with propagation policy Background\nI0709 15:17:45.124112    3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type update\nI0709 15:17:45.124236    3183 endpoints_controller.go:385] About to update endpoints for service \"kube-system/kube-hpa\"\nI0709 15:17:45.124275    3183 endpoints_controller.go:420] Pod is being deleted kube-system/kube-hpa-84c884f994-7gwpz\nI0709 15:17:45.124293    3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0\nI0709 15:17:45.124481    3183 disruption.go:394] updatePod called on pod \"kube-hpa-84c884f994-7gwpz\"\nI0709 15:17:45.124523    3183 disruption.go:457] No PodDisruptionBudgets found for pod kube-hpa-84c884f994-7gwpz, PodDisruptionBudget controller will avoid syncing.\nI0709 15:17:45.124527    3183 disruption.go:397] No matching pdb for pod \"kube-hpa-84c884f994-7gwpz\"\nI0709 15:17:45.131011    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-hpa, uid 17a8623b-2bd6-4253-b7cd-88a7af615220, event type update\nI0709 15:17:45.132261    3183 endpoints_controller.go:353] Finished syncing service \"kube-system/kube-hpa\" endpoints. (8.020508ms)\nI0709 15:17:45.132951    3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace kube-system, name kube-hpa-84c884f994-7gwpz.16900e30134087ab, uid 7c55e936-801b-4eb9-a828-085d92983134, event type add\nI0709 15:17:45.310041    3183 graph_builder.go:543] GraphBuilder process object: apiregistration.k8s.io/v1/APIService, namespace , name v1beta1.custom.metrics.k8s.io, uid 71617a10-8136-4a2a-af65-d64bcd6c78c3, event type update\nI0709 15:17:45.660593    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update\nI0709 15:17:45.668379    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update\nI0709 15:17:46.143691    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update\nI0709 15:17:46.143962    3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type update\nI0709 15:17:46.144055    3183 endpoints_controller.go:385] About to update endpoints for service \"kube-system/kube-hpa\"\nI0709 15:17:46.144095    3183 endpoints_controller.go:420] Pod is being deleted kube-system/kube-hpa-84c884f994-7gwpz\nI0709 15:17:46.144126    3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0\nI0709 15:17:46.144329    3183 disruption.go:394] updatePod called on pod \"kube-hpa-84c884f994-7gwpz\"\nI0709 15:17:46.144347    3183 disruption.go:457] No PodDisruptionBudgets found for pod kube-hpa-84c884f994-7gwpz, PodDisruptionBudget controller will avoid syncing.\nI0709 15:17:46.144350    3183 disruption.go:397] No matching pdb for pod \"kube-hpa-84c884f994-7gwpz\"\nI0709 15:17:46.144361    3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod kube-system/kube-hpa-84c884f994-7gwpz (UID=9833c399-b139-4432-98f7-cec13158f804)\nI0709 15:17:46.150410    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-hpa, uid 17a8623b-2bd6-4253-b7cd-88a7af615220, event type update\nI0709 15:17:46.150749    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update\nI0709 15:17:46.151231    3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager\nI0709 15:17:46.151321    3183 endpoints_controller.go:353] Finished syncing service \"kube-system/kube-hpa\" endpoints. (7.269404ms)\n\n\n\nI0709 15:17:46.978486    3183 cronjob_controller.go:129] Found 4 jobs\nI0709 15:17:46.978503    3183 cronjob_controller.go:135] Found 1 groups\nI0709 15:17:46.982118    3183 event.go:281] Event(v1.ObjectReference{Kind:\"CronJob\", Namespace:\"default\", Name:\"hello\", UID:\"b9648456-0b0a-44a4-b4c7-4c1db9be4085\", APIVersion:\"batch/v1beta1\", ResourceVersion:\"6974347\", FieldPath:\"\"}): type: 'Normal' reason: 'SawCompletedJob' Saw completed job: hello-1625815020, status: Complete\nI0709 15:17:46.986941    3183 graph_builder.go:543] GraphBuilder process object: batch/v1beta1/CronJob, namespace default, name hello, uid b9648456-0b0a-44a4-b4c7-4c1db9be4085, event type update\nI0709 15:17:46.987073    3183 cronjob_controller.go:278] No unmet start times for default/hello\nI0709 15:17:46.987091    3183 cronjob_controller.go:203] Cleaning up 1/4 jobs from default/hello\nI0709 15:17:46.987096    3183 cronjob_controller.go:207] Removing job hello-1625814840 from default/hello\nI0709 15:17:46.987694    3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace default, name hello.16900e3081ed9288, uid 21dc6f32-9c3b-479a-8a69-c71946be3b7a, event type add\nI0709 15:17:46.998396    3183 job_controller.go:452] Job has been deleted: default/hello-1625814840\nI0709 15:17:46.998407    3183 job_controller.go:439] Finished syncing job \"default/hello-1625814840\" (42.057µs)\nI0709 15:17:46.998436    3183 graph_builder.go:543] GraphBuilder process object: batch/v1/Job, namespace default, name hello-1625814840, uid ce65b016-b3c4-4a65-b01d-f81381fca20a, event type delete\nI0709 15:17:46.998463    3183 garbagecollector.go:404] processing item [v1/Pod, namespace: default, name: hello-1625814840-9tmbk, uid: 7aabf04b-31c5-4602-af5e-87a7e0079d1a]\nI0709 15:17:46.998715    3183 resource_quota_monitor.go:354] QuotaMonitor process object: batch/v1, Resource=jobs, namespace default, name hello-1625814840, uid ce65b016-b3c4-4a65-b01d-f81381fca20a, event type delete\nI0709 15:17:46.999144    3183 event.go:281] Event(v1.ObjectReference{Kind:\"CronJob\", Namespace:\"default\", Name:\"hello\", UID:\"b9648456-0b0a-44a4-b4c7-4c1db9be4085\", APIVersion:\"batch/v1beta1\", ResourceVersion:\"6974464\", FieldPath:\"\"}): type: 'Normal' reason: 'SuccessfulDelete' Deleted job hello-1625814840\nI0709 15:17:47.002267    3183 garbagecollector.go:329] according to the absentOwnerCache, object 7aabf04b-31c5-4602-af5e-87a7e0079d1a's owner batch/v1/Job, hello-1625814840 does not exist\nI0709 15:17:47.002298    3183 garbagecollector.go:455] classify references of [v1/Pod, namespace: default, name: hello-1625814840-9tmbk, uid: 7aabf04b-31c5-4602-af5e-87a7e0079d1a].\nsolid: []v1.OwnerReference(nil)\ndangling: []v1.OwnerReference{v1.OwnerReference{APIVersion:\"batch/v1\", Kind:\"Job\", Name:\"hello-1625814840\", UID:\"ce65b016-b3c4-4a65-b01d-f81381fca20a\", Controller:(*bool)(0xc000bdf480), BlockOwnerDeletion:(*bool)(0xc000bdf481)}}\nwaitingForDependentsDeletion: []v1.OwnerReference(nil)\nI0709 15:17:47.002325    3183 garbagecollector.go:517] delete object [v1/Pod, namespace: default, name: hello-1625814840-9tmbk, uid: 7aabf04b-31c5-4602-af5e-87a7e0079d1a] with propagation policy Background\nI0709 15:17:47.005713    3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace default, name hello.16900e3082f15365, uid 903283d1-63da-4ba7-b200-69d6a30a1d5c, event type add\nI0709 15:17:47.011868    3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace default, name hello-1625814840-9tmbk, uid 7aabf04b-31c5-4602-af5e-87a7e0079d1a, event type update\nI0709 15:17:47.011938    3183 disruption.go:394] updatePod called on pod \"hello-1625814840-9tmbk\"\nI0709 15:17:47.011960    3183 disruption.go:457] No PodDisruptionBudgets found for pod hello-1625814840-9tmbk, PodDisruptionBudget controller will avoid syncing.\nI0709 15:17:47.011964    3183 disruption.go:397] No matching pdb for pod \"hello-1625814840-9tmbk\"\nI0709 15:17:47.011977    3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod default/hello-1625814840-9tmbk (UID=7aabf04b-31c5-4602-af5e-87a7e0079d1a)\nI0709 15:17:47.026287    3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace default, name hello-1625814840-9tmbk, uid 7aabf04b-31c5-4602-af5e-87a7e0079d1a, event type delete\nI0709 15:17:47.026312    3183 deployment_controller.go:356] Pod hello-1625814840-9tmbk deleted.\nI0709 15:17:47.026350    3183 taint_manager.go:383] Noticed pod deletion: types.NamespacedName{Namespace:\"default\", Name:\"hello-1625814840-9tmbk\"}\nI0709 15:17:47.026389    3183 disruption.go:423] deletePod called on pod \"hello-1625814840-9tmbk\"\nI0709 15:17:47.026409    3183 disruption.go:457] No PodDisruptionBudgets found for pod hello-1625814840-9tmbk, PodDisruptionBudget controller will avoid syncing.\nI0709 15:17:47.026413    3183 disruption.go:426] No matching pdb for pod \"hello-1625814840-9tmbk\"\nI0709 15:17:47.026425    3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod default/hello-1625814840-9tmbk (UID=7aabf04b-31c5-4602-af5e-87a7e0079d1a)\nI0709 15:17:47.026449    3183 resource_quota_monitor.go:354] QuotaMonitor process object: /v1, Resource=pods, namespace default, name hello-1625814840-9tmbk, uid 7aabf04b-31c5-4602-af5e-87a7e0079d1a, event type delete\nI0709 15:17:47.164797    3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type update\nI0709 15:17:47.164886    3183 endpoints_controller.go:385] About to update endpoints for service \"kube-system/kube-hpa\"\nI0709 15:17:47.164929    3183 endpoints_controller.go:420] Pod is being deleted kube-system/kube-hpa-84c884f994-7gwpz\nI0709 15:17:47.164945    3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0\nI0709 15:17:47.165093    3183 disruption.go:394] updatePod called on pod \"kube-hpa-84c884f994-7gwpz\"\nI0709 15:17:47.165108    3183 disruption.go:457] No PodDisruptionBudgets found for pod kube-hpa-84c884f994-7gwpz, PodDisruptionBudget controller will avoid syncing.\nI0709 15:17:47.165111    3183 disruption.go:397] No matching pdb for pod \"kube-hpa-84c884f994-7gwpz\"\nI0709 15:17:47.165122    3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod kube-system/kube-hpa-84c884f994-7gwpz (UID=9833c399-b139-4432-98f7-cec13158f804)\nI0709 15:17:47.165142    3183 resource_quota_monitor.go:354] QuotaMonitor process object: /v1, Resource=pods, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type update\nI0709 15:17:47.169973    3183 endpoints_controller.go:353] Finished syncing service \"kube-system/kube-hpa\" endpoints. (5.082912ms)\nI0709 15:17:47.172446    3183 graph_builder.go:543] GraphBuilder process object: v1/Pod, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type delete\nI0709 15:17:47.172467    3183 deployment_controller.go:356] Pod kube-hpa-84c884f994-7gwpz deleted.\nI0709 15:17:47.172474    3183 deployment_controller.go:424] Cannot get replicaset \"kube-hpa-84c884f994\" for pod \"kube-hpa-84c884f994-7gwpz\": replicaset.apps \"kube-hpa-84c884f994\" not found\nI0709 15:17:47.172507    3183 taint_manager.go:383] Noticed pod deletion: types.NamespacedName{Namespace:\"kube-system\", Name:\"kube-hpa-84c884f994-7gwpz\"}\nI0709 15:17:47.172564    3183 endpoints_controller.go:385] About to update endpoints for service \"kube-system/kube-hpa\"\nI0709 15:17:47.172614    3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0\nI0709 15:17:47.172779    3183 disruption.go:423] deletePod called on pod \"kube-hpa-84c884f994-7gwpz\"\nI0709 15:17:47.172796    3183 disruption.go:457] No PodDisruptionBudgets found for pod kube-hpa-84c884f994-7gwpz, PodDisruptionBudget controller will avoid syncing.\nI0709 15:17:47.172799    3183 disruption.go:426] No matching pdb for pod \"kube-hpa-84c884f994-7gwpz\"\nI0709 15:17:47.172808    3183 pvc_protection_controller.go:342] Enqueuing PVCs for Pod kube-system/kube-hpa-84c884f994-7gwpz (UID=9833c399-b139-4432-98f7-cec13158f804)\nI0709 15:17:47.172843    3183 resource_quota_monitor.go:354] QuotaMonitor process object: /v1, Resource=pods, namespace kube-system, name kube-hpa-84c884f994-7gwpz, uid 9833c399-b139-4432-98f7-cec13158f804, event type delete\nI0709 15:17:47.173978    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-hpa, uid 17a8623b-2bd6-4253-b7cd-88a7af615220, event type update\nI0709 15:17:47.178093    3183 endpoints_controller.go:353] Finished syncing service \"kube-system/kube-hpa\" endpoints. (5.525822ms)\nI0709 15:17:47.178107    3183 endpoints_controller.go:340] Error syncing endpoints for service \"kube-system/kube-hpa\", retrying. Error: Operation cannot be fulfilled on endpoints \"kube-hpa\": the object has been modified; please apply your changes to the latest version and try again\nI0709 15:17:47.178372    3183 event.go:281] Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"kube-system\", Name:\"kube-hpa\", UID:\"17a8623b-2bd6-4253-b7cd-88a7af615220\", APIVersion:\"v1\", ResourceVersion:\"6974462\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedToUpdateEndpoint' Failed to update endpoint kube-system/kube-hpa: Operation cannot be fulfilled on endpoints \"kube-hpa\": the object has been modified; please apply your changes to the latest version and try again\nI0709 15:17:47.182381    3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace kube-system, name kube-hpa.16900e308da0917a, uid d136415c-0a51-40e2-b1ba-f63587af89a6, event type add\nI0709 15:17:47.183280    3183 endpoints_controller.go:385] About to update endpoints for service \"kube-system/kube-hpa\"\nI0709 15:17:47.183318    3183 endpoints_controller.go:512] Update endpoints for kube-system/kube-hpa, ready: 0 not ready: 0\nI0709 15:17:47.186538    3183 endpoints_controller.go:353] Finished syncing service \"kube-system/kube-hpa\" endpoints. (3.266428ms)\nI0709 15:17:47.679672    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update\nI0709 15:17:47.686259    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update\nI0709 15:17:48.166708    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update\nI0709 15:17:48.175956    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update\nI0709 15:17:48.176356    3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager\nI0709 15:17:49.277193    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-node-lease, name 192.168.0.5, uid 71ce7519-2999-4dbf-9118-227e5cb6d9ef, event type update\nI0709 15:17:49.701416    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update\nI0709 15:17:49.721102    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update\nI0709 15:17:50.189139    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update\nI0709 15:17:50.199890    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update\nI0709 15:17:50.200028    3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager\nI0709 15:17:51.046632    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-node-lease, name 192.168.0.4, uid a6c1c902-8d7f-442e-89d2-407f1677247e, event type update\nI0709 15:17:51.734474    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update\nI0709 15:17:51.742571    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update\nI0709 15:17:51.949675    3183 reflector.go:268] k8s.io/client-go/informers/factory.go:135: forcing resync\nE0709 15:17:51.960736    3183 horizontal.go:214] failed to query scale subresource for Deployment/default/zx-hpa: deployments/scale.apps \"zx-hpa\" not found\nI0709 15:17:51.961135    3183 event.go:281] Event(v1.ObjectReference{Kind:\"HorizontalPodAutoscaler\", Namespace:\"default\", Name:\"nginx-hpa-zx-1\", UID:\"d49c5146-c5ef-4ac8-8039-c9b15f094360\", APIVersion:\"autoscaling/v2beta2\", ResourceVersion:\"4763928\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedGetScale' deployments/scale.apps \"zx-hpa\" not found\nI0709 15:17:51.965206    3183 graph_builder.go:543] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace default, name nginx-hpa-zx-1.16900e31aab074d5, uid 3c9d8d3b-d63f-463c-8f8f-b8d2ba3f4fb3, event type add\nI0709 15:17:52.215733    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update\nI0709 15:17:52.224070    3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager\nI0709 15:17:52.224234    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update\nI0709 15:17:52.461003    3183 pv_controller_base.go:514] resyncing PV controller\nI0709 15:17:53.755870    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update\nI0709 15:17:53.766095    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update\nI0709 15:17:53.886970    3183 discovery.go:214] Invalidating discovery information\nI0709 15:17:54.236384    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update\nI0709 15:17:54.244313    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update\nI0709 15:17:54.244924    3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager\nI0709 15:17:55.778133    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update\nI0709 15:17:55.785242    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update\nI0709 15:17:56.264037    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update\nI0709 15:17:56.271400    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update\nI0709 15:17:56.271774    3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager\nI0709 15:17:57.011460    3183 cronjob_controller.go:129] Found 3 jobs\nI0709 15:17:57.011484    3183 cronjob_controller.go:135] Found 1 groups\nI0709 15:17:57.018598    3183 cronjob_controller.go:278] No unmet start times for default/hello\nI0709 15:17:57.436623    3183 gc_controller.go:163] GC'ing orphaned\nI0709 15:17:57.436642    3183 gc_controller.go:226] GC'ing unscheduled pods which are terminating.\nI0709 15:17:57.799012    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update\nI0709 15:17:57.807268    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update\nI0709 15:17:58.282260    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update\nI0709 15:17:58.288233    3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager\nI0709 15:17:58.288746    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update\nI0709 15:17:59.286621    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-node-lease, name 192.168.0.5, uid 71ce7519-2999-4dbf-9118-227e5cb6d9ef, event type update\nI0709 15:17:59.819587    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-scheduler, uid d1e00c1e-7803-4c0f-ab8a-b3eeb0644879, event type update\nI0709 15:17:59.827855    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-scheduler, uid 9aed1771-031a-4fce-826a-11d98ee81740, event type update\nI0709 15:18:00.301289    3183 graph_builder.go:543] GraphBuilder process object: v1/Endpoints, namespace kube-system, name kube-controller-manager, uid 5d530096-9b10-45bb-a11e-43f1f8733fa5, event type update\nI0709 15:18:00.310096    3183 leaderelection.go:283] successfully renewed lease kube-system/kube-controller-manager\nI0709 15:18:00.310445    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-system, name kube-controller-manager, uid 036d9292-1152-4f8c-8a85-0879c5424cfb, event type update\nI0709 15:18:01.054003    3183 graph_builder.go:543] GraphBuilder process object: coordination.k8s.io/v1/Lease, namespace kube-node-lease, name 192.168.0.4, uid a6c1c902-8d7f-442e-89d2-407f1677247e, event type update\n^Z\n```\n\n<br>\n\n#### 5.3 增大apiserver的日志等级，查看apiserver的处理\n\n至少开到5\n\n```\nI0709 16:43:48.411395   28901 handler.go:143] kube-apiserver: PUT \"/apis/apps/v1/namespaces/default/deployments/zx-hpa/status\" satisfied by gorestful with webservice /apis/apps/v1\nI0709 16:43:48.413431   28901 httplog.go:90] GET /apis/apps/v1/namespaces/default/deployments/zx-hpa: (2.677854ms) 200 [kube-controller-manager/v1.17.4 (linux/amd64) kubernetes/8d8aa39/generic-garbage-collector 192.168.0.4:48978]\nI0709 16:43:48.414076   28901 handler.go:153] kube-aggregator: GET \"/apis/apps/v1/namespaces/default/deployments/zx-hpa\" satisfied by nonGoRestful\nI0709 16:43:48.414089   28901 pathrecorder.go:247] kube-aggregator: \"/apis/apps/v1/namespaces/default/deployments/zx-hpa\" satisfied by prefix /apis/apps/v1/\nI0709 16:43:48.414119   28901 handler.go:143] kube-apiserver: GET \"/apis/apps/v1/namespaces/default/deployments/zx-hpa\" satisfied by gorestful with webservice /apis/apps/v1\nI0709 16:43:48.418663   28901 httplog.go:90] PUT /apis/apps/v1/namespaces/default/deployments/zx-hpa/status: (7.370204ms) 200 [kube-controller-manager/v1.17.4 (linux/amd64) kubernetes/8d8aa39/deployment-controller 192.168.0.4:49000]\nI0709 16:43:48.420303   28901 httplog.go:90] GET /apis/apps/v1/namespaces/default/deployments/zx-hpa: (6.309997ms) 200 [kube-controller-manager/v1.17.4 (linux/amd64) kubernetes/8d8aa39/generic-garbage-collector 192.168.0.4:48978]\nI0709 16:43:48.420817   28901 handler.go:153] kube-aggregator: PATCH \"/apis/apps/v1/namespaces/default/deployments/zx-hpa\" satisfied by nonGoRestful\nI0709 16:43:48.420828   28901 pathrecorder.go:247] kube-aggregator: \"/apis/apps/v1/namespaces/default/deployments/zx-hpa\" satisfied by prefix /apis/apps/v1/\nI0709 16:43:48.420855   28901 handler.go:143] kube-apiserver: PATCH \"/apis/apps/v1/namespaces/default/deployments/zx-hpa\" satisfied by gorestful with webservice /apis/apps/v1\nI0709 16:43:48.425221   28901 store.go:428] going to delete zx-hpa from registry, triggered by update\n```\n\n"
  },
  {
    "path": "k8s/kcm/4-hpa-自定义metric server.md",
    "content": "Table of Contents\n=================\n\n  * [1. custom-metrics-apiserver简介](#1-custom-metrics-apiserver简介)\n  * [2. 定制自己的metric server](#2-定制自己的metric-server)\n     * [2.1 代码部署和编译](#21-代码部署和编译)\n     * [2.2 创建 Sv and APIService](#22-创建-sv-and-apiservice)\n     * [2.3 system:anonymous授权](#23-systemanonymous授权)\n  * [3. 创建hpa验证是否成功](#3-创建hpa验证是否成功)\n  * [4. 追踪整个过程](#4-追踪整个过程)\n  * [5. 总结](#5-总结)\n\n**本章重点：** 如何基于 custom-metrics-apiserver 项目，打造自己的 metric server\n\n### 1. custom-metrics-apiserver简介\n\n 项目地址： https://github.com/kubernetes-sigs/custom-metrics-apiserver/tree/master \n\n**自定义metric server，具体来说需要做以下几个事情：**\n\n（1）实现  custom-metrics-apiserver 的 三个接口，如下：\n\n```\ntype CustomMetricsProvider interface {\n    // 定义metric。 例如 pod_cpu_used_1m\n    ListAllMetrics() []CustomMetricInfo\n\t\n\t// 如何根据 metric的信息，得到具体的值\n    GetMetricByName(name types.NamespacedName, info CustomMetricInfo) (*custom_metrics.MetricValue, error)\n    \n    // 如何根据 metric selector的信息，得到具体的值\n    GetMetricBySelector(namespace string, selector labels.Selector, info CustomMetricInfo) (*custom_metrics.MetricValueList, error)\n}\n```\n\nGetMetricBySelectorm, GetMetricByName 在reststorage.go被使用。\n\nhttps://github.com/kubernetes-sigs/custom-metrics-apiserver/blob/master/pkg/registry/custom_metrics/reststorage.go\n\nrestful接口在installer.go中被定义。\n\nhttps://github.com/kubernetes-sigs/custom-metrics-apiserver/blob/master/pkg/apiserver/installer/installer.go\n\n**总的来说，可以认为**\n\n（1）基于custom-metrics-apiserver这个项目，你只要实现上述三个接口就行。其他的事情这个包在你new provider的时候都自动实现了。\n\n（2）ListAllMetrics 注册了所有的Metric，让api-server 知道有哪些自定义metric。\n\n（3）GetMetricByName， GetMetricBySelector 都是返回具体的Metric数据。\n\n（4）一般api server都是 调用GetMetricBySelector，因为hpa的对象基本都是deploy, GetMetricBySelector会循环调用GetMetricByName取得deploy所有pod的metric信息。\n\n<br>\n\n### 2. 定制自己的metric server\n\n#### 2.1 代码部署和编译\n\n这里我做了如下的修改。对于metric server而言，无论访问什么metric，都返回10。\n\n```\nfunc (p *monitorProvider) GetMetricByName(\n\tname types.NamespacedName,\n\tinfo provider.CustomMetricInfo,\n\tmetricSelector labels.Selector,\n) (*custom_metrics.MetricValue, error) {\n\tref, err := helpers.ReferenceFor(p.mapper, name, info)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn &custom_metrics.MetricValue{\n\t\tDescribedObject: ref,\n\t\t// MetricName:      info.Metric,\n\t\tMetric: custom_metrics.MetricIdentifier{\n\t\t\tName: info.Metric,\n\t\t},\n\t\tTimestamp: metav1.Time{time.Unix(int64(10), 0)},\n\t\tValue:     *resource.NewMilliQuantity(int64(10*1000.0), resource.DecimalSI),\n\t}, nil\n}\n```\n\n更详细的可以参考我的github项目。\n\n<br>\n\n编译生成自己的镜像：zoux/hpa:v1。然后生成一下的deployment。\n\n```\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  labels:\n    app: kube-hpa\n  name: kube-hpa\n  namespace: kube-system\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app: kube-hpa\n  template:\n    metadata:\n      labels:\n        app: kube-hpa\n      name: kube-hpa\n    spec:\n      hostNetwork: true\n      containers:\n      - name: kube-hpa\n        image: zoux/hpa:v1\n        imagePullPolicy: IfNotPresent\n        command:\n        - /metric-server\n        args:\n        - --master-url=XXX\n        - --kube-config=/pkc/config\n        - --tls-private-key-file=/pkc/server-key.pem\n        - --secure-port=9997\n        - --v=10\n        ports:\n        - containerPort: 9997\n        resources:\n          limits:\n            cpu: 2\n            memory: 2048Mi\n          requests:\n            cpu: 0.5\n            memory: 500Mi\n        volumeMounts:\n        - name: pkc\n          mountPath: /pkc\n          readOnly: true\n      volumes:\n      - name: pkc\n        hostPath:\n          path: /opt/kubernetes/ssl\n```\n\n<br>\n\n验证部署成功\n\n```\nroot@k8s-master:~/testyaml/hpa# kubectl get pod -n kube-system -o wide\nNAME                        READY   STATUS    RESTARTS   AGE     IP            NODE          NOMINATED NODE   READINESS GATES\nkube-hpa-84c884f994-gd5fl   1/1     Running   0          3d13h   192.168.0.5   192.168.0.5   <none>           <none>\n```\n\n#### 2.2 创建 Sv and APIService\n\n上面虽然部署成功了，但是apiserver还是访问不到。\n\n```\nk8s-master:~/testyaml/hpa# kubectl get --raw \"/apis/custom.metrics.k8s.io/v1\nError from server (NotFound): the server could not find the requested resource\n```\n\n原因在于，apiserver不知道如何找到kube-hpa-84c884f994-gd5fl这个pod进行访问。所以需要创建下面的svc和apiserver。\n\n```\nroot@k8s-master:~/testyaml/hpa# cat tls.yaml \napiVersion: v1\nkind: Service\nmetadata:\n  name: kube-hpa\n  namespace: kube-system\nspec:\n  clusterIP: None\n  ports:\n  - name: https-hpa-dont-edit-it\n    port: 9997\n    targetPort: 9997\n  selector:\n    app: kube-hpa\n---\napiVersion: apiregistration.k8s.io/v1beta1\nkind: APIService\nmetadata:\n  name: v1beta1.custom.metrics.k8s.io\nspec:\n  service:\n    name: kube-hpa\n    namespace: kube-system\n    port: 9997\n  group: custom.metrics.k8s.io\n  version: v1beta1\n  insecureSkipTLSVerify: true\n  groupPriorityMinimum: 100\n  versionPriority: 100\n```\n\n**创建完成，验证是否成功：**\n\n```\nroot@k8s-master:~/testyaml/hpa# kubectl get --raw \"/apis/custom.metrics.k8s.io/v1beta1\"\n{\"kind\":\"APIResourceList\",\"apiVersion\":\"v1\",\"groupVersion\":\"custom.metrics.k8s.io/v1beta1\",\"resources\":[{\"name\":\"pods/pod_cpu_used_1m\",\"singularName\":\"\",\"namespaced\":true,\"kind\":\"MetricValueList\",\"verbs\":[\"get\"]},{\"name\":\"pods/pod_cpu_used_5m\",\"singularName\":\"\",\"namespaced\":true,\"kind\":\"MetricValueList\",\"verbs\":[\"get\"]},{\"name\":\"pods/container_cpu_used_1m\",\"singularName\":\"\",\"namespaced\":true,\"kind\":\"MetricValueList\",\"verbs\":...}\n```\n\n如果报错。查看该apiserver哪里报错了\n\n```\nroot@k8s-master:~/testyaml/hpa# kubectl get APIService v1beta1.custom.metrics.k8s.io  -oyaml\napiVersion: apiregistration.k8s.io/v1\nkind: APIService\nmetadata:\n  creationTimestamp: \"2021-06-13T13:22:01Z\"\n  name: v1beta1.custom.metrics.k8s.io\n  resourceVersion: \"1590641\"\n  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.custom.metrics.k8s.io\n  uid: d488d6a8-7e79-4311-a1e9-0b12e4591375\nspec:\n  group: custom.metrics.k8s.io\n  groupPriorityMinimum: 100\n  insecureSkipTLSVerify: true\n  service:\n    name: kube-hpa\n    namespace: kube-system\n    port: 9997\n  version: v1beta1\n  versionPriority: 100\nstatus:\n  conditions:\n  - lastTransitionTime: \"2021-06-13T13:42:17Z\"\n    message: all checks passed\n    reason: Passed\n    status: \"True\"\n    type: Available\n```\n\n或者直接curl访问：\n\n```\ncurl -k  https://nodeip:9997/apis/custom.metrics.k8s.io/v1beta1\n```\n\n<br>\n\n####  2.3 system:anonymous授权\n\n如果没有出现类似问题，这一步直接跳过\n\n又是会出现如下的错误 或者 上述的APIService没有运行成功。都是因为system:anonymous权限不够\n\n```\nannotations:\n    autoscaling.alpha.kubernetes.io/conditions: '[{\"type\":\"AbleToScale\",\"status\":\"True\",\"lastTransitionTime\":\"2021-06-13T13:33:12Z\",\"reason\":\"SucceededGetScale\",\"message\":\"the\n      HPA controller was able to get the target''s current scale\"},{\"type\":\"ScalingActive\",\"status\":\"False\",\"lastTransitionTime\":\"2021-06-13T13:33:12Z\",\"reason\":\"FailedGetPodsMetric\",\"message\":\"the\n      HPA was unable to compute the replica count: unable to get metric pod_cpu_usage_for_limit_1m:\n      unable to fetch metrics from custom metrics API: pods.custom.metrics.k8s.io\n      \\\"*\\\" is forbidden: User \\\"system:anonymous\\\" cannot get resource \\\"pods/pod_cpu_usage_for_limit_1m\\\"\n      in API group \\\"custom.metrics.k8s.io\\\" in the namespace \\\"default\\\"\"}]'\n    autoscaling.alpha.kubernetes.io/metrics: '[{\"type\":\"Pods\",\"pods\":{\"metricName\":\"pod_cpu_usage_for_limit_1m\",\"targetAverageValue\":\"60\"}}]'\n    metric-containerName: zx-hpa\n  creationTimestamp: \"2021-06-13T13:32:56Z\"\n  name: nginx-hpa-zx-1\n  namespace: default\n  resourceVersion: \"1589301\"\n  selfLink: /apis/autoscaling/v1/namespaces/default/horizontalpodautoscalers/\n```\n\n这是可以直接绑定clusterrole https://github.com/kubernetes-sigs/metrics-server/issues/81\n\n我这里是直接给了 cluster-admin 权限，实际情况可以按照需求赋权。\n\n```\nkubectl create clusterrolebinding anonymous-role-binding --clusterrole=cluster-admin --user=system:anonymous\n```\n\n<br>\n\n### 3. 创建hpa验证是否成功\n\n可以看出来都是10\n\n```\nroot@k8s-master:~/testyaml/hpa# kubectl get hpa\nNAME             REFERENCE           TARGETS   MINPODS   MAXPODS   REPLICAS   AGE\nnginx-hpa-zx-1   Deployment/zx-hpa   10/60     1         3         3          9m55s\nroot@k8s-master:~/testyaml/hpa# kubectl get hpa\nNAME             REFERENCE           TARGETS   MINPODS   MAXPODS   REPLICAS   AGE\nnginx-hpa-zx-1   Deployment/zx-hpa   10/60     1         3         3          9m57s\n```\n\n<br>\n\n### 4. 追踪整个过程\n\n**第一步**  Kcm（hpa controller）发送的请求。\n\n```\nI0613 23:12:36.498740    9879 httplog.go:90] GET /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test: (35.302304ms) 200 [kube-controller-manager/v1.17.4 (linux/amd64) kubernetes/8d8aa39/horizontal-pod-autoscaler 192.168.0.4:42750]\n```\n\n要在url里使用不安全字符，就需要使用转义。\n\n%2A = *\n\n%3D = =（等号）\n\n<br>\n\n**第二步**  apiserver进行了 url转换。\n\nkcm访问的是： /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test\n\n但是由于第二步创建了 Sv and APIService。所以访问这个url会被转换为：\n\nhttps://192.168.0.5:9997/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test\n\n192.168.0.5是pod kube-hpa-84c884f994-gd5fl 所在的节点ip也是podia(hostNetwork模式)。 9997是定义的端口。\n\n<br>\n\n**第三步：**  访问 https://192.168.0.5:9997/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test\n\n直接在master节点上（masterip=192.168.0.4）通过curl模拟\n\n```\nroot@k8s-master:~/testyaml/hpa# curl -k  https://192.168.0.5:9997/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m?labelSelector=app%3Dzx-hpa-test\n{\n  \"kind\": \"MetricValueList\",\n  \"apiVersion\": \"custom.metrics.k8s.io/v1beta1\",\n  \"metadata\": {\n    \"selfLink\": \"/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/pod_aa_100m\"\n  },\n  \"items\": [\n    {\n      \"describedObject\": {\n        \"kind\": \"Pod\",\n        \"namespace\": \"default\",\n        \"name\": \"zx-hpa-7b56cddd95-5j6r4|\",\n        \"apiVersion\": \"/v1\"\n      },\n      \"metricName\": \"pod_aa_100m\",\n      \"timestamp\": \"1970-01-01T00:00:10Z\",\n      \"value\": \"10\",\n      \"selector\": null\n    },\n    {\n      \"describedObject\": {\n        \"kind\": \"Pod\",\n        \"namespace\": \"default\",\n        \"name\": \"zx-hpa-7b56cddd95-lthbz|\",\n        \"apiVersion\": \"/v1\"\n      },\n      \"metricName\": \"pod_aa_100m\",\n      \"timestamp\": \"1970-01-01T00:00:10Z\",\n      \"value\": \"10\",\n      \"selector\": null\n    },\n    {\n      \"describedObject\": {\n        \"kind\": \"Pod\",\n        \"namespace\": \"default\",\n        \"name\": \"zx-hpa-7b56cddd95-n9ft9|\",\n        \"apiVersion\": \"/v1\"\n      },\n      \"metricName\": \"pod_aa_100m\",\n      \"timestamp\": \"1970-01-01T00:00:10Z\",\n      \"value\": \"10\",\n      \"selector\": null\n    }\n  ]\n}\n```\n\n<br>\n\n### 5. 总结\n\n（1）如何定制自己的metric-server，包括代码编写和环境搭建\n\n（2）Kubernetes 里的 Custom Metrics 机制，也是借助 Aggregator APIServer 扩展机制来实现的。这里的具体原理是，当你把 Custom Metrics APIServer 启动之后，Kubernetes 里就会出现一个叫作custom.metrics.k8s.io的 API。而当你访问这个 URL 时，Aggregator 就会把你的请求转发给 Custom Metrics APIServer 。\n\n这里一定要注意： kube-apiserver启动参数一定要包含： -enable-swagger-ui=true  \n\n（3）ListAllMetrics()并没有将metric注册到apiserver。 apiserver并没有对metric进行验证。上文中，我metric server的ListAllMetrics()并没有注册  pod_aa_100m这个metric，但是可以正常使用。\n\n原因：apiserver并没有进行验证，apiserver只进行url转发，如果有返回数据，apiserver就认为这个metric是正确的。所以这一点可以用来自定义metric。"
  },
  {
    "path": "k8s/kcm/4-hpa源码分析.md",
    "content": "Table of Contents\n=================\n\n  * [1. hpa介绍](#1-hpa介绍)\n     * [1.1 hpa是什么](#11-hpa是什么)\n     * [1.2 hpa如何用起来](#12-hpa如何用起来)\n  * [2. hpa 源码分析](#2-hpa-源码分析)\n     * [2.1 启动参数介绍](#21-启动参数介绍)\n     * [2.2 启动流程](#22-启动流程)\n     * [2.3 核心计算逻辑](#23-核心计算逻辑)\n     * [2.4  计算期望副本数量](#24--计算期望副本数量)\n        * [2.4.1 GetRawMetric-具体的metric值](#241-getrawmetric-具体的metric值)\n        * [2.4.2 calcPlainMetricReplicas-计算期望副本值](#242-calcplainmetricreplicas-计算期望副本值)\n  * [3. 举例说明计算过程](#3-举例说明计算过程)\n     * [3.1 hpa扩容计算逻辑](#31-hpa扩容计算逻辑)\n     * [3.2 场景1](#32-场景1)\n     * [3.3 场景2](#33-场景2)\n  * [4. 总结](#4-总结)\n\n**本章重点：** 从源码角度分析hpa的计算逻辑\n\n### 1. hpa介绍\n\n#### 1.1 hpa是什么\n\nhpa指的是 Pod 水平自动扩缩，全名是Horizontal Pod Autoscaler简称HPA。它可以基于 CPU 利用率或其他指标自动扩缩 ReplicationController、Deployment 和 ReplicaSet 中的 Pod 数量。\n\n**用处：**  用户可以通过设置hpa，实现deploy pod数量的自动扩缩容。比如流量大的时候，pod数量多一些。流量小的时候，Pod数量降下来，避免资源浪费。\n\n![image-20210521145344510](../images/hap-1.png)\n\n<br>\n\n####  1.2 hpa如何用起来\n\n（1）需要一个deploy/svc等，可以参考社区\n\n（2）需要对应的hpa\n\n举例：\n\n(1) 创建1个deploy。这里只有1个副本\n\n```\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  labels:\n    app: zx-hpa-test\n  name: zx-hpa\nspec:\n  strategy:\n    type: RollingUpdate\n    rollingUpdate:\n      maxSurge: 1\n  replicas: 2\n  selector:\n    matchLabels:\n      app: zx-hpa-test\n  template:\n    metadata:\n      labels:\n        app: zx-hpa-test\n      name: zx-hpa-test\n    spec:\n      terminationGracePeriodSeconds: 5\n      containers:\n        - name: busybox\n          image: busybox:latest\n          imagePullPolicy: IfNotPresent\n          command:\n            - sleep\n            - \"3600\"\n```\n\n（2）创建对应的hpa。\n\n```\napiVersion: autoscaling/v2beta1\nkind: HorizontalPodAutoscaler\nmetadata:\n  name: nginx-hpa-zx-1\n  annotations:\n    metric-containerName: zx-hpa\nspec:\n  scaleTargetRef:\n    apiVersion: apps/v1   // 这里必须指定需要监控那个对象\n    kind: Deployment\n    name: zx-hpa\n  minReplicas: 1          // deploy最小的Pod数量\n  maxReplicas: 3          // deploy最大的Pod数量\n  metrics:\n    - type: Pods\n      pods:\n        metricName: pod_cpu_1m\n        targetAverageValue: 60\n```\n\nhpa是从同命名空间下，找对应的deploy。所以yaml中指定deploy的时候不要指定namespaces。这也就要求，hpa 和deploy必须在同一命名空间。\n\n<br>\n\n这里我使用的 pod_cpu_1m这个指标。这是一个自定义指标。接下来就是分析\n\n创建好之后，观察hpa，当deploy的cpu利用率变化时，deploy的副本会随之改变。\n\n<br>\n\n###  2. hpa 源码分析\n\n#### 2.1 启动参数介绍\n\nhpa controller随controller manager的初始化而启动，hpa controller将以下flag添加到controller manager的flag中，通过controller manager的CLI端暴露给用户：\n\n```\n// AddFlags adds flags related to HPAController for controller manager to the specified FlagSet.\nfunc (o *HPAControllerOptions) AddFlags(fs *pflag.FlagSet) {\n\tif o == nil {\n\t\treturn\n\t}\n\n\tfs.DurationVar(&o.HorizontalPodAutoscalerSyncPeriod.Duration, \"horizontal-pod-autoscaler-sync-period\", o.HorizontalPodAutoscalerSyncPeriod.Duration, \"The period for syncing the number of pods in horizontal pod autoscaler.\")\n\tfs.DurationVar(&o.HorizontalPodAutoscalerUpscaleForbiddenWindow.Duration, \"horizontal-pod-autoscaler-upscale-delay\", o.HorizontalPodAutoscalerUpscaleForbiddenWindow.Duration, \"The period since last upscale, before another upscale can be performed in horizontal pod autoscaler.\")\n\tfs.MarkDeprecated(\"horizontal-pod-autoscaler-upscale-delay\", \"This flag is currently no-op and will be deleted.\")\n\tfs.DurationVar(&o.HorizontalPodAutoscalerDownscaleStabilizationWindow.Duration, \"horizontal-pod-autoscaler-downscale-stabilization\", o.HorizontalPodAutoscalerDownscaleStabilizationWindow.Duration, \"The period for which autoscaler will look backwards and not scale down below any recommendation it made during that period.\")\n\tfs.DurationVar(&o.HorizontalPodAutoscalerDownscaleForbiddenWindow.Duration, \"horizontal-pod-autoscaler-downscale-delay\", o.HorizontalPodAutoscalerDownscaleForbiddenWindow.Duration, \"The period since last downscale, before another downscale can be performed in horizontal pod autoscaler.\")\n\tfs.MarkDeprecated(\"horizontal-pod-autoscaler-downscale-delay\", \"This flag is currently no-op and will be deleted.\")\n\tfs.Float64Var(&o.HorizontalPodAutoscalerTolerance, \"horizontal-pod-autoscaler-tolerance\", o.HorizontalPodAutoscalerTolerance, \"The minimum change (from 1.0) in the desired-to-actual metrics ratio for the horizontal pod autoscaler to consider scaling.\")\n\tfs.BoolVar(&o.HorizontalPodAutoscalerUseRESTClients, \"horizontal-pod-autoscaler-use-rest-clients\", o.HorizontalPodAutoscalerUseRESTClients, \"If set to true, causes the horizontal pod autoscaler controller to use REST clients through the kube-aggregator, instead of using the legacy metrics client through the API server proxy.  This is required for custom metrics support in the horizontal pod autoscaler.\")\n\tfs.DurationVar(&o.HorizontalPodAutoscalerCPUInitializationPeriod.Duration, \"horizontal-pod-autoscaler-cpu-initialization-period\", o.HorizontalPodAutoscalerCPUInitializationPeriod.Duration, \"The period after pod start when CPU samples might be skipped.\")\n\tfs.MarkDeprecated(\"horizontal-pod-autoscaler-use-rest-clients\", \"Heapster is no longer supported as a source for Horizontal Pod Autoscaler metrics.\")\n\tfs.DurationVar(&o.HorizontalPodAutoscalerInitialReadinessDelay.Duration, \"horizontal-pod-autoscaler-initial-readiness-delay\", o.HorizontalPodAutoscalerInitialReadinessDelay.Duration, \"The period after pod start during which readiness changes will be treated as initial readiness.\")\n}\n```\n\n| 参数                                                | 默认 | 说明                                                         |\n| :-------------------------------------------------- | :--- | :----------------------------------------------------------- |\n| horizontal-pod-autoscaler-sync-period               | 15s  | controller同步HPA信息的同步周期                              |\n| horizontal-pod-autoscaler-downscale-stabilization   | 5m   | 缩容稳定窗口，缩容间隔时间（v1.12支持）                      |\n| horizontal-pod-autoscaler-tolerance                 | 0.1  | 最小缩放容忍度：计算出的期望值和实际值的比率<最小容忍比率，则不进行扩缩容 |\n| horizontal-pod-autoscaler-cpu-initialization-period | 5m   | pod刚启动时，一定时间内的CPU使用率数据不参与计算。           |\n| horizontal-pod-autoscaler-initial-readiness-delay   | 30s  | 扩容等待pod ready的时间（无法得知pod何时就绪）               |\n\nkcm中需要设置这个，才能启动自定义的rest-clients。   --horizontal-pod-autoscaler-use-rest-clients=true\n\n<br>\n\n#### 2.2 启动流程\n\n**代码流程： **\n\nstartHPAControllerWithMetricsClient -> startHPAControllerWithMetricsClient -> Run -> worker -> processNextWorkItem -> reconcileKey->reconcileAutoscaler\n\n```\nfunc (a *HorizontalController) reconcileKey(key string) (deleted bool, err error) {\n\tnamespace, name, err := cache.SplitMetaNamespaceKey(key)\n\tif err != nil {\n\t\treturn true, err\n\t}\n\n\thpa, err := a.hpaLister.HorizontalPodAutoscalers(namespace).Get(name)\n\tif errors.IsNotFound(err) {\n\t\tklog.Infof(\"Horizontal Pod Autoscaler %s has been deleted in %s\", name, namespace)\n\t\tdelete(a.recommendations, key)\n\t\treturn true, nil\n\t}\n\n\treturn false, a.reconcileAutoscaler(hpa, key)\n}\n```\n\n<br>\n\n####  2.3 核心计算逻辑\n\n**metric的定义类型分为3种，resource、pods和external，这里只分析pods类型的metric。**\n\nreconcileAutoscaler函数就是hpa的核心函数。该函数主要逻辑如下：\n\n* 1.做一些类型转换，用于接下来的Hpa计算\n* 2.计算hpa 的期望副本数量。\n* 3.根据计算的结果判断是否需要改变副本数，需要改变的话，调用接口修改，然后做错误处理。\n\n```go\nfunc (a *HorizontalController) reconcileAutoscaler(hpav1Shared *autoscalingv1.HorizontalPodAutoscaler, key string) error {\n\t// 1. 调用client向apiserver发送请求，scale是返回的hpa实体,然后做各种数据类型转换，然后通过一个client向apiserver获取scale，以及当然还有一些backup、把错误写入hpa event的操作\n   。。。。代码省略\n\n  // 2. 判断是否需要计算副本数，如果需要，就调用computeReplicasForMetrics函数计算当前hpa的副本数。\n\tdesiredReplicas := int32(0)\n\trescaleReason := \"\"\n\n\tvar minReplicas int32\n\n\tif hpa.Spec.MinReplicas != nil {\n\t\tminReplicas = *hpa.Spec.MinReplicas\n\t} else {\n\t\t// Default value\n\t\tminReplicas = 1\n\t}\n\n\trescale := true\n\n\tif scale.Spec.Replicas == 0 && minReplicas != 0 {\n\t\t// Autoscaling is disabled for this resource\n\t\tdesiredReplicas = 0\n\t\trescale = false\n\t\tsetCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionFalse, \"ScalingDisabled\", \"scaling is disabled since the replica count of the target is zero\")\n\t} else if currentReplicas > hpa.Spec.MaxReplicas {\n\t\trescaleReason = \"Current number of replicas above Spec.MaxReplicas\"\n\t\tdesiredReplicas = hpa.Spec.MaxReplicas\n\t} else if currentReplicas < minReplicas {\n\t\trescaleReason = \"Current number of replicas below Spec.MinReplicas\"\n\t\tdesiredReplicas = minReplicas\n\t} else {\n\t\tvar metricTimestamp time.Time\n\t\tmetricDesiredReplicas, metricName, metricStatuses, metricTimestamp, err = a.computeReplicasForMetrics(hpa, scale, hpa.Spec.Metrics)\n\t\tif err != nil {\n\t\t\ta.setCurrentReplicasInStatus(hpa, currentReplicas)\n\t\t\tif err := a.updateStatusIfNeeded(hpaStatusOriginal, hpa); err != nil {\n\t\t\t\tutilruntime.HandleError(err)\n\t\t\t}\n\t\t\ta.eventRecorder.Event(hpa, v1.EventTypeWarning, \"FailedComputeMetricsReplicas\", err.Error())\n\t\t\treturn fmt.Errorf(\"failed to compute desired number of replicas based on listed metrics for %s: %v\", reference, err)\n\t\t}\n\n\t\tklog.V(4).Infof(\"proposing %v desired replicas (based on %s from %s) for %s\", metricDesiredReplicas, metricName, metricTimestamp, reference)\n\n\t\trescaleMetric := \"\"\n\t\tif metricDesiredReplicas > desiredReplicas {\n\t\t\tdesiredReplicas = metricDesiredReplicas\n\t\t\trescaleMetric = metricName\n\t\t}\n\t\tif desiredReplicas > currentReplicas {\n\t\t\trescaleReason = fmt.Sprintf(\"%s above target\", rescaleMetric)\n\t\t}\n\t\tif desiredReplicas < currentReplicas {\n\t\t\trescaleReason = \"All metrics below target\"\n\t\t}\n\t\tdesiredReplicas = a.normalizeDesiredReplicas(hpa, key, currentReplicas, desiredReplicas, minReplicas)\n\t\trescale = desiredReplicas != currentReplicas\n\t}\n  \n  // 3.进行扩缩容，并进行错误处理。\n\tif rescale {\n\t\tscale.Spec.Replicas = desiredReplicas\n\t\t_, err = a.scaleNamespacer.Scales(hpa.Namespace).Update(targetGR, scale)\n\t\tif err != nil {\n\t\t\ta.eventRecorder.Eventf(hpa, v1.EventTypeWarning, \"FailedRescale\", \"New size: %d; reason: %s; error: %v\", desiredReplicas, rescaleReason, err.Error())\n\t\t\tsetCondition(hpa, autoscalingv2.AbleToScale, v1.ConditionFalse, \"FailedUpdateScale\", \"the HPA controller was unable to update the target scale: %v\", err)\n\t\t\ta.setCurrentReplicasInStatus(hpa, currentReplicas)\n\t\t\tif err := a.updateStatusIfNeeded(hpaStatusOriginal, hpa); err != nil {\n\t\t\t\tutilruntime.HandleError(err)\n\t\t\t}\n\t\t\treturn fmt.Errorf(\"failed to rescale %s: %v\", reference, err)\n\t\t}\n\t\tsetCondition(hpa, autoscalingv2.AbleToScale, v1.ConditionTrue, \"SucceededRescale\", \"the HPA controller was able to update the target scale to %d\", desiredReplicas)\n\t\ta.eventRecorder.Eventf(hpa, v1.EventTypeNormal, \"SuccessfulRescale\", \"New size: %d; reason: %s\", desiredReplicas, rescaleReason)\n\t\tklog.Infof(\"Successful rescale of %s, old size: %d, new size: %d, reason: %s\",\n\t\t\thpa.Name, currentReplicas, desiredReplicas, rescaleReason)\n\t} else {\n\t\tklog.V(4).Infof(\"decided not to scale %s to %v (last scale time was %s)\", reference, desiredReplicas, hpa.Status.LastScaleTime)\n\t\tdesiredReplicas = currentReplicas\n\t}\n\n\ta.setStatus(hpa, currentReplicas, desiredReplicas, metricStatuses, rescale)\n\treturn a.updateStatusIfNeeded(hpaStatusOriginal, hpa)\n}\n```\n\n<br>\n\n**这里主要关心第二个步骤：hpa如何计算期望副本数量**\n\n#### 2.4  计算期望副本数量\n\n概念：\n\n最小值：minReplicas。 这个是用户在hpa里面的yaml设置的。这个是可选的，如果不设置，默认是1。\n\n最大值：MaxReplicas。 这个是用户在hpa里面的yaml设置的。这个必填的，如果不设置，会报错, 如下。\n\n当前值：currentReplicas。这个是hpa获得的当前deploy的副本数量。\n\n期望值：desiredReplicas。 这个是hpa希望deploy的副本数量。\n\n```\nerror: error validating \"nginx-deployment-hpa-test.yaml\": error validating data: ValidationError(HorizontalPodAutoscaler.spec): missing required field \"maxReplicas\" in io.k8s.api.autoscaling.v2beta1.HorizontalPodAutoscalerSpec; if you choose to ignore these errors, turn validation off with --validate=false\n```\n\n计算逻辑分为两部分，第一种情况是不需要算，就可以直接得出期望值。 第二种情况需要调用函数计算。\n\n**情况1：不需要计算**\n\n（1）当前值等于0。        期望值=0.    不扩容，\n\n（2）当前值 > 最大值。    没必要计算期望值。 期望值=最大值，需要扩缩容。\n\n（3）当前值 < 最小值。   没必要计算期望值。 期望值=最小值，需要扩缩容。\n\n<br>\n\n**情况2：**    最小值  <= 当前值  <= 最大值。 需要调用函数计算 期望值。\n\n这里的调用链为  computeReplicasForMetrics -> computeReplicasForMetric -> GetMetricReplicas  \n\n这里computeReplicasForMetrics有一个需要注意的点就是。这里可以处理了多个metric的情况。例如：这里一个hpa有多个指标。\n\n```text\n - type: Resource\n    resource:\n      name: cpu\n      # Utilization类型的目标值，Resource类型的指标只支持Utilization和AverageValue类型的目标值\n      target:\n        type: Utilization\n        averageUtilization: 50\n  # Pods类型的指标\n  - type: Pods\n    pods:\n      metric:\n        name: packets-per-second\n      # AverageValue类型的目标值，Pods指标类型下只支持AverageValue类型的目标值\n      target:\n        type: AverageValue\n        averageValue: 1k\n```\n\n这里hpa的逻辑是，谁最大取谁。例如, 通过cpu.Utilization hpa算出来应该需要 4个pod。 但是packets-per-second算出来需要5个。这个时候就已5个为准。见下面代码：\n\n```\n// computeReplicasForMetrics computes the desired number of replicas for the metric specifications listed in the HPA,\n// returning the maximum  of the computed replica counts, a description of the associated metric, and the statuses of\n// all metrics computed.\nfunc (a *HorizontalController) computeReplicasForMetrics(hpa *autoscalingv2.HorizontalPodAutoscaler, scale *autoscalingv1.Scale,\n\tmetricSpecs []autoscalingv2.MetricSpec) (replicas int32, metric string, statuses []autoscalingv2.MetricStatus, timestamp time.Time, err error) {\n\n\tfor i, metricSpec := range metricSpecs {\n\t\treplicaCountProposal, metricNameProposal, timestampProposal, condition, err := a.computeReplicasForMetric(hpa, metricSpec, specReplicas, statusReplicas, selector, &statuses[i])\n\n\t\tif err != nil {\n\t\t\tif invalidMetricsCount <= 0 {\n\t\t\t\tinvalidMetricCondition = condition\n\t\t\t\tinvalidMetricError = err\n\t\t\t}\n\t\t\tinvalidMetricsCount++\n\t\t}\n\t\tif err == nil && (replicas == 0 || replicaCountProposal > replicas) {\n\t\t\ttimestamp = timestampProposal\n\t\t\treplicas = replicaCountProposal\n\t\t\tmetric = metricNameProposal\n\t\t}\n\t}\n\n\t// If all metrics are invalid return error and set condition on hpa based on first invalid metric.\n\tif invalidMetricsCount >= len(metricSpecs) {\n\t\tsetCondition(hpa, invalidMetricCondition.Type, invalidMetricCondition.Status, invalidMetricCondition.Reason, invalidMetricCondition.Message)\n\t\treturn 0, \"\", statuses, time.Time{}, fmt.Errorf(\"invalid metrics (%v invalid out of %v), first error is: %v\", invalidMetricsCount, len(metricSpecs), invalidMetricError)\n\t}\n\tsetCondition(hpa, autoscalingv2.ScalingActive, v1.ConditionTrue, \"ValidMetricFound\", \"the HPA was able to successfully calculate a replica count from %s\", metric)\n\treturn replicas, metric, statuses, timestamp, nil\n}\n```\n\n<br>\n\n针对具体某个metric指标。计算分为俩步：\n\n（1）GetRawMetric函数： 得到 具体的metric值\n\n（2）calcPlainMetricReplicas ：计算期望副本值\n\n这里需要注意一点就是targetUtilization进行了数据转换。乘以了10^3。\n\n```\n// GetMetricReplicas calculates the desired replica count based on a target metric utilization\n// (as a milli-value) for pods matching the given selector in the given namespace, and the\n// current replica count\nfunc (c *ReplicaCalculator) GetMetricReplicas(currentReplicas int32, targetUtilization int64, metricName string, namespace string, selector labels.Selector, metricSelector labels.Selector) (replicaCount int32, utilization int64, timestamp time.Time, err error) {\n\tmetrics, timestamp, err := c.metricsClient.GetRawMetric(metricName, namespace, selector, metricSelector)\n\tif err != nil {\n\t\treturn 0, 0, time.Time{}, fmt.Errorf(\"unable to get metric %s: %v\", metricName, err)\n\t}\n\n\treplicaCount, utilization, err = c.calcPlainMetricReplicas(metrics, currentReplicas, targetUtilization, namespace, selector, v1.ResourceName(\"\"))\n\treturn replicaCount, utilization, timestamp, err\n}\n```\n\n<br>\n\n##### 2.4.1 GetRawMetric-具体的metric值\n\n```\n// GetRawMetric gets the given metric (and an associated oldest timestamp)\n// for all pods matching the specified selector in the given namespace\nfunc (c *customMetricsClient) GetRawMetric(metricName string, namespace string, selector labels.Selector, metricSelector labels.Selector) (PodMetricsInfo, time.Time, error) {\n  // 1.这里直接调用 GetForObjects，发送restful请求获取数据\n\tmetrics, err := c.client.NamespacedMetrics(namespace).GetForObjects(schema.GroupKind{Kind: \"Pod\"}, selector, metricName, metricSelector)\n\tif err != nil {\n\t\treturn nil, time.Time{}, fmt.Errorf(\"unable to fetch metrics from custom metrics API: %v\", err)\n\t}\n\n\tif len(metrics.Items) == 0 {\n\t\treturn nil, time.Time{}, fmt.Errorf(\"no metrics returned from custom metrics API\")\n\t}\n  \n  // 2. 对获取的数据进行处理。这里看起来是乘以了 10^3\n\tres := make(PodMetricsInfo, len(metrics.Items))\n\tfor _, m := range metrics.Items {\n\t\twindow := metricServerDefaultMetricWindow\n\t\tif m.WindowSeconds != nil {\n\t\t\twindow = time.Duration(*m.WindowSeconds) * time.Second\n\t\t}\n\t\tres[m.DescribedObject.Name] = PodMetric{\n\t\t\tTimestamp: m.Timestamp.Time,\n\t\t\tWindow:    window,\n\t\t\tValue:     int64(m.Value.MilliValue()),\n\t\t}\n\n\t\tm.Value.MilliValue()\n\t}\n\n\ttimestamp := metrics.Items[0].Timestamp.Time\n\n\treturn res, timestamp, nil\n}\n```\n\n<br>\n\n##### 2.4.2 calcPlainMetricReplicas-计算期望副本值\n\n这里代码省略，直接贴逻辑。\n\n3.1 先从apiserver端拿到所有相关的pod，将这些pod分为三类：\n\n ```\na.missingPods用于记录处于running状态，但不提供该metric的pod\n\nb.ignoredPods 用于处理resource类型cpu相关metric的延迟（就是pod未就绪），这里不深入讨论\n\nc.readyPodCount记录状态为running，且能提供该metric的pod\n ```\n\n3.2 调用GetMetricUtilizationRatio计算实际值与期望值的对比情况。计算时，对于所有可获取到metric的pod，取它们metric value的平均值得到：usageRatio=实际值/期望值；utilization=实际值（平均）\n\n3.3 计算期望pod数量DesiredReplicas。对于missingPods为0，即所有target pod都处于running可获取metric value的情况:\n\n a.如果实际值与期望值的对比usageRatio处于可容忍范围内，不执行scale操作。默认情况下c.tolerance=0.1，即usageRatio处于\n\n[0.9,1.1]时pod数量不变化\n\n```\nif math.Abs(1.0-usageRatio) <= c.tolerance {\n    // return the current replicas if the change would be too small\n    return currentReplicas, utilization, nil\n}\n```\n\nb.实际值与期望值的对比usageRatio不在可容忍范围内，向上取整得到desiredReplicas\n `return int32(math.Ceil(usageRatio * float64(readyPodCount))), utilization, nil`\n\n对于missingPods>0，即有target pod的metric value没有获取到的情况。 缩容时，对于找不到metric的pod，`视为`正好用了desired value\n\n```\nif usageRatio < 1.0 {\n// on a scale-down, treat missing pods as using 100% of the resource request\nfor podName := range missingPods {\n\tmetrics[podName] = metricsclient.PodMetric{Value: targetUtilization}\n        }\n} \n```\n\n扩容时，对于找不到metric的pod，`视为`该pod对指定metric的使用量为0\n\n```\nfor podName := range missingPods {\n\tmetrics[podName] = metricsclient.PodMetric{Value: 0}\n}\n```\n\n经过上面的处理后，重新计算实际值与期望值的对比newUsageRatio。\n\n 在下面两种情况下，不执行scale操作：新的实际值与期望值的对比newUsageRatio在容忍范围内； 赋值处理前后，一个需要scale up，另一个需要scale down。\n\n其它情况下，同样地执行向上取整操作\n\n```\nif math.Abs(1.0-newUsageRatio) <= c.tolerance || (usageRatio < 1.0 && newUsageRatio > 1.0) || (usageRatio > 1.0 && newUsageRatio < 1.0) {\n\t\t// return the current replicas if the change would be too small,\n\t\t// or if the new usage ratio would cause a change in scale direction\n\t\treturn currentReplicas, utilization, nil\n\t}\nreturn int32(math.Ceil(newUsageRatio * float64(len(metrics)))), utilization, nil\n```\n\n<br>\n\n最后，Hpa将desiredReplicas写到scale.Spec.Replicas，调用a.scaleNamespacer.Scales(hpa.Namespace).Update(targetGR, scale)向apiserver发送更新hpa的请求，对某个hpa的一轮更新操作就完成了。\n\n<br>\n\n### 3. 举例说明计算过程\n\n#### 3.1 hpa扩容计算逻辑\n\n**关键概念**：tolerance（hpa扩容容忍度）， 默认为0.1。\n\nCustom server: 自定义metric服务。这里是一个抽象，用于给hpa提供具体的metric值。Custom server具体可以是prometheus，或者其他的监控系统。下一篇文章会讲如何将Custom server和hpa联系起来。\n\n <br>\n\n#### 3.2 场景1\n\n当前有deployA,  运行着俩个pod, A1和A2。 deploy设置了hpa，指标是内存使用量，并且规定，当平均使用量大于60就要扩容。\n\n![image-20210616210849063](../images/hpa-2.png)\n\nhpa扩容计算步骤：\n\n**第一步:**  往monitor-adaptor发送请求， 要求获得deployA下所有pod的metric值。  这里收到了 A1=50; A2=100\n\n**第二步:**  补全metric值，给获取不到metric值的pod赋值。  这里hpa会查看集群状态，发现deployA 下有俩个pod，A1,A2。并且这两个pod的metric值都获取到了。  这个时候就不用补全。（下面例子就介绍需要补全metric的情况）\n\n**第三步:**  开始计算\n\n（1）计算 平均pod metric值和 target的比例。也可以叫扩容比例系数\n\n      ratio = (A1+A2)/(2*target) = (50+100)/120 = 1.25\n\n   按理说不用再除target值，直接（50+100)/2=75，然后拿75和60比就行。 75比60大就应该扩容。\n\n这里使用系数表示主要有俩个原因：\n\n* 有容忍度的概念，使用比例方便和计算是否超出了容忍度\n* 用于扩缩容计算\n\n（2）判断是否超过容忍度\n\n这里 1.25-1 > 0.1(默认容忍度)。 因此这种情况是需要扩容的。\n\n这里就体现了容忍度的作用。有了容忍度, 平均metric需要大于 66才会扩容（60*1.1）\n\n（3）计算真正的副本数量\n\n   向上取整： 扩容比例系数*当前的副本数\n\n这里就是： 1.25*2 = 2.5 , 取整后就是3。\n\n<br>\n\n#### 3.3 场景2\n\n和场景1不同在于：由于某件原因，导致 monitor-adaptor往hpa发送的时候，只有  A1=20。 A2的数据丢失。\n\n\n![image-20210616211337218](../images/hpa-3.png)hpa扩容计算步骤：\n\n**第一步:**  往monitor-adaptor发送请求， 要求获得deployA下所有pod的metric值。  这里收到了 A1=2;\n\n**第二步:**  补全metric值，给获取不到metric值的pod赋值。  这里hpa会查看集群状态，发现deployA 下有俩个pod，A1,A2。但是这里发现只有A1的值，这个时候hpa就认为A2 有数据，但是获取失败。所以就会给A2自己赋值， 0/target。\n\n赋值逻辑如下：  当 A1 > target的时候，A2=0;  当A1<= target的时候，赋值为 target。\n\n这里由于 A1=2, 比target(60)小，所以最终hpa计算时:\n\nA1=2; A2=60; target=60;  \n\n**第三步:**  开始计算\n\n（1）计算 平均pod metric值和 target的比例。也可以叫扩容比例系数\n\n      ratio = (A1+A2)/(2*target) = (2+60)/120 = 0.517\n\n（2）判断是否超过容忍度\n\n这里 1-0.517 > 0.1(默认容忍度)。 因此这种情况是需要缩容的。\n\n（3）计算真正的副本数量\n\n向上取整： 扩容比例系数*当前的副本数（这里就是metric数量，A1,A2）\n\n对应就是： 0.517*2 = 1.034 , 取整后就是2。\n\n<br>\n\n### 4. 总结\n\n（1）hpa可以设置多个metric。当有多个metric时，谁算出来的副本值最大，取谁的值\n\n（2）针对具体的metric而言（这里是以pods这种为例），首先获得用户定义的hpa指标。比如最大值，最小值，阈值等。\n\n这里有一个点在于。阈值乘以了1000用于计算。\n\n（3）获取metric的值，这里是使用了自定义rest服务。hpa只要发送rest请求，就有数据。这种情况非常适用于公司使用自己的监控数据做扩缩容。 注意：这里每个值也乘以了1000。这样和阈值就是相互抵消了。\n\n（4）利用公式计算期望值。  期望值*X <= 当前pod所有的metric值。X取小的正整数。具体逻辑可以看上文的计算过程。\n\n"
  },
  {
    "path": "k8s/kcm/5-job controller-manager源码分析.md",
    "content": "Table of Contents\n=================\n\n  * [1.  job简介](#1--job简介)\n  * [2. job controller源码分析-初始化](#2-job-controller源码分析-初始化)\n     * [2.1 startJobController](#21-startjobcontroller)\n     * [2.2 NewJobController](#22-newjobcontroller)\n     * [2.3 对Pod的监听事件](#23-对pod的监听事件)\n        * [2.3.1 job的expectations机制](#231-job的expectations机制)\n        * [2.3.2 addPod](#232-addpod)\n        * [2.3.3 updatePod](#233-updatepod)\n        * [2.3.4 deletePod](#234-deletepod)\n        * [2.3.5 总结](#235-总结)\n  * [3. 如何处理队列中的job](#3-如何处理队列中的job)\n     * [3.1 sycnjob](#31-sycnjob)\n     * [3.2 判断job是否完成的标准：  completed,  failed，c.Status == v1.ConditionTrue](#32-判断job是否完成的标准--completed--failedcstatus--v1conditiontrue)\n     * [3.3 如何获得该job对应的pods](#33-如何获得该job对应的pods)\n     * [3.4  jm.manageJob](#34--jmmanagejob)\n  * [4.总结](#4总结)\n\n### 1.  job简介\n\njob 在 kubernetes 中主要用来处理离线任务，job 直接管理 pod，可以创建一个或多个 pod 并会确保指定数量的 pod 运行完成。\n\njob 的一个示例如下所示：\n\n```\napiVersion: batch/v1\nkind: Job\nmetadata:\n  labels:\n    job-name: hello-1626526800\n  name: hello-1626526800\n  namespace: default\nspec:\n  backoffLimit: 6     //标记为 failed 前的重试次数（运行多少个pod failed），默认为 6\n  completions: 4      //当成功的 Pod 个数达到 .spec.completions 时，Job 被视为完成\n  parallelism: 1      // 并行度。这里就是每次1个1个pod的运行，4个pod运行完后，job完成\n  selector:\n    matchLabels:\n      controller-uid: 52f8d25f-6bbf-4439-ab6d-02876c52baea\n  template:\n    metadata:\n      creationTimestamp: null\n      labels:\n        job-name: hello-1626526800\n    spec:\n      containers:\n      - args:\n        - /bin/sh\n        - -c\n        - date; echo \"Hello, World!\"\n        image: busybox\n        imagePullPolicy: Always\n        name: hello\n```\n\n更多关于job的描述，可以参考社区介绍：https://kubernetes.io/zh/docs/concepts/workloads/controllers/job/\n\n<br>\n\n### 2. job controller源码分析-初始化\n\n#### 2.1 startJobController\n\n这个就是 startControllers里面kcm启动时各个controller对应的init函数。\n\ncmd\\kube-controller-manager\\app\\batch.go\n\n```\nfunc startJobController(ctx ControllerContext) (http.Handler, bool, error) {\n\tif !ctx.AvailableResources[schema.GroupVersionResource{Group: \"batch\", Version: \"v1\", Resource: \"jobs\"}] {\n\t\treturn nil, false, nil\n\t}\n\tgo job.NewJobController(\n\t\tctx.InformerFactory.Core().V1().Pods(),\n\t\tctx.InformerFactory.Batch().V1().Jobs(),\n\t\tctx.ClientBuilder.ClientOrDie(\"job-controller\"),\n\t).Run(int(ctx.ComponentConfig.JobController.ConcurrentJobSyncs), ctx.Stop)\n\treturn nil, true, nil\n}\n```\n\n<br>\n\n#### 2.2 NewJobController\n\npkg\\controller\\job\\job_controller.go\n\n这里就是定义好  informer和处理函数。\n\n可以看出来，job的add,delete, update最终都是入队列了。\n\n```\nfunc NewJobController(podInformer coreinformers.PodInformer, jobInformer batchinformers.JobInformer, kubeClient clientset.Interface) *JobController {\n    // 1.定义event上传\n\teventBroadcaster := record.NewBroadcaster()\n\teventBroadcaster.StartLogging(glog.Infof)\n\teventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events(\"\")})\n\n\tif kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {\n\t\tmetrics.RegisterMetricAndTrackRateLimiterUsage(\"job_controller\", kubeClient.CoreV1().RESTClient().GetRateLimiter())\n\t}\n    \n    \n\tjm := &JobController{\n\t\tkubeClient: kubeClient,\n\t\tpodControl: controller.RealPodControl{\n\t\t\tKubeClient: kubeClient,\n\t\t\tRecorder:   eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: \"job-controller\"}),\n\t\t},\n\t\texpectations: controller.NewControllerExpectations(),\n\t\tqueue:        workqueue.NewNamedRateLimitingQueue(workqueue.NewItemExponentialFailureRateLimiter(DefaultJobBackOff, MaxJobBackOff), \"job\"),\n\t\trecorder:     eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: \"job-controller\"}),\n\t}\n\n\tjobInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n\t\tAddFunc: func(obj interface{}) {\n\t\t\tjm.enqueueController(obj, true)\n\t\t},\n\t\tUpdateFunc: jm.updateJob,            // 这个其实也是放入队列的，见下面的函数\n\t\tDeleteFunc: func(obj interface{}) {\n\t\t\tjm.enqueueController(obj, true)\n\t\t},\n\t})\n\tjm.jobLister = jobInformer.Lister()\n\tjm.jobStoreSynced = jobInformer.Informer().HasSynced\n\n\tpodInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n\t\tAddFunc:    jm.addPod,\n\t\tUpdateFunc: jm.updatePod,  \n\t\tDeleteFunc: jm.deletePod,\n\t})\n\tjm.podStore = podInformer.Lister()\n\tjm.podStoreSynced = podInformer.Informer().HasSynced\n\n\tjm.updateHandler = jm.updateJobStatus\n\tjm.syncHandler = jm.syncJob\n\n\treturn jm\n}\n```\n\nupdateJob进行了一些判断，最后还是入队列了。\n\n```\nfunc (jm *JobController) updateJob(old, cur interface{}) {\n\toldJob := old.(*batch.Job)\n\tcurJob := cur.(*batch.Job)\n\n\t// never return error\n\tkey, err := controller.KeyFunc(curJob)\n\tif err != nil {\n\t\treturn\n\t}\n\tjm.enqueueController(curJob, true)\n\t// check if need to add a new rsync for ActiveDeadlineSeconds\n\tif curJob.Status.StartTime != nil {\n\t\tcurADS := curJob.Spec.ActiveDeadlineSeconds\n\t\tif curADS == nil {\n\t\t\treturn\n\t\t}\n\t\toldADS := oldJob.Spec.ActiveDeadlineSeconds\n\t\tif oldADS == nil || *oldADS != *curADS {\n\t\t\tnow := metav1.Now()\n\t\t\tstart := curJob.Status.StartTime.Time\n\t\t\tpassed := now.Time.Sub(start)\n\t\t\ttotal := time.Duration(*curADS) * time.Second\n\t\t\t// AddAfter will handle total < passed\n\t\t\tjm.queue.AddAfter(key, total-passed)\n\t\t\tglog.V(4).Infof(\"job ActiveDeadlineSeconds updated, will rsync after %d seconds\", total-passed)\n\t\t}\n\t}\n}\n```\n\n<br>\n\n#### 2.3 对Pod的监听事件\n\n##### 2.3.1 job的expectations机制\n\n和rs的机制其实是一样的。更详细的可以参考rs那篇博客的介绍。\n\nexpectations可以理解为一个map。举例来说，这个map可以认为有四个关键字段。\n\nkey:  有rs的ns和 rs的name组成\n\nAdd: 表示这个rs还需要增加多少个rs\n\ndel: 表示这个rs还需要删除多少个pod\n\nTime: 表示\n\n| Key         | Add  | Del  | Time                |\n| ----------- | ---- | ---- | ------------------- |\n| Default/zx1 | 0    | 0    | 2021.07.04 16:00:00 |\n| zx/zx1      | 1    | 0    | 2021.07.04 16:00:00 |\n\n<br>\n\n**GetExpectations**:  输入是key, 输出整个map;\n\n**SatisfiedExpectations**: 输入key, 输出bool；判断某个job是否符合预期。符合预期： add<=0 && del<=0 或者 超过了同步周期； 其他情况都是不符合预期。\n\n**DeleteExpectations**：输入key, 无输出；从map(缓存)中删除这个key\n\n**SetExpectations**：输入（key, add, del）; 在map中新增加一行。 **这个会更新时间，将time复制为time.Now**\n\n**ExpectCreations**:  输入（key, add);   覆盖map中的内容，del=0， add等于函数的参数。  **这个会更新时间，将time复制为time.Now**\n\n**ExpectDeletions**： 输入（key, del);  覆盖map中的内容，add=0， del等于函数的参数。   **这个会更新时间，将time复制为time.Now**\n\n**CreationObserved**: 输入(key) ;  map中对应的行中 add-1\n\n**DeletionObserved**: 输入(key);  map中对应的行中 del-1\n\n**RaiseExpectations**:  输入(key, add, del)；  map中对应的行中 Add+add, Del+del\n\n**LowerExpectations**: 输入(key, add, del)；  map中对应的行中 Add-add, Del-del\n\n<br>\n\n##### 2.3.2 addPod\n\n（1）如果pod要删除，deletePod最终会调用DeletionObserved函数，使得这个map中对应job的del-1\n\n（2）有ower并且是job，就将这个job在对应map，add - 1\n\n（3）如果这个pod是孤儿，这将这个pod之前有关联的job入队列，然后通过syncJob更新\n\n```\n// When a pod is created, enqueue the controller that manages it and update it's expectations.\nfunc (jm *JobController) addPod(obj interface{}) {\n\tpod := obj.(*v1.Pod)\n\tif pod.DeletionTimestamp != nil {\n\t\t// on a restart of the controller controller, it's possible a new pod shows up in a state that\n\t\t// is already pending deletion. Prevent the pod from being a creation observation.\n\t\t// 1. deletePod最终会调用DeletionObserved函数。\n\t\tjm.deletePod(pod)\n\t\treturn\n\t}\n\n\t// If it has a ControllerRef, that's all that matters.\n\tif controllerRef := metav1.GetControllerOf(pod); controllerRef != nil {\n\t\tjob := jm.resolveControllerRef(pod.Namespace, controllerRef)\n\t\tif job == nil {\n\t\t\treturn\n\t\t}\n\t\tjobKey, err := controller.KeyFunc(job)\n\t\tif err != nil {\n\t\t\treturn\n\t\t}\n\t\t// 2.有ower并且是job，就将这个job在对应map，add - 1\n\t\tjm.expectations.CreationObserved(jobKey)\n\t\tjm.enqueueController(job, true)\n\t\treturn\n\t}\n\n\t// Otherwise, it's an orphan. Get a list of all matching controllers and sync\n\t// them to see if anyone wants to adopt it.\n\t// DO NOT observe creation because no controller should be waiting for an\n\t// orphan.\n\t// 3.如果是孤儿，这将这个pod之前有关联的job入队列，然后通过syncJob更新\n\tfor _, job := range jm.getPodJobs(pod) {\n\t\tjm.enqueueController(job, true)\n\t}\n}\n```\n\n<br>\n\n##### 2.3.3 updatePod\n\n（1）如果pod要删除，deletePod最终会调用DeletionObserved函数，使得这个map中对应job的del-1\n\n（2） 如果pod更新了owner，先是旧job加入队列，因为进入队列的job都会同步\n\n（3）如果新owner还是job，这个job入队列\n\n（4） 如果这个pod是孤儿，这将这个pod之前有关联的job入队列，然后通过syncJob更新\n\n```\n// When a pod is updated, figure out what job/s manage it and wake them up.\n// If the labels of the pod have changed we need to awaken both the old\n// and new job. old and cur must be *v1.Pod types.\nfunc (jm *JobController) updatePod(old, cur interface{}) {\n\tcurPod := cur.(*v1.Pod)\n\toldPod := old.(*v1.Pod)\n\tif curPod.ResourceVersion == oldPod.ResourceVersion {\n\t\t// Periodic resync will send update events for all known pods.\n\t\t// Two different versions of the same pod will always have different RVs.\n\t\treturn\n\t}\n\t// 1.deletePod最终会调用DeletionObserved函数，使得这个map中对应job的del-1\n\tif curPod.DeletionTimestamp != nil {\n\t\t// when a pod is deleted gracefully it's deletion timestamp is first modified to reflect a grace period,\n\t\t// and after such time has passed, the kubelet actually deletes it from the store. We receive an update\n\t\t// for modification of the deletion timestamp and expect an job to create more pods asap, not wait\n\t\t// until the kubelet actually deletes the pod.\n\t\tjm.deletePod(curPod)\n\t\treturn\n\t}\n\n\t// the only time we want the backoff to kick-in, is when the pod failed\n\timmediate := curPod.Status.Phase != v1.PodFailed\n\n\tcurControllerRef := metav1.GetControllerOf(curPod)\n\toldControllerRef := metav1.GetControllerOf(oldPod)\n\t// 2. 如果pod更新了owner，先是旧job加入队列，因为进入队列的job都会同步\n\tcontrollerRefChanged := !reflect.DeepEqual(curControllerRef, oldControllerRef)\n\tif controllerRefChanged && oldControllerRef != nil {\n\t\t// The ControllerRef was changed. Sync the old controller, if any.\n\t\tif job := jm.resolveControllerRef(oldPod.Namespace, oldControllerRef); job != nil {\n\t\t\tjm.enqueueController(job, immediate)\n\t\t}\n\t}\n  \n  // 3.如果新owner还是job，这个job入队列\n\t// If it has a ControllerRef, that's all that matters.\n\tif curControllerRef != nil {\n\t\tjob := jm.resolveControllerRef(curPod.Namespace, curControllerRef)\n\t\tif job == nil {\n\t\t\treturn\n\t\t}\n\t\tjm.enqueueController(job, immediate)\n\t\treturn\n\t}\n  \n  \n  // 4. 如果这个pod是孤儿，这将这个pod之前有关联的job入队列，然后通过syncJob更新\n\t// Otherwise, it's an orphan. If anything changed, sync matching controllers\n\t// to see if anyone wants to adopt it now.\n\tlabelChanged := !reflect.DeepEqual(curPod.Labels, oldPod.Labels)\n\tif labelChanged || controllerRefChanged {\n\t\tfor _, job := range jm.getPodJobs(curPod) {\n\t\t\tjm.enqueueController(job, immediate)\n\t\t}\n\t}\n}\n```\n\n<br>\n\n##### 2.3.4 deletePod\n\n这里又出现了tombstone。\n\ndeletepod的逻辑就更简单，将map中对应job的del-1\n\n```\n// When a pod is deleted, enqueue the job that manages the pod and update its expectations.\n// obj could be an *v1.Pod, or a DeletionFinalStateUnknown marker item.\nfunc (jm *JobController) deletePod(obj interface{}) {\n\tpod, ok := obj.(*v1.Pod)\n\n\t// When a delete is dropped, the relist will notice a pod in the store not\n\t// in the list, leading to the insertion of a tombstone object which contains\n\t// the deleted key/value. Note that this value might be stale. If the pod\n\t// changed labels the new job will not be woken up till the periodic resync.\n\tif !ok {\n\t\ttombstone, ok := obj.(cache.DeletedFinalStateUnknown)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"couldn't get object from tombstone %+v\", obj))\n\t\t\treturn\n\t\t}\n\t\tpod, ok = tombstone.Obj.(*v1.Pod)\n\t\tif !ok {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"tombstone contained object that is not a pod %+v\", obj))\n\t\t\treturn\n\t\t}\n\t}\n\n\tcontrollerRef := metav1.GetControllerOf(pod)\n\tif controllerRef == nil {\n\t\t// No controller should care about orphans being deleted.\n\t\treturn\n\t}\n\tjob := jm.resolveControllerRef(pod.Namespace, controllerRef)\n\tif job == nil {\n\t\treturn\n\t}\n\tjobKey, err := controller.KeyFunc(job)\n\tif err != nil {\n\t\treturn\n\t}\n\tjm.expectations.DeletionObserved(jobKey)\n\tjm.enqueueController(job, true)\n}\n```\n\n<br>\n\n##### 2.3.5 总结\n\n* 对于pod的add, updated, del事件，核心就是维护 map中job的数据，然后就是将对应的job入队列\n* 对于job的add, updated, del事件，最后都是扔进了队列\n\n### 3. 如何处理队列中的job\n\n```\n// Run the main goroutine responsible for watching and syncing jobs.\nfunc (jm *JobController) Run(workers int, stopCh <-chan struct{}) {\n\tdefer utilruntime.HandleCrash()\n\tdefer jm.queue.ShutDown()\n\n\tglog.Infof(\"Starting job controller\")\n\tdefer glog.Infof(\"Shutting down job controller\")\n\n\tif !controller.WaitForCacheSync(\"job\", stopCh, jm.podStoreSynced, jm.jobStoreSynced) {\n\t\treturn\n\t}\n\n\tfor i := 0; i < workers; i++ {\n\t\tgo wait.Until(jm.worker, time.Second, stopCh)\n\t}\n\n\t<-stopCh\n}\n```\n\n```\n// worker runs a worker thread that just dequeues items, processes them, and marks them done.\n// It enforces that the syncHandler is never invoked concurrently with the same key.\nfunc (jm *JobController) worker() {\n\tfor jm.processNextWorkItem() {\n\t}\n}\n\nfunc (jm *JobController) processNextWorkItem() bool {\n\tkey, quit := jm.queue.Get()\n\tif quit {\n\t\treturn false\n\t}\n\tdefer jm.queue.Done(key)\n\n\tforget, err := jm.syncHandler(key.(string))\n\tif err == nil {\n\t\tif forget {\n\t\t\tjm.queue.Forget(key)\n\t\t}\n\t\treturn true\n\t}\n\n\tutilruntime.HandleError(fmt.Errorf(\"Error syncing job: %v\", err))\n\tjm.queue.AddRateLimited(key)\n\n\treturn true\n}\n```\n\n和所有控制器一样，流程为：Run->worker->processNextWorkItem->syncHandler。\n\n而NewJobController的时候就指定了  jm.syncHandler = jm.syncJob\n\n<br>\n\n#### 3.1 sycnjob\n\n（1）判断 job 是否已经执行完成，如果完成了，直接返回。\n\n* 当 job 的 `.status.conditions` 中有 `Complete` 或 `Failed` 的 type 且对应的 status 为 true 时表示该 job 已经执行完成\n\n（2）获得job的重试次数，以及通过expectations判断是否需要同步,以下三种情况需要同步\n\n- 该 job 在 map 中的 adds 和 dels 都 <= 0\n- 该 job 在 map 中已经超过 5min 没有更新了；\n- 该 job 在 job 中不存在，即该对象是新创建的；\n\n（3）获取 job 关联的所有 pods，然后分为三类：active、succeeded、failed\n\n（4）如果这个job第一次启动，设置启动时间为Now，如果还设置了ActiveDeadlineSeconds值，则等ActiveDeadlineSeconds这个时间后再入队列\n\n（5）判断job是否失败了。有俩种情况：\n\n* 一是 job的重试此时达到了Spec.BackoffLimit`(默认是6次)，`\n*  二是 job 的运行时间达到了 `job.Spec.ActiveDeadlineSeconds` 中设定的值\n\n（6）如果job Failed, 删除所有的pod，然后发事件说这个job已经failed, 原因是XX\n\n（7）如果job需要同步，并且没有deletionStamp，通过manageJob调整activejob=parallelism\n\n（8）检查 `job.Spec.Completions` 判断 job 是否已经运行完成，如果 `job.Spec.Completions` 没有设置，那只要有一个pod运行成功，就表示该 job 完成。\n\n（9）最后通过job的状态有无变化，如果有变化，更新到 apiserver；\n\n```go\n// syncJob will sync the job with the given key if it has had its expectations fulfilled, meaning\n// it did not expect to see any more of its pods created or deleted. This function is not meant to be invoked\n// concurrently with the same key.\nfunc (jm *JobController) syncJob(key string) (bool, error) {\n    // 1、用于统计本次 sync 的运行时间\n\tstartTime := time.Now()\n\tdefer func() {\n\t\tglog.V(4).Infof(\"Finished syncing job %q (%v)\", key, time.Since(startTime))\n\t}()\n    \n    // 2、从 lister 中获取 job 对象\n\tns, name, err := cache.SplitMetaNamespaceKey(key)\n\tif err != nil {\n\t\treturn false, err\n\t}\n\tif len(ns) == 0 || len(name) == 0 {\n\t\treturn false, fmt.Errorf(\"invalid job key %q: either namespace or name is missing\", key)\n\t}\n\tsharedJob, err := jm.jobLister.Jobs(ns).Get(name)\n\tif err != nil {\n\t\tif errors.IsNotFound(err) {\n\t\t\tglog.V(4).Infof(\"Job has been deleted: %v\", key)\n\t\t\tjm.expectations.DeleteExpectations(key)\n\t\t\treturn true, nil\n\t\t}\n\t\treturn false, err\n\t}\n\tjob := *sharedJob\n\n\t// if job was finished previously, we don't want to redo the termination\n    // 3、判断 job 是否已经执行完成，如果完成了，直接返回\n\tif IsJobFinished(&job) {\n\t\treturn true, nil\n\t}\n\n\t// retrieve the previous number of retry\n\t// 4、获取 job 重试的次数，这个是队列自带的函数 workqueue自己就实现了\n\tpreviousRetry := jm.queue.NumRequeues(key)\n\n\t// Check the expectations of the job before counting active pods, otherwise a new pod can sneak in\n\t// and update the expectations after we've retrieved active pods from the store. If a new pod enters\n\t// the store after we've checked the expectation, the job sync is just deferred till the next relist.\n\t// 5、通过Expectations，判断 job 是否能进行 sync 操作\n\tjobNeedsSync := jm.expectations.SatisfiedExpectations(key)\n\n    // 6、获取 job 关联的所有 pod\n\tpods, err := jm.getPodsForJob(&job)\n\tif err != nil {\n\t\treturn false, err\n\t}\n    \n\n   \n    // 7、 获取active、succeeded、failed状态的 pod 数\n\tactivePods := controller.FilterActivePods(pods)\n\tactive := int32(len(activePods))\n\tsucceeded, failed := getStatus(pods)\n\tconditions := len(job.Status.Conditions)\n\t// job first start\n   // 8、判断 job 是否为首次启动\n\tif job.Status.StartTime == nil {\n\t\tnow := metav1.Now()\n\t\tjob.Status.StartTime = &now\n\t\t// enqueue a sync to check if job past ActiveDeadlineSeconds\n         // 9、如果设定了 ActiveDeadlineSeconds值，这个这个事件过去了再加入队列。\n\t\tif job.Spec.ActiveDeadlineSeconds != nil {\n\t\t\tglog.V(4).Infof(\"Job %s have ActiveDeadlineSeconds will sync after %d seconds\",\n\t\t\t\tkey, *job.Spec.ActiveDeadlineSeconds)\n\t\t\tjm.queue.AddAfter(key, time.Duration(*job.Spec.ActiveDeadlineSeconds)*time.Second)\n\t\t}\n\t}\n\n\tvar manageJobErr error\n\tjobFailed := false\n\tvar failureReason string\n\tvar failureMessage string\n\n    // 10、通过已经失败的pod数量，判断是否超过了job运行的上限，BackoffLimit。例子中设置的为6.\n\tjobHaveNewFailure := failed > job.Status.Failed\n\t// new failures happen when status does not reflect the failures and active\n\t// is different than parallelism, otherwise the previous controller loop\n\t// failed updating status so even if we pick up failure it is not a new one\n  // 因为有的pod可能在job完成前就删除了，所以需要previousRetry+1（这次错误）进行判断。\n\texceedsBackoffLimit := jobHaveNewFailure && (active != *job.Spec.Parallelism) &&\n\t\t(int32(previousRetry)+1 > *job.Spec.BackoffLimit)\n\n\tif exceedsBackoffLimit || pastBackoffLimitOnFailure(&job, pods) {\n\t\t// check if the number of pod restart exceeds backoff (for restart OnFailure only)\n\t\t// OR if the number of failed jobs increased since the last syncJob\n\t\tjobFailed = true\n\t\tfailureReason = \"BackoffLimitExceeded\"\n\t\tfailureMessage = \"Job has reached the specified backoff limit\"\n\t} else if pastActiveDeadline(&job) {\n\t\tjobFailed = true\n\t\tfailureReason = \"DeadlineExceeded\"\n\t\tfailureMessage = \"Job was active longer than specified deadline\"\n\t}\n    \n    // 11、如果处于 failed 状态，则调用 jm.deleteJobPods 并发删除所有 active pods\n\tif jobFailed {\n\t\terrCh := make(chan error, active)\n\t\tjm.deleteJobPods(&job, activePods, errCh)\n\t\tselect {\n\t\tcase manageJobErr = <-errCh:\n\t\t\tif manageJobErr != nil {\n\t\t\t\tbreak\n\t\t\t}\n\t\tdefault:\n\t\t}\n\n\t\t// update status values accordingly\n\t\tfailed += active\n\t\tactive = 0\n\t\tjob.Status.Conditions = append(job.Status.Conditions, newCondition(batch.JobFailed, failureReason, failureMessage))\n\t\tjm.recorder.Event(&job, v1.EventTypeWarning, failureReason, failureMessage)\n\t} else {\n           \n        // 12、若非 failed 状态，根据 jobNeedsSync 判断是否要进行同步\n\t\tif jobNeedsSync && job.DeletionTimestamp == nil {\n\t\t\tactive, manageJobErr = jm.manageJob(activePods, succeeded, &job)\n\t\t}\n        \n        // 13、检查 job.Spec.Completions 判断 job 是否已经运行完成\n\t\tcompletions := succeeded\n\t\tcomplete := false\n\t\tif job.Spec.Completions == nil {\n\t\t\t// This type of job is complete when any pod exits with success.\n\t\t\t// Each pod is capable of\n\t\t\t// determining whether or not the entire Job is done.  Subsequent pods are\n\t\t\t// not expected to fail, but if they do, the failure is ignored.  Once any\n\t\t\t// pod succeeds, the controller waits for remaining pods to finish, and\n\t\t\t// then the job is complete.\n            \n\t\t\tif succeeded > 0 && active == 0 {\n\t\t\t\tcomplete = true\n\t\t\t}\n\t\t} else {\n\t\t\t// Job specifies a number of completions.  This type of job signals\n\t\t\t// success by having that number of successes.  Since we do not\n\t\t\t// start more pods than there are remaining completions, there should\n\t\t\t// not be any remaining active pods once this count is reached.\n\t\t\tif completions >= *job.Spec.Completions {\n\t\t\t\tcomplete = true\n\t\t\t\tif active > 0 {\n\t\t\t\t\tjm.recorder.Event(&job, v1.EventTypeWarning, \"TooManyActivePods\", \"Too many active pods running after completion count reached\")\n\t\t\t\t}\n\t\t\t\tif completions > *job.Spec.Completions {\n\t\t\t\t\tjm.recorder.Event(&job, v1.EventTypeWarning, \"TooManySucceededPods\", \"Too many succeeded pods running after completion count reached\")\n\t\t\t\t}\n\t\t\t}\n\t\t}\n                \n        // 14、若 job 运行完成了，则更新 job.Status.Conditions 和 job.Status.CompletionTime 字段\n\t\tif complete {\n\t\t\tjob.Status.Conditions = append(job.Status.Conditions, newCondition(batch.JobComplete, \"\", \"\"))\n\t\t\tnow := metav1.Now()\n\t\t\tjob.Status.CompletionTime = &now\n\t\t}\n\t}\n\n\tforget := false\n\t// Check if the number of jobs succeeded increased since the last check. If yes \"forget\" should be true\n\t// This logic is linked to the issue: https://github.com/kubernetes/kubernetes/issues/56853 that aims to\n\t// improve the Job backoff policy when parallelism > 1 and few Jobs failed but others succeed.\n\t// In this case, we should clear the backoff delay.\n\tif job.Status.Succeeded < succeeded {\n\t\tforget = true\n\t}\n   \n    // 15、如果 job 的 status 有变化，将 job 的 status 更新到 apiserver\n\t// no need to update the job if the status hasn't changed since last time\n\tif job.Status.Active != active || job.Status.Succeeded != succeeded || job.Status.Failed != failed || len(job.Status.Conditions) != conditions {\n\t\tjob.Status.Active = active\n\t\tjob.Status.Succeeded = succeeded\n\t\tjob.Status.Failed = failed\n\n\t\tif err := jm.updateHandler(&job); err != nil {\n\t\t\treturn forget, err\n\t\t}\n\n\t\tif jobHaveNewFailure && !IsJobFinished(&job) {\n\t\t\t// returning an error will re-enqueue Job after the backoff period\n\t\t\treturn forget, fmt.Errorf(\"failed pod(s) detected for job key %q\", key)\n\t\t}\n\n\t\tforget = true\n\t}\n\n\treturn forget, manageJobErr\n}\n```\n\n\n\n<br>\n\n#### 3.2 判断job是否完成的标准：  completed,  failed，c.Status == v1.ConditionTrue\n\n```\nfunc IsJobFinished(j *batch.Job) bool {\n   for _, c := range j.Status.Conditions {\n      if (c.Type == batch.JobComplete || c.Type == batch.JobFailed) && c.Status == v1.ConditionTrue {\n         return true\n      }\n   }\n   return false\n}\n```\n\n找两个job对比，可以看出来 completed确实有一个 status=true的字段；就是第三个判断\n\n```\nComplete的job\nstatus:\n  completionTime: \"2021-01-20T07:27:11Z\"\n  conditions:\n  - lastProbeTime: \"2021-01-20T07:27:11Z\"\n    lastTransitionTime: \"2021-01-20T07:27:11Z\"\n    status: \"True\"\n    type: Complete\n  startTime: \"2021-01-20T07:27:04Z\"\n  succeeded: 1\n\n```\n\n```\nrunning的job\nstatus:\n  active: 1\n  startTime: \"2021-01-20T07:32:06Z\"\n\n```\n\n<br>\n\n#### 3.3 如何获得该job对应的pods\n\n```\n// getPodsForJob returns the set of pods that this Job should manage.\n// It also reconciles ControllerRef by adopting/orphaning.\n// Note that the returned Pods are pointers into the cache.\nfunc (jm *JobController) getPodsForJob(j *batch.Job) ([]*v1.Pod, error) {\n   selector, err := metav1.LabelSelectorAsSelector(j.Spec.Selector)\n   if err != nil {\n      return nil, fmt.Errorf(\"couldn't convert Job selector: %v\", err)\n   }\n   // List all pods to include those that don't match the selector anymore\n   // but have a ControllerRef pointing to this controller.\n   pods, err := jm.podStore.Pods(j.Namespace).List(labels.Everything())\n   if err != nil {\n      return nil, err\n   }\n   // If any adoptions are attempted, we should first recheck for deletion\n   // with an uncached quorum read sometime after listing Pods (see #42639).\n   canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {\n      fresh, err := jm.kubeClient.BatchV1().Jobs(j.Namespace).Get(j.Name, metav1.GetOptions{})\n      if err != nil {\n         return nil, err\n      }\n      if fresh.UID != j.UID {\n         return nil, fmt.Errorf(\"original Job %v/%v is gone: got uid %v, wanted %v\", j.Namespace, j.Name, fresh.UID, j.UID)\n      }\n      return fresh, nil\n   })\n   cm := controller.NewPodControllerRefManager(jm.podControl, j, selector, controllerKind, canAdoptFunc)\n   return cm.ClaimPods(pods)\n}\n```\n\n<br>\n\n```\n// NewPodControllerRefManager returns a PodControllerRefManager that exposes\n// methods to manage the controllerRef of pods.\n//\n// The CanAdopt() function can be used to perform a potentially expensive check\n// (such as a live GET from the API server) prior to the first adoption.\n// It will only be called (at most once) if an adoption is actually attempted.\n// If CanAdopt() returns a non-nil error, all adoptions will fail.\n//\n// NOTE: Once CanAdopt() is called, it will not be called again by the same\n//       PodControllerRefManager instance. Create a new instance if it makes\n//       sense to check CanAdopt() again (e.g. in a different sync pass).\nfunc NewPodControllerRefManager(\n\tpodControl PodControlInterface,\n\tcontroller metav1.Object,\n\tselector labels.Selector,\n\tcontrollerKind schema.GroupVersionKind,\n\tcanAdopt func() error,\n) *PodControllerRefManager {\n\treturn &PodControllerRefManager{\n\t\tBaseControllerRefManager: BaseControllerRefManager{\n\t\t\tController:   controller,\n\t\t\tSelector:     selector,\n\t\t\tCanAdoptFunc: canAdopt,\n\t\t},\n\t\tcontrollerKind: controllerKind,\n\t\tpodControl:     podControl,\n\t}\n}\n```\n\n```\n最终还是通过 labels 匹配\n// If the error is nil, either the reconciliation succeeded, or no\n// reconciliation was necessary. The list of Pods that you now own is returned.\nfunc (m *PodControllerRefManager) ClaimPods(pods []*v1.Pod, filters ...func(*v1.Pod) bool) ([]*v1.Pod, error) {\n\tvar claimed []*v1.Pod\n\tvar errlist []error\n\n\tmatch := func(obj metav1.Object) bool {\n\t\tpod := obj.(*v1.Pod)\n\t\t// Check selector first so filters only run on potentially matching Pods.\n\t\tif !m.Selector.Matches(labels.Set(pod.Labels)) {\n\t\t\treturn false\n\t\t}\n\t\tfor _, filter := range filters {\n\t\t\tif !filter(pod) {\n\t\t\t\treturn false\n\t\t\t}\n\t\t}\n\t\treturn true\n\t}\n\tadopt := func(obj metav1.Object) error {\n\t\treturn m.AdoptPod(obj.(*v1.Pod))\n\t}\n\trelease := func(obj metav1.Object) error {\n\t\treturn m.ReleasePod(obj.(*v1.Pod))\n\t}\n\n\tfor _, pod := range pods {\n\t\tok, err := m.ClaimObject(pod, match, adopt, release)\n\t\tif err != nil {\n\t\t\terrlist = append(errlist, err)\n\t\t\tcontinue\n\t\t}\n\t\tif ok {\n\t\t\tclaimed = append(claimed, pod)\n\t\t}\n\t}\n\treturn claimed, utilerrors.NewAggregate(errlist)\n}\n```\n\n<br>\n\n这是 job zx-testip1-1611142680产生的一个pod.\n\n```\nkind: Pod\nmetadata:\n  labels:\n    controller-uid: ecff8cf1-7523-4d90-9559-22c9e994f726  //这个是job的 uuid\n    job-name: zx-testip1-1611142680\n  name: zx-testip1-1611142680-4s9z8\n  namespace: zx\n```\n\n<br>\n\n#### 3.4  jm.manageJob\n\n注意：这里进行了map的设置，初始化\n\n`jm.manageJob` 核心工作就是根据 job的并发数来确认当前处于 active 的 pods 数量是否ok，如果不ok的话则进行调整。\n\n具体为：\n\n- 如果active > parallelism，说明active的pod数量太多，需要删除一些。\n\n  删除pod的逻辑，rs那篇文章有，其实就是根据pod的运行时间，状态等信息判断pod优先级。\n\n- 如果active < parallelism，说明active的pod数量太少，需要创建一些。\n\n```\n// manageJob is the core method responsible for managing the number of running\n// pods according to what is specified in the job.Spec.\n// Does NOT modify <activePods>.\nfunc (jm *JobController) manageJob(activePods []*v1.Pod, succeeded int32, job *batch.Job) (int32, error) {\n\tvar activeLock sync.Mutex\n\tactive := int32(len(activePods))\n\tparallelism := *job.Spec.Parallelism\n\tjobKey, err := controller.KeyFunc(job)\n\tif err != nil {\n\t\tutilruntime.HandleError(fmt.Errorf(\"Couldn't get key for job %#v: %v\", job, err))\n\t\treturn 0, nil\n\t}\n\n\tvar errCh chan error\n\tif active > parallelism {\n\t\tdiff := active - parallelism\n\t\terrCh = make(chan error, diff)\n\t\t// 注意这里进行了map的设置\n\t\tjm.expectations.ExpectDeletions(jobKey, int(diff))\n\t\tglog.V(4).Infof(\"Too many pods running job %q, need %d, deleting %d\", jobKey, parallelism, diff)\n\t\t// Sort the pods in the order such that not-ready < ready, unscheduled\n\t\t// < scheduled, and pending < running. This ensures that we delete pods\n\t\t// in the earlier stages whenever possible.\n\t\tsort.Sort(controller.ActivePods(activePods))\n\n\t\tactive -= diff\n\t\twait := sync.WaitGroup{}\n\t\twait.Add(int(diff))\n\t\tfor i := int32(0); i < diff; i++ {\n\t\t\tgo func(ix int32) {\n\t\t\t\tdefer wait.Done()\n\t\t\t\tif err := jm.podControl.DeletePod(job.Namespace, activePods[ix].Name, job); err != nil {\n\t\t\t\t\tdefer utilruntime.HandleError(err)\n\t\t\t\t\t// Decrement the expected number of deletes because the informer won't observe this deletion\n\t\t\t\t\tglog.V(2).Infof(\"Failed to delete %v, decrementing expectations for job %q/%q\", activePods[ix].Name, job.Namespace, job.Name)\n\t\t\t\t\tjm.expectations.DeletionObserved(jobKey)\n\t\t\t\t\tactiveLock.Lock()\n\t\t\t\t\tactive++\n\t\t\t\t\tactiveLock.Unlock()\n\t\t\t\t\terrCh <- err\n\t\t\t\t}\n\t\t\t}(i)\n\t\t}\n\t\twait.Wait()\n\n\t} else if active < parallelism {\n\t\twantActive := int32(0)\n\t\tif job.Spec.Completions == nil {\n\t\t\t// Job does not specify a number of completions.  Therefore, number active\n\t\t\t// should be equal to parallelism, unless the job has seen at least\n\t\t\t// once success, in which leave whatever is running, running.\n\t\t\tif succeeded > 0 {\n\t\t\t\twantActive = active\n\t\t\t} else {\n\t\t\t\twantActive = parallelism\n\t\t\t}\n\t\t} else {\n\t\t\t// Job specifies a specific number of completions.  Therefore, number\n\t\t\t// active should not ever exceed number of remaining completions.\n\t\t\twantActive = *job.Spec.Completions - succeeded\n\t\t\tif wantActive > parallelism {\n\t\t\t\twantActive = parallelism\n\t\t\t}\n\t\t}\n\t\tdiff := wantActive - active\n\t\tif diff < 0 {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"More active than wanted: job %q, want %d, have %d\", jobKey, wantActive, active))\n\t\t\tdiff = 0\n\t\t}\n\t\tjm.expectations.ExpectCreations(jobKey, int(diff))\n\t\terrCh = make(chan error, diff)\n\t\tglog.V(4).Infof(\"Too few pods running job %q, need %d, creating %d\", jobKey, wantActive, diff)\n\n\t\tactive += diff\n\t\twait := sync.WaitGroup{}\n\n\t\t// Batch the pod creates. Batch sizes start at SlowStartInitialBatchSize\n\t\t// and double with each successful iteration in a kind of \"slow start\".\n\t\t// This handles attempts to start large numbers of pods that would\n\t\t// likely all fail with the same error. For example a project with a\n\t\t// low quota that attempts to create a large number of pods will be\n\t\t// prevented from spamming the API service with the pod create requests\n\t\t// after one of its pods fails.  Conveniently, this also prevents the\n\t\t// event spam that those failures would generate.\n\t\tfor batchSize := int32(integer.IntMin(int(diff), controller.SlowStartInitialBatchSize)); diff > 0; batchSize = integer.Int32Min(2*batchSize, diff) {\n\t\t\terrorCount := len(errCh)\n\t\t\twait.Add(int(batchSize))\n\t\t\tfor i := int32(0); i < batchSize; i++ {\n\t\t\t\tgo func() {\n\t\t\t\t\tdefer wait.Done()\n\t\t\t\t\terr := jm.podControl.CreatePodsWithControllerRef(job.Namespace, &job.Spec.Template, job, metav1.NewControllerRef(job, controllerKind))\n\t\t\t\t\tif err != nil && errors.IsTimeout(err) {\n\t\t\t\t\t\t// Pod is created but its initialization has timed out.\n\t\t\t\t\t\t// If the initialization is successful eventually, the\n\t\t\t\t\t\t// controller will observe the creation via the informer.\n\t\t\t\t\t\t// If the initialization fails, or if the pod keeps\n\t\t\t\t\t\t// uninitialized for a long time, the informer will not\n\t\t\t\t\t\t// receive any update, and the controller will create a new\n\t\t\t\t\t\t// pod when the expectation expires.\n\t\t\t\t\t\treturn\n\t\t\t\t\t}\n\t\t\t\t\tif err != nil {\n\t\t\t\t\t\tdefer utilruntime.HandleError(err)\n\t\t\t\t\t\t// Decrement the expected number of creates because the informer won't observe this pod\n\t\t\t\t\t\tglog.V(2).Infof(\"Failed creation, decrementing expectations for job %q/%q\", job.Namespace, job.Name)\n\t\t\t\t\t\tjm.expectations.CreationObserved(jobKey)\n\t\t\t\t\t\tactiveLock.Lock()\n\t\t\t\t\t\tactive--\n\t\t\t\t\t\tactiveLock.Unlock()\n\t\t\t\t\t\terrCh <- err\n\t\t\t\t\t}\n\t\t\t\t}()\n\t\t\t}\n\t\t\twait.Wait()\n\t\t\t// any skipped pods that we never attempted to start shouldn't be expected.\n\t\t\tskippedPods := diff - batchSize\n\t\t\tif errorCount < len(errCh) && skippedPods > 0 {\n\t\t\t\tglog.V(2).Infof(\"Slow-start failure. Skipping creation of %d pods, decrementing expectations for job %q/%q\", skippedPods, job.Namespace, job.Name)\n\t\t\t\tactive -= skippedPods\n\t\t\t\tfor i := int32(0); i < skippedPods; i++ {\n\t\t\t\t\t// Decrement the expected number of creates because the informer won't observe this pod\n\t\t\t\t\tjm.expectations.CreationObserved(jobKey)\n\t\t\t\t}\n\t\t\t\t// The skipped pods will be retried later. The next controller resync will\n\t\t\t\t// retry the slow start process.\n\t\t\t\tbreak\n\t\t\t}\n\t\t\tdiff -= batchSize\n\t\t}\n\t}\n\n\tselect {\n\tcase err := <-errCh:\n\t\t// all errors have been reported before, we only need to inform the controller that there was an error and it should re-try this job once more next time.\n\t\tif err != nil {\n\t\t\treturn active, err\n\t\t}\n\tdefault:\n\t}\n\n\treturn active, nil\n}\n```\n\n<br>\n\n### 4.总结\n\n（1）jobController也利用expectations机制，在每次同步计算当前active pod的数量时进行了设置。\n\n（2）然后pod的add, update, del 对map进行了修改\n\n举例来说，如果一个job completions=4, parallelism=2。那么当这个job创建的时候：\n\n（1）发现map(expectations)中没有这个job，那么需要同步。\n\n（2）通过manageJob，设置map中 add=2, del=0\n\n（3）然后在创建了2个pod\n\n（4）每pod创建，add-1, 创建完2个pod后，add=0, del=0，表示job需要同步了。\n\n这个就是expectations的精髓，这2个pod没创建完之前，这个job根本不需要同步。\n\n（5）然后同步发现，当前确实只能运行2个pod，所以等着2个pod运行完后，触发下一轮更新，再创建2个pod\n\n（6）最后job运行完成"
  },
  {
    "path": "k8s/kcm/6-namespaces controller-manager源码分析.md",
    "content": "Table of Contents\n=================\n\n  * [1. startNamespaceController](#1-startnamespacecontroller)\n  * [2. NewNamespaceController](#2-newnamespacecontroller)\n  * [3. Run](#3-run)\n     * [3.1 syncNamespaceFromKey](#31-syncnamespacefromkey)\n     * [3.2. deleteAllContent](#32-deleteallcontent)\n  * [4 总结](#4-总结)\n\n和其他kcm控制器组件一样，nsController还是在NewControllerInitializers定义然后启动。\n\n### 1. startNamespaceController\n\ncmd\\kube-controller-manager\\app\\core.go\n\n这里有很多的startController函数。这个就是 startControllers里面各个controller对应的init函数。\n\nns也是用了令牌桶进行限速。\n\n```\nfunc startNamespaceController(ctx ControllerContext) (http.Handler, bool, error) {\n\t// the namespace cleanup controller is very chatty.  It makes lots of discovery calls and then it makes lots of delete calls\n\t// the ratelimiter negatively affects its speed.  Deleting 100 total items in a namespace (that's only a few of each resource\n\t// including events), takes ~10 seconds by default.\n\tnsKubeconfig := ctx.ClientBuilder.ConfigOrDie(\"namespace-controller\")\n\tnsKubeconfig.QPS *= 20\n\tnsKubeconfig.Burst *= 100\n\tnamespaceKubeClient := clientset.NewForConfigOrDie(nsKubeconfig)\n\treturn startModifiedNamespaceController(ctx, namespaceKubeClient, nsKubeconfig)\n}\n\n\nfunc startModifiedNamespaceController(ctx ControllerContext, namespaceKubeClient clientset.Interface, nsKubeconfig *restclient.Config) (http.Handler, bool, error) {\n\n\tmetadataClient, err := metadata.NewForConfig(nsKubeconfig)\n\tif err != nil {\n\t\treturn nil, true, err\n\t}\n\n\tdiscoverResourcesFn := namespaceKubeClient.Discovery().ServerPreferredNamespacedResources\n\n\tnamespaceController := namespacecontroller.NewNamespaceController(\n\t\tnamespaceKubeClient,\n\t\tmetadataClient,\n\t\tdiscoverResourcesFn,\n\t\tctx.InformerFactory.Core().V1().Namespaces(),\n\t\tctx.ComponentConfig.NamespaceController.NamespaceSyncPeriod.Duration,\n\t\tv1.FinalizerKubernetes,\n\t)\n\tgo namespaceController.Run(int(ctx.ComponentConfig.NamespaceController.ConcurrentNamespaceSyncs), ctx.Stop)\n\n\treturn nil, true, nil\n}\n```\n\n（1）NewNamespaceController\n\n（2）nsController.Run\n\n<br>\n\n### 2. NewNamespaceController\n\n```\n// NewNamespaceController creates a new NamespaceController\nfunc NewNamespaceController(\n\tkubeClient clientset.Interface,\n\tmetadataClient metadata.Interface,\n\tdiscoverResourcesFn func() ([]*metav1.APIResourceList, error),\n\tnamespaceInformer coreinformers.NamespaceInformer,\n\tresyncPeriod time.Duration,\n\tfinalizerToken v1.FinalizerName) *NamespaceController {\n\n\t// create the controller so we can inject the enqueue function\n\tnamespaceController := &NamespaceController{\n\t\tqueue:                      workqueue.NewNamedRateLimitingQueue(nsControllerRateLimiter(), \"namespace\"),\n\t\tnamespacedResourcesDeleter: deletion.NewNamespacedResourcesDeleter(kubeClient.CoreV1().Namespaces(), metadataClient, kubeClient.CoreV1(), discoverResourcesFn, finalizerToken),\n\t}\n\n\tif kubeClient != nil && kubeClient.CoreV1().RESTClient().GetRateLimiter() != nil {\n\t\tratelimiter.RegisterMetricAndTrackRateLimiterUsage(\"namespace_controller\", kubeClient.CoreV1().RESTClient().GetRateLimiter())\n\t}\n\n\t// configure the namespace informer event handlers\n\tnamespaceInformer.Informer().AddEventHandlerWithResyncPeriod(\n\t\tcache.ResourceEventHandlerFuncs{\n\t\t\tAddFunc: func(obj interface{}) {\n\t\t\t\tnamespace := obj.(*v1.Namespace)\n\t\t\t\tnamespaceController.enqueueNamespace(namespace)\n\t\t\t},\n\t\t\tUpdateFunc: func(oldObj, newObj interface{}) {\n\t\t\t\tnamespace := newObj.(*v1.Namespace)\n\t\t\t\tnamespaceController.enqueueNamespace(namespace)\n\t\t\t},\n\t\t},\n\t\tresyncPeriod,\n\t)\n\tnamespaceController.lister = namespaceInformer.Lister()\n\tnamespaceController.listerSynced = namespaceInformer.Informer().HasSynced\n\n\treturn namespaceController\n}\n```\n\nNewNamespaceController就是定义了kubeconfig，然后定义好需要监听对象的处理函数。这里这监听addFunc, updateFunc。\n\nQ：delete为啥不监听？\n\nA：因为ns都delete掉了，监听没什么意义。ns 有deletionStamp，是update事件。\n\n<br>\n\n### 3. Run\n\n定义好控制器之后就开始运行了。\n\n```go\nfunc (nm *NamespaceController) Run(workers int, stopCh <-chan struct{}) {\n\tdefer utilruntime.HandleCrash()\n\tdefer nm.queue.ShutDown()\n\n\tklog.Infof(\"Starting namespace controller\")\n\tdefer klog.Infof(\"Shutting down namespace controller\")\n\n\tif !cache.WaitForNamedCacheSync(\"namespace\", stopCh, nm.listerSynced) {\n\t\treturn\n\t}\n\n\tklog.V(5).Info(\"Starting workers of namespace controller\")\n\tfor i := 0; i < workers; i++ {\n\t\tgo wait.Until(nm.worker, time.Second, stopCh)\n\t}\n\t<-stopCh\n}\n```\n\n还是一样，run完之后调用 worker,并发处理队列中的namespace。这里并没有processNextItem()\n\n可以看出来，work函数会一直循环处理一个ns。知道这个ns从队列中被移除。\n\n```\n// worker processes the queue of namespace objects.\n// Each namespace can be in the queue at most once.\n// The system ensures that no two workers can process\n// the same namespace at the same time.\nfunc (nm *NamespaceController) worker() {\n\tworkFunc := func() bool {\n\t\tkey, quit := nm.queue.Get()\n\t\tif quit {\n\t\t\treturn true\n\t\t}\n\t\tdefer nm.queue.Done(key)\n\n\t\terr := nm.syncNamespaceFromKey(key.(string))\n\t\tif err == nil {\n\t\t\t// no error, forget this entry and return\n\t\t\tnm.queue.Forget(key)\n\t\t\treturn false\n\t\t}\n\n\t\tif estimate, ok := err.(*deletion.ResourcesRemainingError); ok {\n\t\t\tt := estimate.Estimate/2 + 1\n\t\t\tklog.V(4).Infof(\"Content remaining in namespace %s, waiting %d seconds\", key, t)\n\t\t\tnm.queue.AddAfter(key, time.Duration(t)*time.Second)\n\t\t} else {\n\t\t\t// rather than wait for a full resync, re-add the namespace to the queue to be processed\n\t\t\tnm.queue.AddRateLimited(key)\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"deletion of namespace %v failed: %v\", key, err))\n\t\t}\n\t\treturn false\n\t}\n\n\tfor {\n\t\tquit := workFunc()\n\n\t\tif quit {\n\t\t\treturn\n\t\t}\n\t}\n}\n```\n\n<br>\n\n#### 3.1 syncNamespaceFromKey\n\nsyncNamespaceFromKey主要调用了nm.namespacedResourcesDeleter.Delete，它们的逻辑如下：\n\n（1）如果namespace不存在，返回nil\n\n（2）如果namespace没有DeletionTimestamp字段，返回nil\n\n（3）可以删除的话，先删除namespaces下所以的资源，如果某一个资源删除需要等待，返回一个ResourcesRemainingError\n\n（4）所有的资源删除完后，删除namespace。\n\n```\n// syncNamespaceFromKey looks for a namespace with the specified key in its store and synchronizes it\nfunc (nm *NamespaceController) syncNamespaceFromKey(key string) (err error) {\n\tstartTime := time.Now()\n\tdefer func() {\n\t\tglog.V(4).Infof(\"Finished syncing namespace %q (%v)\", key, time.Since(startTime))\n\t}()\n  \n  // 1.如果namespace不存在，返回nil\n\tnamespace, err := nm.lister.Get(key)\n\tif errors.IsNotFound(err) {\n\t\tglog.Infof(\"Namespace has been deleted %v\", key)\n\t\treturn nil\n\t}\n\tif err != nil {\n\t\tutilruntime.HandleError(fmt.Errorf(\"Unable to retrieve namespace %v from store: %v\", key, err))\n\t\treturn err\n\t}\n\treturn nm.namespacedResourcesDeleter.Delete(namespace.Name)\n}\n```\n\nDelete函数的主要逻辑就是：如果ns不需要删除就返回，需要删除就先删除资源，再删除ns。\n\n具体为：\n\n（1）ns没有DeletionTimestamp不做任何操作\n\n（2）如果没有Finalizers,也不处理\n\n（3）\n\n```\n// Delete deletes all resources in the given namespace.\n// Before deleting resources:\n// * It ensures that deletion timestamp is set on the\n//   namespace (does nothing if deletion timestamp is missing).\n// * Verifies that the namespace is in the \"terminating\" phase\n//   (updates the namespace phase if it is not yet marked terminating)\n// After deleting the resources:\n// * It removes finalizer token from the given namespace.\n//\n// Returns an error if any of those steps fail.\n// Returns ResourcesRemainingError if it deleted some resources but needs\n// to wait for them to go away.\n// Caller is expected to keep calling this until it succeeds.\nfunc (d *namespacedResourcesDeleter) Delete(nsName string) error {\n\t// Multiple controllers may edit a namespace during termination\n\t// first get the latest state of the namespace before proceeding\n\t// if the namespace was deleted already, don't do anything\n\tnamespace, err := d.nsClient.Get(nsName, metav1.GetOptions{})\n\tif err != nil {\n\t\tif errors.IsNotFound(err) {\n\t\t\treturn nil\n\t\t}\n\t\treturn err\n\t}\n\t// 1.ns没有DeletionTimestamp不做任何操作。\n\tif namespace.DeletionTimestamp == nil {\n\t\treturn nil\n\t}\n\n\tklog.V(5).Infof(\"namespace controller - syncNamespace - namespace: %s, finalizerToken: %s\", namespace.Name, d.finalizerToken)\n\n\t// ensure that the status is up to date on the namespace\n\t// if we get a not found error, we assume the namespace is truly gone\n\t// 2. 对ns的状态进行修改。retryOnConflictError是一个通用的函数，updateNamespaceStatusFunc是实际修改的    // 函数，这里就是修改ns的 phase为Terminating\n\tnamespace, err = d.retryOnConflictError(namespace, d.updateNamespaceStatusFunc)\n\tif err != nil {\n\t\tif errors.IsNotFound(err) {\n\t\t\treturn nil\n\t\t}\n\t\treturn err\n\t}\n\n\t// the latest view of the namespace asserts that namespace is no longer deleting..\n\tif namespace.DeletionTimestamp.IsZero() {\n\t\treturn nil\n\t}\n  \n  \n  // 2.如果没有Finalizers,也不处理\n\t// return if it is already finalized.\n\tif finalized(namespace) {\n\t\treturn nil\n\t}\n  \n  // 3.开始删除ns下的所有资源，estimate表示有多少个资源删除不了\n\t// there may still be content for us to remove\n\testimate, err := d.deleteAllContent(namespace)\n\tif err != nil {\n\t\treturn err\n\t}\n\tif estimate > 0 {\n\t\treturn &ResourcesRemainingError{estimate}\n\t}\n   \n   // 移除finalize，然后apiserver能够删除\n\t// we have removed content, so mark it finalized by us\n\t_, err = d.retryOnConflictError(namespace, d.finalizeNamespace)\n\tif err != nil {\n\t\t// in normal practice, this should not be possible, but if a deployment is running\n\t\t// two controllers to do namespace deletion that share a common finalizer token it's\n\t\t// possible that a not found could occur since the other controller would have finished the delete.\n\t\tif errors.IsNotFound(err) {\n\t\t\treturn nil\n\t\t}\n\t\treturn err\n\t}\n\treturn nil\n}\n```\n\n<br>\n\n#### 3.2. deleteAllContent\n\n这里是用了 dynamic client 一个一个的删除所有的对象, 这里看起来就是并行的\n\n```\n// deleteAllContent will use the dynamic client to delete each resource identified in groupVersionResources.\n// It returns an estimate of the time remaining before the remaining resources are deleted.\n// If estimate > 0, not all resources are guaranteed to be gone.\nfunc (d *namespacedResourcesDeleter) deleteAllContent(ns *v1.Namespace) (int64, error) {\n\tnamespace := ns.Name\n\tnamespaceDeletedAt := *ns.DeletionTimestamp\n\tvar errs []error\n\tconditionUpdater := namespaceConditionUpdater{}\n\testimate := int64(0)\n\tklog.V(4).Infof(\"namespace controller - deleteAllContent - namespace: %s\", namespace)\n\n\tresources, err := d.discoverResourcesFn()\n\tif err != nil {\n\t\t// discovery errors are not fatal.  We often have some set of resources we can operate against even if we don't have a complete list\n\t\terrs = append(errs, err)\n\t\tconditionUpdater.ProcessDiscoverResourcesErr(err)\n\t}\n\t// TODO(sttts): get rid of opCache and pass the verbs (especially \"deletecollection\") down into the deleter\n\tdeletableResources := discovery.FilteredBy(discovery.SupportsAllVerbs{Verbs: []string{\"delete\"}}, resources)\n\tgroupVersionResources, err := discovery.GroupVersionResources(deletableResources)\n\tif err != nil {\n\t\t// discovery errors are not fatal.  We often have some set of resources we can operate against even if we don't have a complete list\n\t\terrs = append(errs, err)\n\t\tconditionUpdater.ProcessGroupVersionErr(err)\n\t}\n\n\tnumRemainingTotals := allGVRDeletionMetadata{\n\t\tgvrToNumRemaining:        map[schema.GroupVersionResource]int{},\n\t\tfinalizersToNumRemaining: map[string]int{},\n\t}\n\tfor gvr := range groupVersionResources {\n\t\tgvrDeletionMetadata, err := d.deleteAllContentForGroupVersionResource(gvr, namespace, namespaceDeletedAt)\n\t\tif err != nil {\n\t\t\t// If there is an error, hold on to it but proceed with all the remaining\n\t\t\t// groupVersionResources.\n\t\t\terrs = append(errs, err)\n\t\t\tconditionUpdater.ProcessDeleteContentErr(err)\n\t\t}\n\t\tif gvrDeletionMetadata.finalizerEstimateSeconds > estimate {\n\t\t\testimate = gvrDeletionMetadata.finalizerEstimateSeconds\n\t\t}\n\t\tif gvrDeletionMetadata.numRemaining > 0 {\n\t\t\tnumRemainingTotals.gvrToNumRemaining[gvr] = gvrDeletionMetadata.numRemaining\n\t\t\tfor finalizer, numRemaining := range gvrDeletionMetadata.finalizersToNumRemaining {\n\t\t\t\tif numRemaining == 0 {\n\t\t\t\t\tcontinue\n\t\t\t\t}\n\t\t\t\tnumRemainingTotals.finalizersToNumRemaining[finalizer] = numRemainingTotals.finalizersToNumRemaining[finalizer] + numRemaining\n\t\t\t}\n\t\t}\n\t}\n\tconditionUpdater.ProcessContentTotals(numRemainingTotals)\n\n\t// we always want to update the conditions because if we have set a condition to \"it worked\" after it was previously, \"it didn't work\",\n\t// we need to reflect that information.  Recall that additional finalizers can be set on namespaces, so this finalizer may clear itself and\n\t// NOT remove the resource instance.\n\tif hasChanged := conditionUpdater.Update(ns); hasChanged {\n\t\tif _, err = d.nsClient.UpdateStatus(ns); err != nil {\n\t\t\tutilruntime.HandleError(fmt.Errorf(\"couldn't update status condition for namespace %q: %v\", namespace, err))\n\t\t}\n\t}\n\n\t// if len(errs)==0, NewAggregate returns nil.\n\tklog.V(4).Infof(\"namespace controller - deleteAllContent - namespace: %s, estimate: %v, errors: %v\", namespace, estimate, utilerrors.NewAggregate(errs))\n\treturn estimate, utilerrors.NewAggregate(errs)\n}\n```\n\n<br>\n\n### 4 总结\n\n（1）ns只对add, update的ns进行处理，而且只针对设置里deletionStamp的ns进行处理。\n\n（2）如果ns的有deletionStamp，nsController做的操作为：\n\n* 第一，修改ns的状态为删除中\n* 第二，移除ns的finalizer\n\n<br>\n\n为什么会这样，这就得补充一下ns的基本特征：\n\n（1）ns在删除之前，有一个finalizers：kubernetes。并且phase: Active\n\nfinalizers：kubernetes的作用就是，删除ns的时候会卡住，得等nsController删除里该命名空间下的所有资源才会移除\n\n```\n//删除前\nroot@k8s-master:~/testyaml# kubectl get ns zoux -oyaml -w\napiVersion: v1\nkind: Namespace\nmetadata:\n  creationTimestamp: \"2021-07-20T03:18:44Z\"\n  name: zoux\n  resourceVersion: \"9449396\"\n  selfLink: /api/v1/namespaces/zoux\n  uid: 12fab759-0cda-4d98-97db-330cbf407e15\nspec:\n  finalizers:\n  - kubernetes\nstatus:\n  phase: Active\n  \n  \n//删除中  \napiVersion: v1\nkind: Namespace\nmetadata:\n  creationTimestamp: \"2021-07-20T03:18:44Z\"\n  deletionTimestamp: \"2021-07-20T03:20:07Z\"\n  name: zoux\n  resourceVersion: \"9449629\"\n  selfLink: /api/v1/namespaces/zoux\n  uid: 12fab759-0cda-4d98-97db-330cbf407e15\nspec:\n  finalizers:\n  - kubernetes\nstatus:\n  phase: Terminating\n```"
  },
  {
    "path": "k8s/kcm/9-kubernetes污点和容忍度概念介绍.md",
    "content": "### 1. 概念介绍\n\n **污点（Taint）** 应用于node身上，表示该节点有污点了，如果不能忍受这个污点的pod，你就不要调度/运行到这个节点上。如果是不能运行到这个节点上，那就是污点驱逐了。\n\n**容忍度（Toleration）** 是应用于 Pod 上的。容忍度允许调度器调度带有对应污点的 Pod。或者允许这个pod继续运行到这个节点上。\n\n可以看出来，污点和容忍度（Toleration）相互配合，可以用来避免 Pod 被分配/运行到不合适的节点上。 每个节点上都可以应用一个或多个污点，每个pod也是可以应用一个或多个容忍度。\n\n### 2. 污点详解\n\n污点总共由4个字段组成：\n\n**key, value字段**：可以任意字符。这个可以自定义。\n\n**Effect**：NoExecute，PreferNoSchedule，NoSchedule 三选一\n\n* NoExecute表示不能运行污点，意思是如果该节点有这种污点，但是pod没有对应的容忍度，那么这个pod是会被驱逐的\n* NoSchedule表示不能调度污点，意思是如果该节点有这种污点，pod没有对应的容忍度，那么在调度的时候，这个pod是不会考虑这个节点的\n* PreferNoSchedule 是NoSchedule的软化版。意思是如果该节点有这种污点，pod没有对应的容忍度，那么在调度的时候，这个pod不会优先考虑这个节点，但是如果实在没有节点可用，它还是接受调度到该节点上的。\n\n**TimeAdded** : 这个污点是什么时候加的\n\n```\n// The node this Taint is attached to has the \"effect\" on\n// any pod that does not tolerate the Taint.\ntype Taint struct {\n\t// Required. The taint key to be applied to a node.\n\tKey string `json:\"key\" protobuf:\"bytes,1,opt,name=key\"`\n\t// Required. The taint value corresponding to the taint key.\n\t// +optional\n\tValue string `json:\"value,omitempty\" protobuf:\"bytes,2,opt,name=value\"`\n\t// Required. The effect of the taint on pods\n\t// that do not tolerate the taint.\n\t// Valid effects are NoSchedule, PreferNoSchedule and NoExecute.\n\tEffect TaintEffect `json:\"effect\" protobuf:\"bytes,3,opt,name=effect,casttype=TaintEffect\"`\n\t// TimeAdded represents the time at which the taint was added.\n\t// It is only written for NoExecute taints.\n\t// +optional\n\tTimeAdded *metav1.Time `json:\"timeAdded,omitempty\" protobuf:\"bytes,4,opt,name=timeAdded\"`\n}\n```\n\n添加污点的方式也很简单：\n\n```\nkubectl taint nodes node1 key1=value1:NoSchedule\nkubectl taint nodes node1 key1=value1:NoExecute\n```\n\n**k8s默认污点**\n\n- node.kubernetes.io/not-ready：节点未准备好，相当于节点状态Ready的值为False。\n- node.kubernetes.io/unreachable：Node Controller访问不到节点，相当于节点状态Ready的值为Unknown\n- node.kubernetes.io/out-of-disk：节点磁盘耗尽\n- node.kubernetes.io/memory-pressure：节点存在内存压力\n- node.kubernetes.io/disk-pressure：节点存在磁盘压力\n- node.kubernetes.io/network-unavailable：节点网络不可达\n- node.kubernetes.io/unschedulable：节点不可调度\n- node.cloudprovider.kubernetes.io/uninitialized：如果Kubelet启动时指定了一个外部的cloudprovider，它将给当前节点添加一个Taint将其标记为不可用。在cloud-controller-manager的一个controller初始化这个节点后，Kubelet将删除这个Taint\n\n<br>\n\n### 3. 容忍度详解\n\n```\n// Toleration represents the toleration object that can be attached to a pod.\n// The pod this Toleration is attached to tolerates any taint that matches\n// the triple <key,value,effect> using the matching operator <operator>.\ntype Toleration struct {\n\t// Key is the taint key that the toleration applies to. Empty means match all taint keys.\n\t// If the key is empty, operator must be Exists; this combination means to match all values and all keys.\n\t// +optional\n\tKey string\n\t// Operator represents a key's relationship to the value.\n\t// Valid operators are Exists and Equal. Defaults to Equal.\n\t// Exists is equivalent to wildcard for value, so that a pod can\n\t// tolerate all taints of a particular category.\n\t// +optional\n\tOperator TolerationOperator\n\t// Value is the taint value the toleration matches to.\n\t// If the operator is Exists, the value should be empty, otherwise just a regular string.\n\t// +optional\n\tValue string\n\t// Effect indicates the taint effect to match. Empty means match all taint effects.\n\t// When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.\n\t// +optional\n\tEffect TaintEffect  // \n\t// TolerationSeconds represents the period of time the toleration (which must be\n\t// of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default,\n\t// it is not set, which means tolerate the taint forever (do not evict). Zero and\n\t// negative values will be treated as 0 (evict immediately) by the system.\n\t// +optional\n\tTolerationSeconds *int64\n}\n```\n\n容忍度应用在pod身上，可以看出来，相比污点，多了2个字段：\n\n**Operator**： string类型，Exists，Equal 二选一\n\n`operator` 的默认值是 `Equal`。\n\n一个容忍度和一个污点相“匹配”是指它们有一样的键名和效果，并且：\n\n- 如果 `operator` 是 `Exists` （此时容忍度不能指定 `value`）, 例如这种\n\n  ```\n  tolerations:\n  - key: \"key1\"\n    operator: \"Exists\"\n    effect: \"NoSchedule\"\n  ```\n\n- 如果 `operator` 是 `Equal` ，则它们的 `value` 应该相等。例如这种\n\n```\ntolerations:\n- key: \"key1\"\n  operator: \"Equal\"\n  value: \"value1\"\n  effect: \"NoSchedule\"\n```\n\n**TolerationSeconds**: 容忍时间。表示在驱逐之前，我还可以忍受你这个pod运行多久。只针对NoSchedule类型生效。\n\n<br>\n\n**说明：**\n\n存在两种特殊情况：\n\n如果一个容忍度的 `key` 为空且 `operator` 为 `Exists`， 表示这个容忍度与任意的 key、value 和 effect 都匹配，即这个容忍度能容忍任何污点。\n\n如果 `effect` 为空，则可以与所有键名 `key1` 的效果相匹配。\n\n**TolerationSeconds：** 容忍时间。如果没有设置默认是不容忍。\n\n### 4. 污点驱逐\n\n污点驱逐：node在运行过程中，被设置了NoExecute的污点，但是运行的pod没有对应的容忍度。因此需要将这些pod删除。\n\nkcm中是nodelifeController控制污点驱逐的。默认是开启的。如下参数默认是true。\n\n```\n--enable-taint-manager=true --feature-gates=TaintBasedEvictions=true\n```"
  },
  {
    "path": "k8s/kcm/kcm篇源码分析总结.md",
    "content": "目前为止，kcm篇源码分析共hpa, gc, deploy, rs, job, ns 6个主要的控制器。通过这些源码分析，总结下目前的工作:\n\n（1）更了解kcm的机制。kcm就是一堆控制器的结合。每个控制器只干自己相关的事情，通过控制器的共同操作，让集群中的资源达到 期望状态\n\n（2）对以后问题的排查，或者需求的开发积累来经验。\n\n* 例如，通过gc篇，以后k8s集群中出现来gc资源出现来问题，可以马上定位修复\n* rs的expectations机制，informer机制等都可以借鉴代码\n\n目前kcm打算就分析道这里，原因在于：\n\n* 通过这6个控制器已经了解来kcm控制器的主体运行逻辑，以后有需求，分析其他控制器的源码也非常容易，以后分析来再补充\n\n* 将精力放到其他组件的源码分析上"
  },
  {
    "path": "k8s/kube-apiserver/0-apiserver笔记规划.md",
    "content": "本章节的目标就是弄懂kube-apiserver的实现细节。从本质来说，kube-apiserver就是一个go server服务器端。\n\n假设我要实现kube-apiserver，我想到的要考虑的以下的事情：\n\n（1）apiserver的启动流程是怎么样的\n\n（2）k8s这么多资源，是怎么注册的，如何进行多版本的资源管理\n\n（3）如何和etcd存储打通\n\n（4）一个request，经历了哪些流程\n\n（5）认证，授权，Admission是如何实现的\n\n（6）apiserver是如何处理create, update, delete请求的\n\n<br>\n\n因此这章节的目标就是弄清楚上诉的问题\n"
  },
  {
    "path": "k8s/kube-apiserver/1-v1.17 kube-apiserver启动参数介绍.md",
    "content": "\n\n摘自：https://v1-17.docs.kubernetes.io/zh/docs/reference/command-line-tools-reference/kube-apiserver/\n\n\n\n- –etcd-servers：etcd集群地址\n\n- –bind-address：监听地址\n\n- –secure-port：https安全端口\n\n- –advertise-address：集群通告地址\n\n- –allow-privileged：启用授权\n\n- –service-cluster-ip-range：Service虚拟IP地址段\n\n- –enable-admission-plugins：准入控制模块\n\n- –authorization-mode：认证授权，启用RBAC授权和节点自管理\n\n- –enable-bootstrap-token-auth：启用TLS bootstrap机制\n\n- –token-auth-file：bootstrap token文件\n\n- –service-node-port-range：Service nodeport类型默认分配端口范围\n\n- –kubelet-client-xxx：apiserver访问kubelet客户端证书\n\n- –tls-xxx-file：apiserver https证书\n\n- –etcd-xxxfile：连接Etcd集群证书\n\n- –audit-log-xxx：审计日志\n\n  \n\n```f\n      --admission-control stringSlice                           控制资源进入集群的准入控制插件的顺序列表。逗号分隔的 NamespaceLifecycle 列表。（默认值 [AlwaysAdmit]）\n\n      --admission-control-config-file string                    包含准入控制配置的文件。\n\n      --advertise-address ip                                    向集群成员通知 apiserver 消息的 IP 地址。这个地址必须能够被集群中其他成员访问。如果 IP 地址为空，将会使用 --bind-address，如果未指定 --bind-address，将会使用主机的默认接口地址。\n\n      --allow-privileged                                        如果为 true, 将允许特权容器。\n\n      --anonymous-auth                                          启用到 API server 的安全端口的匿名请求。未被其他认证方法拒绝的请求被当做匿名请求。匿名请求的用户名为 system:anonymous，用户组名为 system:unauthenticated。（默认值 true）\n\n      --apiserver-count int                                     集群中运行的 apiserver 数量，必须为正数。（默认值 1）\n\n      --audit-log-maxage int                                    基于文件名中的时间戳，旧审计日志文件的最长保留天数。\n\n      --audit-log-maxbackup int                                 旧审计日志文件的最大保留个数。\n\n      --audit-log-maxsize int                                   审计日志被轮转前的最大兆字节数。\n\n      --audit-log-path string                                   如果设置该值，所有到 apiserver 的请求都将会被记录到这个文件。'-' 表示记录到标准输出。\n\n      --audit-policy-file string                                定义审计策略配置的文件的路径。需要打开 'AdvancedAuditing' 特性开关。AdvancedAuditing 需要一个配置来启用审计功能。\n\n      --audit-webhook-config-file string                        一个具有 kubeconfig 格式文件的路径，该文件定义了审计的 webhook 配置。需要打开 'AdvancedAuditing' 特性开关。\n\n      --audit-webhook-mode string                               发送审计事件的策略。 Blocking 模式表示正在发送事件时应该阻塞服务器的响应。 Batch 模式使 webhook 异步缓存和发送事件。 Known 模式为 batch,blocking。（默认值 \"batch\")\n\n      --authentication-token-webhook-cache-ttl duration         从 webhook 令牌认证者获取的响应的缓存时长。( 默认值 2m0s)\n\n      --authentication-token-webhook-config-file string         包含 webhook 配置的文件，用于令牌认证，具有 kubeconfig 格式。API server 将查询远程服务来决定对 bearer 令牌的认证。\n\n      --authorization-mode string                               在安全端口上进行权限验证的插件的顺序列表。以逗号分隔的列表，包括：AlwaysAllow,AlwaysDeny,ABAC,Webhook,RBAC,Node.（默认值 \"AlwaysAllow\"）\n \n      --authorization-policy-file string                        包含权限验证策略的 csv 文件，和 --authorization-mode=ABAC 一起使用，作用在安全端口上。\n \n      --authorization-webhook-cache-authorized-ttl duration     从 webhook 授权者获得的 'authorized' 响应的缓存时长。（默认值 5m0s） \n \n      --authorization-webhook-cache-unauthorized-ttl duration   从 webhook 授权者获得的 'unauthorized' 响应的缓存时长。（默认值 30s）\n \n      --authorization-webhook-config-file string                包含 webhook 配置的 kubeconfig 格式文件，和 --authorization-mode=Webhook 一起使用。API server 将查询远程服务来决定对 API server 安全端口的访问。\n \n      --azure-container-registry-config string                  包含 Azure 容器注册表配置信息的文件的路径。\n\n      --basic-auth-file string                                  如果设置该值，这个文件将会被用于准许通过 http 基本认证到 API server 安全端口的请求。\n\n      --bind-address ip                                         监听 --seure-port 的 IP 地址。被关联的接口必须能够被集群其它节点和 CLI/web 客户端访问。如果为空，则将使用所有接口（0.0.0.0）。（默认值 0.0.0.0）\n\n      --cert-dir string                                         存放 TLS 证书的目录。如果提供了 --tls-cert-file 和 --tls-private-key-file 选项，该标志将被忽略。（默认值 \"/var/run/kubernetes\"）\n\n      --client-ca-file string                                   如果设置此标志，对于任何请求，如果存包含 client-ca-file 中的 authorities 签名的客户端证书，将会使用客户端证书中的 CommonName 对应的身份进行认证。\n \n      --cloud-config string                                     云服务提供商配置文件路径。空字符串表示无配置文件 .\n \n      --cloud-provider string                                   云服务提供商，空字符串表示无提供商。\n \n      --contention-profiling                                    如果已经启用 profiling，则启用锁竞争 profiling。\n\n      --cors-allowed-origins stringSlice                        CORS 的域列表，以逗号分隔。合法的域可以是一个匹配子域名的正则表达式。如果这个列表为空则不会启用 CORS.\n\n      --delete-collection-workers int                           用于 DeleteCollection 调用的工作者数量。这被用于加速 namespace 的清理。( 默认值 1)\n\n      --deserialization-cache-size int                          在内存中缓存的反序列化 json 对象的数量。\n\n      --enable-aggregator-routing                               打开到 endpoints IP 的 aggregator 路由请求，替换 cluster IP。\n\n      --enable-garbage-collector                                启用通用垃圾回收器 . 必须与 kube-controller-manager 对应的标志保持同步。 （默认值 true）\n\n      --enable-logs-handler                                     如果为 true，则为 apiserver 日志功能安装一个 /logs 处理器。（默认值 true）\n\n      --enable-swagger-ui                                       在 apiserver 的 /swagger-ui 路径启用 swagger ui。\n\n      --etcd-cafile string                                      用于保护 etcd 通信的 SSL CA 文件。\n\n      --etcd-certfile string                                    用于保护 etcd 通信的的 SSL 证书文件。\n\n      --etcd-keyfile string                                     用于保护 etcd 通信的 SSL 密钥文件 .\n\n      --etcd-prefix string                                      附加到所有 etcd 中资源路径的前缀。 （默认值 \"/registry\"）\n\n      --etcd-quorum-read                                        如果为 true, 启用 quorum 读。\n\n      --etcd-servers stringSlice                                连接的 etcd 服务器列表 , 形式为（scheme://ip:port)，使用逗号分隔。\n\n      --etcd-servers-overrides stringSlice                      针对单个资源的 etcd 服务器覆盖配置 , 以逗号分隔。 单个配置覆盖格式为 : group/resource#servers, 其中 servers 形式为 http://ip:port, 以分号分隔。\n\n      --event-ttl duration                                      事件驻留时间。（默认值 1h0m0s)\n\n      --enable-bootstrap-token-auth                             启用此选项以允许 'kube-system' 命名空间中的 'bootstrap.kubernetes.io/token' 类型密钥可以被用于 TLS 的启动认证。\n\n      --experimental-encryption-provider-config string          包含加密提供程序的配置的文件，该加密提供程序被用于在 etcd 中保存密钥。\n\n      --external-hostname string                                为此 master 生成外部 URL 时使用的主机名 ( 例如 Swagger API 文档 )。\n\n      --feature-gates mapStringBool                             一个描述 alpha/experimental 特性开关的键值对列表。 选项包括 :\nAccelerators=true|false (ALPHA - default=false)\nAdvancedAuditing=true|false (ALPHA - default=false)\nAffinityInAnnotations=true|false (ALPHA - default=false)\nAllAlpha=true|false (ALPHA - default=false)\nAllowExtTrafficLocalEndpoints=true|false (default=true)\nAppArmor=true|false (BETA - default=true)\nDynamicKubeletConfig=true|false (ALPHA - default=false)\nDynamicVolumeProvisioning=true|false (ALPHA - default=true)\nExperimentalCriticalPodAnnotation=true|false (ALPHA - default=false)\nExperimentalHostUserNamespaceDefaulting=true|false (BETA - default=false)\nLocalStorageCapacityIsolation=true|false (ALPHA - default=false)\nPersistentLocalVolumes=true|false (ALPHA - default=false)\nRotateKubeletClientCertificate=true|false (ALPHA - default=false)\nRotateKubeletServerCertificate=true|false (ALPHA - default=false)\nStreamingProxyRedirects=true|false (BETA - default=true)\nTaintBasedEvictions=true|false (ALPHA - default=false)\n\n      --google-json-key string                                  用于认证的 Google Cloud Platform 服务账号的 JSON 密钥。\n\n      --insecure-allow-any-token username/group1,group2         如果设置该值 , 你的服务将处于非安全状态。任何令牌都将会被允许，并将从令牌中把用户信息解析成为 username/group1,group2。\n\n      --insecure-bind-address ip                                用于监听 --insecure-port 的 IP 地址 ( 设置成 0.0.0.0 表示监听所有接口 )。（默认值 127.0.0.1)\n\n      --insecure-port int                                       用于监听不安全和为认证访问的端口。这个配置假设你已经设置了防火墙规则，使得这个端口不能从集群外访问。对集群的公共地址的 443 端口的访问将被代理到这个端口。默认设置中使用 nginx 实现。（默认值 8080）\n\n      --kubelet-certificate-authority string                    证书 authority 的文件路径。\n\n      --kubelet-client-certificate string                       用于 TLS 的客户端证书文件路径。\n\n      --kubelet-client-key string                               用于 TLS 的客户端证书密钥文件路径 .\n\n      --kubelet-https                                           为 kubelet 启用 https。 （默认值 true）\n\n      --kubelet-preferred-address-types stringSlice             用于 kubelet 连接的首选 NodeAddressTypes 列表。 ( 默认值[Hostname,InternalDNS,InternalIP,ExternalDNS,ExternalIP])\n\n      --kubelet-read-only-port uint                             已废弃 : kubelet 端口 . （默认值 10255）\n\n      --kubelet-timeout duration                                kubelet 操作超时时间。（默认值 \n      5s）\n\n      --kubernetes-service-node-port int                        如果不为 0，Kubernetes master 服务（用于创建 / 管理 apiserver）将会使用 NodePort 类型，并将这个值作为端口号。如果为 0，Kubernetes master 服务将会使用 ClusterIP 类型。\n\n      --master-service-namespace string                         已废弃 : 注入到 pod 中的 kubernetes master 服务的命名空间。（默认值 \"default\"） \n\n      --max-connection-bytes-per-sec int                        如果不为 0，每个用户连接将会被限速为该值（bytes/sec）。当前只应用于长时间运行的请求。\n\n      --max-mutating-requests-inflight int                      在给定时间内进行中可变请求的最大数量。当超过该值时，服务将拒绝所有请求。0 值表示没有限制。（默认值 200）\n\n      --max-requests-inflight int                               在给定时间内进行中不可变请求的最大数量。当超过该值时，服务将拒绝所有请求。0 值表示没有限制。（默认值 400）\n\n      --min-request-timeout int                                 一个可选字段，表示一个 handler 在一个请求超时前，必须保持它处于打开状态的最小秒数。当前只对监听请求 handler 有效，它基于这个值选择一个随机数作为连接超时值，以达到分散负载的目的（默认值 1800）。\n      \n      --oidc-ca-file string                                    如果设置该值，将会使用 oidc-ca-file 中的任意一个 authority 对 OpenID 服务的证书进行验证，否则将会使用主机的根 CA 对其进行验证。\n      \n      --oidc-client-id string                                   使用 OpenID 连接的客户端的 ID，如果设置了 oidc-issuer-url，则必须设置这个值。\n       \n      --oidc-groups-claim string                                如果提供该值，这个自定义 OpenID 连接名将指定给特定的用户组。该声明值需要是一个字符串或字符串数组。此标志为实验性的，请查阅验证相关文档进一步了解详细信息。\n\n      --oidc-issuer-url string                                  OpenID 颁发者 URL，只接受 HTTPS 方案。如果设置该值，它将被用于验证 OIDC JSON Web Token(JWT)。\n      \n      --oidc-username-claim string                              用作用户名的 OpenID 声明值。注意，不保证除默认 ('sub') 外的其他声明值的唯一性和不变性。此标志为实验性的，请查阅验证相关文档进一步了解详细信息。\n       \n      --profiling                                               在 web 接口 host:port/debug/pprof/ 上启用 profiling。（默认值 true）\n       \n      --proxy-client-cert-file string                           当必须调用外部程序时，用于证明 aggregator 或者 kube-apiserver 的身份的客户端证书。包括代理到用户 api-server 的请求和调用 webhook 准入控制插件的请求。它期望这个证书包含一个来自于 CA 中的 --requestheader-client-ca-file 标记的签名。该 CA 在 kube-system 命名空间的 'extension-apiserver-authentication' configmap 中发布。从 Kube-aggregator 收到调用的组件应该使用该 CA 进行他们部分的双向 TLS 验证。\n \n      --proxy-client-key-file string                            当必须调用外部程序时，用于证明 aggregator 或者 kube-apiserver 的身份的客户端证书密钥。包括代理到用户 api-server 的请求和调用 webhook 准入控制插件的请求。\n      \n      --repair-malformed-updates                                如果为 true，服务将会尽力修复更新请求以通过验证，例如：将更新请求 UID 的当前值设置为空。在我们修复了所有发送错误格式请求的客户端后，可以关闭这个标志。\n\n      --requestheader-allowed-names stringSlice                 使用 --requestheader-username-headers 指定的，允许在头部提供用户名的客户端证书通用名称列表。如果为空，任何通过 --requestheader-client-ca-file 中 authorities 验证的客户端证书都是被允许的。\n      \n      --requestheader-client-ca-file string                     在信任请求头中以 --requestheader-username-headers 指示的用户名之前，用于验证接入请求中客户端证书的根证书捆绑。\n      \n      --requestheader-extra-headers-prefix stringSlice          用于检查的请求头的前缀列表。建议使用 X-Remote-Extra-。\n\n      --requestheader-group-headers stringSlice                 用于检查群组的请求头列表。建议使用 X-Remote-Group.\n       \n      --requestheader-username-headers stringSlice              用于检查用户名的请求头列表。建议使用 X-Remote-User。\n       \n      --runtime-config mapStringString                          传递给 apiserver 用于描述运行时配置的键值对集合。 apis/<groupVersion> 键可以被用来打开 / 关闭特定的 api 版本。apis/<groupVersion>/<resource> 键被用来打开 / 关闭特定的资源 . api/all 和 api/legacy 键分别用于控制所有的和遗留的 api 版本 .\n        \n      --secure-port int                                         用于监听具有认证授权功能的 HTTPS 协议的端口。如果为 0，则不会监听 HTTPS 协议。 （默认值 6443)\n       \n      --service-account-key-file stringArray                    包含 PEM 加密的 x509 RSA 或 ECDSA 私钥或公钥的文件，用于验证 ServiceAccount 令牌。如果设置该值，--tls-private-key-file 将会被使用。指定的文件可以包含多个密钥，并且这个标志可以和不同的文件一起多次使用。\n      \n      --service-cluster-ip-range ipNet                          CIDR 表示的 IP 范围，服务的 cluster ip 将从中分配。 一定不要和分配给 nodes 和 pods 的 IP 范围产生重叠。\n      \n      --ssh-keyfile string                                      如果不为空，在使用安全的 SSH 代理访问节点时，将这个文件作为用户密钥文件。\n      \n      --storage-backend string                                  持久化存储后端。 选项为 : 'etcd3' ( 默认 ), 'etcd2'.\n     \n      --storage-media-type string                               在存储中保存对象的媒体类型。某些资源或者存储后端可能仅支持特定的媒体类型，并且忽略该配置项。（默认值 \"application/vnd.kubernetes.protobuf\")\n      \n      --storage-versions string                                 按组划分资源存储的版本。 以 \"group1/version1,group2/version2,...\" 的格式指定。当对象从一组移动到另一组时 , 你可以指定 \"group1=group2/v1beta1,group3/v1beta1,...\" 的格式。你只需要传入你希望从结果中改变的组的列表。默认为从 KUBE_API_VERSIONS 环境变量集成而来，所有注册组的首选版本列表。 （默认值 \"admission.k8s.io/v1alpha1,admissionregistration.k8s.io/v1alpha1,apps/v1beta1,authentication.k8s.io/v1,authorization.k8s.io/v1,autoscaling/v1,batch/v1,certificates.k8s.io/v1beta1,componentconfig/v1alpha1,extensions/v1beta1,federation/v1beta1,imagepolicy.k8s.io/v1alpha1,networking.k8s.io/v1,policy/v1beta1,rbac.authorization.k8s.io/v1beta1,settings.k8s.io/v1alpha1,storage.k8s.io/v1,v1\")\n      \n      --target-ram-mb int                                       apiserver 内存限制，单位为 MB( 用于配置缓存大小等 )。\n      \n      --tls-ca-file string                                      如果设置该值，这个证书 authority 将会被用于从 Admission Controllers 过来的安全访问。它必须是一个 PEM 加密的合法 CA 捆绑包。此外 , 该证书 authority 可以被添加到以 --tls-cert-file 提供的证书文件中 .\n      \n      --tls-cert-file string                                    包含用于 HTTPS 的默认 x509 证书的文件。（如果有 CA 证书，则附加于 server 证书之后）。如果启用了 HTTPS 服务，并且没有提供 --tls-cert-file 和 --tls-private-key-file，则将为公共地址生成一个自签名的证书和密钥并保存于 /var/run/kubernetes 目录。\n      \n      --tls-private-key-file string                             包含匹配 --tls-cert-file 的 x509 证书私钥的文件。\n      \n      --tls-sni-cert-key namedCertKey                           一对 x509 证书和私钥的文件路径 , 可以使用符合正式域名的域形式作为后缀。 如果没有提供域形式后缀 , 则将提取证书名。 非通配符版本优先于通配符版本 , 显示的域形式优先于证书中提取的名字。 对于多个密钥 / 证书对, 请多次使用 --tls-sni-cert-key。例如 : \"example.crt,example.key\" or \"foo.crt,foo.key:*.foo.com,foo.com\". （默认值[])\n      \n      --token-auth-file string                                  如果设置该值，这个文件将被用于通过令牌认证来保护 API 服务的安全端口。\n      \n      --version version[=true]                                  打印版本信息并退出。\n      \n      --watch-cache                                             启用 apiserver 的监视缓存。（默认值 true）\n      \n      --watch-cache-sizes stringSlice                           每种资源（pods, nodes 等）的监视缓存大小列表，以逗号分隔。每个缓存配置的形式为：resource#size，size 是一个数字。在 watch-cache 启用时生效。\n```"
  },
  {
    "path": "k8s/kube-apiserver/10-kube-apiserver创建AggregatorServer.md",
    "content": "* [1\\. kube\\-apiserver 背景介绍](#1-kube-apiserver-背景介绍)\n* [2\\. CreateAggregatorServer源码分析](#2-createaggregatorserver源码分析)\n  * [2\\.1 NewWithDelegate](#21-newwithdelegate)\n    * [2\\.1\\.1 apiserviceRegistrationController\\-处理APIService对象的增删改](#211-apiserviceregistrationcontroller-处理apiservice对象的增删改)\n    * [2\\.1\\.2 availableController](#212-availablecontroller)\n  * [2\\.2 创建autoRegistrationController](#22-创建autoregistrationcontroller)\n    * [2\\.2\\.1 checkAPIService](#221-checkapiservice)\n    * [2\\.2\\.2 为什么需要这个](#222-为什么需要这个)\n  * [2\\.3 crdRegistrationController](#23-crdregistrationcontroller)\n  * [2\\.4 openAPIAggregationController](#24-openapiaggregationcontroller)\n* [3\\. 总结](#3-总结)\n\n**本章重点：**分析第六个流程，创建APIExtensionsServer\n\n kube-apiserver整体启动流程如下：\n\n（1）资源注册。\n\n（2）Cobra命令行参数解析\n\n（3）创建APIServer通用配置\n\n（4）创建APIExtensionsServer\n\n（5）创建KubeAPIServer\n\n（6）创建AggregatorServer\n\n（7）启动HTTP服务。\n\n（8）启动HTTPS服务\n\n### 1. kube-apiserver 背景介绍\n\nkube-apiserver其实是包含了3个server: aggregator、\bapiserver、apiExtensionsServer。通过聚合的方式，对外变成一个kube-apisever对外提供服务。\n\n举例说明，如下图：\n\n![image-20220703162508208](../images/apiserver-73.png)\n\n当一个请求来的时候，首先经过的是aggregatorServer，aggregatorServer会判断这个服务是否是需要本地处理，如果需要本地处理，就放行到apiserver这一层，处理内置资源（pod, node, svc等等）。如果不是内置资源，那就是CRD资源，转到extensionsSever处理。\n\n**怎么判断是否是本地服务呢？-通过APIService对象**\n\nK8s 有个资源对象叫做APIService, 这个资源就是表示当前支持的服务类型。\n\n```\nroot@cld-kmaster1-1051:/home/ngadm# kubectl explain APIService\nKIND:     APIService\nVERSION:  apiregistration.k8s.io/v1\n\nDESCRIPTION:\n     APIService represents a server for a particular GroupVersion. Name must be\n     \"version.group\".\n\nFIELDS:\n   apiVersion\t<string>\n     APIVersion defines the versioned schema of this representation of an\n     object. Servers should convert recognized schemas to the latest internal\n     value, and may reject unrecognized values. More info:\n     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources\n\n   kind\t<string>\n     Kind is a string value representing the REST resource this object\n     represents. Servers may infer this from the endpoint the client submits\n     requests to. Cannot be updated. In CamelCase. More info:\n     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds\n\n   metadata\t<Object>\n\n   spec\t<Object>\n     Spec contains information for locating and communicating with a server\n\n   status\t<Object>\n     Status contains derived information about an API server\n     \n     \n// local 表示本地， 非local表示sever\nroot:/home/zoux# kubectl get APIService\nNAME                                       SERVICE                AVAILABLE   AGE\nv1.                                        Local                  True        369d\nv1.admissionregistration.k8s.io            Local                  True        369d\nv1.apiextensions.k8s.io                    Local                  True        369d\nv1.apps                                    Local                  True        369d\nv1.authentication.k8s.io                   Local                  True        369d\nv1.authorization.k8s.io                    Local                  True        369d\nv1.autoscaling                             Local                  True        369d\nv1.autoscaling.k8s.io                      Local                  True        35d\nv1.batch                                   Local                  True        369d\n.... \nv1beta1.batch                              Local                  True        369d\nv1beta1.certificates.k8s.io                Local                  True        369d\nv1beta1.coordination.k8s.io                Local                  True        369d\nv1beta1.custom.metrics.k8s.io              kube-system/kube-hpa   True        369d\n...\n\n\nroot:/home/zoux# kubectl get APIService v2alpha1.batch  -oyaml\napiVersion: apiregistration.k8s.io/v1\nkind: APIService\nmetadata:\n  creationTimestamp: \"2021-06-28T10:02:30Z\"\n  labels:\n    kube-aggregator.kubernetes.io/automanaged: onstart\n  name: v2alpha1.batch\n  resourceVersion: \"34\"\n  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v2alpha1.batch\n  uid: 30aae086-ca97-4b90-8097-435561d1e56d\nspec:\n  group: batch\n  groupPriorityMinimum: 17400\n  service: null\n  version: v2alpha1\n  versionPriority: 9\nstatus:\n  conditions:\n  - lastTransitionTime: \"2021-06-28T10:02:30Z\"\n    message: Local APIServices are always available\n    reason: Local\n    status: \"True\"\n    type: Available\n```\n\n**所以访问batch这个group下的资源（就是job）就是本地访问，直接访问apisrver;访问v1beta1.custom.metrics.k8s.io(hpa)就是访问kube-system kube-hpa这个service**\n\n<br>\n\n所以从上面可以知道，AggregatorServer负责处理 `apiregistration.k8s.io` 组下的APIService资源请求，实际上所有的服务都是apiregistration.k8s.io。所以AggregatorServer是一个总的入口。\n\n这也是为什么创建server的顺序为: apiExtensionsServer, \bapiserver、aggregator。\n\n<br>\n\n### 2. CreateAggregatorServer源码分析\n\n目标：通过源码分析弄清楚具体的流程。\n\nCreateAggregatorServer核心是运行了一下的控制器\n\n- `apiserviceRegistrationController`：负责 APIServices 中资源的注册与删除；\n- `availableConditionController`：维护 APIServices 的可用状态，包括其引用 Service 是否可用等；\n- `autoRegistrationController`：用于保持 API 中存在的一组特定的 APIServices；\n- `crdRegistrationController`：负责将 CRD GroupVersions 自动注册到 APIServices 中；\n- `openAPIAggregationController`：将 APIServices 资源的变化同步至提供的 OpenAPI 文档；\n\n```\naggregatorServer, err := createAggregatorServer(aggregatorConfig, kubeAPIServer.GenericAPIServer, apiExtensionsServer.Informers)\n\tif err != nil {\n\t\t// we don't need special handling for innerStopCh because the aggregator server doesn't create any go routines\n\t\treturn nil, err\n\t}\n\t\n\t\n\tfunc createAggregatorServer(aggregatorConfig *aggregatorapiserver.Config, delegateAPIServer genericapiserver.DelegationTarget, apiExtensionInformers apiextensionsinformers.SharedInformerFactory) (*aggregatorapiserver.APIAggregator, error) {\n\t// 1.创建aggregatorServer\n\taggregatorServer, err := aggregatorConfig.Complete().NewWithDelegate(delegateAPIServer)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// create controllers for auto-registration\n\tapiRegistrationClient, err := apiregistrationclient.NewForConfig(aggregatorConfig.GenericConfig.LoopbackClientConfig)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t// 2.创建autoRegistrationController\n\tautoRegistrationController := autoregister.NewAutoRegisterController(aggregatorServer.APIRegistrationInformers.Apiregistration().V1().APIServices(), apiRegistrationClient)\n\tapiServices := apiServicesToRegister(delegateAPIServer, autoRegistrationController)\n\n\tcrdRegistrationController := crdregistration.NewCRDRegistrationController(\n\t\tapiExtensionInformers.Apiextensions().InternalVersion().CustomResourceDefinitions(),\n\t\tautoRegistrationController)\n\n\terr = aggregatorServer.GenericAPIServer.AddPostStartHook(\"kube-apiserver-autoregistration\", func(context genericapiserver.PostStartHookContext) error {\n\t\tgo crdRegistrationController.Run(5, context.StopCh)\n\t\tgo func() {\n\t\t\t// let the CRD controller process the initial set of CRDs before starting the autoregistration controller.\n\t\t\t// this prevents the autoregistration controller's initial sync from deleting APIServices for CRDs that still exist.\n\t\t\t// we only need to do this if CRDs are enabled on this server.  We can't use discovery because we are the source for discovery.\n\t\t\tif aggregatorConfig.GenericConfig.MergedResourceConfig.AnyVersionForGroupEnabled(\"apiextensions.k8s.io\") {\n\t\t\t\tcrdRegistrationController.WaitForInitialSync()\n\t\t\t}\n\t\t\tautoRegistrationController.Run(5, context.StopCh)\n\t\t}()\n\t\treturn nil\n\t})\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\terr = aggregatorServer.GenericAPIServer.AddBootSequenceHealthChecks(\n\t\tmakeAPIServiceAvailableHealthCheck(\n\t\t\t\"autoregister-completion\",\n\t\t\tapiServices,\n\t\t\taggregatorServer.APIRegistrationInformers.Apiregistration().V1().APIServices(),\n\t\t),\n\t)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\treturn aggregatorServer, nil\n}\n```\n\n#### 2.1 NewWithDelegate\n\n核心逻辑如下：\n\n（1）利用apiserver生成genericServer，这个和apiserver利用extensionServer生成是一样的\n\n（2）注册路由信息\n\n（3）生成apiserviceRegistrationController，启动监听APIServiceRegistrationController请求\n\n（4）运行availableController\n\n可以看出来这里核心就是运行了apiserviceRegistrationController 和 availableController这2个控制器\n\n```\n// NewWithDelegate returns a new instance of APIAggregator from the given config.\nfunc (c completedConfig) NewWithDelegate(delegationTarget genericapiserver.DelegationTarget) (*APIAggregator, error) {\n\t// Prevent generic API server to install OpenAPI handler. Aggregator server\n\t// has its own customized OpenAPI handler.\n\topenAPIConfig := c.GenericConfig.OpenAPIConfig\n\tc.GenericConfig.OpenAPIConfig = nil\n  \n  // 1. 利用apiserver生成genericServer\n\tgenericServer, err := c.GenericConfig.New(\"kube-aggregator\", delegationTarget)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tapiregistrationClient, err := clientset.NewForConfig(c.GenericConfig.LoopbackClientConfig)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tinformerFactory := informers.NewSharedInformerFactory(\n\t\tapiregistrationClient,\n\t\t5*time.Minute, // this is effectively used as a refresh interval right now.  Might want to do something nicer later on.\n\t)\n\n\ts := &APIAggregator{\n\t\tGenericAPIServer:         genericServer,\n\t\tdelegateHandler:          delegationTarget.UnprotectedHandler(),\n\t\tproxyClientCert:          c.ExtraConfig.ProxyClientCert,\n\t\tproxyClientKey:           c.ExtraConfig.ProxyClientKey,\n\t\tproxyTransport:           c.ExtraConfig.ProxyTransport,\n\t\tproxyHandlers:            map[string]*proxyHandler{},\n\t\thandledGroups:            sets.String{},\n\t\tlister:                   informerFactory.Apiregistration().V1().APIServices().Lister(),\n\t\tAPIRegistrationInformers: informerFactory,\n\t\tserviceResolver:          c.ExtraConfig.ServiceResolver,\n\t\topenAPIConfig:            openAPIConfig,\n\t}\n  \n  // 2.注册路由信息\n\tapiGroupInfo := apiservicerest.NewRESTStorage(c.GenericConfig.MergedResourceConfig, c.GenericConfig.RESTOptionsGetter)\n\tif err := s.GenericAPIServer.InstallAPIGroup(&apiGroupInfo); err != nil {\n\t\treturn nil, err\n\t}\n\n\tenabledVersions := sets.NewString()\n\tfor v := range apiGroupInfo.VersionedResourcesStorageMap {\n\t\tenabledVersions.Insert(v)\n\t}\n\tif !enabledVersions.Has(v1.SchemeGroupVersion.Version) {\n\t\treturn nil, fmt.Errorf(\"API group/version %s must be enabled\", v1.SchemeGroupVersion.String())\n\t}\n\n\tapisHandler := &apisHandler{\n\t\tcodecs:         aggregatorscheme.Codecs,\n\t\tlister:         s.lister,\n\t\tdiscoveryGroup: discoveryGroup(enabledVersions),\n\t}\n\ts.GenericAPIServer.Handler.NonGoRestfulMux.Handle(\"/apis\", apisHandler)\n\ts.GenericAPIServer.Handler.NonGoRestfulMux.UnlistedHandle(\"/apis/\", apisHandler)\n  \n  // 3.生成apiserviceRegistrationController，监听APIServiceRegistrationController请求\n\tapiserviceRegistrationController := NewAPIServiceRegistrationController(informerFactory.Apiregistration().V1().APIServices(), s)\n\tavailableController, err := statuscontrollers.NewAvailableConditionController(\n\t\tinformerFactory.Apiregistration().V1().APIServices(),\n\t\tc.GenericConfig.SharedInformerFactory.Core().V1().Services(),\n\t\tc.GenericConfig.SharedInformerFactory.Core().V1().Endpoints(),\n\t\tapiregistrationClient.ApiregistrationV1(),\n\t\tc.ExtraConfig.ProxyTransport,\n\t\tc.ExtraConfig.ProxyClientCert,\n\t\tc.ExtraConfig.ProxyClientKey,\n\t\ts.serviceResolver,\n\t)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n  \n  // 启动监听\n\ts.GenericAPIServer.AddPostStartHookOrDie(\"start-kube-aggregator-informers\", func(context genericapiserver.PostStartHookContext) error {\n\t\tinformerFactory.Start(context.StopCh)\n\t\tc.GenericConfig.SharedInformerFactory.Start(context.StopCh)\n\t\treturn nil\n\t})\n\t\n// 4.运行apiserviceRegistrationController\n\ts.GenericAPIServer.AddPostStartHookOrDie(\"apiservice-registration-controller\", func(context genericapiserver.PostStartHookContext) error {\n\t\tgo apiserviceRegistrationController.Run(context.StopCh)\n\t\treturn nil\n\t})\n\t\n\t// 5. 运行availableController\n\ts.GenericAPIServer.AddPostStartHookOrDie(\"apiservice-status-available-controller\", func(context genericapiserver.PostStartHookContext) error {\n\t\t// if we end up blocking for long periods of time, we may need to increase threadiness.\n\t\tgo availableController.Run(5, context.StopCh)\n\t\treturn nil\n\t})\n\n\treturn s, nil\n}\n\n// 监听APIService这个对象的add, update, delete\n// NewAPIServiceRegistrationController returns a new APIServiceRegistrationController.\nfunc NewAPIServiceRegistrationController(apiServiceInformer informers.APIServiceInformer, apiHandlerManager APIHandlerManager) *APIServiceRegistrationController {\n\tc := &APIServiceRegistrationController{\n\t\tapiHandlerManager: apiHandlerManager,\n\t\tapiServiceLister:  apiServiceInformer.Lister(),\n\t\tapiServiceSynced:  apiServiceInformer.Informer().HasSynced,\n\t\tqueue:             workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), \"APIServiceRegistrationController\"),\n\t}\n\n\tapiServiceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n\t\tAddFunc:    c.addAPIService,\n\t\tUpdateFunc: c.updateAPIService,\n\t\tDeleteFunc: c.deleteAPIService,\n\t})\n\n\tc.syncFn = c.sync\n\n\treturn c\n}\n```\n\n##### 2.1.1 apiserviceRegistrationController-处理APIService对象的增删改\n\naddAPIService, updateAPIService, updateAPIService都是进入队列。通过Run->runWorker->processNextWorkItem->sync处理。\n\n```\nfunc (c *APIServiceRegistrationController) sync(key string) error {\n\t// 如果APIService对象不存在，就删除\n\tapiService, err := c.apiServiceLister.Get(key)\n\tif apierrors.IsNotFound(err) {\n\t\tc.apiHandlerManager.RemoveAPIService(key)\n\t\treturn nil\n\t}\n\tif err != nil {\n\t\treturn err\n\t}\n  // 核心就是AddAPIService函数\n\treturn c.apiHandlerManager.AddAPIService(apiService)\n}\n```\n\n<br>\n\nAddAPIService函数的核心逻辑：\n（1）如果存在，说明路由信息不用修改，直接更新porxy就行\n\n（2）如果不存在，要处理restful url和路由的对应关系\n\n```\n// AddAPIService adds an API service.  It is not thread-safe, so only call it on one thread at a time please.\n// It's a slow moving API, so its ok to run the controller on a single thread\nfunc (s *APIAggregator) AddAPIService(apiService *v1.APIService) error {\n\t// if the proxyHandler already exists, it needs to be updated. The aggregation bits do not\n\t// since they are wired against listers because they require multiple resources to respond\n\t// 1.如果存在，说明路由信息不用修改，直接更新porxy就行\n\tif proxyHandler, exists := s.proxyHandlers[apiService.Name]; exists {\n\t\tproxyHandler.updateAPIService(apiService)\n\t\tif s.openAPIAggregationController != nil {\n\t\t\ts.openAPIAggregationController.UpdateAPIService(proxyHandler, apiService)\n\t\t}\n\t\treturn nil\n\t}\n  \n  // 2.如果不存在，要处理restful url和路由的对应关系\n\tproxyPath := \"/apis/\" + apiService.Spec.Group + \"/\" + apiService.Spec.Version\n\t// v1. is a special case for the legacy API.  It proxies to a wider set of endpoints.\n\tif apiService.Name == legacyAPIServiceName {\n\t\tproxyPath = \"/api\"\n\t}\n  \n\t// register the proxy handler\n\tproxyHandler := &proxyHandler{\n\t\tlocalDelegate:   s.delegateHandler,\n\t\tproxyClientCert: s.proxyClientCert,\n\t\tproxyClientKey:  s.proxyClientKey,\n\t\tproxyTransport:  s.proxyTransport,\n\t\tserviceResolver: s.serviceResolver,\n\t}\n\tproxyHandler.updateAPIService(apiService)\n\tif s.openAPIAggregationController != nil {\n\t\ts.openAPIAggregationController.AddAPIService(proxyHandler, apiService)\n\t}\n\ts.proxyHandlers[apiService.Name] = proxyHandler\n\ts.GenericAPIServer.Handler.NonGoRestfulMux.Handle(proxyPath, proxyHandler)\n\ts.GenericAPIServer.Handler.NonGoRestfulMux.UnlistedHandlePrefix(proxyPath+\"/\", proxyHandler)\n\n\t// if we're dealing with the legacy group, we're done here\n\tif apiService.Name == legacyAPIServiceName {\n\t\treturn nil\n\t}\n\n\t// if we've already registered the path with the handler, we don't want to do it again.\n\tif s.handledGroups.Has(apiService.Spec.Group) {\n\t\treturn nil\n\t}\n\n\t// it's time to register the group aggregation endpoint\n\tgroupPath := \"/apis/\" + apiService.Spec.Group\n\tgroupDiscoveryHandler := &apiGroupHandler{\n\t\tcodecs:    aggregatorscheme.Codecs,\n\t\tgroupName: apiService.Spec.Group,\n\t\tlister:    s.lister,\n\t\tdelegate:  s.delegateHandler,\n\t}\n\t// aggregation is protected\n\ts.GenericAPIServer.Handler.NonGoRestfulMux.Handle(groupPath, groupDiscoveryHandler)\n\ts.GenericAPIServer.Handler.NonGoRestfulMux.UnlistedHandle(groupPath+\"/\", groupDiscoveryHandler)\n\ts.handledGroups.Insert(apiService.Spec.Group)\n\treturn nil\n}\n```\n\n<br>\n\nupdateAPIService的核心逻辑：\n\n（1）如果APIService对象是Local类型，不用设置代理\n\n（2）否则设置路由代理，访问这个restful 服务，都由APIService对应的后端处理\n\n```\nfunc (r *proxyHandler) updateAPIService(apiService *apiregistrationv1api.APIService) {\n\tif apiService.Spec.Service == nil {\n\t\tr.handlingInfo.Store(proxyHandlingInfo{local: true})\n\t\treturn\n\t}\n\n\tnewInfo := proxyHandlingInfo{\n\t\tname: apiService.Name,\n\t\trestConfig: &restclient.Config{\n\t\t\tTLSClientConfig: restclient.TLSClientConfig{\n\t\t\t\tInsecure:   apiService.Spec.InsecureSkipTLSVerify,\n\t\t\t\tServerName: apiService.Spec.Service.Name + \".\" + apiService.Spec.Service.Namespace + \".svc\",\n\t\t\t\tCertData:   r.proxyClientCert,\n\t\t\t\tKeyData:    r.proxyClientKey,\n\t\t\t\tCAData:     apiService.Spec.CABundle,\n\t\t\t},\n\t\t},\n\t\tserviceName:      apiService.Spec.Service.Name,\n\t\tserviceNamespace: apiService.Spec.Service.Namespace,\n\t\tservicePort:      *apiService.Spec.Service.Port,\n\t\tserviceAvailable: apiregistrationv1apihelper.IsAPIServiceConditionTrue(apiService, apiregistrationv1api.Available),\n\t}\n\tif r.proxyTransport != nil && r.proxyTransport.DialContext != nil {\n\t\tnewInfo.restConfig.Dial = r.proxyTransport.DialContext\n\t}\n\tnewInfo.proxyRoundTripper, newInfo.transportBuildingError = restclient.TransportFor(newInfo.restConfig)\n\tif newInfo.transportBuildingError != nil {\n\t\tklog.Warning(newInfo.transportBuildingError.Error())\n\t}\n\tr.handlingInfo.Store(newInfo)\n}\n```\n\n<br>\n\n这里大家可以参考我的一篇文章，就能更清楚理解了\n\n[hpa-自定义metric-server](https://zoux86.github.io/post/2021-6-18-hpa-%E8%87%AA%E5%AE%9A%E4%B9%89metric-server/)\n\n```\nroot@k8s-master:~/testyaml/hpa# cat tls.yaml \napiVersion: v1\nkind: Service\nmetadata:\n  name: kube-hpa\n  namespace: kube-system\nspec:\n  clusterIP: None\n  ports:\n  - name: https-hpa-dont-edit-it\n    port: 9997\n    targetPort: 9997\n  selector:\n    app: kube-hpa\n---\napiVersion: apiregistration.k8s.io/v1beta1\nkind: APIService\nmetadata:\n  name: v1beta1.custom.metrics.k8s.io\nspec:\n  service:\n    name: kube-hpa\n    namespace: kube-system\n    port: 9997\n  group: custom.metrics.k8s.io\n  version: v1beta1\n  insecureSkipTLSVerify: true\n  groupPriorityMinimum: 100\n  versionPriority: 100\n```\n\n##### 2.1.2 availableController\n\navailableController 核心工作就是判断APIService对应的service是否能工作。所以处理监听APIService外，还要监听svc, ep资源。\n\n```\n// NewAvailableConditionController returns a new AvailableConditionController.\nfunc NewAvailableConditionController(\n\tapiServiceInformer informers.APIServiceInformer,\n\tserviceInformer v1informers.ServiceInformer,\n\tendpointsInformer v1informers.EndpointsInformer,\n\tapiServiceClient apiregistrationclient.APIServicesGetter,\n\tproxyTransport *http.Transport,\n\tproxyClientCert []byte,\n\tproxyClientKey []byte,\n\tserviceResolver ServiceResolver,\n) (*AvailableConditionController, error) {\n\tc := &AvailableConditionController{\n\t\tapiServiceClient: apiServiceClient,\n\t\tapiServiceLister: apiServiceInformer.Lister(),\n\t\tapiServiceSynced: apiServiceInformer.Informer().HasSynced,\n\t\tserviceLister:    serviceInformer.Lister(),\n\t\tservicesSynced:   serviceInformer.Informer().HasSynced,\n\t\tendpointsLister:  endpointsInformer.Lister(),\n\t\tendpointsSynced:  endpointsInformer.Informer().HasSynced,\n\t\tserviceResolver:  serviceResolver,\n\t\tqueue: workqueue.NewNamedRateLimitingQueue(\n\t\t\t// We want a fairly tight requeue time.  The controller listens to the API, but because it relies on the routability of the\n\t\t\t// service network, it is possible for an external, non-watchable factor to affect availability.  This keeps\n\t\t\t// the maximum disruption time to a minimum, but it does prevent hot loops.\n\t\t\tworkqueue.NewItemExponentialFailureRateLimiter(5*time.Millisecond, 30*time.Second),\n\t\t\t\"AvailableConditionController\"),\n\t}\n\n\t// if a particular transport was specified, use that otherwise build one\n\t// construct an http client that will ignore TLS verification (if someone owns the network and messes with your status\n\t// that's not so bad) and sets a very short timeout.  This is a best effort GET that provides no additional information\n\trestConfig := &rest.Config{\n\t\tTLSClientConfig: rest.TLSClientConfig{\n\t\t\tInsecure: true,\n\t\t\tCertData: proxyClientCert,\n\t\t\tKeyData:  proxyClientKey,\n\t\t},\n\t}\n\tif proxyTransport != nil && proxyTransport.DialContext != nil {\n\t\trestConfig.Dial = proxyTransport.DialContext\n\t}\n\ttransport, err := rest.TransportFor(restConfig)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tc.discoveryClient = &http.Client{\n\t\tTransport: transport,\n\t\t// the request should happen quickly.\n\t\tTimeout: 5 * time.Second,\n\t}\n\n\t// resync on this one because it is low cardinality and rechecking the actual discovery\n\t// allows us to detect health in a more timely fashion when network connectivity to\n\t// nodes is snipped, but the network still attempts to route there.  See\n\t// https://github.com/openshift/origin/issues/17159#issuecomment-341798063\n\tapiServiceInformer.Informer().AddEventHandlerWithResyncPeriod(\n\t\tcache.ResourceEventHandlerFuncs{\n\t\t\tAddFunc:    c.addAPIService,\n\t\t\tUpdateFunc: c.updateAPIService,\n\t\t\tDeleteFunc: c.deleteAPIService,\n\t\t},\n\t\t30*time.Second)\n\n\tserviceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n\t\tAddFunc:    c.addService,\n\t\tUpdateFunc: c.updateService,\n\t\tDeleteFunc: c.deleteService,\n\t})\n\n\tendpointsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n\t\tAddFunc:    c.addEndpoints,\n\t\tUpdateFunc: c.updateEndpoints,\n\t\tDeleteFunc: c.deleteEndpoints,\n\t})\n\n\tc.syncFn = c.sync\n\n\treturn c, nil\n}\n```\n\n<br>\n\n这里核心就是AvailableConditionController.sync。具体逻辑不展开了。核心就是更新APIService status。判断APIserver对象的service是否可用\n\n```\nfunc (c *AvailableConditionController) sync(key string) error {\n\toriginalAPIService, err := c.apiServiceLister.Get(key)\n```\n\n<br>\n\n```\nroot@k8s-master:~/testyaml/hpa# kubectl get APIService v1beta1.custom.metrics.k8s.io  -oyaml\napiVersion: apiregistration.k8s.io/v1\nkind: APIService\nmetadata:\n  creationTimestamp: \"2021-06-13T13:22:01Z\"\n  name: v1beta1.custom.metrics.k8s.io\n  resourceVersion: \"1590641\"\n  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.custom.metrics.k8s.io\n  uid: d488d6a8-7e79-4311-a1e9-0b12e4591375\nspec:\n  group: custom.metrics.k8s.io\n  groupPriorityMinimum: 100\n  insecureSkipTLSVerify: true\n  service:\n    name: kube-hpa\n    namespace: kube-system\n    port: 9997\n  version: v1beta1\n  versionPriority: 100\nstatus:     //就是这个\n  conditions:\n  - lastTransitionTime: \"2021-06-13T13:42:17Z\"\n    message: all checks passed\n    reason: Passed\n    status: \"True\"\n    type: Available\n```\n\n#### 2.2 创建autoRegistrationController\n\nautoRegistrationController也监听了APIService。统一通过Run->runWorker->processNextWorkItem->checkAPIService处理。核心就是checkAPIService函数。\n\n```\n// NewAutoRegisterController creates a new autoRegisterController.\nfunc NewAutoRegisterController(apiServiceInformer informers.APIServiceInformer, apiServiceClient apiregistrationclient.APIServicesGetter) *autoRegisterController {\n   c := &autoRegisterController{\n      apiServiceLister:  apiServiceInformer.Lister(),\n      apiServiceSynced:  apiServiceInformer.Informer().HasSynced,\n      apiServiceClient:  apiServiceClient,\n      apiServicesToSync: map[string]*v1.APIService{},\n\n      apiServicesAtStart: map[string]bool{},\n\n      syncedSuccessfullyLock: &sync.RWMutex{},\n      syncedSuccessfully:     map[string]bool{},\n\n      queue: workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), \"autoregister\"),\n   }\n   c.syncHandler = c.checkAPIService\n\n   apiServiceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n      AddFunc: func(obj interface{}) {\n         cast := obj.(*v1.APIService)\n         c.queue.Add(cast.Name)\n      },\n      UpdateFunc: func(_, obj interface{}) {\n         cast := obj.(*v1.APIService)\n         c.queue.Add(cast.Name)\n      },\n      DeleteFunc: func(obj interface{}) {\n         cast, ok := obj.(*v1.APIService)\n         if !ok {\n            tombstone, ok := obj.(cache.DeletedFinalStateUnknown)\n            if !ok {\n               klog.V(2).Infof(\"Couldn't get object from tombstone %#v\", obj)\n               return\n            }\n            cast, ok = tombstone.Obj.(*v1.APIService)\n            if !ok {\n               klog.V(2).Infof(\"Tombstone contained unexpected object: %#v\", obj)\n               return\n            }\n         }\n         c.queue.Add(cast.Name)\n      },\n   })\n\n   return c\n}\n```\n\n##### 2.2.1 checkAPIService\n\nApiservice按照同步类型分为2类manageOnStart，manageContinuously ，通过标签AutoRegisterManagedLabel标记。\n\n```\nconst (\n\t// AutoRegisterManagedLabel is a label attached to the APIService that identifies how the APIService wants to be synced.\n\tAutoRegisterManagedLabel = \"kube-aggregator.kubernetes.io/automanaged\"\n\n\t// manageOnStart is a value for the AutoRegisterManagedLabel that indicates the APIService wants to be synced one time when the controller starts.\n\tmanageOnStart = \"onstart\"\n\t// manageContinuously is a value for the AutoRegisterManagedLabel that indicates the APIService wants to be synced continuously.\n\tmanageContinuously = \"true\"\n)\n```\n\ncheckAPIService这个函数就注释表就知道，该函数功能更加不同类型的不同动作做同步操作。\n\n<br>\n\n```\n// checkAPIService syncs the current APIService against a list of desired APIService objects\n//\n//                                                 | A. desired: not found | B. desired: sync on start | C. desired: sync always\n// ------------------------------------------------|-----------------------|---------------------------|------------------------\n// 1. current: lookup error                        | error                 | error                     | error\n// 2. current: not found                           | -                     | create once               | create\n// 3. current: no sync                             | -                     | -                         | -\n// 4. current: sync on start, not present at start | -                     | -                         | -\n// 5. current: sync on start, present at start     | delete once           | update once               | update once\n// 6. current: sync always                         | delete                | update once               | update\nfunc (c *autoRegisterController) checkAPIService(name string) (err error) {\n\tdesired := c.GetAPIServiceToSync(name)\n\tcurr, err := c.apiServiceLister.Get(name)\n\n\t// if we've never synced this service successfully, record a successful sync.\n\thasSynced := c.hasSyncedSuccessfully(name)\n\tif !hasSynced {\n\t\tdefer func() {\n\t\t\tif err == nil {\n\t\t\t\tc.setSyncedSuccessfully(name)\n\t\t\t}\n\t\t}()\n\t}\n\n\tswitch {\n\t// we had a real error, just return it (1A,1B,1C)\n\tcase err != nil && !apierrors.IsNotFound(err):\n\t\treturn err\n\n\t// we don't have an entry and we don't want one (2A)\n\tcase apierrors.IsNotFound(err) && desired == nil:\n\t\treturn nil\n\n\t// the local object only wants to sync on start and has already synced (2B,5B,6B \"once\" enforcement)\n\tcase isAutomanagedOnStart(desired) && hasSynced:\n\t\treturn nil\n\n\t// we don't have an entry and we do want one (2B,2C)\n\tcase apierrors.IsNotFound(err) && desired != nil:\n\t\t_, err := c.apiServiceClient.APIServices().Create(desired)\n\t\tif apierrors.IsAlreadyExists(err) {\n\t\t\t// created in the meantime, we'll get called again\n\t\t\treturn nil\n\t\t}\n\t\treturn err\n\n\t// we aren't trying to manage this APIService (3A,3B,3C)\n\tcase !isAutomanaged(curr):\n\t\treturn nil\n\n\t// the remote object only wants to sync on start, but was added after we started (4A,4B,4C)\n\tcase isAutomanagedOnStart(curr) && !c.apiServicesAtStart[name]:\n\t\treturn nil\n\n\t// the remote object only wants to sync on start and has already synced (5A,5B,5C \"once\" enforcement)\n\tcase isAutomanagedOnStart(curr) && hasSynced:\n\t\treturn nil\n\n\t// we have a spurious APIService that we're managing, delete it (5A,6A)\n\tcase desired == nil:\n\t\topts := &metav1.DeleteOptions{Preconditions: metav1.NewUIDPreconditions(string(curr.UID))}\n\t\terr := c.apiServiceClient.APIServices().Delete(curr.Name, opts)\n\t\tif apierrors.IsNotFound(err) || apierrors.IsConflict(err) {\n\t\t\t// deleted or changed in the meantime, we'll get called again\n\t\t\treturn nil\n\t\t}\n\t\treturn err\n\n\t// if the specs already match, nothing for us to do\n\tcase reflect.DeepEqual(curr.Spec, desired.Spec):\n\t\treturn nil\n\t}\n\n\t// we have an entry and we have a desired, now we deconflict.  Only a few fields matter. (5B,5C,6B,6C)\n\tapiService := curr.DeepCopy()\n\tapiService.Spec = desired.Spec\n\t_, err = c.apiServiceClient.APIServices().Update(apiService)\n\tif apierrors.IsNotFound(err) || apierrors.IsConflict(err) {\n\t\t// deleted or changed in the meantime, we'll get called again\n\t\treturn nil\n\t}\n\treturn err\n}\n```\n\n##### 2.2.2 为什么需要这个\n\n作用：用于保持 API 中存在的一组特定的 APIServices\n\n内置资源的APIService都会有标签`kube-aggregator.kubernetes.io/automanaged: onstart`，例如：v1.apps apiService。autoRegistrationController创建并维护这些列表中的APIService，也即我们看到的Local apiService；\n\nCRD资源则是automanaged=true，表示always\n\n而自定义service类型的APIService是没有的这个标签，因为自己会更新路由。\n\n```\nroo # kubectl get APIService --show-labels\nNAME                                       SERVICE                AVAILABLE   AGE    LABELS\nv1.                                        Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.admissionregistration.k8s.io            Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.apiextensions.k8s.io                    Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.apps                                    Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.authentication.k8s.io                   Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.authorization.k8s.io                    Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.autoscaling                             Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.autoscaling.k8s.io                      Local                  True        35d    kube-aggregator.kubernetes.io/automanaged=true\nv1.batch                                   Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.coordination.k8s.io                     Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.messaging.k8s.io                        Local                  True        44d    kube-aggregator.kubernetes.io/automanaged=true\nv1.networking.k8s.io                       Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.rbac.authorization.k8s.io               Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.schedular.istio.io                      Local                  True        41d    kube-aggregator.kubernetes.io/automanaged=true\nv1.scheduling.k8s.io                       Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1.security.symphony.netease.com           Local                  True        44d    kube-aggregator.kubernetes.io/automanaged=true\nv1.storage.k8s.io                          Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1alpha1.argoproj.io                       Local                  True        41d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha1.auditregistration.k8s.io          Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1alpha1.authentication.istio.io           Local                  True        41d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha1.certmanager.k8s.io                Local                  True        44d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha1.crdlbcontroller.k8s.io            Local                  True        44d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha1.loadbalancer.k8s.io               Local                  True        35d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha1.multicluster.admiralty.io         Local                  True        35d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha1.networking.symphony.netease.com   Local                  True        41d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha1.node.k8s.io                       Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1alpha1.rbac.authorization.k8s.io         Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1alpha1.rbac.istio.io                     Local                  True        35d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha1.resources.symphony.netease.com    Local                  True        44d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha1.scheduling.k8s.io                 Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1alpha1.settings.k8s.io                   Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1alpha1.storage.k8s.io                    Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1alpha2.config.istio.io                   Local                  True        35d    kube-aggregator.kubernetes.io/automanaged=true\nv1alpha3.networking.istio.io               Local                  True        41d    kube-aggregator.kubernetes.io/automanaged=true\nv1beta1.admissionregistration.k8s.io       Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.apiextensions.k8s.io               Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.apps                               Local                  True        172d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.authentication.k8s.io              Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.authorization.k8s.io               Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.batch                              Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.certificates.k8s.io                Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.coordination.k8s.io                Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.custom.metrics.k8s.io              kube-system/kube-hpa   True        369d   <none>\nv1beta1.discovery.k8s.io                   Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.events.k8s.io                      Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.extensions                         Local                  True        369d   kube-aggregator.kubernetes.io/automanaged=onstart\nv1beta1.kustomize.toolkit.fluxcd.io        Local                  True        44d    kube-aggregator.kubernetes.io/automanaged=true\n```\n\n#### 2.3 crdRegistrationController\n\ncrdRegistrationController监听的是crd资源的增删改操作。也是Run->runWorker->processNextWorkItem->handleVersionUpdate。核心看handleVersionUpdate。\n\n从这里可以看出来：APIService就是根据CRD资源的增删改，修改APIService对象。CRD资源则是automanaged=true，表示always\n\n```\nfunc (c *crdRegistrationController) handleVersionUpdate(groupVersion schema.GroupVersion) error {\n\tapiServiceName := groupVersion.Version + \".\" + groupVersion.Group\n\n\t// check all CRDs.  There shouldn't that many, but if we have problems later we can index them\n\tcrds, err := c.crdLister.List(labels.Everything())\n\tif err != nil {\n\t\treturn err\n\t}\n\tfor _, crd := range crds {\n\t\tif crd.Spec.Group != groupVersion.Group {\n\t\t\tcontinue\n\t\t}\n\t\tfor _, version := range crd.Spec.Versions {\n\t\t\tif version.Name != groupVersion.Version || !version.Served {\n\t\t\t\tcontinue\n\t\t\t}\n\n\t\t\tc.apiServiceRegistration.AddAPIServiceToSync(&v1.APIService{\n\t\t\t\tObjectMeta: metav1.ObjectMeta{Name: apiServiceName},\n\t\t\t\tSpec: v1.APIServiceSpec{\n\t\t\t\t\tGroup:                groupVersion.Group,\n\t\t\t\t\tVersion:              groupVersion.Version,\n\t\t\t\t\tGroupPriorityMinimum: 1000, // CRDs should have relatively low priority\n\t\t\t\t\tVersionPriority:      100,  // CRDs will be sorted by kube-like versions like any other APIService with the same VersionPriority\n\t\t\t\t},\n\t\t\t})\n\t\t\treturn nil\n\t\t}\n\t}\n\n\tc.apiServiceRegistration.RemoveAPIServiceToSync(apiServiceName)\n\treturn nil\n}\n\n// CRD表示要跟着资源变化一起同步的APIService\n// AddAPIServiceToSync registers an API service to sync continuously.\nfunc (c *autoRegisterController) AddAPIServiceToSync(in *v1.APIService) {\n\tc.addAPIServiceToSync(in, manageContinuously)\n}\n```\n\n#### 2.4 openAPIAggregationController\n\nopenAPIAggregationController 是在PrepareRun中运行的，核心也是监听APIService对象。然后Run->runWorker->processNextWorkItem->sync，同步OpenAPI 文档。\n\n```\n// PrepareRun prepares the aggregator to run, by setting up the OpenAPI spec and calling\n// the generic PrepareRun.\nfunc (s *APIAggregator) PrepareRun() (preparedAPIAggregator, error) {\n\t// add post start hook before generic PrepareRun in order to be before /healthz installation\n\tif s.openAPIConfig != nil {\n\t\ts.GenericAPIServer.AddPostStartHookOrDie(\"apiservice-openapi-controller\", func(context genericapiserver.PostStartHookContext) error {\n\t\t\tgo s.openAPIAggregationController.Run(context.StopCh)\n\t\t\treturn nil\n\t\t})\n\t}\n\n\tprepared := s.GenericAPIServer.PrepareRun()\n\n\t// delay OpenAPI setup until the delegate had a chance to setup their OpenAPI handlers\n\tif s.openAPIConfig != nil {\n\t\tspecDownloader := openapiaggregator.NewDownloader()\n\t\topenAPIAggregator, err := openapiaggregator.BuildAndRegisterAggregator(\n\t\t\t&specDownloader,\n\t\t\ts.GenericAPIServer.NextDelegate(),\n\t\t\ts.GenericAPIServer.Handler.GoRestfulContainer.RegisteredWebServices(),\n\t\t\ts.openAPIConfig,\n\t\t\ts.GenericAPIServer.Handler.NonGoRestfulMux)\n\t\tif err != nil {\n\t\t\treturn preparedAPIAggregator{}, err\n\t\t}\n\t\ts.openAPIAggregationController = openapicontroller.NewAggregationController(&specDownloader, openAPIAggregator)\n\t}\n\n\treturn preparedAPIAggregator{APIAggregator: s, runnable: prepared}, nil\n}\n```\n\n### 3. 总结\n\n可以看出来AggregatorServer做了很多事情。kube-apiserver实现聚合的关键就是它，通过APIService资源扩展了api。\n\n可以利用这个机制做很多事情，比如自定义mertic-server。比如可以通过添加APIService实现CRD的效果。\n\n社区也有一个专业的工具，详见：https://github.com/kubernetes-sigs/apiserver-builder-alpha/\n\napiserver-builder-alpha是一系列工具和库的集合，它能够：\n\n1. 为新的API资源创建Go类型、控制器（基于controller-runtime）、测试用例、文档\n2. 构建、（独立、在Minikube或者在K8S中）运行扩展的控制平面组件（APIServer）\n3. 让在控制器中watch/update资源更简单\n4. 让创建新的资源/子资源更简单\n5. 提供大部分合理的默认值"
  },
  {
    "path": "k8s/kube-apiserver/11-kube-apiserver 启动http和https服务.md",
    "content": "* [1\\. 启动http服务](#1-启动http服务)\n  * [1\\.1 链路流程](#11-链路流程)\n  * [1\\.2 insecureHandlerChain](#12-insecurehandlerchain)\n* [2\\. 启动https服务](#2-启动https服务)\n  * [2\\.1 启动过程](#21-启动过程)\n  * [2\\.2 DefaultBuildHandlerChain](#22-defaultbuildhandlerchain)\n  * [2\\.3 调用链路](#23-调用链路)\n    * [2\\.3\\.1\\. NewConfig 指定了server\\.Config\\.BuildHandlerChainFunc=DefaultBuildHandlerChain](#231-newconfig-指定了serverconfigbuildhandlerchainfuncdefaultbuildhandlerchain)\n    * [2\\.3\\.2\\. completedConfig\\.new 使用这个func](#232-completedconfignew-使用这个func)\n    * [2\\.3\\.3\\. createAggregatorServer调用了NewWithDelegate，调用了第二步的New函数](#233-createaggregatorserver调用了newwithdelegate调用了第二步的new函数)\n    * [2\\.3\\.4\\. Run函数调用了NonBlockingRun函数](#234-run函数调用了nonblockingrun函数)\n* [3 总结](#3-总结)\n\n**本章重点：**分析最后两个流程，启动HTTP，HTTPS服务\n\n kube-apiserver整体启动流程如下：\n\n（1）资源注册。\n\n（2）Cobra命令行参数解析\n\n（3）创建APIServer通用配置\n\n（4）创建APIExtensionsServer\n\n（5）创建KubeAPIServer\n\n（6）创建AggregatorServer\n\n（7）启动HTTP服务。\n\n（8）启动HTTPS服务\n\n### 1. 启动http服务\n\n#### 1.1 链路流程\n\nGo语言提供的HTTP标准库非常强大，Kubernetes API Server在其基础上并没有过多的封装，因为它的功能和性能已经很完善了，可直接拿来用。在Go语言中开启HTTP服务有很多种方法，例如通过http.ListenAndServe函数可以直接启动HTTP服务，其内部实现了创建\n\nSocket、监控端口等操作。下面看看Kubernetes APIServer通过自定义http.Server的方式创建HTTP服务的过程，代码示例如下：\n\n```\nif insecureServingInfo != nil {\n\t\tinsecureHandlerChain := kubeserver.BuildInsecureHandlerChain(aggregatorServer.GenericAPIServer.UnprotectedHandler(), kubeAPIServerConfig.GenericConfig)\n\t\tif err := insecureServingInfo.Serve(insecureHandlerChain, kubeAPIServerConfig.GenericConfig.RequestTimeout, stopCh); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\t\n\t\n// Serve starts an insecure http server with the given handler. It fails only if\n// the initial listen call fails. It does not block.\nfunc (s *DeprecatedInsecureServingInfo) Serve(handler http.Handler, shutdownTimeout time.Duration, stopCh <-chan struct{}) error {\n\tinsecureServer := &http.Server{\n\t\tAddr:           s.Listener.Addr().String(),\n\t\tHandler:        handler,\n\t\tMaxHeaderBytes: 1 << 20,\n\t}\n\n\tif len(s.Name) > 0 {\n\t\tklog.Infof(\"Serving %s insecurely on %s\", s.Name, s.Listener.Addr())\n\t} else {\n\t\tklog.Infof(\"Serving insecurely on %s\", s.Listener.Addr())\n\t}\n\t_, err := RunServer(insecureServer, s.Listener, shutdownTimeout, stopCh)\n\t// NOTE: we do not handle stoppedCh returned by RunServer for graceful termination here\n\treturn err\n}\n\n\n// RunServer spawns a go-routine continuously serving until the stopCh is\n// closed.\n// It returns a stoppedCh that is closed when all non-hijacked active requests\n// have been processed.\n// This function does not block\n// TODO: make private when insecure serving is gone from the kube-apiserver\nfunc RunServer(\n\tserver *http.Server,\n\tln net.Listener,\n\tshutDownTimeout time.Duration,\n\tstopCh <-chan struct{},\n) (<-chan struct{}, error) {\n\tif ln == nil {\n\t\treturn nil, fmt.Errorf(\"listener must not be nil\")\n\t}\n\n\t// Shutdown server gracefully.\n\tstoppedCh := make(chan struct{})\n\tgo func() {\n\t\tdefer close(stoppedCh)\n\t\t<-stopCh\n\t\tctx, cancel := context.WithTimeout(context.Background(), shutDownTimeout)\n\t\tserver.Shutdown(ctx)\n\t\tcancel()\n\t}()\n\n\tgo func() {\n\t\tdefer utilruntime.HandleCrash()\n\n\t\tvar listener net.Listener\n\t\tlistener = tcpKeepAliveListener{ln.(*net.TCPListener)}\n\t\tif server.TLSConfig != nil {\n\t\t\tlistener = tls.NewListener(listener, server.TLSConfig)\n\t\t}\n\n\t\terr := server.Serve(listener)\n\n\t\tmsg := fmt.Sprintf(\"Stopped listening on %s\", ln.Addr().String())\n\t\tselect {\n\t\tcase <-stopCh:\n\t\t\tklog.Info(msg)\n\t\tdefault:\n\t\t\tpanic(fmt.Sprintf(\"%s due to error: %v\", msg, err))\n\t\t}\n\t}()\n\n\treturn stoppedCh, nil\n}\n```\n\n在RunServer函数中，通过Go语言标准库的serverServe监听listener，并在运行过程中为每个连接创建一个goroutine。goroutine读取请求，然后调用Handler函数来处理并响应请求。另外，在Kubernetes API Server的代码中还实现了平滑关闭HTTP服务的功能，利用Go语言标准库的HTTP Server.Shutdown函数可以在不干扰任何活跃连接的情况下关闭服务。其原理是，首先关闭所有的监听listener，然后\n\n关闭所有的空闲连接，接着无限期地等待所有连接变成空闲状态并关闭。如果设置带有超时的Context，将在HTTP服务关闭之前返回Context超时错误。\n\n#### 1.2 insecureHandlerChain \n\n所以如果是http请求的话。处理函数的链路为：\n\n**WithMaxInFlightLimit：apiserver**限流策略，通过go chan实现限流。--max-requests-inflight=1000 --max-mutating-requests-inflight=1000指定了QPS。\n\n**WithAudit**：  开启审计，日志以event格式输出\n\n**WithAuthentication：** 进行认证，其实是为了方便审计\n\n**WithCORS:**   cors全称是--cors-allowed-origins, 通过kube-apiserver的cors-allowed-origins指定运行的cors. 例如：\n\n-cors-allowed-origins = http://www.example.com, https://*.example.com\n\n**WithTimeoutForNonLongRunningRequests:**  设置超时时间，默认是1min\n\n**WithRequestInfo**: 根据请求信息，补充完整requestInfo结构体信息\n\n**WithCacheControl：** 给request设置Cache-Control信息\n\n**WithPanicRecovery：** 如果一个请求给apiserver造成了panic, 设置http.StatusInternalServerError \n\n函数介绍：\n\n```\n// BuildInsecureHandlerChain sets up the server to listen to http. Should be removed.\nfunc BuildInsecureHandlerChain(apiHandler http.Handler, c *server.Config) http.Handler {\n\thandler := apiHandler\n\thandler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc, c.EventQpsRatio, c.RequestTimeout)\n\thandler = genericapifilters.WithAudit(handler, c.AuditBackend, c.AuditPolicyChecker, c.LongRunningFunc)\n\thandler = genericapifilters.WithAuthentication(handler, server.InsecureSuperuser{}, nil, nil)\n\thandler = genericfilters.WithCORS(handler, c.CorsAllowedOriginList, nil, nil, nil, \"true\")\n\thandler = genericfilters.WithTimeoutForNonLongRunningRequests(handler, c.LongRunningFunc, c.RequestTimeout)\n\thandler = genericfilters.WithWaitGroup(handler, c.LongRunningFunc, c.HandlerChainWaitGroup)\n\thandler = genericapifilters.WithRequestInfo(handler, server.NewRequestInfoResolver(c))\n\thandler = genericapifilters.WithCacheControl(handler)\n\thandler = genericfilters.WithPanicRecovery(handler)\n\n\treturn handler\n}\n```\n\n请求是从下到上的，所以顺序为：Panic recovery -> TimeOut -> Authentication -> Audit -> MaxInFlightLimit\n\n<br>\n\n### 2. 启动https服务\n\n#### 2.1 启动过程\n\n在NonBlockingRun函数，启动了https服务\n\n```\n// NonBlockingRun spawns the secure http server. An error is\n// returned if the secure port cannot be listened on.\nfunc (s preparedGenericAPIServer) NonBlockingRun(stopCh <-chan struct{}) error {\n\t// Use an stop channel to allow graceful shutdown without dropping audit events\n\t// after http server shutdown.\n\tauditStopCh := make(chan struct{})\n\n\t// Start the audit backend before any request comes in. This means we must call Backend.Run\n\t// before http server start serving. Otherwise the Backend.ProcessEvents call might block.\n\tif s.AuditBackend != nil {\n\t\tif err := s.AuditBackend.Run(auditStopCh); err != nil {\n\t\t\treturn fmt.Errorf(\"failed to run the audit backend: %v\", err)\n\t\t}\n\t}\n   \n  // 开启https服务\n\t// Use an internal stop channel to allow cleanup of the listeners on error.\n\tinternalStopCh := make(chan struct{})\n\tvar stoppedCh <-chan struct{}\n\tif s.SecureServingInfo != nil && s.Handler != nil {\n\t\tvar err error\n\t\tstoppedCh, err = s.SecureServingInfo.Serve(s.Handler, s.ShutdownTimeout, internalStopCh)\n\t\tif err != nil {\n\t\t\tclose(internalStopCh)\n\t\t\tclose(auditStopCh)\n\t\t\treturn err\n\t\t}\n\t}\n\n\t// Now that listener have bound successfully, it is the\n\t// responsibility of the caller to close the provided channel to\n\t// ensure cleanup.\n\tgo func() {\n\t\t<-stopCh\n\t\tclose(internalStopCh)\n\t\tif stoppedCh != nil {\n\t\t\t<-stoppedCh\n\t\t}\n\t\ts.HandlerChainWaitGroup.Wait()\n\t\tclose(auditStopCh)\n\t}()\n\n\ts.RunPostStartHooks(stopCh)\n\n\tif _, err := systemd.SdNotify(true, \"READY=1\\n\"); err != nil {\n\t\tklog.Errorf(\"Unable to send systemd daemon successful start message: %v\\n\", err)\n\t}\n\n\treturn nil\n}\n```\n\nHTTPS服务在http.Server上增加了TLSConfig配置，TLSConfig用于配置相关证书，可以通过命令行相关参数（--client-ca-file、--tls-private-key-file、--tls-cert-file参数）进行配置。具体过程不再赘述。\n\n#### 2.2 DefaultBuildHandlerChain\n\n很多人在网上看见的都是这个图。左边的handler-chain其实就是https服务的handlers\n\n![handler-chian](../images/handler-chian.jpg)\n\n调用函数为**DefaultBuildHandlerChain**:\n\nDefaultBuildHandlerChain比insecureHandlerChain 多了结果授权的Handler, 其他基本一致。\n\n```\nfunc DefaultBuildHandlerChain(apiHandler http.Handler, c *Config) http.Handler {\n\thandler := genericapifilters.WithAuthorization(apiHandler, c.Authorization.Authorizer, c.Serializer)\n\thandler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc, c.EventQpsRatio, c.RequestTimeout)\n\thandler = genericapifilters.WithImpersonation(handler, c.Authorization.Authorizer, c.Serializer)\n\thandler = genericapifilters.WithAudit(handler, c.AuditBackend, c.AuditPolicyChecker, c.LongRunningFunc)\n\tfailedHandler := genericapifilters.Unauthorized(c.Serializer, c.Authentication.SupportsBasicAuth)\n\tfailedHandler = genericapifilters.WithFailedAuthenticationAudit(failedHandler, c.AuditBackend, c.AuditPolicyChecker)\n\thandler = genericapifilters.WithAuthentication(handler, c.Authentication.Authenticator, failedHandler, c.Authentication.APIAudiences)\n\thandler = genericfilters.WithCORS(handler, c.CorsAllowedOriginList, nil, nil, nil, \"true\")\n\thandler = genericfilters.WithTimeoutForNonLongRunningRequests(handler, c.LongRunningFunc, c.RequestTimeout)\n\thandler = genericfilters.WithWaitGroup(handler, c.LongRunningFunc, c.HandlerChainWaitGroup)\n\thandler = genericapifilters.WithRequestInfo(handler, c.RequestInfoResolver)\n\thandler = genericfilters.WithPanicRecovery(handler)\n\treturn handler\n}\n```\n\n\n\n#### 2.3 调用链路\n\n**调用链路**： createAggregatorServer -> NewWithDelegate -> NewConfig -> DefaultBuildHandlerChain\n\n##### 2.3.1. NewConfig 指定了server.Config.BuildHandlerChainFunc=DefaultBuildHandlerChain\n\n```\n// NewConfig returns a Config struct with the default values\nfunc NewConfig(codecs serializer.CodecFactory) *Config {\n\tdefaultHealthChecks := []healthz.HealthChecker{healthz.PingHealthz, healthz.LogHealthz}\n\treturn &Config{\n\t\tSerializer:                  codecs,\n\t\tBuildHandlerChainFunc:       DefaultBuildHandlerChain,\n```\n\n##### 2.3.2. completedConfig.new 使用这个func\n\n最终APIServerHandler = DefaultBuildHandlerChain\n\n最终GenericAPIServer.Handler = DefaultBuildHandlerChain\n\n```\n// New creates a new server which logically combines the handling chain with the passed server.\n// name is used to differentiate for logging. The handler chain in particular can be difficult as it starts delgating.\n// delegationTarget may not be nil.\nfunc (c completedConfig) New(name string, delegationTarget DelegationTarget) (*GenericAPIServer, error) {\n\tif c.Serializer == nil {\n\t\treturn nil, fmt.Errorf(\"Genericapiserver.New() called with config.Serializer == nil\")\n\t}\n\tif c.LoopbackClientConfig == nil {\n\t\treturn nil, fmt.Errorf(\"Genericapiserver.New() called with config.LoopbackClientConfig == nil\")\n\t}\n\tif c.EquivalentResourceRegistry == nil {\n\t\treturn nil, fmt.Errorf(\"Genericapiserver.New() called with config.EquivalentResourceRegistry == nil\")\n\t}\n  \n  // \n\thandlerChainBuilder := func(handler http.Handler) http.Handler {\n\t\treturn c.BuildHandlerChainFunc(handler, c.Config)\n\t}\n\tapiServerHandler := NewAPIServerHandler(name, c.Serializer, handlerChainBuilder, delegationTarget.UnprotectedHandler())\n\n}\n\n\nfunc NewAPIServerHandler(name string, s runtime.NegotiatedSerializer, handlerChainBuilder HandlerChainBuilderFn, notFoundHandler http.Handler) *APIServerHandler {\n\tnonGoRestfulMux := mux.NewPathRecorderMux(name)\n\tif notFoundHandler != nil {\n\t\tnonGoRestfulMux.NotFoundHandler(notFoundHandler)\n\t}\n\n\tgorestfulContainer := restful.NewContainer()\n\tgorestfulContainer.ServeMux = http.NewServeMux()\n\tgorestfulContainer.Router(restful.CurlyRouter{}) // e.g. for proxy/{kind}/{name}/{*}\n\tgorestfulContainer.RecoverHandler(func(panicReason interface{}, httpWriter http.ResponseWriter) {\n\t\tlogStackOnRecover(s, panicReason, httpWriter)\n\t})\n\tgorestfulContainer.ServiceErrorHandler(func(serviceErr restful.ServiceError, request *restful.Request, response *restful.Response) {\n\t\tserviceErrorHandler(s, serviceErr, request, response)\n\t})\n\n\tdirector := director{\n\t\tname:               name,\n\t\tgoRestfulContainer: gorestfulContainer,\n\t\tnonGoRestfulMux:    nonGoRestfulMux,\n\t}\n\n\treturn &APIServerHandler{\n\t\tFullHandlerChain:   handlerChainBuilder(director),\n\t\tGoRestfulContainer: gorestfulContainer,\n\t\tNonGoRestfulMux:    nonGoRestfulMux,\n\t\tDirector:           director,\n\t}\n}\n```\n\n##### 2.3.3. createAggregatorServer调用了NewWithDelegate，调用了第二步的New函数\n\n```\nfunc createAggregatorServer(aggregatorConfig *aggregatorapiserver.Config, delegateAPIServer genericapiserver.DelegationTarget, apiExtensionInformers apiextensionsinformers.SharedInformerFactory) (*aggregatorapiserver.APIAggregator, error) {\n\taggregatorServer, err := aggregatorConfig.Complete().NewWithDelegate(delegateAPIServer)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// create controller\n```\n\n\n\n所以APIAggregator=DefaultBuildHandlerChain\n\n<br>\n\n最终还调用了server.GenericAPIServer.PrepareRun().Run(stopCh) \n\n```\n// RunAggregator runs the API Aggregator.\nfunc (o AggregatorOptions) RunAggregator(stopCh <-chan struct{}) error {\n\t\n\n\n\tserver, err := config.Complete().NewWithDelegate(genericapiserver.NewEmptyDelegate())\n\tif err != nil {\n\t\treturn err\n\t}\n\treturn server.GenericAPIServer.PrepareRun().Run(stopCh)\n}\n\n\n// NewWithDelegate returns a new instance of APIAggregator from the given config.\nfunc (c completedConfig) NewWithDelegate(delegationTarget genericapiserver.DelegationTarget) (*APIAggregator, error) {\n\t// Prevent generic API server to install OpenAPI handler. Aggregator server\n\t// has its own customized OpenAPI handler.\n\topenAPIConfig := c.GenericConfig.OpenAPIConfig\n\tc.GenericConfig.OpenAPIConfig = nil\n\n\tgenericServer, err := c.GenericConfig.New(\"kube-aggregator\", delegationTarget)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n```\n\n##### 2.3.4. Run函数调用了NonBlockingRun函数\n\nNonBlockingRun 调用了SecureServingInfo.Serve。handler是s.Handler，就是preparedGenericAPIServer\n\n```\nif s.SecureServingInfo != nil && s.Handler != nil {\n   var err error\n   stoppedCh, err = s.SecureServingInfo.Serve(s.Handler, s.ShutdownTimeout, internalStopCh)\n   if err != nil {\n      close(internalStopCh)\n      close(auditStopCh)\n      return err\n   }\n}\n```\n\n### 3 总结\n\n到这里, kube-apiserver 对一个请求的处理就非常清楚了。\n\n（1）先是通过统一的handler chain处理。（http, https是不同的chain, https多了授权相关的处理）\n\n（2)   然后看是否是aggregated server需要出现的请求（APIService）\n\n​\t(3)  如果是内置资源或者CRD资源，则通过kube-apiserver处理（MUX后面的流程，接下来进行分析）\n\n![handler-chian](../images/handler-chian.jpg)"
  },
  {
    "path": "k8s/kube-apiserver/12-k8s之Authentication.md",
    "content": "Table of Contents\n=================\n\n  * [1. 简介](#1-简介)\n  * [2. 认证器的生成](#2-认证器的生成)\n     * [2.1 调用链路](#21-调用链路)\n     * [2.2 BuildAuthenticator](#22-buildauthenticator)\n     * [2.3 ToAuthenticationConfig](#23-toauthenticationconfig)\n     * [2.4 New](#24-new)\n  * [3. 具体的认证过程](#3-具体的认证过程)\n     * [3.1 调用链路](#31-调用链路)\n     * [3.2 t.handler到底是谁](#32-thandler到底是谁)\n     * [3.3 DefaultBuildHandlerChain](#33-defaultbuildhandlerchain)\n     * [3.4 WithAuthentication](#34-withauthentication)\n  * [4. 9种认证方式介绍](#4-9种认证方式介绍)\n     * [4.1 BasicAuth认证](#41-basicauth认证)\n     * [4.2 ClientCA认证](#42-clientca认证)\n     * [4.3 TokenAuth认证](#43-tokenauth认证)\n     * [4.4 BootstrapToken认证](#44-bootstraptoken认证)\n     * [4.5  RequestHeader认证](#45--requestheader认证)\n     * [4.6 WebhookTokenAuth认证](#46-webhooktokenauth认证)\n     * [4.7 Anonymous认证](#47-anonymous认证)\n     * [4.8 OIDC认证](#48-oidc认证)\n     * [4.9 ServiceAccountAuth认证](#49-serviceaccountauth认证)\n     * [4.10 总结](#410-总结)\n  * [5.参考链接：](#5参考链接)\n\n### 1. 简介\n\nkube-apiserver作为一个服务器端。每次请求到来时都需要经过认证授权，以及一系列的访问控制。k8s1.17版本中，共提供9中认证方式：Anonymous，BootstrapToken，ClientCert，OIDC，PasswordFile，RequestHeader，ServiceAccounts，TokenFile，WebHook。\n\n认证和授权的区别： \n\n假设apiserver收到了一个请求，一个名叫张三的用户想删除 namespaceA下的一个pod。\n\n**认证：** apiserver判断你到底是不是张三\n\n**授权：** 张三到底有没有删除这个pod的权限\n\n<br>\n\n### 2. 认证器的生成\n\n#### 2.1 调用链路\n\nrun函数经过一系列的调用，最终调用BuildAuthenticator函数来生成认证器。\n\ncmd/kube-apiserver/app/server.go Run ->  CreateServerChain ->  CreateKubeAPIServerConfig -> buildGenericConfig ->BuildAuthenticator \n\n```\n以下函数只显示关键的代码\n// Run runs the specified APIServer.  This should never exit.\nfunc Run(completeOptions completedServerRunOptions, stopCh <-chan struct{}) error {\n\tserver, err := CreateServerChain(completeOptions, stopCh)\n}\n\n\n// CreateServerChain creates the apiservers connected via delegation.\nfunc CreateServerChain(completedOptions completedServerRunOptions, stopCh <-chan struct{}) (*aggregatorapiserver.APIAggregator, error) {\n\n\n\tkubeAPIServerConfig, insecureServingInfo, serviceResolver, pluginInitializer, err := CreateKubeAPIServerConfig(completedOptions, nodeTunneler, proxyTransport)\n\t\n}\n\n\n// CreateKubeAPIServerConfig creates all the resources for running the API server, but runs none of them\nfunc CreateKubeAPIServerConfig(\n\ts completedServerRunOptions,\n\tnodeTunneler tunneler.Tunneler,\n\tproxyTransport *http.Transport,\n) (\n\t*master.Config,\n\t*genericapiserver.DeprecatedInsecureServingInfo,\n\taggregatorapiserver.ServiceResolver,\n\t[]admission.PluginInitializer,\n\terror,\n) {\n\n\tgenericConfig, versionedInformers, insecureServingInfo, serviceResolver, pluginInitializers, admissionPostStartHook, storageFactory, err := buildGenericConfig(s.ServerRunOptions, proxyTransport)\n\n}\n\n\n// BuildGenericConfig takes the master server options and produces the genericapiserver.Config associated with it\nfunc buildGenericConfig(\n\ts *options.ServerRunOptions,\n\tproxyTransport *http.Transport,\n) (\n\tgenericConfig *genericapiserver.Config,\n\tversionedInformers clientgoinformers.SharedInformerFactory,\n\tinsecureServingInfo *genericapiserver.DeprecatedInsecureServingInfo,\n\tserviceResolver aggregatorapiserver.ServiceResolver,\n\tpluginInitializers []admission.PluginInitializer,\n\tadmissionPostStartHook genericapiserver.PostStartHookFunc,\n\tstorageFactory *serverstorage.DefaultStorageFactory,\n\tlastErr error,\n) {\n\t\n  // 认证\n\tgenericConfig.Authentication.Authenticator, genericConfig.OpenAPIConfig.SecurityDefinitions, err = BuildAuthenticator(s, clientgoExternalClient, versionedInformers)\n\tif err != nil {\n\t\tlastErr = fmt.Errorf(\"invalid authentication config: %v\", err)\n\t\treturn\n\t}\n\n  //授权\n\tgenericConfig.Authorization.Authorizer, genericConfig.RuleResolver, err =BuildAuthorizer(s, versionedInformers)\n}\n```\n\n<br>\n\n####  2.2 BuildAuthenticator\n\nBuildAuthenticator目前是关键的函数，从这里开始进行代码分析\n\n```\n// BuildAuthenticator constructs the authenticator\nfunc BuildAuthenticator(s *options.ServerRunOptions, extclient clientgoclientset.Interface, versionedInformer clientgoinformers.SharedInformerFactory) (authenticator.Request, *spec.SecurityDefinitions, error) {\n   \n   // 1.生成config\n   authenticatorConfig, err := s.Authentication.ToAuthenticationConfig()\n   if err != nil {\n      return nil, nil, err\n   }\n   if s.Authentication.ServiceAccounts.Lookup || utilfeature.DefaultFeatureGate.Enabled(features.TokenRequest) {\n      authenticatorConfig.ServiceAccountTokenGetter = serviceaccountcontroller.NewGetterFromClient(\n         extclient,\n         versionedInformer.Core().V1().Secrets().Lister(),\n         versionedInformer.Core().V1().ServiceAccounts().Lister(),\n         versionedInformer.Core().V1().Pods().Lister(),\n      )\n   }\n   authenticatorConfig.BootstrapTokenAuthenticator = bootstrap.NewTokenAuthenticator(\n      versionedInformer.Core().V1().Secrets().Lister().Secrets(v1.NamespaceSystem),\n   )\n   \n   // 2.更加config，生成每个认证方式的 handler\n   return authenticatorConfig.New()\n}\n```\n\n<br>\n\n#### 2.3 ToAuthenticationConfig\n\n可以看出来这里ToAuthenticationConfig就是根据输入的配置，判断哪些认证方式（上面说的九种）需要生成config\n\n这里直接看代码函数就行。\n\n<br>\n\n#### 2.4 New\n\n（1）New函数根据认证的配置信息，针对9中认证方法，生成对应的handler。具体做法就是将各种认证生成authenticator，加入authenticators数组\n\n（2）将authenticators数组生成一个union handler\n\n（3）最终得认证器AuthenticatedGroupAdder\n\n```\n// New returns an authenticator.Request or an error that supports the standard\n// Kubernetes authentication mechanisms.\nfunc (config Config) New() (authenticator.Request, *spec.SecurityDefinitions, error) {\n\tvar authenticators []authenticator.Request\n\tvar tokenAuthenticators []authenticator.Token\n\tsecurityDefinitions := spec.SecurityDefinitions{}\n\n\t// front-proxy, BasicAuth methods, local first, then remote\n\t// Add the front proxy authenticator if requested\n\tif config.RequestHeaderConfig != nil {\n\t\trequestHeaderAuthenticator := headerrequest.NewDynamicVerifyOptionsSecure(\n\t\t\tconfig.RequestHeaderConfig.CAContentProvider.VerifyOptions,\n\t\t\tconfig.RequestHeaderConfig.AllowedClientNames,\n\t\t\tconfig.RequestHeaderConfig.UsernameHeaders,\n\t\t\tconfig.RequestHeaderConfig.GroupHeaders,\n\t\t\tconfig.RequestHeaderConfig.ExtraHeaderPrefixes,\n\t\t)\n\t\tauthenticators = append(authenticators, authenticator.WrapAudienceAgnosticRequest(config.APIAudiences, requestHeaderAuthenticator))\n\t}\n  \n  // 1.将各种认证生成authenticator，加入authenticators数组\n\t// basic auth\n\tif len(config.BasicAuthFile) > 0 {\n\t\tbasicAuth, err := newAuthenticatorFromBasicAuthFile(config.BasicAuthFile)\n\t\tif err != nil {\n\t\t\treturn nil, nil, err\n\t\t}\n\t\tauthenticators = append(authenticators, authenticator.WrapAudienceAgnosticRequest(config.APIAudiences, basicAuth))\n\n\t\tsecurityDefinitions[\"HTTPBasic\"] = &spec.SecurityScheme{\n\t\t\tSecuritySchemeProps: spec.SecuritySchemeProps{\n\t\t\t\tType:        \"basic\",\n\t\t\t\tDescription: \"HTTP Basic authentication\",\n\t\t\t},\n\t\t}\n\t}\n\n\t// X509 methods\n\tif config.ClientCAContentProvider != nil {\n\t\tcertAuth := x509.NewDynamic(config.ClientCAContentProvider.VerifyOptions, x509.CommonNameUserConversion)\n\t\tauthenticators = append(authenticators, certAuth)\n\t}\n\n\t// Bearer token methods, local first, then remote\n\tif len(config.TokenAuthFile) > 0 {\n\t\ttokenAuth, err := newAuthenticatorFromTokenFile(config.TokenAuthFile)\n\t\tif err != nil {\n\t\t\treturn nil, nil, err\n\t\t}\n\t\ttokenAuthenticators = append(tokenAuthenticators, authenticator.WrapAudienceAgnosticToken(config.APIAudiences, tokenAuth))\n\t}\n\tif len(config.ServiceAccountKeyFiles) > 0 {\n\t\tserviceAccountAuth, err := newLegacyServiceAccountAuthenticator(config.ServiceAccountKeyFiles, config.ServiceAccountLookup, config.APIAudiences, config.ServiceAccountTokenGetter)\n\t\tif err != nil {\n\t\t\treturn nil, nil, err\n\t\t}\n\t\ttokenAuthenticators = append(tokenAuthenticators, serviceAccountAuth)\n\t}\n\tif utilfeature.DefaultFeatureGate.Enabled(features.TokenRequest) && config.ServiceAccountIssuer != \"\" {\n\t\tserviceAccountAuth, err := newServiceAccountAuthenticator(config.ServiceAccountIssuer, config.ServiceAccountKeyFiles, config.APIAudiences, config.ServiceAccountTokenGetter)\n\t\tif err != nil {\n\t\t\treturn nil, nil, err\n\t\t}\n\t\ttokenAuthenticators = append(tokenAuthenticators, serviceAccountAuth)\n\t}\n\tif config.BootstrapToken {\n\t\tif config.BootstrapTokenAuthenticator != nil {\n\t\t\t// TODO: This can sometimes be nil because of\n\t\t\ttokenAuthenticators = append(tokenAuthenticators, authenticator.WrapAudienceAgnosticToken(config.APIAudiences, config.BootstrapTokenAuthenticator))\n\t\t}\n\t}\n\t// NOTE(ericchiang): Keep the OpenID Connect after Service Accounts.\n\t//\n\t// Because both plugins verify JWTs whichever comes first in the union experiences\n\t// cache misses for all requests using the other. While the service account plugin\n\t// simply returns an error, the OpenID Connect plugin may query the provider to\n\t// update the keys, causing performance hits.\n\tif len(config.OIDCIssuerURL) > 0 && len(config.OIDCClientID) > 0 {\n\t\toidcAuth, err := newAuthenticatorFromOIDCIssuerURL(oidc.Options{\n\t\t\tIssuerURL:            config.OIDCIssuerURL,\n\t\t\tClientID:             config.OIDCClientID,\n\t\t\tAPIAudiences:         config.APIAudiences,\n\t\t\tCAFile:               config.OIDCCAFile,\n\t\t\tUsernameClaim:        config.OIDCUsernameClaim,\n\t\t\tUsernamePrefix:       config.OIDCUsernamePrefix,\n\t\t\tGroupsClaim:          config.OIDCGroupsClaim,\n\t\t\tGroupsPrefix:         config.OIDCGroupsPrefix,\n\t\t\tSupportedSigningAlgs: config.OIDCSigningAlgs,\n\t\t\tRequiredClaims:       config.OIDCRequiredClaims,\n\t\t})\n\t\tif err != nil {\n\t\t\treturn nil, nil, err\n\t\t}\n\t\ttokenAuthenticators = append(tokenAuthenticators, oidcAuth)\n\t}\n\tif len(config.WebhookTokenAuthnConfigFile) > 0 {\n\t\twebhookTokenAuth, err := newWebhookTokenAuthenticator(config.WebhookTokenAuthnConfigFile, config.WebhookTokenAuthnVersion, config.WebhookTokenAuthnCacheTTL, config.APIAudiences)\n\t\tif err != nil {\n\t\t\treturn nil, nil, err\n\t\t}\n\t\ttokenAuthenticators = append(tokenAuthenticators, webhookTokenAuth)\n\t}\n\n\tif len(tokenAuthenticators) > 0 {\n\t\t// Union the token authenticators\n\t\ttokenAuth := tokenunion.New(tokenAuthenticators...)\n\t\t// Optionally cache authentication results\n\t\tif config.TokenSuccessCacheTTL > 0 || config.TokenFailureCacheTTL > 0 {\n\t\t\ttokenAuth = tokencache.New(tokenAuth, true, config.TokenSuccessCacheTTL, config.TokenFailureCacheTTL)\n\t\t}\n\t\tauthenticators = append(authenticators, bearertoken.New(tokenAuth), websocket.NewProtocolAuthenticator(tokenAuth))\n\t\tsecurityDefinitions[\"BearerToken\"] = &spec.SecurityScheme{\n\t\t\tSecuritySchemeProps: spec.SecuritySchemeProps{\n\t\t\t\tType:        \"apiKey\",\n\t\t\t\tName:        \"authorization\",\n\t\t\t\tIn:          \"header\",\n\t\t\t\tDescription: \"Bearer Token authentication\",\n\t\t\t},\n\t\t}\n\t}\n\n\tif len(authenticators) == 0 {\n\t\tif config.Anonymous {\n\t\t\treturn anonymous.NewAuthenticator(), &securityDefinitions, nil\n\t\t}\n\t\treturn nil, &securityDefinitions, nil\n\t}\n\n  // 2. 生成一个union handler\n\tauthenticator := union.New(authenticators...)\n \n \n  // 3.最终得认证器AuthenticatedGroupAdder\n\tauthenticator = group.NewAuthenticatedGroupAdder(authenticator)\n\n\tif config.Anonymous {\n\t\t// If the authenticator chain returns an error, return an error (don't consider a bad bearer token\n\t\t// or invalid username/password combination anonymous).\n\t\tauthenticator = union.NewFailOnError(authenticator, anonymous.NewAuthenticator())\n\t}\n\n\treturn authenticator, &securityDefinitions, nil\n}\n\n\n// union.New函数。\n// New returns a request authenticator that validates credentials using a chain of authenticator.Request objects.\n// The entire chain is tried until one succeeds. If all fail, an aggregate error is returned.\nfunc New(authRequestHandlers ...authenticator.Request) authenticator.Request {\n\tif len(authRequestHandlers) == 1 {\n\t\treturn authRequestHandlers[0]\n\t}\n\treturn &unionAuthRequestHandler{Handlers: authRequestHandlers, FailOnError: false}\n}\n```\n\n为什么要弄成一个unionAuthRequestHandler，原因在于unionAuthRequestHandler有一个这样的函数AuthenticateRequest。\n\n从这里可以看出来，unionAuthRequestHandler分别调用各种认证方法的handler，如果有一种方法认证成功，则成功，返回相应的用户信息。\n\n```\n// AuthenticateRequest authenticates the request using a chain of authenticator.Request objects.\nfunc (authHandler *unionAuthRequestHandler) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) {\n\tvar errlist []error\n\tfor _, currAuthRequestHandler := range authHandler.Handlers {\n\t\tresp, ok, err := currAuthRequestHandler.AuthenticateRequest(req)\n\t\tif err != nil {\n\t\t\tif authHandler.FailOnError {\n\t\t\t\treturn resp, ok, err\n\t\t\t}\n\t\t\terrlist = append(errlist, err)\n\t\t\tcontinue\n\t\t}\n\n\t\tif ok {\n\t\t\treturn resp, ok, err\n\t\t}\n\t}\n\n\treturn nil, false, utilerrors.NewAggregate(errlist)\n}\n```\n\n<br>\n\n### 3. 具体的认证过程\n\nAuthenticator步骤的输入是整个HTTP请求，但是，它通常只是检查HTTP Headers and/or client certificate。\n\n可以指定多个Authenticator模块，在这种情况下，每个认证模块都按顺序尝试，直到其中一个成功即可。\n\n如果认证成功，则用户的`username`会传入授权模块做进一步授权验证；而对于认证失败的请求则返回HTTP 401。\n\nKubernetes使用client certificates, bearer tokens, an authenticating proxy, or HTTP basic auth, 通过身份验证插件对API请求进行身份验证。 当向API服务器发出一个HTTP请求，Authentication plugin会尝试将以下属性与请求关联：\n\n- Username: 标识终端用户的字符串, 常用值可能是kube-admin或[jane@example.com](mailto:jane@example.com)。\n- UID: 标识终端用户的字符串,比Username更具有唯一性。\n- Groups: a set of strings which associate users with a set of commonly grouped users.\n- Extra fields: 可能有用的额外信息\n\n系统中把这4个属性封装成一个type DefaultInfo struct ，见/pkg/auth/user/user.go。\n\n```\n// DefaultInfo provides a simple user information exchange object\n// for components that implement the UserInfo interface.\ntype DefaultInfo struct {\n\tName   string\n\tUID    string\n\tGroups []string\n\tExtra  map[string][]string\n}\n```\n\n\n\n#### 3.1 调用链路\n\nstaging/src/k8s.io/apiserver/pkg/server/filters/timeout.go ServeHTTP -> ServeHTTP -> \n\n```\nfunc (t *timeoutHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {\n  ...\n   go func() {\n      defer func() {\n         err := recover()\n         // do not wrap the sentinel ErrAbortHandler panic value\n         if err != nil && err != http.ErrAbortHandler {\n            // Same as stdlib http server code. Manually allocate stack\n            // trace buffer size to prevent excessively large logs\n            const size = 64 << 10\n            buf := make([]byte, size)\n            buf = buf[:runtime.Stack(buf, false)]\n            err = fmt.Sprintf(\"%v\\n%s\", err, buf)\n         }\n         resultCh <- err\n      }()\n      t.handler.ServeHTTP(tw, r)\n   }()\n}\n\nServeHTTP是一个接口，所以就看t.handler是谁\n// ServeHTTP calls f(w, r).\nfunc (f HandlerFunc) ServeHTTP(w ResponseWriter, r *Request) {\n\tf(w, r)\n}\n```\n\n<br>\n\n#### 3.2 t.handler到底是谁\n\nstaging/src/k8s.io/apiserver/pkg/server/config.go\n\nApiserver的config中定义了handler函数。这是一串链式handler函数。\n\n```\n// NewConfig returns a Config struct with the default values\nfunc NewConfig(codecs serializer.CodecFactory) *Config {\n\tdefaultHealthChecks := []healthz.HealthChecker{healthz.PingHealthz, healthz.LogHealthz}\n\treturn &Config{\n\t\tSerializer:                  codecs,\n\t\tBuildHandlerChainFunc:       DefaultBuildHandlerChain,\n}\n```\n\n<br>\n\n#### 3.3 DefaultBuildHandlerChain\n\nDefaultBuildHandlerChain定义了链式handle。其中认证的就是 WithAuthentication函数\n\n```\nfunc DefaultBuildHandlerChain(apiHandler http.Handler, c *Config) http.Handler {\n\thandler := genericapifilters.WithAuthorization(apiHandler, c.Authorization.Authorizer, c.Serializer)\n\thandler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc)\n\thandler = genericapifilters.WithImpersonation(handler, c.Authorization.Authorizer, c.Serializer)\n\thandler = genericapifilters.WithAudit(handler, c.AuditBackend, c.AuditPolicyChecker, c.LongRunningFunc)\n\tfailedHandler := genericapifilters.Unauthorized(c.Serializer, c.Authentication.SupportsBasicAuth)\n\t// 认证的handler\n\tfailedHandler = genericapifilters.WithFailedAuthenticationAudit(failedHandler, c.AuditBackend, c.AuditPolicyChecker)\n\thandler = genericapifilters.WithAuthentication(handler, c.Authentication.Authenticator, failedHandler, c.Authentication.APIAudiences)\n\thandler = genericfilters.WithCORS(handler, c.CorsAllowedOriginList, nil, nil, nil, \"true\")\n\thandler = genericfilters.WithTimeoutForNonLongRunningRequests(handler, c.LongRunningFunc, c.RequestTimeout)\n\thandler = genericfilters.WithWaitGroup(handler, c.LongRunningFunc, c.HandlerChainWaitGroup)\n\thandler = genericapifilters.WithRequestInfo(handler, c.RequestInfoResolver)\n\thandler = genericfilters.WithPanicRecovery(handler)\n\treturn handler\n}\n```\n\n<br>\n\n#### 3.4 WithAuthentication\n\nWithAuthentication主要干了两件事：\n\n（1）调用AuthenticateRequest进行了认证。这里实际就是之前的unionAuthRequestHandler.AuthenticateRequest\n\nunionAuthRequestHandler.AuthenticateRequest会遍历所有的认证handler，然后有一个认证成功，就返回ok。\n\n（2）如果认证失败，调用failed.ServeHTTP(w, req)进行处理\n\n（3）如果成功， req.Header.Del(\"Authorization\")删除头部的Authorization, 表示认证通过了\n\n```\n// WithAuthentication creates an http handler that tries to authenticate the given request as a user, and then\n// stores any such user found onto the provided context for the request. If authentication fails or returns an error\n// the failed handler is used. On success, \"Authorization\" header is removed from the request and handler\n// is invoked to serve the request.\nfunc WithAuthentication(handler http.Handler, auth authenticator.Request, failed http.Handler, apiAuds authenticator.Audiences) http.Handler {\n   if auth == nil {\n      klog.Warningf(\"Authentication is disabled\")\n      return handler\n   }\n   return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {\n      authenticationStart := time.Now()\n\n      if len(apiAuds) > 0 {\n         req = req.WithContext(authenticator.WithAudiences(req.Context(), apiAuds))\n      }\n      // 这里调用了AuthenticateRequest进行认证\n      resp, ok, err := auth.AuthenticateRequest(req)\n      if err != nil || !ok {\n         if err != nil {\n            klog.Errorf(\"Unable to authenticate the request due to an error: %v\", err)\n            authenticatedAttemptsCounter.WithLabelValues(errorLabel).Inc()\n            authenticationLatency.WithLabelValues(errorLabel).Observe(time.Since(authenticationStart).Seconds())\n         } else if !ok {\n            authenticatedAttemptsCounter.WithLabelValues(failureLabel).Inc()\n            authenticationLatency.WithLabelValues(failureLabel).Observe(time.Since(authenticationStart).Seconds())\n         }\n\n         failed.ServeHTTP(w, req)\n         return\n      }\n\n      if len(apiAuds) > 0 && len(resp.Audiences) > 0 && len(authenticator.Audiences(apiAuds).Intersect(resp.Audiences)) == 0 {\n         klog.Errorf(\"Unable to match the audience: %v , accepted: %v\", resp.Audiences, apiAuds)\n         failed.ServeHTTP(w, req)\n         return\n      }\n\n      // authorization header is not required anymore in case of a successful authentication.\n      req.Header.Del(\"Authorization\")\n\n      req = req.WithContext(genericapirequest.WithUser(req.Context(), resp.User))\n\n      authenticatedUserCounter.WithLabelValues(compressUsername(resp.User.GetName())).Inc()\n      authenticatedAttemptsCounter.WithLabelValues(successLabel).Inc()\n      authenticationLatency.WithLabelValues(successLabel).Observe(time.Since(authenticationStart).Seconds())\n\n      handler.ServeHTTP(w, req)\n   })\n}\n```\n\n### 4. 9种认证方式介绍\n\n#### 4.1 BasicAuth认证\n\nBasicAuth是一种简单的HTTP协议上的认证机制，客户端将用户、密码写入请求头中，HTTP服务端尝试从请求头中验证用户、密码信息，从而实现身份验证。客户端发送的请求头示例如下：\n\n```\nAuthorization: Basic BASE64ENCODED(USER:PASSWORD)\n```\n\n请求头的key为Authorization，value为Basic BASE64ENCODED（USER：PASSWORD），其中用户名及密码是通过Base64编码后的字符串。\n\n<br>\n\n**启用BasicAuth认证:**\n\nkube-apiserver通过指定--basic-auth-file参数启用BasicAuth认证。AUTH_FILE是一个CSV文件，每个用户在CSV中的表现形式为password、username、uid，代码示例如下：\n\n```\na0d175cf548f665938498,derk,1 \n```\n\n<br>\n\n**认证函数:**\n\nstaging/src/k8s.io/apiserver/plugin/pkg/authenticator/request/basicauth/basicauth.go\n\n```\n// AuthenticateRequest authenticates the request using the \"Authorization: Basic\" header in the request\nfunc (a *Authenticator) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) {\n\tusername, password, found := req.BasicAuth()\n\tif !found {\n\t\treturn nil, false, nil\n\t}\n\n\tresp, ok, err := a.auth.AuthenticatePassword(req.Context(), username, password)\n\n\t// If the password authenticator didn't error, provide a default error\n\tif !ok && err == nil {\n\t\terr = errInvalidAuth\n\t}\n\n\treturn resp, ok, err\n}\n```\n\n<br>\n\n#### 4.2 ClientCA认证\n\nClientCA认证，也被称为TLS双向认证，即服务端与客户端互相验证证书的正确性。使用ClientCA认证的时候，只要是CA签名过的证书都可以通过验证。1.启用ClientCA认证kube-apiserver通过指定--client-ca-file参数启用ClientCA认证。这个目前比较常用。\n\nClientCA认证接口定义了AuthenticateRequest方法，该方法接收客户端请求。若验证失败，bool值会为false；若验证成功，bool值会为true，并返回*authenticator.Response，*authenticator.Response中携带了身份验证用户的信息，例如Name、UID、Groups、Extra等信息。\n\nstaging/src/k8s.io/apiserver/pkg/authentication/request/x509/x509.go\n\n```\n// AuthenticateRequest authenticates the request using presented client certificates\nfunc (a *Authenticator) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) {\n\tif req.TLS == nil || len(req.TLS.PeerCertificates) == 0 {\n\t\treturn nil, false, nil\n\t}\n\n\t// Use intermediates, if provided\n\toptsCopy, ok := a.verifyOptionsFn()\n\t// if there are intentionally no verify options, then we cannot authenticate this request\n\tif !ok {\n\t\treturn nil, false, nil\n\t}\n\tif optsCopy.Intermediates == nil && len(req.TLS.PeerCertificates) > 1 {\n\t\toptsCopy.Intermediates = x509.NewCertPool()\n\t\tfor _, intermediate := range req.TLS.PeerCertificates[1:] {\n\t\t\toptsCopy.Intermediates.AddCert(intermediate)\n\t\t}\n\t}\n\n\tremaining := req.TLS.PeerCertificates[0].NotAfter.Sub(time.Now())\n\tclientCertificateExpirationHistogram.Observe(remaining.Seconds())\n\tchains, err := req.TLS.PeerCertificates[0].Verify(optsCopy)\n\tif err != nil {\n\t\treturn nil, false, err\n\t}\n\n\tvar errlist []error\n\tfor _, chain := range chains {\n\t\tuser, ok, err := a.user.User(chain)\n\t\tif err != nil {\n\t\t\terrlist = append(errlist, err)\n\t\t\tcontinue\n\t\t}\n\n\t\tif ok {\n\t\t\treturn user, ok, err\n\t\t}\n\t}\n\treturn nil, false, utilerrors.NewAggregate(errlist)\n}\n```\n\n在进行ClientCA认证时，通过req.TLS.PeerCertifcates[0].Verify验证证书，如果是CA签名过的证书，都可以通过验证，认证失败会返回false，而认证成功会返回true。\n\n<br>\n\n#### 4.3 TokenAuth认证\n\nToken也被称为令牌，服务端为了验证客户端的身份，需要客户端向服务端提供一个可靠的验证信息，这个验证信息就是Token。TokenAuth是基于Token的认证，Token一般是一个字符串。\n\n**启用TokenAuth认证**\n\nkube-apiserver通过指定--token-auth-file参数启用TokenAuth认证。TOKEN_FILE是一个CSV文件，每个用户在CSV中的表现形式为token、user、userid、group，代码示例如下：\n\n```\na0d73844190894384102943,kubelet-bootstrap.1001,\"system:kubelet-bootstrap\"\n```\n\nToken认证接口定义了AuthenticateToken方法，该方法接收token字符串。若验证失败，bool值会为false；若验证成功，bool值会为true，并返回*authenticator.Response，*authenticator.Response中携带了身份验证用户的信息，例如Name、UID、Groups、Extra等信息。\n\n```\nfunc (a *TokenAuthenticator) AuthenticateToken(ctx context.Context, value string) (*authenticator.Response, bool, error) {\n\tuser, ok := a.tokens[value]\n\tif !ok {\n\t\treturn nil, false, nil\n\t}\n\treturn &authenticator.Response{User: user}, true, nil\n}\n```\n\n<br>\n\n#### 4.4 BootstrapToken认证\n\n当Kubernetes集群中有非常多的节点时，手动为每个节点配置TLS认证比较烦琐，为此Kubernetes提供了BootstrapToken认证，其也被称为引导Token。客户端的Token信息与服务端的Token相匹配，则认证通过，自动为节点颁发证书，这是一种引导Token的机制。客户端发送的请求头示例如下：\n\n```\nAuthorization: Bearer 07410b.f2355rejewrql\n```\n\n请求头的key为Authorization，value为Bearer<TOKENS>，其中TOKENS的表现形式为[a-z0-9]{6}.[a-z0-9]{16}。第一个组是Token ID，第二个组是TokenSecret。\n\n**启用BootstrapToken认证**\n\nkube-apiserver通过指定--enable-bootstrap-token-auth参数启用BootstrapToken认证。\n\n这个在安装kubelet的时候使用过。\n\nBootstrapToken认证接口定义了AuthenticateToken方法，该方法接收token字符串。若验证失败，bool值会为false；若验证成功，bool值会为true，并返回*authenticator.Response，*authenticator.Response中携带了身份验证用户的信息，例如Name、UID、Groups、Extra等信息。\n\nplugin/pkg/auth/authenticator/token/bootstrap/bootstrap.go\n\n```\nfunc (t *TokenAuthenticator) AuthenticateToken(ctx context.Context, token string) (*authenticator.Response, bool, error) {\n\ttokenID, tokenSecret, err := bootstraptokenutil.ParseToken(token)\n\tif err != nil {\n\t\t// Token isn't of the correct form, ignore it.\n\t\treturn nil, false, nil\n\t}\n\n\tsecretName := bootstrapapi.BootstrapTokenSecretPrefix + tokenID\n\tsecret, err := t.lister.Get(secretName)\n\tif err != nil {\n\t\tif errors.IsNotFound(err) {\n\t\t\tklog.V(3).Infof(\"No secret of name %s to match bootstrap bearer token\", secretName)\n\t\t\treturn nil, false, nil\n\t\t}\n\t\treturn nil, false, err\n\t}\n\n\tif secret.DeletionTimestamp != nil {\n\t\ttokenErrorf(secret, \"is deleted and awaiting removal\")\n\t\treturn nil, false, nil\n\t}\n\n\tif string(secret.Type) != string(bootstrapapi.SecretTypeBootstrapToken) || secret.Data == nil {\n\t\ttokenErrorf(secret, \"has invalid type, expected %s.\", bootstrapapi.SecretTypeBootstrapToken)\n\t\treturn nil, false, nil\n\t}\n\n\tts := bootstrapsecretutil.GetData(secret, bootstrapapi.BootstrapTokenSecretKey)\n\tif subtle.ConstantTimeCompare([]byte(ts), []byte(tokenSecret)) != 1 {\n\t\ttokenErrorf(secret, \"has invalid value for key %s, expected %s.\", bootstrapapi.BootstrapTokenSecretKey, tokenSecret)\n\t\treturn nil, false, nil\n\t}\n\n\tid := bootstrapsecretutil.GetData(secret, bootstrapapi.BootstrapTokenIDKey)\n\tif id != tokenID {\n\t\ttokenErrorf(secret, \"has invalid value for key %s, expected %s.\", bootstrapapi.BootstrapTokenIDKey, tokenID)\n\t\treturn nil, false, nil\n\t}\n\n\tif bootstrapsecretutil.HasExpired(secret, time.Now()) {\n\t\t// logging done in isSecretExpired method.\n\t\treturn nil, false, nil\n\t}\n\n\tif bootstrapsecretutil.GetData(secret, bootstrapapi.BootstrapTokenUsageAuthentication) != \"true\" {\n\t\ttokenErrorf(secret, \"not marked %s=true.\", bootstrapapi.BootstrapTokenUsageAuthentication)\n\t\treturn nil, false, nil\n\t}\n\n\tgroups, err := bootstrapsecretutil.GetGroups(secret)\n\tif err != nil {\n\t\ttokenErrorf(secret, \"has invalid value for key %s: %v.\", bootstrapapi.BootstrapTokenExtraGroupsKey, err)\n\t\treturn nil, false, nil\n\t}\n\n\treturn &authenticator.Response{\n\t\tUser: &user.DefaultInfo{\n\t\t\tName:   bootstrapapi.BootstrapUserPrefix + string(id),\n\t\t\tGroups: groups,\n\t\t},\n\t}, true, nil\n}\n\n```\n\n在进行BootstrapToken认证时，通过paseToken函数解析出Token ID和TokenSecret，验证Token Secret中的Expire（过期）、Data、Type等，认证失败会返回false，而认证成功会返回true。\n\n#### 4.5  RequestHeader认证\n\nKubernetes可以设置一个认证代理，客户端发送的认证请求可以通过认证代理将验证信息发送给kube-apiserver组件。RequestHeader认证使用的就是这种代理方式，它使用请求头将用户名和组信息发送给kube-apiserver。\n\nRequestHeader认证有几个列表，分别介绍如下。\n\n● 用户名列表。建议使用X-Remote-User，如果启用RequestHeader认证，该参数必选。\n\n● 组列表。建议使用X-Remote-Group，如果启用RequestHeader认证，该参数可选。\n\n● 额外列表。建议使用X-Remote-Extra-，如果启用RequestHeader认证，该参数可选。\n\n当客户端发送认证请求时，kube-apiserver根据Header Values中的用户名列表来识别用户，例如返回X-Remote-User：Bob则表示验证成功。\n\n**启用RequestHeader认证**\n\nkube-apiserver通过指定如下参数启用RequestHeader认证。\n\n●--requestheader-client-ca-file：指定有效的客户端CA证书。\n\n●--requestheader-allowed-names：指定通用名称（CommonName）。\n\n●--requestheader-extra-headers-prefix：指定额外列表。\n\n●--requestheader-group-headers：指定组列表。\n\n●--requestheader-username-headers：指定用户名列表。\n\nkube-apiserver收到客户端验证请求后，会先通过--requestheader-client-ca-file参数对客户端证书进行验证。\n\n--requestheader-username-headers参数指定了Header中包含的用户名，这一参数中的列表确定了有效的用户名列表，如果该列表为空，则所有通过--requestheader-client-ca-file参数校验的请求都允许通过。\n\n```\nfunc (a *requestHeaderAuthRequestHandler) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) {\n\tname := headerValue(req.Header, a.nameHeaders.Value())\n\tif len(name) == 0 {\n\t\treturn nil, false, nil\n\t}\n\tgroups := allHeaderValues(req.Header, a.groupHeaders.Value())\n\textra := newExtra(req.Header, a.extraHeaderPrefixes.Value())\n\n\t// clear headers used for authentication\n\tfor _, headerName := range a.nameHeaders.Value() {\n\t\treq.Header.Del(headerName)\n\t}\n\tfor _, headerName := range a.groupHeaders.Value() {\n\t\treq.Header.Del(headerName)\n\t}\n\tfor k := range extra {\n\t\tfor _, prefix := range a.extraHeaderPrefixes.Value() {\n\t\t\treq.Header.Del(prefix + k)\n\t\t}\n\t}\n\n\treturn &authenticator.Response{\n\t\tUser: &user.DefaultInfo{\n\t\t\tName:   name,\n\t\t\tGroups: groups,\n\t\t\tExtra:  extra,\n\t\t},\n\t}, true, nil\n}\n```\n\n在进行RequestHeader认证时，通过headerValue函数从请求头中读取所有的用户信息，通过allHeaderValues函数读取所有组的信息，通过newExtra函数读取所有额外的信息。当用户名无法匹配时，则认证失败返回false，反之则认证成功返回true。\n\n<br>\n\n#### 4.6 WebhookTokenAuth认证\n\nWebhook也被称为钩子，是一种基于HTTP协议的回调机制，当客户端发送的认证请求到达kube-apiserver时，kube-apiserver回调钩子方法，将验证信息发送给远程的Webhook服务器进行认证，然后根据Webhook服务器返回的状态码来判断是否认证成功。\n\n**启用WebhookTokenAuth认证**\n\nkube-apiserver通过指定如下参数启用WebhookTokenAuth认证。\n\n●--authentication-token-webhook-config-file：Webhook配置文件描述了如何访问远程Webhook服务。\n\n●--authentication-token-webhook-cache-ttl：缓存认证时间，默认值为2分钟。\n\n<br>\n\nWebhookTokenAuth认证接口定义了AuthenticateToken方法，该方法接收token字符串。若验证失败，bool值会为false；若验证成功，bool值会为true，并返回*authenticator.Response，*authenticator.Response中携带了身份验证用户的信息，例如Name、UID、Groups、Extra等信息。\n\n<br>\n\n#### 4.7 Anonymous认证\n\nAnonymous认证就是匿名认证，未被其他认证器拒绝的请求都可视为匿名请求。kube-apiserver默认开启Anonymous（匿名）认证。1.启用Anonymous认证kube-apiserver通过指定--anonymous-auth参数启用Anonymous认证，默认该参数值为true。\n\nAnonymous认证接口定义了AuthenticateRequest方法，该方法接收客户端请求。若验证失败，bool值会为false；若验证成功，bool值会为true，并返回*authenticator.Response，*authenticator.Response中携带了身份验证用户的信息，例如Name、UID、Groups、Extra等信息。\n\n在进行Anonymous认证时，直接验证成功，返回true。\n\n#### 4.8 OIDC认证\n\nOIDC（OpenID Connect）是一套基于OAuth 2.0协议的轻量级认证规范，其提供了通过API进行身份交互的框架。OIDC认证除了认证请求外，还会标明请求的用户身份（ID Token）。其中Toekn被称为ID Token，此ID Token是JSON WebToken （JWT），具有由服务器签名的相关字段。\n\nOIDC认证流程介绍如下。（1）Kubernetes用户想访问Kubernetes API Server，先通过认证服务（AuthServer，例如Google Accounts服务）认证自己，得到access_token、id_token和refresh_token。（2）Kubernetes用户把access_token、id_token和refresh_token配置到客户端应用程序（如kubectl或dashboard工具等）中。（3）Kubernetes客户端使用Token以用户的身份访问Kubernetes API Server。Kubernetes API Server和Auth Server并没有直接进行交互，而是鉴定客户端发送的Token是否为合法Token。\n\n**启用OIDC认证**\n\nkube-apiserver通过指定如下参数启用OIDC认证。\n\n●--oidc-ca-file：签署身份提供商的CA证书的路径，默认值为主机的根CA证书的路径（即/etc/kubernetes/ssl/kc-ca.pem）。\n\n●--oidc-client-id：颁发所有Token的Client ID。\n\n●--oidc-groups-claim：JWT（JSON Web Token）声明的用户组名称。\n\n●--oidc-groups-prefix：组名前缀，所有组都将以此值为前缀，以避免与其他身份验证策略发生冲突。\n\n●--oidc-issuer-url：Auth Server服务的URL地址，例如使用GoogleAccounts服务。\n\n●--oidc-required-claim：该参数是键值对，用于描述ID Token中的必要声明。如果设置该参数，则验证声明是否以匹配值存在于ID Token中。重复指定该参数可以设置多个声明。\n\n●--oidc-signing-algs：JOSE非对称签名算法列表，算法以逗号分隔。如果以alg开头的JWT请求不在此列表中，请求会被拒绝（默认值为[RS256]）。\n\n●--oidc-username-claim：JWT（JSON Web Token）声明的用户名称（默认值为sub）。●--oidc-username-prefix：用户名前缀，所有用户名都将以此值为前缀，以避免与其他身份验证策略发生冲突。如果要跳过任何前缀，请设置该参数值为-。\n\n#### 4.9 ServiceAccountAuth认证\n\nServiceAccountAuth是一种特殊的认证机制，其他认证机制都是处于Kubernetes集群外部而希望访问kube-apiserver组件，而ServiceAccountAuth认证是从Pod资源内部访问kube-apiserver组件，提供给运行在Pod资源中的进程使用，它为Pod资源中的进程提供必要的身份证明，从而获取集群的信息。ServiceAccountAuth认证通过Kubernetes资源的Service Account实现。\n\n具体使用就是在创建pod的时候，定义使用ServiceAccount。\n\n#### 4.10 总结\n\n这一部分基本都是摘抄kubernetes源码解剖部分的内容。目的就是先了解一下具体有哪些认证，以后有需要再深入了解一下。\n\n<br>\n\n### 5.参考链接：\n\nhttps://www.jianshu.com/p/daa4ff387a78\n\n书籍：kubernetes源码解剖，郑东\n"
  },
  {
    "path": "k8s/kube-apiserver/13-k8s之Authorization.md",
    "content": "Table of Contents\n=================\n\n  * [1. Authorization简介](#1-authorization简介)\n  * [2. 6种授权机制](#2-6种授权机制)\n     * [2.1 AlwaysAllow](#21-alwaysallow)\n     * [2.2 AlwaysDeny授权](#22-alwaysdeny授权)\n     * [2.3 ABAC授权](#23-abac授权)\n     * [2.4 Webhook授权](#24-webhook授权)\n     * [2.5 RBAC授权](#25-rbac授权)\n     * [2.6 Node授权](#26-node授权)\n  * [3. 总结](#3-总结)\n  * [4. 参考](#4-参考)\n\nkube-apiserver中与权限相关的主要有三种机制，即认证、鉴权和准入控制。这里主要记录鉴权相关的笔记。\n\n### 1. Authorization简介\n\n客户端请求到了apiserver端后，首先是认证，然后就是授权。apiserver同样也支持多种授权机制，并支持同时开启多个授权功能，如果开启多个授权功能，则按照顺序执行授权器，在前面的授权器具有更高的优先级来允许或拒绝请求。客户端发起一个请求，在经过授权阶段后，**只要有一个授权器通过则授权成功**。\n\n<br>\n\nkube-apiserver目前提供了6种授权机制，分别是AlwaysAllow、AlwaysDeny、ABAC、Webhook、RBAC、Node。\n\n可通过kube-apiserver启动参数--authorization-mode参数设置授权机制。\n\n目前比较常用的就是RBAC和webhook。    --authorization-mode=RBAC,Webhook\n\n<br>\n\n### 2. 6种授权机制\n\n#### 2.1 AlwaysAllow\n\n在进行AlwaysAllow授权时，直接授权成功，返回DecisionAllow决策状态。另外，AlwaysAllow的规则解析器会将资源类型的规则列表（ResourceRuleInfo）和非资源类型的规则列表（NonResourceRuleInfo）都设置为通配符（*）匹配所有资源版本、资源及资源操作方法。代码示例如下：\n\n```\nstaging/src/k8s.io/apiserver/pkg/authorization/authorizerfactory/builtin.go\nfunc (alwaysAllowAuthorizer) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) {\n\treturn []authorizer.ResourceRuleInfo{\n\t\t\t&authorizer.DefaultResourceRuleInfo{\n\t\t\t\tVerbs:     []string{\"*\"},\n\t\t\t\tAPIGroups: []string{\"*\"},\n\t\t\t\tResources: []string{\"*\"},\n\t\t\t},\n\t\t}, []authorizer.NonResourceRuleInfo{\n\t\t\t&authorizer.DefaultNonResourceRuleInfo{\n\t\t\t\tVerbs:           []string{\"*\"},\n\t\t\t\tNonResourceURLs: []string{\"*\"},\n\t\t\t},\n\t\t}, false, nil\n}\n```\n\n#### 2.2 AlwaysDeny授权\n\nAlwaysDeny授权器会阻止所有请求，该授权器很少单独使用，一般会结合其他授权器一起使用。它的应用场景是先拒绝所有请求，再允许授权过的用户请求。\n\n--authorization-mode=AlwaysDeny,Webhook 。所以这样就做到了，只允许Webhook授权的用户通过。\n\n<br>\n\n在进行AlwaysDeny授权时，直接返回DecisionNoOpionion决策状态。如果存在下一个授权器，会继续执行下一个授权器；如果不存在下一个授权器，则会拒绝所有请求。这就是kube-apiserver使用AlwaysDeny的应用场景。另外，AlwaysDeny的规则解析器会将资源类型的规则列表（ResourceRuleInfo）和非资源类型的规则列表（NonResourceRuleInfo）都设置为空，代码示例如下：[插图]\n\n```\nstaging/src/k8s.io/apiserver/pkg/authorization/authorizerfactory/builtin.go\nfunc (alwaysDenyAuthorizer) Authorize(ctx context.Context, a authorizer.Attributes) (decision authorizer.Decision, reason string, err error) {\n\treturn authorizer.DecisionNoOpinion, \"Everything is forbidden.\", nil\n}\n\n\nfunc (alwaysDenyAuthorizer) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) {\n\treturn []authorizer.ResourceRuleInfo{}, []authorizer.NonResourceRuleInfo{}, false, nil\n}\n```\n\n#### 2.3 ABAC授权\n\nABAC授权器基于属性的访问控制（Attribute-Based Access Control，ABAC）定义了访问控制范例，其中通过将属性组合在一起的策略来向用户授予操作权限。\n\nkube-apiserver通过指定如下参数启用ABAC授权。\n\n●--authorization-mode=ABAC：启用ABAC授权器。\n\n●--authorization-policy-file：基于ABAC模式，指定策略文件，该文件使用JSON格式进行描述，每一行都是一个策略对象。如下：\n\nAlice 可以对所有somenamespace下的资源做任何事情：\n\n```\n{\"apiVersion\": \"abac.authorization.kubernetes.io/v1beta1\", \"kind\": \"Policy\", \"spec\": {\"user\": \"alice\", \"namespace\": \"somenamespace\", \"resource\": \"*\", \"apiGroup\": \"*\"}}\n```\n\n**参考**：https://kubernetes.io/zh/docs/reference/access-authn-authz/abac/\n\n在进行ABAC授权时，遍历所有的策略，通过matches函数进行匹配，如果授权成功，返回DecisionAllow决策状态。另外，ABAC的规则\n\n解析器会根据每一个策略将资源类型的规则列表（ResourceRuleInfo）和非资源类型的规则列表（NonResourceRuleInfo）都设置为该\n\n用户有权限操作的资源版本、资源及资源操作方法。代码示例如下：\n\n```\npkg/auth/authorizer/abac/abac.go\n// Authorize implements authorizer.Authorize\nfunc (pl PolicyList) Authorize(ctx context.Context, a authorizer.Attributes) (authorizer.Decision, string, error) {\n\tfor _, p := range pl {\n\t\tif matches(*p, a) {\n\t\t\treturn authorizer.DecisionAllow, \"\", nil\n\t\t}\n\t}\n\treturn authorizer.DecisionNoOpinion, \"No policy matched.\", nil\n\t// TODO: Benchmark how much time policy matching takes with a medium size\n\t// policy file, compared to other steps such as encoding/decoding.\n\t// Then, add Caching only if needed.\n}\n\n// RulesFor returns rules for the given user and namespace.\nfunc (pl PolicyList) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) {\n\tvar (\n\t\tresourceRules    []authorizer.ResourceRuleInfo\n\t\tnonResourceRules []authorizer.NonResourceRuleInfo\n\t)\n\n\tfor _, p := range pl {\n\t\tif subjectMatches(*p, user) {\n\t\t\tif p.Spec.Namespace == \"*\" || p.Spec.Namespace == namespace {\n\t\t\t\tif len(p.Spec.Resource) > 0 {\n\t\t\t\t\tr := authorizer.DefaultResourceRuleInfo{\n\t\t\t\t\t\tVerbs:     getVerbs(p.Spec.Readonly),\n\t\t\t\t\t\tAPIGroups: []string{p.Spec.APIGroup},\n\t\t\t\t\t\tResources: []string{p.Spec.Resource},\n\t\t\t\t\t}\n\t\t\t\t\tvar resourceRule authorizer.ResourceRuleInfo = &r\n\t\t\t\t\tresourceRules = append(resourceRules, resourceRule)\n\t\t\t\t}\n\t\t\t\tif len(p.Spec.NonResourcePath) > 0 {\n\t\t\t\t\tr := authorizer.DefaultNonResourceRuleInfo{\n\t\t\t\t\t\tVerbs:           getVerbs(p.Spec.Readonly),\n\t\t\t\t\t\tNonResourceURLs: []string{p.Spec.NonResourcePath},\n\t\t\t\t\t}\n\t\t\t\t\tvar nonResourceRule authorizer.NonResourceRuleInfo = &r\n\t\t\t\t\tnonResourceRules = append(nonResourceRules, nonResourceRule)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\treturn resourceRules, nonResourceRules, false, nil\n}\n```\n\n**缺点：** 每次更新的时候，需要更新文件，并且重启kube-apiserver。\n\n#### 2.4 Webhook授权\n\nWebhook授权器拥有基于HTTP协议回调的机制，当用户授权时，kube-apiserver组件会查询外部的Webhook服务。该过程与WebhookTokenAuth认证相似，但其中确认用户身份的机制不一样。当客户端发送的认证请求到达kube-apiserver时，kube-apiserver回调钩子方法，将授权信息发送给远程的Webhook服务器进行认证，根据Webhook服务器返回的状态来判断是否授权成功。\n\n<br>\n\nkube-apiserver通过指定如下参数启用Webhook授权。\n\n●--authorization-mode=Webhook：启用Webhook授权器。\n\n●--authorization-webhook-config-file：使用kubeconfig格式的Webhook配置文件。Webhook授权器配置文件定义如下：\n\n```\n# Kubernetes API 版本\napiVersion: v1\n# API 对象种类\nkind: Config\n# clusters 代表远程服务。\nclusters:\n  - name: name-of-remote-authz-service\n    cluster:\n      # 对远程服务进行身份认证的 CA。\n      certificate-authority: /path/to/ca.pem\n      # 远程服务的查询 URL。必须使用 'https'。\n      server: https://authz.example.com/authorize\n\n# users 代表 API 服务器的 webhook 配置\nusers:\n  - name: name-of-api-server\n    user:\n      client-certificate: /path/to/cert.pem # webhook plugin 使用 cert\n      client-key: /path/to/key.pem          # cert 所对应的 key\n\n# kubeconfig 文件必须有 context。需要提供一个给 API 服务器。\ncurrent-context: webhook\ncontexts:\n- context:\n    cluster: name-of-remote-authz-service\n    user: name-of-api-server\n  name: webhook\n```\n\n如上配置，文件使用kubeconfig格式。在该配置文件中，users指的是kube-apiserver本身，clusters指的是远程Webhook服务。\n\n**参考：**https://kubernetes.io/zh/docs/reference/access-authn-authz/webhook/\n\n<br>\n\n在进行Webhook授权时，首先通过w.responseCache.Get函数从缓存中查找是否已有缓存的授权，如果有则直接使用该状态\n\n（Status），如果没有则通过w.subjectAccessReview.Create（RESTClient）从远程的Webhook服务器获取授权验证，该函数发送Post\n\n请求，并在请求体（Body）中携带授权信息。在验证Webhook服务器授权之后，返回的Status.Allowed字段为true，表示授权成功并返\n\n回DecisionAllow决策状态。另外，Webhook的规则解析器不支持规则列表解析，因为规则是由远程的Webhook服务端进行授权的。所\n\n以Webhook的规则解析器的资源类型的规则列表（ResourceRuleInfo）和非资源类型的规则列表（NonResourceRuleInfo）都会被设置\n\n为空。代码示例如下：\n\n```\nstaging/src/k8s.io/apiserver/plugin/pkg/authorizer/webhook/webhook.go\n// Authorize makes a REST request to the remote service describing the attempted action as a JSON\n// serialized api.authorization.v1beta1.SubjectAccessReview object. An example request body is\n// provided below.\n//\n//     {\n//       \"apiVersion\": \"authorization.k8s.io/v1beta1\",\n//       \"kind\": \"SubjectAccessReview\",\n//       \"spec\": {\n//         \"resourceAttributes\": {\n//           \"namespace\": \"kittensandponies\",\n//           \"verb\": \"GET\",\n//           \"group\": \"group3\",\n//           \"resource\": \"pods\"\n//         },\n//         \"user\": \"jane\",\n//         \"group\": [\n//           \"group1\",\n//           \"group2\"\n//         ]\n//       }\n//     }\n//\n// The remote service is expected to fill the SubjectAccessReviewStatus field to either allow or\n// disallow access. A permissive response would return:\n//\n//     {\n//       \"apiVersion\": \"authorization.k8s.io/v1beta1\",\n//       \"kind\": \"SubjectAccessReview\",\n//       \"status\": {\n//         \"allowed\": true\n//       }\n//     }\n//\n// To disallow access, the remote service would return:\n//\n//     {\n//       \"apiVersion\": \"authorization.k8s.io/v1beta1\",\n//       \"kind\": \"SubjectAccessReview\",\n//       \"status\": {\n//         \"allowed\": false,\n//         \"reason\": \"user does not have read access to the namespace\"\n//       }\n//     }\n//\n// TODO(mikedanese): We should eventually support failing closed when we\n// encounter an error. We are failing open now to preserve backwards compatible\n// behavior.\nfunc (w *WebhookAuthorizer) Authorize(ctx context.Context, attr authorizer.Attributes) (decision authorizer.Decision, reason string, err error) {\n\tr := &authorizationv1.SubjectAccessReview{}\n\tif user := attr.GetUser(); user != nil {\n\t\tr.Spec = authorizationv1.SubjectAccessReviewSpec{\n\t\t\tUser:   user.GetName(),\n\t\t\tUID:    user.GetUID(),\n\t\t\tGroups: user.GetGroups(),\n\t\t\tExtra:  convertToSARExtra(user.GetExtra()),\n\t\t}\n\t}\n\n\tif attr.IsResourceRequest() {\n\t\tr.Spec.ResourceAttributes = &authorizationv1.ResourceAttributes{\n\t\t\tNamespace:   attr.GetNamespace(),\n\t\t\tVerb:        attr.GetVerb(),\n\t\t\tGroup:       attr.GetAPIGroup(),\n\t\t\tVersion:     attr.GetAPIVersion(),\n\t\t\tResource:    attr.GetResource(),\n\t\t\tSubresource: attr.GetSubresource(),\n\t\t\tName:        attr.GetName(),\n\t\t}\n\t} else {\n\t\tr.Spec.NonResourceAttributes = &authorizationv1.NonResourceAttributes{\n\t\t\tPath: attr.GetPath(),\n\t\t\tVerb: attr.GetVerb(),\n\t\t}\n\t}\n\tkey, err := json.Marshal(r.Spec)\n\tif err != nil {\n\t\treturn w.decisionOnError, \"\", err\n\t}\n\t// 先使用缓存\n\tif entry, ok := w.responseCache.Get(string(key)); ok {\n\t\tr.Status = entry.(authorizationv1.SubjectAccessReviewStatus)\n\t} else {\n\t\tvar (\n\t\t\tresult *authorizationv1.SubjectAccessReview\n\t\t\terr    error\n\t\t)\n\t\twebhook.WithExponentialBackoff(ctx, w.initialBackoff, func() error {\n\t\t\tresult, err = w.subjectAccessReview.CreateContext(ctx, r)\n\t\t\treturn err\n\t\t}, webhook.DefaultShouldRetry)\n\t\tif err != nil {\n\t\t\t// An error here indicates bad configuration or an outage. Log for debugging.\n\t\t\tklog.Errorf(\"Failed to make webhook authorizer request: %v\", err)\n\t\t\treturn w.decisionOnError, \"\", err\n\t\t}\n\t\tr.Status = result.Status\n\t\tif shouldCache(attr) {\n\t\t\tif r.Status.Allowed {\n\t\t\t\tw.responseCache.Add(string(key), r.Status, w.authorizedTTL)\n\t\t\t} else {\n\t\t\t\tw.responseCache.Add(string(key), r.Status, w.unauthorizedTTL)\n\t\t\t}\n\t\t}\n\t}\n\tswitch {\n\tcase r.Status.Denied && r.Status.Allowed:\n\t\treturn authorizer.DecisionDeny, r.Status.Reason, fmt.Errorf(\"webhook subject access review returned both allow and deny response\")\n\tcase r.Status.Denied:\n\t\treturn authorizer.DecisionDeny, r.Status.Reason, nil\n\tcase r.Status.Allowed:\n\t\treturn authorizer.DecisionAllow, r.Status.Reason, nil\n\tdefault:\n\t\treturn authorizer.DecisionNoOpinion, r.Status.Reason, nil\n\t}\n\n}\n\n//TODO: need to finish the method to get the rules when using webhook mode\nfunc (w *WebhookAuthorizer) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) {\n\tvar (\n\t\tresourceRules    []authorizer.ResourceRuleInfo\n\t\tnonResourceRules []authorizer.NonResourceRuleInfo\n\t)\n\tincomplete := true\n\treturn resourceRules, nonResourceRules, incomplete, fmt.Errorf(\"webhook authorizer does not support user rule resolution\")\n}\n```\n\n<br>\n\n#### 2.5 RBAC授权\n\nRBAC授权器现实了基于角色的权限访问控制（Role-Based Access Control），其也是目前使用最为广泛的授权模型。在RBAC授权器\n\n中，权限与角色相关联，形成了用户—角色—权限的授权模型。用户通过加入某些角色从而得到这些角色的操作权限，这极大地简化了权\n\n限管理。\n\n在kube-apiserver设计的RBAC授权器中，新增了角色与集群绑定的概念，也就是说，kube-apiserver可以提供4种数据类型来表达基于角色的授权，它们分别是角色（Role）、集群角色（ClusterRole）、角色绑定（RoleBinding）及集群角色绑定（ClusterRoleBinding），这4种数据类型定义在vendor/k8s.io/api/rbac/v1/types.go中。\n\nRole <-> RoleBinding。  角色只能被授予某一个命名空间的权限。\n\nClusterRole <-> ClusterRoleBinding。集群角色是一组用户的集合，与规则相关联。集群角色能够被授予集群范围的权限，例如节点、非资源类型的服务端点（Endpoint）、跨所有命名空间的权限等。\n\n<br>\n\n```\n// Role is a namespaced, logical grouping of PolicyRules that can be referenced as a unit by a RoleBinding.\ntype Role struct {\n\tmetav1.TypeMeta `json:\",inline\"`\n\t// Standard object's metadata.\n\t// +optional\n\tmetav1.ObjectMeta `json:\"metadata,omitempty\" protobuf:\"bytes,1,opt,name=metadata\"`\n\n\t// Rules holds all the PolicyRules for this Role\n\t// +optional\n\tRules []PolicyRule `json:\"rules\" protobuf:\"bytes,2,rep,name=rules\"`\n}\n\n// +genclient\n// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object\n\n// RoleBinding references a role, but does not contain it.  It can reference a Role in the same namespace or a ClusterRole in the global namespace.\n// It adds who information via Subjects and namespace information by which namespace it exists in.  RoleBindings in a given\n// namespace only have effect in that namespace.\ntype RoleBinding struct {\n\tmetav1.TypeMeta `json:\",inline\"`\n\t// Standard object's metadata.\n\t// +optional\n\tmetav1.ObjectMeta `json:\"metadata,omitempty\" protobuf:\"bytes,1,opt,name=metadata\"`\n\n\t// Subjects holds references to the objects the role applies to.\n\t// +optional\n\tSubjects []Subject `json:\"subjects,omitempty\" protobuf:\"bytes,2,rep,name=subjects\"`\n\n\t// RoleRef can reference a Role in the current namespace or a ClusterRole in the global namespace.\n\t// If the RoleRef cannot be resolved, the Authorizer must return an error.\n\tRoleRef RoleRef `json:\"roleRef\" protobuf:\"bytes,3,opt,name=roleRef\"`\n}\n```\n\n**Role**就是相当于定义了一些规则列表。具体就是。举个例子，这个就是定义了一个 haimaxy-role 的角色。这个角色可以对 extensions.apps组下面的deploy, rs, pod执行 \"get\", \"list\", \"watch\", \"create\", \"update\", \"patch\", \"delete\"操作。\n\n```\napiVersion: rbac.authorization.k8s.io/v1\nkind: Role\nmetadata:\n  name: haimaxy-role\n  namespace: kube-system\nrules:\n- apiGroups: [\"\", \"extensions\", \"apps\"]\n  resources: [\"deployments\", \"replicasets\", \"pods\"]\n  verbs: [\"get\", \"list\", \"watch\", \"create\", \"update\", \"patch\", \"delete\"] # 也可以使用['*']\n```\n\n<br>\n\n而RoleBinding就是将某个用户和角色进行绑定。\n\n这里就是 User.haimaxy绑定了上面的haimaxy-role。这样haimaxy这个用户，就可以对 extensions.apps组下面的deploy, rs, pod执行 \"get\", \"list\", \"watch\", \"create\", \"update\", \"patch\", \"delete\"操作。\n\n```\napiVersion: rbac.authorization.k8s.io/v1\nkind: RoleBinding\nmetadata:\n  name: haimaxy-rolebinding\n  namespace: kube-system\nsubjects:\n- kind: User\n  name: haimaxy\n  apiGroup: \"\"\nroleRef:\n  kind: Role\n  name: haimaxy-role\n  apiGroup: \"\"\n```\n\n更多的使用教程，可以参考这个博客，手动创建一波就清楚了：https://www.qikqiak.com/post/use-rbac-in-k8s/\n\n<br>\n\nrbac的鉴权流程如下:\n\n1. 通过`Request`获取`Attribute`包括用户，资源和对应的操作\n2. `Authorize`调用`VisitRulesFor`进行具体的鉴权\n3. 获取所有的ClusterRoleBindings，并对其进行遍历操作\n4. 根据请求User信息，判断该是否被绑定在该ClusterRoleBinding中\n5. 若在将通过函数`GetRoleReferenceRules()`获取绑定的Role所控制的访问的资源\n6. 将Role所控制的访问的资源，与从API请求中提取出的资源进行比对，若比对成功，即为API请求的调用者有权访问相关资源\n7. 遍历ClusterRoleBinding中，都没有获得鉴权成功的操作，将会判断提取出的信息中是否包括了namespace的信息，若包括了，将会获取该namespace下的所有RoleBindings，类似ClusterRoleBindings\n8. 若在遍历了所有CluterRoleBindings，及该namespace下的所有RoleBingdings之后，仍没有对资源比对成功，则可判断该API请求的调用者没有权限访问相关资源, 鉴权失败\n\n这里没有细看，参考了文章：https://qingwave.github.io/kube-apiserver-authorization-code/\n\n<br>\n\n#### 2.6 Node授权\n\nNode授权器也被称为节点授权，是一种特殊用途的授权机制，专门授权由kubelet组件发出的API请求。\n\nNode授权器基于RBAC授权机制实现，对kubelet组件进行基于system：node内置角色的权限控制。\n\nsystem：node内置角色的权限定义在NodeRules函数中，代码示例如下：\n\nNodeRules函数定义了system：node内置角色的权限，它拥有许多资源的操作权限，例如Configmap、Secret、Service、Pod等资源。例如，在上面的代码中，针对Pod资源的get、list、watch、create、delete等操作权限。\n\n```\nconst (\n\tlegacyGroup         = \"\"\n\tappsGroup           = \"apps\"\n\tauthenticationGroup = \"authentication.k8s.io\"\n\tauthorizationGroup  = \"authorization.k8s.io\"\n\tautoscalingGroup    = \"autoscaling\"\n\tbatchGroup          = \"batch\"\n\tcertificatesGroup   = \"certificates.k8s.io\"\n\tcoordinationGroup   = \"coordination.k8s.io\"\n\tdiscoveryGroup      = \"discovery.k8s.io\"\n\textensionsGroup     = \"extensions\"\n\tpolicyGroup         = \"policy\"\n\trbacGroup           = \"rbac.authorization.k8s.io\"\n\tstorageGroup        = \"storage.k8s.io\"\n\tresMetricsGroup     = \"metrics.k8s.io\"\n\tcustomMetricsGroup  = \"custom.metrics.k8s.io\"\n\tnetworkingGroup     = \"networking.k8s.io\"\n\teventsGroup         = \"events.k8s.io\"\n)\n\nfunc NodeRules() []rbacv1.PolicyRule {\n\tnodePolicyRules := []rbacv1.PolicyRule{\n\t\t// Needed to check API access.  These creates are non-mutating\n\t\trbacv1helpers.NewRule(\"create\").Groups(authenticationGroup).Resources(\"tokenreviews\").RuleOrDie(),\n\t\trbacv1helpers.NewRule(\"create\").Groups(authorizationGroup).Resources(\"subjectaccessreviews\", \"localsubjectaccessreviews\").RuleOrDie(),\n\n\t\t// Needed to build serviceLister, to populate env vars for services\n\t\trbacv1helpers.NewRule(Read...).Groups(legacyGroup).Resources(\"services\").RuleOrDie(),\n\n\t\t// Nodes can register Node API objects and report status.\n\t\t// Use the NodeRestriction admission plugin to limit a node to creating/updating its own API object.\n\t\trbacv1helpers.NewRule(\"create\", \"get\", \"list\", \"watch\").Groups(legacyGroup).Resources(\"nodes\").RuleOrDie(),\n\t\trbacv1helpers.NewRule(\"update\", \"patch\").Groups(legacyGroup).Resources(\"nodes/status\").RuleOrDie(),\n\t\trbacv1helpers.NewRule(\"update\", \"patch\").Groups(legacyGroup).Resources(\"nodes\").RuleOrDie(),\n\n\t\t// TODO: restrict to the bound node as creator in the NodeRestrictions admission plugin\n\t\trbacv1helpers.NewRule(\"create\", \"update\", \"patch\").Groups(legacyGroup).Resources(\"events\").RuleOrDie(),\n\n\t\t// TODO: restrict to pods scheduled on the bound node once field selectors are supported by list/watch authorization\n\t\trbacv1helpers.NewRule(Read...).Groups(legacyGroup).Resources(\"pods\").RuleOrDie(),\n\n\t\t// Needed for the node to create/delete mirror pods.\n\t\t// Use the NodeRestriction admission plugin to limit a node to creating/deleting mirror pods bound to itself.\n\t\trbacv1helpers.NewRule(\"create\", \"delete\").Groups(legacyGroup).Resources(\"pods\").RuleOrDie(),\n\t\t// Needed for the node to report status of pods it is running.\n\t\t// Use the NodeRestriction admission plugin to limit a node to updating status of pods bound to itself.\n\t\trbacv1helpers.NewRule(\"update\", \"patch\").Groups(legacyGroup).Resources(\"pods/status\").RuleOrDie(),\n\t\t// Needed for the node to create pod evictions.\n\t\t// Use the NodeRestriction admission plugin to limit a node to creating evictions for pods bound to itself.\n\t\trbacv1helpers.NewRule(\"create\").Groups(legacyGroup).Resources(\"pods/eviction\").RuleOrDie(),\n\n\t\t// Needed for imagepullsecrets, rbd/ceph and secret volumes, and secrets in envs\n\t\t// Needed for configmap volume and envs\n\t\t// Use the Node authorization mode to limit a node to get secrets/configmaps referenced by pods bound to itself.\n\t\trbacv1helpers.NewRule(\"get\", \"list\", \"watch\").Groups(legacyGroup).Resources(\"secrets\", \"configmaps\").RuleOrDie(),\n\t\t// Needed for persistent volumes\n\t\t// Use the Node authorization mode to limit a node to get pv/pvc objects referenced by pods bound to itself.\n\t\trbacv1helpers.NewRule(\"get\").Groups(legacyGroup).Resources(\"persistentvolumeclaims\", \"persistentvolumes\").RuleOrDie(),\n\n\t\t// TODO: add to the Node authorizer and restrict to endpoints referenced by pods or PVs bound to the node\n\t\t// Needed for glusterfs volumes\n\t\trbacv1helpers.NewRule(\"get\").Groups(legacyGroup).Resources(\"endpoints\").RuleOrDie(),\n\t\t// Used to create a certificatesigningrequest for a node-specific client certificate, and watch\n\t\t// for it to be signed. This allows the kubelet to rotate it's own certificate.\n\t\trbacv1helpers.NewRule(\"create\", \"get\", \"list\", \"watch\").Groups(certificatesGroup).Resources(\"certificatesigningrequests\").RuleOrDie(),\n\n\t\t// Leases\n\t\trbacv1helpers.NewRule(\"get\", \"create\", \"update\", \"patch\", \"delete\").Groups(\"coordination.k8s.io\").Resources(\"leases\").RuleOrDie(),\n\n\t\t// CSI\n\t\trbacv1helpers.NewRule(\"get\").Groups(storageGroup).Resources(\"volumeattachments\").RuleOrDie(),\n\t}\n\n\tif utilfeature.DefaultFeatureGate.Enabled(features.ExpandPersistentVolumes) {\n\t\t// Use the Node authorization mode to limit a node to update status of pvc objects referenced by pods bound to itself.\n\t\t// Use the NodeRestriction admission plugin to limit a node to just update the status stanza.\n\t\tpvcStatusPolicyRule := rbacv1helpers.NewRule(\"get\", \"update\", \"patch\").Groups(legacyGroup).Resources(\"persistentvolumeclaims/status\").RuleOrDie()\n\t\tnodePolicyRules = append(nodePolicyRules, pvcStatusPolicyRule)\n\t}\n\n\tif utilfeature.DefaultFeatureGate.Enabled(features.TokenRequest) {\n\t\t// Use the Node authorization to limit a node to create tokens for service accounts running on that node\n\t\t// Use the NodeRestriction admission plugin to limit a node to create tokens bound to pods on that node\n\t\ttokenRequestRule := rbacv1helpers.NewRule(\"create\").Groups(legacyGroup).Resources(\"serviceaccounts/token\").RuleOrDie()\n\t\tnodePolicyRules = append(nodePolicyRules, tokenRequestRule)\n\t}\n\n\t// CSI\n\tif utilfeature.DefaultFeatureGate.Enabled(features.CSIDriverRegistry) {\n\t\tcsiDriverRule := rbacv1helpers.NewRule(\"get\", \"watch\", \"list\").Groups(\"storage.k8s.io\").Resources(\"csidrivers\").RuleOrDie()\n\t\tnodePolicyRules = append(nodePolicyRules, csiDriverRule)\n\t}\n\tif utilfeature.DefaultFeatureGate.Enabled(features.CSINodeInfo) {\n\t\tcsiNodeInfoRule := rbacv1helpers.NewRule(\"get\", \"create\", \"update\", \"patch\", \"delete\").Groups(\"storage.k8s.io\").Resources(\"csinodes\").RuleOrDie()\n\t\tnodePolicyRules = append(nodePolicyRules, csiNodeInfoRule)\n\t}\n\n\t// RuntimeClass\n\tif utilfeature.DefaultFeatureGate.Enabled(features.RuntimeClass) {\n\t\tnodePolicyRules = append(nodePolicyRules, rbacv1helpers.NewRule(\"get\", \"list\", \"watch\").Groups(\"node.k8s.io\").Resources(\"runtimeclasses\").RuleOrDie())\n\t}\n\treturn nodePolicyRules\n}\n```\n\n<br>\n\n在进行Node授权时，通过r.identifier.NodeIdentity函数获取角色信息，并验证其是否为system：node内置角色，nodeName的表现形式为system：node：<nodeName>。通过rbac.RulesAllow函数进行RBAC授权，如果授权成功，返回DecisionAllow决策状态。\n\n```\nfunc (r *NodeAuthorizer) Authorize(ctx context.Context, attrs authorizer.Attributes) (authorizer.Decision, string, error) {\n   nodeName, isNode := r.identifier.NodeIdentity(attrs.GetUser())\n   if !isNode {\n      // reject requests from non-nodes\n      return authorizer.DecisionNoOpinion, \"\", nil\n   }\n   if len(nodeName) == 0 {\n      // reject requests from unidentifiable nodes\n      klog.V(2).Infof(\"NODE DENY: unknown node for user %q\", attrs.GetUser().GetName())\n      return authorizer.DecisionNoOpinion, fmt.Sprintf(\"unknown node for user %q\", attrs.GetUser().GetName()), nil\n   }\n\n   // subdivide access to specific resources\n   if attrs.IsResourceRequest() {\n      requestResource := schema.GroupResource{Group: attrs.GetAPIGroup(), Resource: attrs.GetResource()}\n      switch requestResource {\n      case secretResource:\n         return r.authorizeReadNamespacedObject(nodeName, secretVertexType, attrs)\n      case configMapResource:\n         return r.authorizeReadNamespacedObject(nodeName, configMapVertexType, attrs)\n      case pvcResource:\n         if r.features.Enabled(features.ExpandPersistentVolumes) {\n            if attrs.GetSubresource() == \"status\" {\n               return r.authorizeStatusUpdate(nodeName, pvcVertexType, attrs)\n            }\n         }\n         return r.authorizeGet(nodeName, pvcVertexType, attrs)\n      case pvResource:\n         return r.authorizeGet(nodeName, pvVertexType, attrs)\n      case vaResource:\n         return r.authorizeGet(nodeName, vaVertexType, attrs)\n      case svcAcctResource:\n         if r.features.Enabled(features.TokenRequest) {\n            return r.authorizeCreateToken(nodeName, serviceAccountVertexType, attrs)\n         }\n         return authorizer.DecisionNoOpinion, fmt.Sprintf(\"disabled by feature gate %s\", features.TokenRequest), nil\n      case leaseResource:\n         return r.authorizeLease(nodeName, attrs)\n      case csiNodeResource:\n         if r.features.Enabled(features.CSINodeInfo) {\n            return r.authorizeCSINode(nodeName, attrs)\n         }\n         return authorizer.DecisionNoOpinion, fmt.Sprintf(\"disabled by feature gates %s\", features.CSINodeInfo), nil\n      }\n\n   }\n\n   // Access to other resources is not subdivided, so just evaluate against the statically defined node rules\n   if rbac.RulesAllow(attrs, r.nodeRules...) {\n      return authorizer.DecisionAllow, \"\", nil\n   }\n   return authorizer.DecisionNoOpinion, \"\", nil\n}\n```\n\n<br>\n\n### 3. 总结\n\n（1）本文主要参考《kubernetes源码解剖》这本书籍，通过根据一些自身使用，记录一下k8s授权方面的知识。日后需要相关开发或者更深入的了解时，有一定的知识基础。\n\n（2）针对这6种授权模式，个人认为的优缺点如下：\n\n| 模式        | 优点                                                         | 缺点                                                         |\n| ----------- | ------------------------------------------------------------ | ------------------------------------------------------------ |\n| AlwaysAllow | 简答，适用于自己搭建集群实践，这样就会省一些搭建环境的事情   | 非常不安全                                                   |\n| AlwaysDeny  | 基本是配合使用的。AlwaysDeny放第一个，先拒绝所有请求，然后后面加webhook这些，这样的好处就是，我只允许我webhook认证过的请求。 |                                                              |\n| ABAC        |                                                              | 每次需要修改文件，并且重启apiserver。而正式集群中，重启apiserver的风险是非常大的 |\n| Webhook     | 可以通过webhook使用一套 定制化的权限管理系统                 |                                                              |\n| RBAC        | 是目前比较流行的授权思路。有这么几个优点：<br>（1）对集群中的资源和非资源均拥有完整的覆盖<br>（2）整个RBAC完全由几个API对象完成，同其他API对象一样，可以用kubectl或API进行操作<br>（3）可以在运行时进行操作，无需重启API Server |                                                              |\n| Node        | 专门针对kubelet的授权，事先专门为kubelet定义好了一组权限     |                                                              |\n\n个人感觉目前常用的就是：RBAC，Webhook，Node这三种\n\n###  4. 参考\n\n书籍：kubernetes源码解剖，郑东\n\nhttps://kubernetes.io/zh/docs/reference/access-authn-authz/abac/\n\nhttps://kubernetes.io/zh/docs/reference/access-authn-authz/webhook/\n\nhttps://www.qikqiak.com/post/use-rbac-in-k8s/\n\nhttps://qingwave.github.io/kube-apiserver-authorization-code/\n\n<br>"
  },
  {
    "path": "k8s/kube-apiserver/14-k8s之admission分析.md",
    "content": "Table of Contents\n=================\n\n  * [1. 背景](#1-背景)\n  * [2. 分析流程](#2-分析流程)\n     * [2.1 Admission的注册](#21-admission的注册)\n     * [2.2 admission的调用](#22-admission的调用)\n     * [2.3  validatingwebhook, mutatingwebhook的调用](#23--validatingwebhook-mutatingwebhook的调用)\n        * [2.3.1 ValidatingAdmissionWebhook调用](#231-validatingadmissionwebhook调用)\n        * [2.3.2 MutatingAdmissionWebhook调用](#232-mutatingadmissionwebhook调用)\n     * [2.4 动态更新webhook的原理](#24-动态更新webhook的原理)\n  * [3. 总结](#3-总结)\n  * [4.参考链接：](#4参考链接)\n\n### 1. 背景\n\n\napi Request -> 认证 -> 授权 -> admission -> etcd.\n\n和initializer不同，webhook是在保存在etcd之前工作的。\n\n\n经过了认证，授权之后，接下来就到了webhook这个环节了。\n\n这篇笔记主要就是分析 `MutatingAdmissionWebhook` 和  `ValidatingAdmissionWebhook` 如何工作的。\n\n<br>\n\n### 2. 分析流程\n\n#### 2.1 Admission的注册\n\nkube-apiserver在调用NewServerRunOptions函数初始化options的时候，调用了NewAdmissionOptions去初始化了AdmissionOptions，并注册了内置的\n\nadmission插件和webhook admission插件。\n\n```\n// NewServerRunOptions creates a new ServerRunOptions object with default parameters\nfunc NewServerRunOptions() *ServerRunOptions {\n   s := ServerRunOptions{\n      // 省略...\n      // 初始化AdmissionOptions\n      Admission:               kubeoptions.NewAdmissionOptions(), \n      Authentication:          kubeoptions.NewBuiltInAuthenticationOptions().WithAll(),\n      Authorization:           kubeoptions.NewBuiltInAuthorizationOptions(),\n      // 省略...\n   }\n   // ...\n   return &s\n}\n```\n\n<br>\n\n**AdmissionOptions的一些基础概念**\n\n```\noptions.AdmissionOptions\n// AdmissionOptions holds the admission options.\n// It is a wrap of generic AdmissionOptions.\ntype AdmissionOptions struct {\n\t// GenericAdmission holds the generic admission options.\n\tGenericAdmission *genericoptions.AdmissionOptions\n\t// DEPRECATED flag, should use EnabledAdmissionPlugins and DisabledAdmissionPlugins.\n\t// They are mutually exclusive, specify both will lead to an error.\n\tPluginNames []string\n}\n\ngenericoptions.AdmissionOptions\n// AdmissionOptions holds the admission options\ntype AdmissionOptions struct {\n    // 有序的推荐插件列表集合  \n\tRecommendedPluginOrder []string\n\t// 默认禁止的插件  \n\tDefaultOffPlugins sets.String\n\t// 开启的插件列表，通过kube-apiserver 启动参数设置--enable-admission-plugins 选项  \n\tEnablePlugins []string\n\t// 禁止的插件列表，通过kube-apiserver 启动参数设置 --disable-admission-plugins 选项  \n\tDisablePlugins []string\n\t// ConfigFile is the file path with admission control configuration.\n\tConfigFile string\n\t// 代表了所有已经注册的插件 \n\tPlugins *admission.Plugins\n}\n```\n\n<br>\n\n**options. NewAdmissionOptions()**\n\nNewAdmissionOptions里面先是调用genericoptions.NewAdmissionOptions创建一个AdmissionOptions，NewAdmissionOptions同时也注册了lifecycle、validatingwebhook、mutatingwebhook这三个插件。然后再调用RegisterAllAdmissionPlugins注册内置的其他admission。\n\n```\noptions. NewAdmissionOptions()\n// NewAdmissionOptions creates a new instance of AdmissionOptions\n// Note:\n//  In addition it calls RegisterAllAdmissionPlugins to register\n//  all kube-apiserver admission plugins.\n//\n//  Provides the list of RecommendedPluginOrder that holds sane values\n//  that can be used by servers that don't care about admission chain.\n//  Servers that do care can overwrite/append that field after creation.\nfunc NewAdmissionOptions() *AdmissionOptions {\n    // 这里注册了 lifecycle, initialization,validatingwebhook,mutatingwebhook 四个admission。（2.2.2 中mutating的注册函数就是这个时候调用的）\n\toptions := genericoptions.NewAdmissionOptions()\n\t// 这里注册了所有的 admission, 没有上面四个 admission\n\t// register all admission plugins\n\tRegisterAllAdmissionPlugins(options.Plugins)\n\t// set RecommendedPluginOrder\n\toptions.RecommendedPluginOrder = AllOrderedPlugins           // 确定了admission-plugin的相对顺序。\n\t// set DefaultOffPlugins\n\t\n\t// 设置默认的停用插件\n\toptions.DefaultOffPlugins = DefaultOffAdmissionPlugins()\n\n\treturn &AdmissionOptions{\n\t\tGenericAdmission: options,\n\t}\n}\n\n\ngenericoptions.NewAdmissionOptions()\n// NewAdmissionOptions creates a new instance of AdmissionOptions\n// Note:\n//  In addition it calls RegisterAllAdmissionPlugins to register\n//  all generic admission plugins.\n//\n//  Provides the list of RecommendedPluginOrder that holds sane values\n//  that can be used by servers that don't care about admission chain.\n//  Servers that do care can overwrite/append that field after creation.\nfunc NewAdmissionOptions() *AdmissionOptions {\n\toptions := &AdmissionOptions{\n\t\tPlugins: admission.NewPlugins(),\n\t\t// This list is mix of mutating admission plugins and validating\n\t\t// admission plugins. The apiserver always runs the validating ones\n\t\t// after all the mutating ones, so their relative order in this list\n\t\t// doesn't matter.\n\t\tRecommendedPluginOrder: []string{lifecycle.PluginName, initialization.PluginName, mutatingwebhook.PluginName, validatingwebhook.PluginName},\n\t\tDefaultOffPlugins:      sets.NewString(initialization.PluginName),\n\t}\n\t// 注册了lifecycle、validatingwebhook、mutatingwebhook\n\tserver.RegisterAllAdmissionPlugins(options.Plugins)\n\treturn options\n}\n\n// validatingwebhook, mutatingwebhook 是动态的，这里应该就是注册一个总体的概念，而不是一个一个的实体。\n// RegisterAllAdmissionPlugins registers all admission plugins\nfunc RegisterAllAdmissionPlugins(plugins *admission.Plugins) {\n\tlifecycle.Register(plugins)\n\tinitialization.Register(plugins)\n\tvalidatingwebhook.Register(plugins)\n\tmutatingwebhook.Register(plugins)\n}\n```\n\n**AllOrderedPlugins**\n\n```\n// AllOrderedPlugins is the list of all the plugins in order.\nvar AllOrderedPlugins = []string{\n\tadmit.PluginName,                        // AlwaysAdmit\n\tautoprovision.PluginName,                // NamespaceAutoProvision\n\tlifecycle.PluginName,                    // NamespaceLifecycle\n\texists.PluginName,                       // NamespaceExists\n\tscdeny.PluginName,                       // SecurityContextDeny\n\tantiaffinity.PluginName,                 // LimitPodHardAntiAffinityTopology\n\tpodpreset.PluginName,                    // PodPreset\n\tlimitranger.PluginName,                  // LimitRanger\n\tserviceaccount.PluginName,               // ServiceAccount\n\tnoderestriction.PluginName,              // NodeRestriction\n\talwayspullimages.PluginName,             // AlwaysPullImages\n\timagepolicy.PluginName,                  // ImagePolicyWebhook\n\tpodsecuritypolicy.PluginName,            // PodSecurityPolicy\n\tpodnodeselector.PluginName,              // PodNodeSelector\n\tpodpriority.PluginName,                  // Priority\n\tdefaulttolerationseconds.PluginName,     // DefaultTolerationSeconds\n\tpodtolerationrestriction.PluginName,     // PodTolerationRestriction\n\texec.DenyEscalatingExec,                 // DenyEscalatingExec\n\texec.DenyExecOnPrivileged,               // DenyExecOnPrivileged\n\teventratelimit.PluginName,               // EventRateLimit\n\textendedresourcetoleration.PluginName,   // ExtendedResourceToleration\n\tlabel.PluginName,                        // PersistentVolumeLabel\n\tsetdefault.PluginName,                   // DefaultStorageClass\n\tstorageobjectinuseprotection.PluginName, // StorageObjectInUseProtection\n\tgc.PluginName,                           // OwnerReferencesPermissionEnforcement\n\tresize.PluginName,                       // PersistentVolumeClaimResize\n\tmutatingwebhook.PluginName,              // MutatingAdmissionWebhook\n\tinitialization.PluginName,               // Initializers\n\tvalidatingwebhook.PluginName,            // ValidatingAdmissionWebhook\n\tresourcequota.PluginName,                // ResourceQuota\n\tdeny.PluginName,                         // AlwaysDeny\n}\n```\n\n\n<br>\n\n#### 2.2 admission的调用\n前面已经分析AdmissionPlugin注册到ServerRunOptions的过程， buildGenericConfig中会调用ServerRunOptions.Admission.ApplyTo生成admission chain设置到GenericConfig里面。把所有的admission plugin生成chainAdmissionHandler对象，其实就是plugin数组，这个类的Admit、Validate等方法会遍历调用每个plugin的Admit、Validate方法\n\n\n```\nbuildGenericConfig(){\n\terr = s.Admission.ApplyTo(\n\t\tgenericConfig,\n\t\tversionedInformers,\n\t\tkubeClientConfig,\n\t\tfeature.DefaultFeatureGate,\n\t\tpluginInitializers...)\n}\n```\n\nGenericConfig.AdmissionControl 又会赋值给GenericAPIServer.admissionControl\n\n\n```\nfunc (a *AdmissionOptions) ApplyTo(\n   c *server.Config,\n   informers informers.SharedInformerFactory,\n   kubeAPIServerClientConfig *rest.Config,\n   features featuregate.FeatureGate,\n   pluginInitializers ...admission.PluginInitializer,\n) error {\n      // 省略 ...\n    // 找到所有启用的plugin\n   pluginNames := a.enabledPluginNames()\n \n   pluginsConfigProvider, err := admission.ReadAdmissionConfiguration(pluginNames, a.ConfigFile, configScheme)\n   if err != nil {\n      return fmt.Errorf(\"failed to read plugin config: %v\", err)\n   }\n \n   clientset, err := kubernetes.NewForConfig(kubeAPIServerClientConfig)\n   if err != nil {\n      return err\n   }\n   genericInitializer := initializer.New(clientset, informers, c.Authorization.Authorizer, features)\n   initializersChain := admission.PluginInitializers{}\n   pluginInitializers = append(pluginInitializers, genericInitializer)\n   initializersChain = append(initializersChain, pluginInitializers...)\n    // 把所有的admission plugin生成admissionChain，实际是个plugin数组\n   admissionChain, err := a.Plugins.NewFromPlugins(pluginNames, pluginsConfigProvider, initializersChain, a.Decorators)\n   if err != nil {\n      return err\n   }\n    // 把admissionChain设置给GenericConfig.AdmissionControl \n   c.AdmissionControl = admissionmetrics.WithStepMetrics(admissionChain)\n   return nil\n}\n\n```\n\nAdmission Plugin是在kube-apiserver处理完前面的handler之后，在调用RESTStorage的Get、Create、Update、Delete等函数前会调用Admission Plugin。\n\nkube-apiserver有很多的handler组成了handler链，这写handler链的最内层，是使用gorestful框架注册的WebService。每个WebService都对应一种资源的RESTStorage，比如NodeStorage（pkg/registry/core/node/storage/storage.go )，installAPIResources初始化WebService时，会把RESTStorage的Get、Create、Update等函数分别封装成Get、POST、PUT等http方法的handler注册到WebService中。 \n\n比如把Update函数封装成http handler 作为PUT方法的handler，而在这个hanlder调用Update函数之前，会先调用Admission Plugin的Admit、Validate等函数。下面看个PUT方法的例子。\n\n<br>\n\na.group.Admit是从GenericAPIServer.admissionControl取的值，就是前面ApplyTo函数生成的admissionChain。admit、updater作为参数调用restfulUpdateResource函数生成的handler\n\n\n\na.group.Admit是从GenericAPIServer.admissionControl取的值，就是前面ApplyTo函数生成的admissionChain。admit、updater作为参数调用restfulUpdateResource函数生成的handler\n\n\n\n\n\n```\n// staging/src/k8s.io/apiserver/pkg/endpoints/installer.go\nfunc (a *APIInstaller) registerResourceHandlers(path string, storage rest.Storage, ws *restful.WebService) (*metav1.APIResource, error) {\n   admit := a.group.Admit\n   // 省略 ...\n   updater, isUpdater := storage.(rest.Updater)\n   // 省略 ...\n   switch action.Verb {\n     case \"GET\": ...\n    case \"PUT\": // Update a resource.\n       doc := \"replace the specified \" + kind\n       if isSubresource {\n          doc = \"replace \" + subresource + \" of the specified \" + kind\n       }\n       // admit、updater作为参数调用restfulUpdateResource函数生成的handler\n       handler := metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, restfulUpdateResource(updater, reqScope, admit))\n       route := ws.PUT(action.Path).To(handler).\n          Doc(doc).\n          Param(ws.QueryParameter(\"pretty\", \"If 'true', then the output is pretty printed.\")).\n          Operation(\"replace\"+namespaced+kind+strings.Title(subresource)+operationSuffix).\n          Produces(append(storageMeta.ProducesMIMETypes(action.Verb), mediaTypes...)...).\n          Returns(http.StatusOK, \"OK\", producedObject).\n          // TODO: in some cases, the API may return a v1.Status instead of the versioned object\n          // but currently go-restful can't handle multiple different objects being returned.\n          Returns(http.StatusCreated, \"Created\", producedObject).\n          Reads(defaultVersionedObject).\n          Writes(producedObject)\n       if err := AddObjectParams(ws, route, versionedUpdateOptions); err != nil {\n          return nil, err\n       }\n       addParams(route, action.Params)\n       routes = append(routes, route)     \n     case \"PARTCH\": ...  \n     // 省略 ....\n   }     \n}\n\nrestfulUpdateResource调用了 handlers.UpdateResource。\nfunc restfulUpdateResource(r rest.Updater, scope handlers.RequestScope, admit admission.Interface) restful.RouteFunction {\n\treturn func(req *restful.Request, res *restful.Response) {\n\t\thandlers.UpdateResource(r, &scope, admit)(res.ResponseWriter, req.Request)\n\t}\n}\n```\n\n看handlers.UpdateResource的代码实现，会先判断如果传入的admission.Interface参数是MutationInterface类型，就调用Admit，也就是调用admissionChain的Admit，最终会遍历调用每个Admission Plugin的Admit方法。而Webhook Admission是众多admission中的一个。\n\n \n\n执行完Admission，后面的requestFunc 才会调用RESTStorage的Update函数。每个资源的RESTStorage最终都是要调用ETCD3Storage的Get、Update等函数。\n\n```\n// staging/src/k8s.io/apiserver/pkg/endpoints/handlers/update.go\nfunc UpdateResource(r rest.Updater, scope *RequestScope, admit admission.Interface) http.HandlerFunc {\n   return func(w http.ResponseWriter, req *http.Request) {\n      // 省略 ...\n      ae := request.AuditEventFrom(ctx)\n      audit.LogRequestObject(ae, obj, scope.Resource, scope.Subresource, scope.Serializer)\n      admit = admission.WithAudit(admit, ae)\n    // 如果admit是MutationInterface类型的，就调用其Admit函数，也就是admissionChain的Admit\n      if mutatingAdmission, ok := admit.(admission.MutationInterface); ok {\n         transformers = append(transformers, func(ctx context.Context, newObj, oldObj runtime.Object) (runtime.Object, error) {\n            isNotZeroObject, err := hasUID(oldObj)\n            if err != nil {\n               return nil, fmt.Errorf(\"unexpected error when extracting UID from oldObj: %v\", err.Error())\n            } else if !isNotZeroObject {\n               if mutatingAdmission.Handles(admission.Create) {\n                  return newObj, mutatingAdmission.Admit(ctx, admission.NewAttributesRecord(newObj, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Create, updateToCreateOptions(options), dryrun.IsDryRun(options.DryRun), userInfo), scope)\n               }\n            } else {\n               if mutatingAdmission.Handles(admission.Update) {\n                  return newObj, mutatingAdmission.Admit(ctx, admission.NewAttributesRecord(newObj, oldObj, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Update, options, dryrun.IsDryRun(options.DryRun), userInfo), scope)\n               }\n            }\n            return newObj, nil\n         })\n      }\n      // 省略 ...\n      // 执行完MutationInterface类型的admission，这里先会执行validatingAdmission，然后才调用RESTStorage的Update函数 \n      requestFunc := func() (runtime.Object, error) {\n         obj, created, err := r.Update(\n            ctx,\n            name,\n            rest.DefaultUpdatedObjectInfo(obj, transformers...),\n            withAuthorization(rest.AdmissionToValidateObjectFunc(\n               admit,\n               admission.NewAttributesRecord(nil, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Create, updateToCreateOptions(options), dryrun.IsDryRun(options.DryRun), userInfo), scope),\n               scope.Authorizer, createAuthorizerAttributes),\n           // 这里调用了validatingAdmission.Validate函数\n            rest.AdmissionToValidateObjectUpdateFunc(\n               admit,\n               admission.NewAttributesRecord(nil, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Update, options, dryrun.IsDryRun(options.DryRun), userInfo), scope),\n            false,\n            options,\n         )\n         wasCreated = created\n         return obj, err\n      }\n      result, err := finishRequest(timeout, func() (runtime.Object, error) {\n         result, err := requestFunc()\n         // 省略 ...\n         return result, err\n      })\n      // ...\n      transformResponseObject(ctx, scope, trace, req, w, status, outputMediaType, result)\n   }\n}\n\n\n// 这里调用了validatingAdmission.Validate函数\n// AdmissionToValidateObjectUpdateFunc converts validating admission to a rest validate object update func\nfunc AdmissionToValidateObjectUpdateFunc(admit admission.Interface, staticAttributes admission.Attributes, o admission.ObjectInterfaces) ValidateObjectUpdateFunc {\n\tvalidatingAdmission, ok := admit.(admission.ValidationInterface)\n\tif !ok {\n\t\treturn func(ctx context.Context, obj, old runtime.Object) error { return nil }\n\t}\n\treturn func(ctx context.Context, obj, old runtime.Object) error {\n\t\tfinalAttributes := admission.NewAttributesRecord(\n\t\t\tobj,\n\t\t\told,\n\t\t\tstaticAttributes.GetKind(),\n\t\t\tstaticAttributes.GetNamespace(),\n\t\t\tstaticAttributes.GetName(),\n\t\t\tstaticAttributes.GetResource(),\n\t\t\tstaticAttributes.GetSubresource(),\n\t\t\tstaticAttributes.GetOperation(),\n\t\t\tstaticAttributes.GetOperationOptions(),\n\t\t\tstaticAttributes.IsDryRun(),\n\t\t\tstaticAttributes.GetUserInfo(),\n\t\t)\n\t\tif !validatingAdmission.Handles(finalAttributes.GetOperation()) {\n\t\t\treturn nil\n\t\t}\n\t\treturn validatingAdmission.Validate(ctx, finalAttributes, o)\n\t}\n}\n```\n\n以上是PUT方法的例子，里面调用了MutationInterface和ValidationInterface。其他的方法比如POST、DELETE等也是类似。但是GET方法不会调用Admission Plugin。\n\n\n\n#### 2.3  validatingwebhook, mutatingwebhook的调用\n\nvalidatingwebhook和mutatingwebhook分别位于staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/validating/plugin.go，staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/plugin.go两个文件中。\n\n##### 2.3.1 ValidatingAdmissionWebhook调用\n\n（1） ValidatingAdmissionWebhook的Validate()函数实现了ValidationInterface接口，有请求到来时kube-apiserver会调用所有admission 的Validate()方法。ValidatingAdmissionWebhook持有了一个Webhook对象，Validate()会调用Webhook.Dispatch()。\n\n（2）Webhook.Dispatch()又调用了其持有的dispatcher的Dispatch()方法。dispatcher时通过dispatcherFactory创建的，dispatcherFactory是ValidatingAdmissionWebhook创建generic.Webhook时候传入的newValidatingDispatcher函数。调用dispatcherFactory函数创建的实际上是validatingDispatcher对象，也就是Webhook.Dispatch()调用的是validatingDispatcher.Dispatch()。\n\n（3）validatingDispatcher.Dispatch()会逐个远程调用注册的webhook plugin\n\n\n\n\n\nNewValidatingAdmissionWebhook初始化了ValidatingAdmissionWebhook对象，内部持有了一个generic.Webhook对象，generic.Webhook是一个Validate和mutate公用的框架，创建generic.Webhook时需要一个dispatcherFactory函数，用这个函数生成dispatcher对象。\n\n```\n// staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/validating/plugin.go\n// NewValidatingAdmissionWebhook returns a generic admission webhook plugin.\nfunc NewValidatingAdmissionWebhook(configFile io.Reader) (*Plugin, error) {\n   handler := admission.NewHandler(admission.Connect, admission.Create, admission.Delete, admission.Update)\n   p := &Plugin{}\n   var err error\n   p.Webhook, err = generic.NewWebhook(handler, configFile, configuration.NewValidatingWebhookConfigurationManager, newValidatingDispatcher(p))\n   if err != nil {\n      return nil, err\n   }\n   return p, nil\n}\n \n// Validate makes an admission decision based on the request attributes.\nfunc (a *Plugin) Validate(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) error {\n   return a.Webhook.Dispatch(ctx, attr, o)\n}\n```\n\n调用generic.Webhook.Dispatch()时会调用dispatcher对象的Dispatch。\n\n```\n// Dispatch is called by the downstream Validate or Admit methods.\nfunc (a *Webhook) Dispatch(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) error {\n   if rules.IsWebhookConfigurationResource(attr) {\n      return nil\n   }\n   if !a.WaitForReady() {\n      return admission.NewForbidden(attr, fmt.Errorf(\"not yet ready to handle request\"))\n   }\n   hooks := a.hookSource.Webhooks()\n   return a.dispatcher.Dispatch(ctx, attr, o, hooks)\n}\n```\n\nvalidatingDispatcher.Dispatch遍历所有的hooks ，找到相关的webhooks，然后执行callHooks调用外部注册进来的\n\n```go\nfunc (d *validatingDispatcher) Dispatch(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces, hooks []webhook.WebhookAccessor) error {\n   var relevantHooks []*generic.WebhookInvocation\n   // Construct all the versions we need to call our webhooks\n   versionedAttrs := map[schema.GroupVersionKind]*generic.VersionedAttributes{}\n   for _, hook := range hooks {\n       // 遍历所有的webhooks，根据ValidatingWebhookConfiguration中的rules是否匹配找到所有相关的hooks\n      invocation, statusError := d.plugin.ShouldCallHook(hook, attr, o)\n      if statusError != nil {\n         return statusError\n      }\n      if invocation == nil {\n         continue\n      }\n      relevantHooks = append(relevantHooks, invocation)\n      // If we already have this version, continue\n      if _, ok := versionedAttrs[invocation.Kind]; ok {\n         continue\n      }\n      versionedAttr, err := generic.NewVersionedAttributes(attr, invocation.Kind, o)\n      if err != nil {\n         return apierrors.NewInternalError(err)\n      }\n      versionedAttrs[invocation.Kind] = versionedAttr\n   }\n \n   if len(relevantHooks) == 0 {\n      // no matching hooks\n      return nil\n   }\n \n   // Check if the request has already timed out before spawning remote calls\n   select {\n   case <-ctx.Done():\n      // parent context is canceled or timed out, no point in continuing\n      return apierrors.NewTimeoutError(\"request did not complete within requested timeout\", 0)\n   default:\n   }\n \n   wg := sync.WaitGroup{}\n   errCh := make(chan error, len(relevantHooks))\n   wg.Add(len(relevantHooks))\n   for i := range relevantHooks {\n      go func(invocation *generic.WebhookInvocation) {\n         defer wg.Done()\n         hook, ok := invocation.Webhook.GetValidatingWebhook()\n         if !ok {\n            utilruntime.HandleError(fmt.Errorf(\"validating webhook dispatch requires v1.ValidatingWebhook, but got %T\", hook))\n            return\n         }\n         versionedAttr := versionedAttrs[invocation.Kind]\n         t := time.Now()\n         // 启动多个go routine 并行调用注册进来的webhook plugin\n         err := d.callHook(ctx, hook, invocation, versionedAttr)\n         ignoreClientCallFailures := hook.FailurePolicy != nil && *hook.FailurePolicy == v1.Ignore\n         rejected := false\n         if err != nil {\n            switch err := err.(type) {\n            case *webhookutil.ErrCallingWebhook:\n               if !ignoreClientCallFailures {\n                  rejected = true\n                  admissionmetrics.Metrics.ObserveWebhookRejection(hook.Name, \"validating\", string(versionedAttr.Attributes.GetOperation()), admissionmetrics.WebhookRejectionCallingWebhookError, 0)\n               }\n            case *webhookutil.ErrWebhookRejection:\n               rejected = true\n               admissionmetrics.Metrics.ObserveWebhookRejection(hook.Name, \"validating\", string(versionedAttr.Attributes.GetOperation()), admissionmetrics.WebhookRejectionNoError, int(err.Status.ErrStatus.Code))\n            default:\n               rejected = true\n               admissionmetrics.Metrics.ObserveWebhookRejection(hook.Name, \"validating\", string(versionedAttr.Attributes.GetOperation()), admissionmetrics.WebhookRejectionAPIServerInternalError, 0)\n            }\n         }\n         admissionmetrics.Metrics.ObserveWebhook(time.Since(t), rejected, versionedAttr.Attributes, \"validating\", hook.Name)\n         if err == nil {\n            return\n         }\n \n         if callErr, ok := err.(*webhookutil.ErrCallingWebhook); ok {\n            if ignoreClientCallFailures {\n               klog.Warningf(\"Failed calling webhook, failing open %v: %v\", hook.Name, callErr)\n               utilruntime.HandleError(callErr)\n               return\n            }\n \n            klog.Warningf(\"Failed calling webhook, failing closed %v: %v\", hook.Name, err)\n            errCh <- apierrors.NewInternalError(err)\n            return\n         }\n \n         if rejectionErr, ok := err.(*webhookutil.ErrWebhookRejection); ok {\n            err = rejectionErr.Status\n         }\n         klog.Warningf(\"rejected by webhook %q: %#v\", hook.Name, err)\n         errCh <- err\n      }(relevantHooks[i])\n   }\n   // 等待多个goroutine 执行完成\n   wg.Wait()\n   close(errCh)\n \n   var errs []error\n   for e := range errCh {\n      errs = append(errs, e)\n   }\n   if len(errs) == 0 {\n      return nil\n   }\n   if len(errs) > 1 {\n      for i := 1; i < len(errs); i++ {\n         // TODO: merge status errors; until then, just return the first one.\n         utilruntime.HandleError(errs[i])\n      }\n   }\n   return errs[0]\n}\n```\n\n\n\n##### 2.3.2 MutatingAdmissionWebhook调用\n\n看MutatingWebhook的构造函数就可以看到，MutatingWebhook和ValidatingWebhook的代码架构是一样的，只不过在创建generic.Webhook的时候传入的dispatcherFactory函数是newMutatingDispatcher，所以Webhook.Dispatch()最终调用的就是mutatingDispatcher.Dispatch(),这个和validatingDispatcher.Dispatch的实现逻辑基本是一样的，也是根据WebhookConfiguration中的rules是否匹配找到相关的webhooks，然后逐个调用。\n\n\n```\n// staging/src/k8s.io/apiserver/pkg/admission/plugin/webhook/mutating/plugin.go\n// NewMutatingWebhook returns a generic admission webhook plugin.\nfunc NewMutatingWebhook(configFile io.Reader) (*Plugin, error) {\n   handler := admission.NewHandler(admission.Connect, admission.Create, admission.Delete, admission.Update)\n   p := &Plugin{}\n   var err error\n   p.Webhook, err = generic.NewWebhook(handler, configFile, configuration.NewMutatingWebhookConfigurationManager, newMutatingDispatcher(p))\n   if err != nil {\n      return nil, err\n   }\n \n   return p, nil\n}\n \n// ValidateInitialization implements the InitializationValidator interface.\nfunc (a *Plugin) ValidateInitialization() error {\n   if err := a.Webhook.ValidateInitialization(); err != nil {\n      return err\n   }\n   return nil\n}\n \n// Admit makes an admission decision based on the request attributes.\nfunc (a *Plugin) Admit(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) error {\n   return a.Webhook.Dispatch(ctx, attr, o)\n}\n\n\nfunc (a *mutatingDispatcher) Dispatch(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces, hooks []webhook.WebhookAccessor) error {\n\treinvokeCtx := attr.GetReinvocationContext()\n\tvar webhookReinvokeCtx *webhookReinvokeContext\n\tif v := reinvokeCtx.Value(PluginName); v != nil {\n\t\twebhookReinvokeCtx = v.(*webhookReinvokeContext)\n\t} else {\n\t\twebhookReinvokeCtx = &webhookReinvokeContext{}\n\t\treinvokeCtx.SetValue(PluginName, webhookReinvokeCtx)\n\t}\n\n\tif reinvokeCtx.IsReinvoke() && webhookReinvokeCtx.IsOutputChangedSinceLastWebhookInvocation(attr.GetObject()) {\n\t\t// If the object has changed, we know the in-tree plugin re-invocations have mutated the object,\n\t\t// and we need to reinvoke all eligible webhooks.\n\t\twebhookReinvokeCtx.RequireReinvokingPreviouslyInvokedPlugins()\n\t}\n\tdefer func() {\n\t\twebhookReinvokeCtx.SetLastWebhookInvocationOutput(attr.GetObject())\n\t}()\n\tvar versionedAttr *generic.VersionedAttributes\n\t//是一个一个执行的\n\tfor i, hook := range hooks {\n\t\tattrForCheck := attr\n\t\tif versionedAttr != nil {\n\t\t\tattrForCheck = versionedAttr\n\t\t}\n\t\tinvocation, statusErr := a.plugin.ShouldCallHook(hook, attrForCheck, o)\n\t\tif statusErr != nil {\n\t\t\treturn statusErr\n\t\t}\n\t\tif invocation == nil {\n\t\t\tcontinue\n\t\t}\n\t\thook, ok := invocation.Webhook.GetMutatingWebhook()\n\t\tif !ok {\n\t\t\treturn fmt.Errorf(\"mutating webhook dispatch requires v1.MutatingWebhook, but got %T\", hook)\n\t\t}\n\t\t// This means that during reinvocation, a webhook will not be\n\t\t// called for the first time. For example, if the webhook is\n\t\t// skipped in the first round because of mismatching labels,\n\t\t// even if the labels become matching, the webhook does not\n\t\t// get called during reinvocation.\n\t\tif reinvokeCtx.IsReinvoke() && !webhookReinvokeCtx.ShouldReinvokeWebhook(invocation.Webhook.GetUID()) {\n\t\t\tcontinue\n\t\t}\n\t\t\n\n\treturn nil\n}\n```\n\n#### 2.4 动态更新webhook的原理\n\n我们使用的时候都是通过创建类似于创建这样的来增加一个webhook。那如果我增加了一个这个webhook, 是如何生效的呢。\n\n```\napiVersion: admissionregistration.k8s.io/v1beta1\nkind: ValidatingWebhookConfiguration\nmetadata:\n  name: validation-webhook-example-cfg\n  labels:\n    app: admission-webhook-example\nwebhooks:\n  - name: required-labels.banzaicloud.com\n    clientConfig:\n      service:\n        name: admission-webhook-example-webhook-svc\n        namespace: default\n        path: \"/validate\"\n      caBundle: ${CA_BUNDLE}\n    rules:\n      - operations: [ \"CREATE\" ]\n        apiGroups: [\"apps\", \"\"]\n        apiVersions: [\"v1\"]\n        resources: [\"deployments\",\"services\"]\n    namespaceSelector:\n      matchLabels:\n        admission-webhook-example: enabled\n```\n\n<br>\n\n```\n// Dispatch is called by the downstream Validate or Admit methods.\nfunc (a *Webhook) Dispatch(ctx context.Context, attr admission.Attributes, o admission.ObjectInterfaces) error {\n\tif rules.IsWebhookConfigurationResource(attr) {\n\t\treturn nil\n\t}\n\tif !a.WaitForReady() {\n\t\treturn admission.NewForbidden(attr, fmt.Errorf(\"not yet ready to handle request\"))\n\t}\n\t//这里获取了所有的webhook，然后再调用的Dispatch函数\n\thooks := a.hookSource.Webhooks()\n\treturn a.dispatcher.Dispatch(ctx, attr, o, hooks)\n}\n\n\n// 获得所有的validatingWebhookConfiguration\n// Webhooks returns the merged ValidatingWebhookConfiguration.\nfunc (v *validatingWebhookConfigurationManager) Webhooks() []webhook.WebhookAccessor {\n\treturn v.configuration.Load().([]webhook.WebhookAccessor)\n}\n\n```\n\n\n\nValidatingWebhookConfigurationManager会维护所有的validatingWebhookConfiguration，一旦有ValidatingWebhookConfigurationManager的add, update, del都会调用updateConfiguration更新\n\n```\npkg/admission/configuration/validating_webhook_manager.go\nfunc NewValidatingWebhookConfigurationManager(f informers.SharedInformerFactory) generic.Source {\n\tinformer := f.Admissionregistration().V1().ValidatingWebhookConfigurations()\n\tmanager := &validatingWebhookConfigurationManager{\n\t\tconfiguration: &atomic.Value{},\n\t\tlister:        informer.Lister(),\n\t\thasSynced:     informer.Informer().HasSynced,\n\t}\n\n\t// Start with an empty list\n\tmanager.configuration.Store([]webhook.WebhookAccessor{})\n\n\t// On any change, rebuild the config\n\tinformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{\n\t\tAddFunc:    func(_ interface{}) { manager.updateConfiguration() },\n\t\tUpdateFunc: func(_, _ interface{}) { manager.updateConfiguration() },\n\t\tDeleteFunc: func(_ interface{}) { manager.updateConfiguration() },\n\t})\n\n\treturn manager\n}\n\n//然后上面Load的时候就回获得最新的webhook\nfunc mergeValidatingWebhookConfigurations(configurations []*v1.ValidatingWebhookConfiguration) []webhook.WebhookAccessor {\n\tsort.SliceStable(configurations, ValidatingWebhookConfigurationSorter(configurations).ByName)\n\taccessors := []webhook.WebhookAccessor{}\n\tfor _, c := range configurations {\n\t\t// webhook names are not validated for uniqueness, so we check for duplicates and\n\t\t// add a int suffix to distinguish between them\n\t\tnames := map[string]int{}\n\t\tfor i := range c.Webhooks {\n\t\t\tn := c.Webhooks[i].Name\n\t\t\tuid := fmt.Sprintf(\"%s/%s/%d\", c.Name, n, names[n])\n\t\t\tnames[n]++\n\t\t\taccessors = append(accessors, webhook.NewValidatingWebhookAccessor(uid, c.Name, &c.Webhooks[i]))\n\t\t}\n\t}\n\treturn accessors\n}\n```\n\n\n\n### 3. 总结\n\n（1）webhook是通过插入在 apiserver的处理链条中，存入etcd之前生效的\n\n（2）mutatingwebhook，ValidatingAdmission都是有对应的manager来实时更新的\n\n（3）ValidatingAdmission，mutatingwebhook的不同在于\n\n* 所有的请求先经过mutatingwebhook，在经过ValidatingAdmission。这也很好理解，因为ValidatingAdmission不会修改对象\n* ValidatingAdmission是并行处理的，都满足后放行(可以设置超时跳过改webhook的策略)\n* mutatingwebhook是一个一个串行操作的\n\n### 4.参考链接：\n\n https://blog.csdn.net/u014152978/article/details/107170600\n"
  },
  {
    "path": "k8s/kube-apiserver/15-k8s之etcd存储实现.md",
    "content": "Table of Contents\n=================\n\n  * [1. etcd 配置](#1-etcd-配置)\n  * [2. Apiserver定义etcd的config](#2-apiserver定义etcd的config)\n     * [2.1 DefaultAPIResourceConfigSource](#21-defaultapiresourceconfigsource)\n     * [2.2 初始化 storageFactory](#22-初始化-storagefactory)\n  * [3. 以pod为例, apiserver是如何add/del/update etcd资源的](#3-以pod为例-apiserver是如何adddelupdate-etcd资源的)\n     * [3.1 NewStorage](#31-newstorage)\n     * [3.2 pod.Strategy](#32-podstrategy)\n     * [3.3 CompleteWithOptions](#33-completewithoptions)\n     * [3.4 总结](#34-总结)\n  * [4. 总结](#4-总结)\n  * [5.参考链接：](#5参考链接)\n\n本节介绍apiserver是如何使用etcd进行存储的。在apiserver的启动流程下中分析到了，不同资源的url注册最终依赖于一个Storage的东西。接下来就分析Storage到底是什么。\n\n### 1. etcd 配置\n\n这里就是指定一些etcd的参数，比如EnableWatchCache，EtcdPathPrefix，数据格式等等。\n\n```\n// NewServerRunOptions creates a new ServerRunOptions object with default parameters\nfunc NewServerRunOptions() *ServerRunOptions {\n\ts := ServerRunOptions{\n   ...\n   // 资源信息存储路径前缀缺省为：DefaultEtcdPathPrefix = \"registry\"。但是这个参数我们可以在运行时指定参数覆盖，具体的参数配置为：etcd-prefix\n\t\tEtcd:                    genericoptions.NewEtcdOptions(storagebackend.NewDefaultConfig(kubeoptions.DefaultEtcdPathPrefix, nil)),\n\t...\n\t\t}\n\t\t// 指定etcd的数据格式为protobuf\n\t\ts.Etcd.DefaultStorageMediaType = \"application/vnd.kubernetes.protobuf\"\n\t}\n  \n  \nfunc NewEtcdOptions(backendConfig *storagebackend.Config) *EtcdOptions {\n\t options := &EtcdOptions{\n\t\tStorageConfig:           *backendConfig,\n\t\tDefaultStorageMediaType: \"application/json\",\n\t\tDeleteCollectionWorkers: 1,\n\t\tEnableGarbageCollection: true,\n\t\tEnableWatchCache:        true,\n\t\tDefaultWatchCacheSize:   100,\n\t}\n\toptions.StorageConfig.CountMetricPollPeriod = time.Minute\n\treturn options\n}\n\n\nfunc NewDefaultConfig(prefix string, codec runtime.Codec) *Config {\n\treturn &Config{\n\t\tPaging:             true,\n\t\tPrefix:             prefix,\n\t\tCodec:              codec,\n\t\tCompactionInterval: DefaultCompactInterval,\n\t}\n}\n```\n\n### 2. Apiserver定义etcd的config\n\ncmd/kube-apiserver/app/server.go buildGenericConfig函数回生存很多config，其中就有etcd的config。\n\nbuildGenericConfig关于etcd做的事情如下：\n\n- 1、调用 `master.DefaultAPIResourceConfigSource` 加载需要启用的 API Resource\n- 2、初始化，并补全StorageFactory的配置。s.Etcd就是上面定义的etcd配置\n\n```\n  ...\n  // 1.加载默认支持的资源\n  genericConfig.MergedResourceConfig = master.DefaultAPIResourceConfigSource()\n  ... \n  storageFactoryConfig := kubeapiserver.NewStorageFactoryConfig()\n\tstorageFactoryConfig.APIResourceConfig = genericConfig.MergedResourceConfig\n\tcompletedStorageFactoryConfig, err := storageFactoryConfig.Complete(s.Etcd)\n\tif err != nil {\n\t\tlastErr = err\n\t\treturn\n\t}\n\tstorageFactory, lastErr = completedStorageFactoryConfig.New()\n\tif lastErr != nil {\n\t\treturn\n\t}\n\tif genericConfig.EgressSelector != nil {\n\t\tstorageFactory.StorageConfig.Transport.EgressLookup = genericConfig.EgressSelector.Lookup\n\t}\n\tif lastErr = s.Etcd.ApplyWithStorageFactoryTo(storageFactory, genericConfig); lastErr != nil {\n\t\treturn\n\t}\n```\n\n<br>\n\n#### 2.1 DefaultAPIResourceConfigSource\n\n可以看出来DefaultAPIResourceConfigSource函数就是返回当前集群支持哪些默认的版本和资源。\n\n```\n// DefaultAPIResourceConfigSource returns default configuration for an APIResource.\nfunc DefaultAPIResourceConfigSource() *serverstorage.ResourceConfig {\n\tret := serverstorage.NewResourceConfig()\n\t// NOTE: GroupVersions listed here will be enabled by default. Don't put alpha versions in the list.\n\tret.EnableVersions(\n\t\tadmissionregistrationv1.SchemeGroupVersion,\n\t\tadmissionregistrationv1beta1.SchemeGroupVersion,\n\t\tapiv1.SchemeGroupVersion,\n\t\tappsv1.SchemeGroupVersion,\n\t\tauthenticationv1.SchemeGroupVersion,\n\t\tauthenticationv1beta1.SchemeGroupVersion,\n\t\tauthorizationapiv1.SchemeGroupVersion,\n\t\tauthorizationapiv1beta1.SchemeGroupVersion,\n\t\tautoscalingapiv1.SchemeGroupVersion,\n\t\tautoscalingapiv2beta1.SchemeGroupVersion,\n\t\tautoscalingapiv2beta2.SchemeGroupVersion,\n\t\tbatchapiv1.SchemeGroupVersion,\n\t\tbatchapiv1beta1.SchemeGroupVersion,\n\t\tcertificatesapiv1beta1.SchemeGroupVersion,\n\t\tcoordinationapiv1.SchemeGroupVersion,\n\t\tcoordinationapiv1beta1.SchemeGroupVersion,\n\t\tdiscoveryv1beta1.SchemeGroupVersion,\n\t\teventsv1beta1.SchemeGroupVersion,\n\t\textensionsapiv1beta1.SchemeGroupVersion,\n\t\tnetworkingapiv1.SchemeGroupVersion,\n\t\tnetworkingapiv1beta1.SchemeGroupVersion,\n\t\tnodev1beta1.SchemeGroupVersion,\n\t\tpolicyapiv1beta1.SchemeGroupVersion,\n\t\trbacv1.SchemeGroupVersion,\n\t\trbacv1beta1.SchemeGroupVersion,\n\t\tstorageapiv1.SchemeGroupVersion,\n\t\tstorageapiv1beta1.SchemeGroupVersion,\n\t\tschedulingapiv1beta1.SchemeGroupVersion,\n\t\tschedulingapiv1.SchemeGroupVersion,\n\t)\n\t// enable non-deprecated beta resources in extensions/v1beta1 explicitly so we have a full list of what's possible to serve\n\tret.EnableResources(\n\t\textensionsapiv1beta1.SchemeGroupVersion.WithResource(\"ingresses\"),\n\t)\n\t// disable deprecated beta resources in extensions/v1beta1 explicitly so we have a full list of what's possible to serve\n\tret.DisableResources(\n\t\textensionsapiv1beta1.SchemeGroupVersion.WithResource(\"daemonsets\"),\n\t\textensionsapiv1beta1.SchemeGroupVersion.WithResource(\"deployments\"),\n\t\textensionsapiv1beta1.SchemeGroupVersion.WithResource(\"networkpolicies\"),\n\t\textensionsapiv1beta1.SchemeGroupVersion.WithResource(\"podsecuritypolicies\"),\n\t\textensionsapiv1beta1.SchemeGroupVersion.WithResource(\"replicasets\"),\n\t\textensionsapiv1beta1.SchemeGroupVersion.WithResource(\"replicationcontrollers\"),\n\t)\n\t// disable deprecated beta versions explicitly so we have a full list of what's possible to serve\n\tret.DisableVersions(\n\t\tappsv1beta1.SchemeGroupVersion,\n\t\tappsv1beta2.SchemeGroupVersion,\n\t)\n\t// disable alpha versions explicitly so we have a full list of what's possible to serve\n\tret.DisableVersions(\n\t\tauditregistrationv1alpha1.SchemeGroupVersion,\n\t\tbatchapiv2alpha1.SchemeGroupVersion,\n\t\tnodev1alpha1.SchemeGroupVersion,\n\t\trbacv1alpha1.SchemeGroupVersion,\n\t\tschedulingv1alpha1.SchemeGroupVersion,\n\t\tsettingsv1alpha1.SchemeGroupVersion,\n\t\tstorageapiv1alpha1.SchemeGroupVersion,\n\t\tflowcontrolv1alpha1.SchemeGroupVersion,\n\t)\n\n\treturn ret\n}\n```\n\n#### 2.2 初始化 storageFactory\n\n这里分为了三步。\n\n第一步：NewStorageFactoryConfig。\n\n第二步：storageFactory, lastErr = completedStorageFactoryConfig.New()\n\n第三步：s.Etcd.ApplyWithStorageFactoryTo(storageFactory, genericConfig)\n\n<br>\n\n**第一步**，NewStorageFactoryConfig就是定义了一些编码解码方式，以及需要覆盖的资源\n\n```\n// NewStorageFactoryConfig returns a new StorageFactoryConfig set up with necessary resource overrides.\nfunc NewStorageFactoryConfig() *StorageFactoryConfig {\n\n\tresources := []schema.GroupVersionResource{\n\t\tbatch.Resource(\"cronjobs\").WithVersion(\"v1beta1\"),\n\t\tnetworking.Resource(\"ingresses\").WithVersion(\"v1beta1\"),\n\t\t// TODO #83513 csinodes override can be removed in 1.18\n\t\tapisstorage.Resource(\"csinodes\").WithVersion(\"v1beta1\"),\n\t\tapisstorage.Resource(\"csidrivers\").WithVersion(\"v1beta1\"),\n\t}\n\n\treturn &StorageFactoryConfig{\n\t\tSerializer:                legacyscheme.Codecs,        //传统的编码解码\n\t\tDefaultResourceEncoding:   serverstorage.NewDefaultResourceEncodingConfig(legacyscheme.Scheme),\n\t\tResourceEncodingOverrides: resources,\n\t}\n}\n```\n\n<br>\n\n**第二步**，New 就是初始化了一个NewDefaultStorageFactory结构体。\n\n* 描述了如何创建到底层存储的连接，包含了各种存储接口storage.Interface实现的认证信息。\n\n例如默认使用etcd3，编码转换等方式。\n\n```\n// Config is configuration for creating a storage backend.\ntype Config struct {\n\t// Type defines the type of storage backend. Default (\"\") is \"etcd3\".\n\tType string\n\t// Prefix is the prefix to all keys passed to storage.Interface methods.\n\tPrefix string\n\t// Transport holds all connection related info, i.e. equal TransportConfig means equal servers we talk to.\n\tTransport TransportConfig\n\t// Paging indicates whether the server implementation should allow paging (if it is\n\t// supported). This is generally configured by feature gating, or by a specific\n\t// resource type not wishing to allow paging, and is not intended for end users to\n\t// set.\n\tPaging bool\n\n\tCodec runtime.Codec\n\t// EncodeVersioner is the same groupVersioner used to build the\n\t// storage encoder. Given a list of kinds the input object might belong\n\t// to, the EncodeVersioner outputs the gvk the object will be\n\t// converted to before persisted in etcd.\n\tEncodeVersioner runtime.GroupVersioner\n\t// Transformer allows the value to be transformed prior to persisting into etcd.\n\tTransformer value.Transformer\n\n\t// CompactionInterval is an interval of requesting compaction from apiserver.\n\t// If the value is 0, no compaction will be issued.\n\tCompactionInterval time.Duration\n\t// CountMetricPollPeriod specifies how often should count metric be updated\n\tCountMetricPollPeriod time.Duration\n}\n```\n\n以及其他的参数如下：\n\n```\n// New returns a new storage factory created from the completed storage factory configuration.\nfunc (c *completedStorageFactoryConfig) New() (*serverstorage.DefaultStorageFactory, error) {\n   resourceEncodingConfig := resourceconfig.MergeResourceEncodingConfigs(c.DefaultResourceEncoding, c.ResourceEncodingOverrides)\n   storageFactory := serverstorage.NewDefaultStorageFactory(\n      c.StorageConfig,      //描述了如何创建到底层存储的连接，包含了各种存储接口storage.Interface实现的认证信息。\n      c.DefaultStorageMediaType,  //数据格式，缺省存储媒介类型，application/json\n      c.Serializer,              //缺省序列化实例，legacyscheme.Codecs\n      resourceEncodingConfig,    // 资源编码配置\n      c.APIResourceConfig,    //API启用的资源版本\n      SpecialDefaultResourcePrefixes)  //前缀\n\n   // 同居资源绑定，约定了同居资源的查找顺序\n   storageFactory.AddCohabitatingResources(networking.Resource(\"networkpolicies\"), extensions.Resource(\"networkpolicies\"))\n   storageFactory.AddCohabitatingResources(apps.Resource(\"deployments\"), extensions.Resource(\"deployments\"))\n   storageFactory.AddCohabitatingResources(apps.Resource(\"daemonsets\"), extensions.Resource(\"daemonsets\"))\n   storageFactory.AddCohabitatingResources(apps.Resource(\"replicasets\"), extensions.Resource(\"replicasets\"))\n   storageFactory.AddCohabitatingResources(api.Resource(\"events\"), events.Resource(\"events\"))\n   storageFactory.AddCohabitatingResources(api.Resource(\"replicationcontrollers\"), extensions.Resource(\"replicationcontrollers\")) // to make scale subresources equivalent\n   storageFactory.AddCohabitatingResources(policy.Resource(\"podsecuritypolicies\"), extensions.Resource(\"podsecuritypolicies\"))\n   storageFactory.AddCohabitatingResources(networking.Resource(\"ingresses\"), extensions.Resource(\"ingresses\"))\n\n  \n   for _, override := range c.EtcdServersOverrides {\n      tokens := strings.Split(override, \"#\")\n      apiresource := strings.Split(tokens[0], \"/\")\n\n      group := apiresource[0]\n      resource := apiresource[1]\n      groupResource := schema.GroupResource{Group: group, Resource: resource}\n\n      servers := strings.Split(tokens[1], \";\")\n      storageFactory.SetEtcdLocation(groupResource, servers)\n   }\n   if len(c.EncryptionProviderConfigFilepath) != 0 {\n      transformerOverrides, err := encryptionconfig.GetTransformerOverrides(c.EncryptionProviderConfigFilepath)\n      if err != nil {\n         return nil, err\n      }\n      for groupResource, transformer := range transformerOverrides {\n         storageFactory.SetTransformer(groupResource, transformer)\n      }\n   }\n   return storageFactory, nil\n}\n\n\n\nfunc NewDefaultStorageFactory(\n    config storagebackend.Config,\n    defaultMediaType string,                                // 从EtcdOptions参数中传入的，缺省为 application/json，见NewEtcdOptions方法\n    defaultSerializer runtime.StorageSerializer,    // 具体的值：legacyscheme.Codecs\n    resourceEncodingConfig ResourceEncodingConfig,  // 资源编码配置情况，并不是所有的资源都按照指定的Group来存放，有些特例。另外也可以指定存储在不同etcd、不同的prefix、甚至于不同的编码存储。\n    resourceConfig APIResourceConfigSource,  //  启用的资源版本的API情况\n    specialDefaultResourcePrefixes map[schema.GroupResource]string,  // 见：SpecialDefaultResourcePrefixes\n) *DefaultStorageFactory {\n    config.Paging = utilfeature.DefaultFeatureGate.Enabled(features.APIListChunking)\n    if len(defaultMediaType) == 0 {\n        defaultMediaType = runtime.ContentTypeJSON\n    }\n    return &DefaultStorageFactory{\n        StorageConfig:           config, // 描述了如何创建到底层存储的连接，包含了各种存储接口storage.Interface实现的认证信息。\n        Overrides:               map[schema.GroupResource]groupResourceOverrides{}, // 特殊资源处理\n        DefaultMediaType:        defaultMediaType,  // 缺省存储媒介类型，application/json\n        DefaultSerializer:       defaultSerializer,        // 缺省序列化实例，legacyscheme.Codecs\n        ResourceEncodingConfig:  resourceEncodingConfig, // 资源编码配置\n        APIResourceConfigSource: resourceConfig,           // API启用的资源版本\n        DefaultResourcePrefixes: specialDefaultResourcePrefixes, // 特殊资源prefix\n\n        newStorageCodecFn: NewStorageCodec, // 为提供的存储媒介类型、序列化和请求的存储与内存版本组装一个存储codec\n    }\n}\n```\n\n<br>\n\n**第三步，**初始化 RESTOptionsGetter，后期根据其获取操作 Etcd 的句柄，同时添加 etcd 的健康检查方法\n\n```\nfunc (s *EtcdOptions) ApplyWithStorageFactoryTo(factory serverstorage.StorageFactory, c *server.Config) error {\n\tif err := s.addEtcdHealthEndpoint(c); err != nil {\n\t\treturn err\n\t}\n\tc.RESTOptionsGetter = &StorageFactoryRestOptionsFactory{Options: *s, StorageFactory: factory}\n\treturn nil\n}\n```\n\n<br>\n\n最终构建好的DefaultStorageFactory，会被存储在genericapiserver.Config的RESTOptionsGetter成员中，代码如上所示。\n\n<br>\n\n### 3. 以pod为例, apiserver是如何add/del/update etcd资源的\n\n在创建KubeAPIServer的过程中，会调用InstallLegacyAPI注册api资源。其中就有一个NewLegacyRESTStorage的函数\n\n```\nfunc (c LegacyRESTStorageProvider) NewLegacyRESTStorage(restOptionsGetter generic.RESTOptionsGetter) (LegacyRESTStorage, genericapiserver.APIGroupInfo, error) {\n\n 。。。\n\tpodStorage, err := podstore.NewStorage(\n\t\trestOptionsGetter,              //这个就是之前的restOptionsGetter，里面有etcd的各种配置\n\t\tnodeStorage.KubeletConnectionInfo,\n\t\tc.ProxyTransport,\n\t\tpodDisruptionClient,\n\t)\n\n\tserviceRest, serviceRestProxy := servicestore.NewREST(serviceRESTStorage,\n\t\tendpointsStorage,\n\t\tpodStorage.Pod,\n\t\tserviceClusterIPAllocator,\n\t\tsecondaryServiceClusterIPAllocator,\n\t\tserviceNodePortAllocator,\n\t\tc.ProxyTransport)\n\n\trestStorageMap := map[string]rest.Storage{\n\t\t\"pods\":             podStorage.Pod,\n\t\t\"pods/attach\":      podStorage.Attach,\n\t\t\"pods/status\":      podStorage.Status,\n\t\t\"pods/log\":         podStorage.Log,\n\t\t\"pods/exec\":        podStorage.Exec,\n\t\t\"pods/portforward\": podStorage.PortForward,\n\t\t\"pods/proxy\":       podStorage.Proxy,\n\t\t\"pods/binding\":     podStorage.Binding,\n\t\t\"bindings\":         podStorage.LegacyBinding,\n\n\t\t\"podTemplates\": podTemplateStorage,\n\n\t\t\"replicationControllers\":        controllerStorage.Controller,\n\t\t\"replicationControllers/status\": controllerStorage.Status,\n\n\t\t\"services\":        serviceRest,\n\t\t\"services/proxy\":  serviceRestProxy,\n\t\t\"services/status\": serviceStatusStorage,\n\n\t\t\"endpoints\": endpointsStorage,\n\n\t\t\"nodes\":        nodeStorage.Node,\n\t\t\"nodes/status\": nodeStorage.Status,\n\t\t\"nodes/proxy\":  nodeStorage.Proxy,\n\n\t\t\"events\": eventStorage,\n\n\t\t\"limitRanges\":                   limitRangeStorage,\n\t\t\"resourceQuotas\":                resourceQuotaStorage,\n\t\t\"resourceQuotas/status\":         resourceQuotaStatusStorage,\n\t\t\"namespaces\":                    namespaceStorage,\n\t\t\"namespaces/status\":             namespaceStatusStorage,\n\t\t\"namespaces/finalize\":           namespaceFinalizeStorage,\n\t\t\"secrets\":                       secretStorage,\n\t\t\"serviceAccounts\":               serviceAccountStorage,\n\t\t\"persistentVolumes\":             persistentVolumeStorage,\n\t\t\"persistentVolumes/status\":      persistentVolumeStatusStorage,\n\t\t\"persistentVolumeClaims\":        persistentVolumeClaimStorage,\n\t\t\"persistentVolumeClaims/status\": persistentVolumeClaimStatusStorage,\n\t\t\"configMaps\":                    configMapStorage,\n\n\t\t\"componentStatuses\": componentstatus.NewStorage(componentStatusStorage{c.StorageFactory}.serversToValidate),\n\t}\n\tif legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: \"autoscaling\", Version: \"v1\"}) {\n\t\trestStorageMap[\"replicationControllers/scale\"] = controllerStorage.Scale\n\t}\n\tif legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: \"policy\", Version: \"v1beta1\"}) {\n\t\trestStorageMap[\"pods/eviction\"] = podStorage.Eviction\n\t}\n\tif serviceAccountStorage.Token != nil {\n\t\trestStorageMap[\"serviceaccounts/token\"] = serviceAccountStorage.Token\n\t}\n\tif utilfeature.DefaultFeatureGate.Enabled(features.EphemeralContainers) {\n\t\trestStorageMap[\"pods/ephemeralcontainers\"] = podStorage.EphemeralContainers\n\t}\n\tapiGroupInfo.VersionedResourcesStorageMap[\"v1\"] = restStorageMap\n\n\treturn restStorage, apiGroupInfo, nil\n}\n```\n\n<br>\n\n#### 3.1 NewStorage\n\n```\n// NewStorage returns a RESTStorage object that will work against pods.\nfunc NewStorage(optsGetter generic.RESTOptionsGetter, k client.ConnectionInfoGetter, proxyTransport http.RoundTripper, podDisruptionBudgetClient policyclient.PodDisruptionBudgetsGetter) (PodStorage, error) {\n\n\tstore := &genericregistry.Store{\n\t\tNewFunc:                  func() runtime.Object { return &api.Pod{} },    //NewFunc用于构建一个Pod实例\n\t\tNewListFunc:              func() runtime.Object { return &api.PodList{} },\n\t\tPredicateFunc:            pod.MatchPod,\n\t\tDefaultQualifiedResource: api.Resource(\"pods\"),\n    \n    // 关键点1，pod.Strategy\n\t\tCreateStrategy:      pod.Strategy,  \n\t\tUpdateStrategy:      pod.Strategy,\n\t\tDeleteStrategy:      pod.Strategy,\n\t\tReturnDeletedObject: true,\n\n\t\tTableConvertor: printerstorage.TableConvertor{TableGenerator: printers.NewTableGenerator().With(printersinternal.AddHandlers)},\n\t}\n\toptions := &generic.StoreOptions{\n\t\tRESTOptions: optsGetter,\n\t\tAttrFunc:    pod.GetAttrs,\n\t\tTriggerFunc: map[string]storage.IndexerFunc{\"spec.nodeName\": pod.NodeNameTriggerFunc},\n\t}\n\t// 关键点2，CompleteWithOptions\n\tif err := store.CompleteWithOptions(options); err != nil {\n\t\treturn PodStorage{}, err\n\t}\n\n\tstatusStore := *store\n\tstatusStore.UpdateStrategy = pod.StatusStrategy\n\tephemeralContainersStore := *store\n\tephemeralContainersStore.UpdateStrategy = pod.EphemeralContainersStrategy\n\n\tbindingREST := &BindingREST{store: store}\n\treturn PodStorage{\n\t\tPod:                 &REST{store, proxyTransport},\n\t\tBinding:             &BindingREST{store: store},\n\t\tLegacyBinding:       &LegacyBindingREST{bindingREST},\n\t\tEviction:            newEvictionStorage(store, podDisruptionBudgetClient),\n\t\tStatus:              &StatusREST{store: &statusStore},\n\t\tEphemeralContainers: &EphemeralContainersREST{store: &ephemeralContainersStore},\n\t\tLog:                 &podrest.LogREST{Store: store, KubeletConn: k},\n\t\tProxy:               &podrest.ProxyREST{Store: store, ProxyTransport: proxyTransport},\n\t\tExec:                &podrest.ExecREST{Store: store, KubeletConn: k},\n\t\tAttach:              &podrest.AttachREST{Store: store, KubeletConn: k},\n\t\tPortForward:         &podrest.PortForwardREST{Store: store, KubeletConn: k},\n\t}, nil\n}\n\n```\n\n上面代码的关键，就是store对象的创建，store.Storage的类型为：storage.Interface接口。\n\n这里有两个关键点，pod.Strategy和CompleteWithOptions函数。\n\n#### 3.2 pod.Strategy\n\n可以看出来，podStrategy就是每一个storage独特的地方。比如NamespaceScoped函数表示了这个资源是否有namespaces这个概念。\n\n这决定了了url中是否有namespace前缀。\n\nPrepareForCreate，对接受的pod进行了status的初始化。这样通过kubectl create pod的话。obj包含了template信息。PrepareForCreate函数进行了status的初始化。\n\n```\n// podStrategy implements behavior for Pods\ntype podStrategy struct {\n\truntime.ObjectTyper\n\tnames.NameGenerator\n}\n\n// Strategy is the default logic that applies when creating and updating Pod\n// objects via the REST API.\nvar Strategy = podStrategy{legacyscheme.Scheme, names.SimpleNameGenerator}\n\n// NamespaceScoped is true for pods.\nfunc (podStrategy) NamespaceScoped() bool {\n\treturn true\n}\n\n// PrepareForCreate clears fields that are not allowed to be set by end users on creation.\nfunc (podStrategy) PrepareForCreate(ctx context.Context, obj runtime.Object) {\n\tpod := obj.(*api.Pod)\n\tpod.Status = api.PodStatus{\n\t\tPhase:    api.PodPending,\n\t\tQOSClass: qos.GetPodQOS(pod),\n\t}\n\n\tpodutil.DropDisabledPodFields(pod, nil)\n}\n\n// PrepareForUpdate clears fields that are not allowed to be set by end users on update.\nfunc (podStrategy) PrepareForUpdate(ctx context.Context, obj, old runtime.Object) {\n\tnewPod := obj.(*api.Pod)\n\toldPod := old.(*api.Pod)\n\tnewPod.Status = oldPod.Status\n\n\tpodutil.DropDisabledPodFields(newPod, oldPod)\n}\n\n// Validate validates a new pod.\nfunc (podStrategy) Validate(ctx context.Context, obj runtime.Object) field.ErrorList {\n\tpod := obj.(*api.Pod)\n\tallErrs := validation.ValidatePodCreate(pod)\n\tallErrs = append(allErrs, validation.ValidateConditionalPod(pod, nil, field.NewPath(\"\"))...)\n\treturn allErrs\n}\n\n// Canonicalize normalizes the object after validation.\nfunc (podStrategy) Canonicalize(obj runtime.Object) {\n}\n。。。\n```\n\n<br>\n\n#### 3.3 CompleteWithOptions\n\n`store.CompleteWithOptions` 主要功能是为 store 中的配置设置一些默认的值以及根据提供的 options 更新 store，其中最主要的就是初始化 store 的后端存储实例。CompleteWithOptions函数如下：\n\n```\n// CompleteWithOptions updates the store with the provided options and\n// defaults common fields.\nfunc (e *Store) CompleteWithOptions(options *generic.StoreOptions) error {\n\t。。。省略了一些检查代码。。。\n\t\n\tattrFunc := options.AttrFunc\n\tif attrFunc == nil {\n\t\tif isNamespaced {\n\t\t\tattrFunc = storage.DefaultNamespaceScopedAttr\n\t\t} else {\n\t\t\tattrFunc = storage.DefaultClusterScopedAttr\n\t\t}\n\t}\n\tif e.PredicateFunc == nil {\n\t\te.PredicateFunc = func(label labels.Selector, field fields.Selector) storage.SelectionPredicate {\n\t\t\treturn storage.SelectionPredicate{\n\t\t\t\tLabel:    label,\n\t\t\t\tField:    field,\n\t\t\t\tGetAttrs: attrFunc,\n\t\t\t}\n\t\t}\n\t}\n\n  // GetRESTOptions对etcd进行了初始化。\n\topts, err := options.RESTOptions.GetRESTOptions(e.DefaultQualifiedResource)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t// ResourcePrefix must come from the underlying factory\n\tprefix := opts.ResourcePrefix\n\tif !strings.HasPrefix(prefix, \"/\") {\n\t\tprefix = \"/\" + prefix\n\t}\n\tif prefix == \"/\" {\n\t\treturn fmt.Errorf(\"store for %s has an invalid prefix %q\", e.DefaultQualifiedResource.String(), opts.ResourcePrefix)\n\t}\n\n\t// Set the default behavior for storage key generation\n\tif e.KeyRootFunc == nil && e.KeyFunc == nil {\n\t\tif isNamespaced {\n\t\t\te.KeyRootFunc = func(ctx context.Context) string {\n\t\t\t\treturn NamespaceKeyRootFunc(ctx, prefix)\n\t\t\t}\n\t\t\te.KeyFunc = func(ctx context.Context, name string) (string, error) {\n\t\t\t\treturn NamespaceKeyFunc(ctx, prefix, name)\n\t\t\t}\n\t\t} else {\n\t\t\te.KeyRootFunc = func(ctx context.Context) string {\n\t\t\t\treturn prefix\n\t\t\t}\n\t\t\te.KeyFunc = func(ctx context.Context, name string) (string, error) {\n\t\t\t\treturn NoNamespaceKeyFunc(ctx, prefix, name)\n\t\t\t}\n\t\t}\n\t}\n\n\t// We adapt the store's keyFunc so that we can use it with the StorageDecorator\n\t// without making any assumptions about where objects are stored in etcd\n\tkeyFunc := func(obj runtime.Object) (string, error) {\n\t\taccessor, err := meta.Accessor(obj)\n\t\tif err != nil {\n\t\t\treturn \"\", err\n\t\t}\n\n\t\tif isNamespaced {\n\t\t\treturn e.KeyFunc(genericapirequest.WithNamespace(genericapirequest.NewContext(), accessor.GetNamespace()), accessor.GetName())\n\t\t}\n\n\t\treturn e.KeyFunc(genericapirequest.NewContext(), accessor.GetName())\n\t}\n\n\tif e.DeleteCollectionWorkers == 0 {\n\t\te.DeleteCollectionWorkers = opts.DeleteCollectionWorkers\n\t}\n\n\te.EnableGarbageCollection = opts.EnableGarbageCollection\n\n\tif e.ObjectNameFunc == nil {\n\t\te.ObjectNameFunc = func(obj runtime.Object) (string, error) {\n\t\t\taccessor, err := meta.Accessor(obj)\n\t\t\tif err != nil {\n\t\t\t\treturn \"\", err\n\t\t\t}\n\t\t\treturn accessor.GetName(), nil\n\t\t}\n\t}\n\n\tif e.Storage.Storage == nil {\n\t\te.Storage.Codec = opts.StorageConfig.Codec\n\t\tvar err error\n\t\te.Storage.Storage, e.DestroyFunc, err = opts.Decorator(\n\t\t\topts.StorageConfig,\n\t\t\tprefix,\n\t\t\tkeyFunc,\n\t\t\te.NewFunc,\n\t\t\te.NewListFunc,\n\t\t\tattrFunc,\n\t\t\toptions.TriggerFunc,\n\t\t)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\te.StorageVersioner = opts.StorageConfig.EncodeVersioner\n\n\t\tif opts.CountMetricPollPeriod > 0 {\n\t\t\tstopFunc := e.startObservingCount(opts.CountMetricPollPeriod)\n\t\t\tpreviousDestroy := e.DestroyFunc\n\t\t\te.DestroyFunc = func() {\n\t\t\t\tstopFunc()\n\t\t\t\tif previousDestroy != nil {\n\t\t\t\t\tpreviousDestroy()\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\treturn nil\n}\n```\n\n在`CompleteWithOptions`方法内，调用了`options.RESTOptions.GetRESTOptions` 方法，其最终返回`generic.RESTOptions` 对象，`generic.RESTOptions` 对象中包含对 etcd 初始化的一些配置、数据序列化方法以及对 etcd 操作的 storage.Interface 对象。其会依次调用`StorageWithCacher-->NewRawStorage-->Create`方法创建最终依赖的后端存储。\n\n```\nfunc (f *StorageFactoryRestOptionsFactory) GetRESTOptions(resource schema.GroupResource) (generic.RESTOptions, error) {\n\tstorageConfig, err := f.StorageFactory.NewConfig(resource)\n\tif err != nil {\n\t\treturn generic.RESTOptions{}, fmt.Errorf(\"unable to find storage destination for %v, due to %v\", resource, err.Error())\n\t}\n\n\tret := generic.RESTOptions{\n\t\tStorageConfig:           storageConfig,\n\t\tDecorator:               generic.UndecoratedStorage,\n\t\tDeleteCollectionWorkers: f.Options.DeleteCollectionWorkers,\n\t\tEnableGarbageCollection: f.Options.EnableGarbageCollection,\n\t\tResourcePrefix:          f.StorageFactory.ResourcePrefix(resource),\n\t\tCountMetricPollPeriod:   f.Options.StorageConfig.CountMetricPollPeriod,\n\t}\n\tif f.Options.EnableWatchCache {\n\t\tsizes, err := ParseWatchCacheSizes(f.Options.WatchCacheSizes)\n\t\tif err != nil {\n\t\t\treturn generic.RESTOptions{}, err\n\t\t}\n\t\tcacheSize, ok := sizes[resource]\n\t\tif !ok {\n\t\t\tcacheSize = f.Options.DefaultWatchCacheSize\n\t\t}\n\t\t// depending on cache size this might return an undecorated storage\n\t\tret.Decorator = genericregistry.StorageWithCacher(cacheSize)\n\t}\n\n\treturn ret, nil\n}\n\n// NewRawStorage creates the low level kv storage. This is a work-around for current\n// two layer of same storage interface.\n// TODO: Once cacher is enabled on all registries (event registry is special), we will remove this method.\nfunc NewRawStorage(config *storagebackend.Config) (storage.Interface, factory.DestroyFunc, error) {\n\treturn factory.Create(*config)\n}\n\n\n// Create creates a storage backend based on given config.\nfunc Create(c storagebackend.Config) (storage.Interface, DestroyFunc, error) {\n\tswitch c.Type {\n\tcase \"etcd2\":\n\t\treturn nil, nil, fmt.Errorf(\"%v is no longer a supported storage backend\", c.Type)\n\tcase storagebackend.StorageTypeUnset, storagebackend.StorageTypeETCD3:\n\t\treturn newETCD3Storage(c)\n\tdefault:\n\t\treturn nil, nil, fmt.Errorf(\"unknown storage type: %s\", c.Type)\n\t}\n}\n```\n\n<br>\n\n#### 3.4 总结\n\nPod对象需要定义好NewStorage，其中在NewStorage中定义了Strategy，里面包含了创建，更新前的各种操作。这样的好处就是，Strorage这一层封装好了每个资源的差异性，下层的etcd来了数据，就只管增删改就行。\n\n同时调用了CompleteWithOptions函数和etcd进行了打通。\n\n<br>\n\n### 4. 总结\n\nk8s中完整的etcd框架是下图所示：这里先分析到Stroage.Interface。知道一个大体流程。后面有需要再深入。\n\n\n\n![etcd struct](../images/etcd struct.png)\n\n\n\n### 5.参考链接：\n\nhttps://www.jianshu.com/p/daa4ff387a78\n\n书籍：kubernetes源码解剖，郑东"
  },
  {
    "path": "k8s/kube-apiserver/16. 创建更新删除资源时apiserver做了什么工作.md",
    "content": "* [1\\. 简介](#1-简介)\n* [2\\. 流程介绍](#2-流程介绍)\n* [3\\. pod创建](#3-pod创建)\n  * [3\\.1 pod create 前端逻辑](#31-pod-create-前端逻辑)\n  * [3\\.2 pod创建\\-后端逻辑](#32-pod创建-后端逻辑)\n    * [3\\.2\\.1 BeforeCreate函数](#321-beforecreate函数)\n    * [3\\.2\\.2 Create函数](#322-create函数)\n  * [3\\.3 总结](#33-总结)\n* [4\\. Pod 删除](#4-pod-删除)\n  * [4\\.1 Delete](#41-delete)\n  * [4\\.2 BeforeDelete](#42-beforedelete)\n  * [4\\.3 updateForGracefulDeletionAndFinalizers](#43-updateforgracefuldeletionandfinalizers)\n  * [4\\.4 总结](#44-总结)\n* [5\\.参考](#5参考)\n\n### 1. 简介\n\n目前只剩下一个请求在kube-apiserver webhook之后，存放etcd之前的操作没有分析了。这里以pod为例介绍一下。\n\n同时也用于快速定位创建，删除，更新，get某个资源时，apiserver做了什么操作。\n\n<br>\n\n再次回顾之前的apiserver初始化逻辑。可以看看之前的文章，回顾一下。\n\n在之前的分析中：\n\nInstallLegacyAPI函数的执行过程分为两步:\n\n**第一步：**通过legacyRESTStorageProvider.NewLegacyRESTStorage函数实例化APIGroupInfo，APIGroupInfo对象用于描述资源组信\n\n息，该对象的VersionedResourcesStorageMap字段用于存储资源与资源存储对象的映射关系，其表现形式为map[string]map[string]rest.Storage （即<资源版本>/<资源>/<资源存储对象>），\n\n例如Pod资源与资源存储对象的映射关系是v1/pods/PodStorage。使Core Groups/v1下的资源与资源存储对象相互映射，代码路径：pkg/registry/core/rest/storage_core.go\n\n```\n\n  // storage就是将ulr和处理函数进行了绑定\n\trestStorageMap := map[string]rest.Storage{\n\t\t\"pods\":             podStorage.Pod,\n\t\t\"pods/attach\":      podStorage.Attach,\n\t\t\"pods/status\":      podStorage.Status,\n\t\t\"pods/log\":         podStorage.Log,\n\t\t\"pods/exec\":        podStorage.Exec,\n\t\t\"pods/portforward\": podStorage.PortForward,\n\t\t\"pods/proxy\":       podStorage.Proxy,\n\t\t\"pods/binding\":     podStorage.Binding,\n\t\t\"bindings\":         podStorage.LegacyBinding,\n\n\t\t\"podTemplates\": podTemplateStorage,\n\n\t\t\"replicationControllers\":        controllerStorage.Controller,\n\t\t\"replicationControllers/status\": controllerStorage.Status,\n\n\t\t\"services\":        serviceRest,\n\t\t\"services/proxy\":  serviceRestProxy,\n\t\t\"services/status\": serviceStatusStorage,\n\n\t\t\"endpoints\": endpointsStorage,\n\n\t\t\"nodes\":        nodeStorage.Node,\n\t\t\"nodes/status\": nodeStorage.Status,\n\t\t\"nodes/proxy\":  nodeStorage.Proxy,\n\n\t\t\"events\": eventStorage,\n\n\t\t\"limitRanges\":                   limitRangeStorage,\n\t\t\"resourceQuotas\":                resourceQuotaStorage,\n\t\t\"resourceQuotas/status\":         resourceQuotaStatusStorage,\n\t\t\"namespaces\":                    namespaceStorage,\n\t\t\"namespaces/status\":             namespaceStatusStorage,\n\t\t\"namespaces/finalize\":           namespaceFinalizeStorage,\n\t\t\"secrets\":                       secretStorage,\n\t\t\"serviceAccounts\":               serviceAccountStorage,\n\t\t\"persistentVolumes\":             persistentVolumeStorage,\n\t\t\"persistentVolumes/status\":      persistentVolumeStatusStorage,\n\t\t\"persistentVolumeClaims\":        persistentVolumeClaimStorage,\n\t\t\"persistentVolumeClaims/status\": persistentVolumeClaimStatusStorage,\n\t\t\"configMaps\":                    configMapStorage,\n\n\t\t\"componentStatuses\": componentstatus.NewStorage(componentStatusStorage{c.StorageFactory}.serversToValidate),\n\t}\n\tif legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: \"autoscaling\", Version: \"v1\"}) {\n\t\trestStorageMap[\"replicationControllers/scale\"] = controllerStorage.Scale\n\t}\n\tif legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: \"policy\", Version: \"v1beta1\"}) {\n\t\trestStorageMap[\"pods/eviction\"] = podStorage.Eviction\n\t}\n\tif serviceAccountStorage.Token != nil {\n\t\trestStorageMap[\"serviceaccounts/token\"] = serviceAccountStorage.Token\n\t}\n\tif utilfeature.DefaultFeatureGate.Enabled(features.EphemeralContainers) {\n\t\trestStorageMap[\"pods/ephemeralcontainers\"] = podStorage.EphemeralContainers\n\t}\n```\n\n<br>\n\n**第二步：** 创建完上面的路由之后，则开始进行路由的安装，执行`InstallLegacyAPIGroup`方法，主要调用链为`InstallLegacyAPIGroup-->installAPIResources-->InstallREST-->Install-->registerResourceHandlers`，最终核心的路由构造在`registerResourceHandlers`方法内。\n\n```\n\n// Install handlers for API resources.\nfunc (a *APIInstaller) Install() ([]metav1.APIResource, *restful.WebService, []error) {\n\tvar apiResources []metav1.APIResource\n\tvar errors []error\n\tws := a.newWebService()\n\n\t// Register the paths in a deterministic (sorted) order to get a deterministic swagger spec.\n\tpaths := make([]string, len(a.group.Storage))\n\tvar i int = 0\n\tfor path := range a.group.Storage {\n\t\tpaths[i] = path\n\t\ti++\n\t}\n\tsort.Strings(paths)\n\tfor _, path := range paths {\n\t\tapiResource, err := a.registerResourceHandlers(path, a.group.Storage[path], ws)\n\t\tif err != nil {\n\t\t\terrors = append(errors, fmt.Errorf(\"error in registering resource: %s, %v\", path, err))\n\t\t}\n\t\tif apiResource != nil {\n\t\t\tapiResources = append(apiResources, *apiResource)\n\t\t}\n\t}\n\treturn apiResources, ws, errors\n}\n```\n\ninstall方法先创建了一个websevice。然后将所有的api 路径都存入一个数组：paths。对该数组排序（sort）。然后利用for range遍历数组的所有元素，调用registerResourceHandlers方法来对每个api路径注册，也就是和对应的storage以及Webservice绑定。\n\n这里的storage指的是后端etcd的存储。storage变量是个map，Key是REST API的path，Value是rest.Storage接口，该接口就是一个通用的符合Restful要求的资源存储接口。\n<br>\n\n注意每个函数都会调用registerResourceHandlers\n\nregisterResourceHandlers 函数很长。定义在：staging/src/k8s.io/apiserver/pkg/endpoints/installer.go\n\n代码不贴出来了。具体逻辑为：\n\n（1） 首先对资源的后端存储storage（etcd的存储）进行验证，判断那些方法是storage所支持的。然后将所有支持的方法存入action数组中。比如判断是否支持，create, list, get, list, watch, patch等等动作。\n\n```\n\tcreater, isCreater := storage.(rest.Creater)\n\tnamedCreater, isNamedCreater := storage.(rest.NamedCreater)\n\tlister, isLister := storage.(rest.Lister)\n\tgetter, isGetter := storage.(rest.Getter)\n\tgetterWithOptions, isGetterWithOptions := storage.(rest.GetterWithOptions)\n\tgracefulDeleter, isGracefulDeleter := storage.(rest.GracefulDeleter)\n\tcollectionDeleter, isCollectionDeleter := storage.(rest.CollectionDeleter)\n\tupdater, isUpdater := storage.(rest.Updater)\n\tpatcher, isPatcher := storage.(rest.Patcher)\n\twatcher, isWatcher := storage.(rest.Watcher)\n\tconnecter, isConnecter := storage.(rest.Connecter)\n\tstorageMeta, isMetadata := storage.(rest.StorageMetadata)\n```\n\n（2）然后，遍历actions数组，在一个switch语句中，为所有元素定义路由。如贴出的case \"GET\"这一块，首先创建并包装一个handler对象，然后调用WebService的一系列方法，创建一个route对象，将handler绑定到这个route上。后面还有case \"PUT\"、case \"DELETE\"等一系列case，不一一贴出。最后，将route加入routes数组中。\n\n```\n{\n\t\tcase \"GET\": // Get a resource.\n\t\t...\n\t\tcase \"LIST\": // List all resources of a kind.\n\t\t...\n\t\t}\n```\n\n### 2. 流程介绍\n\n上面是先注册了Storage，然后再实例化路由。这样每个资源的增删改查，就和路径对应上了。\n\n然后根据registerResourceHandlers函数为每个资源的增删改查绑定  后端处理函数。\n\n注意上面的case \"GET\", case \"LIST\"等都是通用的rest入口，最终会调用每个对象storage的处理函数。具体某个对象的storage处理逻辑如下：\n\n![image-20220517170229750](../images/apiserver-14.png)\n\n接下里以pod为例来说明\n\n### 3. pod创建\n\n#### 3.1 pod create 前端逻辑\n\ncreate 对应的是post方法，可以看到核心函数就是createHandler（staging/src/k8s.io/apiserver/pkg/endpoints/handlers/create.go）。函数逻辑如下：\n\n（1）如果是dryRun，并且不支持dryRun就退出\n\n（2）经历decode，admission，validation以及encode的流程\n\n（3）调用 r.Create 完成某个资源对象storage处理，这一步是到后端和etcd交互的处理了。之前1,2都是apiserver自己的逻辑处理。\n\n```\n\tcase \"POST\": // Create a resource.\n\t\t\tvar handler restful.RouteFunction\n\t\t\tif isNamedCreater {\n\t\t\t\thandler = restfulCreateNamedResource(namedCreater, reqScope, admit)\n\t\t\t} else {\n\t\t\t\thandler = restfulCreateResource(creater, reqScope, admit)\n\t\t\t}\n\t\t\thandler = metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, handler)\n\t\t\tarticle := GetArticleForNoun(kind, \" \")\n\t\t\tdoc := \"create\" + article + kind\n\t\t\tif isSubresource {\n\t\t\t\tdoc = \"create \" + subresource + \" of\" + article + kind\n\t\t\t}\n\t\t\troute := ws.POST(action.Path).To(handler).\n\t\t\t\tDoc(doc).\n\t\t\t\tParam(ws.QueryParameter(\"pretty\", \"If 'true', then the output is pretty printed.\")).\n\t\t\t\tOperation(\"create\"+namespaced+kind+strings.Title(subresource)+operationSuffix).\n\t\t\t\tProduces(append(storageMeta.ProducesMIMETypes(action.Verb), mediaTypes...)...).\n\t\t\t\tReturns(http.StatusOK, \"OK\", producedObject).\n\t\t\t\t// TODO: in some cases, the API may return a v1.Status instead of the versioned object\n\t\t\t\t// but currently go-restful can't handle multiple different objects being returned.\n\t\t\t\tReturns(http.StatusCreated, \"Created\", producedObject).\n\t\t\t\tReturns(http.StatusAccepted, \"Accepted\", producedObject).\n\t\t\t\tReads(defaultVersionedObject).\n\t\t\t\tWrites(producedObject)\n\t\t\tif err := AddObjectParams(ws, route, versionedCreateOptions); err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\taddParams(route, action.Params)\n\t\t\troutes = append(routes, route)\n\t\t\t\n\t\t\t\n\t\t\t\nfunc restfulCreateNamedResource(r rest.NamedCreater, scope handlers.RequestScope, admit admission.Interface) restful.RouteFunction {\n\treturn func(req *restful.Request, res *restful.Response) {\n\t\thandlers.CreateNamedResource(r, &scope, admit)(res.ResponseWriter, req.Request)\n\t}\n}\n\n\n// CreateNamedResource returns a function that will handle a resource creation with name.\nfunc CreateNamedResource(r rest.NamedCreater, scope *RequestScope, admission admission.Interface) http.HandlerFunc {\n\treturn createHandler(r, scope, admission, true)\n}\n\n\n// 核心函数createHandler\nfunc createHandler(r rest.NamedCreater, scope *RequestScope, admit admission.Interface, includeName bool) http.HandlerFunc {\n\treturn func(w http.ResponseWriter, req *http.Request) {\n\t\t// For performance tracking purposes.\n\t\ttrace := utiltrace.New(\"Create\", utiltrace.Field{Key: \"url\", Value: req.URL.Path}, utiltrace.Field{Key: \"user-agent\", Value: &lazyTruncatedUserAgent{req}}, utiltrace.Field{Key: \"client\", Value: &lazyClientIP{req}})\n\t\tdefer trace.LogIfLong(500 * time.Millisecond)\n\n\t\tif isDryRun(req.URL) && !utilfeature.DefaultFeatureGate.Enabled(features.DryRun) {\n\t\t\tscope.err(errors.NewBadRequest(\"the dryRun alpha feature is disabled\"), w, req)\n\t\t\treturn\n\t\t}\n\n\t\t// TODO: we either want to remove timeout or document it (if we document, move timeout out of this function and declare it in api_installer)\n\t\ttimeout := parseTimeout(req.URL.Query().Get(\"timeout\"))\n\n\t\tnamespace, name, err := scope.Namer.Name(req)\n\t\tif err != nil {\n\t\t\tif includeName {\n\t\t\t\t// name was required, return\n\t\t\t\tscope.err(err, w, req)\n\t\t\t\treturn\n\t\t\t}\n\n\t\t\t// otherwise attempt to look up the namespace\n\t\t\tnamespace, err = scope.Namer.Namespace(req)\n\t\t\tif err != nil {\n\t\t\t\tscope.err(err, w, req)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\n\t\tctx, cancel := context.WithTimeout(req.Context(), timeout)\n\t\tdefer cancel()\n\t\tctx = request.WithNamespace(ctx, namespace)\n\t\toutputMediaType, _, err := negotiation.NegotiateOutputMediaType(req, scope.Serializer, scope)\n\t\tif err != nil {\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\n\t\tgv := scope.Kind.GroupVersion()\n\t\ts, err := negotiation.NegotiateInputSerializer(req, false, scope.Serializer)\n\t\tif err != nil {\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\n\t\tdecoder := scope.Serializer.DecoderToVersion(s.Serializer, scope.HubGroupVersion)\n\n\t\tbody, err := limitedReadBody(req, scope.MaxRequestBodyBytes)\n\t\tif err != nil {\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\n\t\toptions := &metav1.CreateOptions{}\n\t\tvalues := req.URL.Query()\n\t\tif err := metainternalversionscheme.ParameterCodec.DecodeParameters(values, scope.MetaGroupVersion, options); err != nil {\n\t\t\terr = errors.NewBadRequest(err.Error())\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\t\tif errs := validation.ValidateCreateOptions(options); len(errs) > 0 {\n\t\t\terr := errors.NewInvalid(schema.GroupKind{Group: metav1.GroupName, Kind: \"CreateOptions\"}, \"\", errs)\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\t\toptions.TypeMeta.SetGroupVersionKind(metav1.SchemeGroupVersion.WithKind(\"CreateOptions\"))\n\n\t\tdefaultGVK := scope.Kind\n\t\toriginal := r.New()\n\t\ttrace.Step(\"About to convert to expected version\")\n\t\tobj, gvk, err := decoder.Decode(body, &defaultGVK, original)\n\t\tif err != nil {\n\t\t\terr = transformDecodeError(scope.Typer, err, original, gvk, body)\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\t\tif gvk.GroupVersion() != gv {\n\t\t\terr = errors.NewBadRequest(fmt.Sprintf(\"the API version in the data (%s) does not match the expected API version (%v)\", gvk.GroupVersion().String(), gv.String()))\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\t\ttrace.Step(\"Conversion done\")\n\n\t\tae := request.AuditEventFrom(ctx)\n\t\tadmit = admission.WithAudit(admit, ae)\n\t\taudit.LogRequestObject(ae, obj, scope.Resource, scope.Subresource, scope.Serializer)\n\n\t\tuserInfo, _ := request.UserFrom(ctx)\n\n\t\t// On create, get name from new object if unset\n\t\tif len(name) == 0 {\n\t\t\t_, name, _ = scope.Namer.ObjectName(obj)\n\t\t}\n\t\tadmissionAttributes := admission.NewAttributesRecord(obj, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Create, options, dryrun.IsDryRun(options.DryRun), userInfo)\n\t\tif mutatingAdmission, ok := admit.(admission.MutationInterface); ok && mutatingAdmission.Handles(admission.Create) {\n\t\t\terr = mutatingAdmission.Admit(ctx, admissionAttributes, scope)\n\t\t\tif err != nil {\n\t\t\t\tscope.err(err, w, req)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\n\t\tif scope.FieldManager != nil {\n\t\t\tliveObj, err := scope.Creater.New(scope.Kind)\n\t\t\tif err != nil {\n\t\t\t\tscope.err(fmt.Errorf(\"failed to create new object (Create for %v): %v\", scope.Kind, err), w, req)\n\t\t\t\treturn\n\t\t\t}\n\n\t\t\tobj, err = scope.FieldManager.Update(liveObj, obj, managerOrUserAgent(options.FieldManager, req.UserAgent()))\n\t\t\tif err != nil {\n\t\t\t\tscope.err(fmt.Errorf(\"failed to update object (Create for %v) managed fields: %v\", scope.Kind, err), w, req)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\n\t\ttrace.Step(\"About to store object in database\")\n\t\tresult, err := finishRequest(timeout, func() (runtime.Object, error) {\n\t\t\treturn r.Create(\n\t\t\t\tctx,\n\t\t\t\tname,\n\t\t\t\tobj,\n\t\t\t\trest.AdmissionToValidateObjectFunc(admit, admissionAttributes, scope),\n\t\t\t\toptions,\n\t\t\t)\n\t\t})\n\t\tif err != nil {\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\t\ttrace.Step(\"Object stored in database\")\n\n\t\tcode := http.StatusCreated\n\t\tstatus, ok := result.(*metav1.Status)\n\t\tif ok && err == nil && status.Code == 0 {\n\t\t\tstatus.Code = int32(code)\n\t\t}\n\n\t\ttransformResponseObject(ctx, scope, trace, req, w, code, outputMediaType, result)\n\t}\n}\n```\n\n<br>\n\n#### 3.2 pod创建-后端逻辑\n\n**创建pod特有逻辑**: r.Create，从之前的调用链可以看出来。当资源为pod时，e.CreateStrategy=podStrategy。\n\nCreate 这里的逻辑是：\n\n（1）调用BeforeCreate做创建之前的工作，详见3.2.1\n\n（2）得到对象的名字以及key，这个也是对象特有的\n\n（3）调用Storage.Create开始创建对象\n\n（4）创建对象后，如果这对象实现了AfterCreate, 再走AfterCreate逻辑，pod没有实现\n\n（5）创建对象后，如果这对象实现了Decorator装饰, 再走AfterCreate逻辑，pod没有实现\n\n```\n// Create inserts a new item according to the unique key from the object.\nfunc (e *Store) Create(ctx context.Context, obj runtime.Object, createValidation rest.ValidateObjectFunc, options *metav1.CreateOptions) (runtime.Object, error) {\n  // 1。调用BeforeCreate做创建之前的工作，详见3.2.1\n\tif err := rest.BeforeCreate(e.CreateStrategy, ctx, obj); err != nil {\n\t\treturn nil, err\n\t}\n\t// at this point we have a fully formed object.  It is time to call the validators that the apiserver\n\t// handling chain wants to enforce.\n\tif createValidation != nil {\n\t\tif err := createValidation(ctx, obj.DeepCopyObject()); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n  \n  // 2.得到对象的名字以及key，这个也是对象特有的\n\tname, err := e.ObjectNameFunc(obj)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tkey, err := e.KeyFunc(ctx, name)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tqualifiedResource := e.qualifiedResourceFromContext(ctx)\n\tttl, err := e.calculateTTL(obj, 0, false)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t\n\t// 3. 调用Storage.Create开始创建对象\n\tout := e.NewFunc()\n\tif err := e.Storage.Create(ctx, key, obj, out, ttl, dryrun.IsDryRun(options.DryRun)); err != nil {\n\t\terr = storeerr.InterpretCreateError(err, qualifiedResource, name)\n\t\terr = rest.CheckGeneratedNameError(e.CreateStrategy, err, obj)\n\t\tif !kubeerr.IsAlreadyExists(err) {\n\t\t\treturn nil, err\n\t\t}\n\t\tif errGet := e.Storage.Get(ctx, key, \"\", out, false); errGet != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\taccessor, errGetAcc := meta.Accessor(out)\n\t\tif errGetAcc != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tif accessor.GetDeletionTimestamp() != nil {\n\t\t\tmsg := &err.(*kubeerr.StatusError).ErrStatus.Message\n\t\t\t*msg = fmt.Sprintf(\"object is being deleted: %s\", *msg)\n\t\t}\n\t\treturn nil, err\n\t}\n\t\n  // 4.创建对象后，如果这对象实现了AfterCreate, 再走AfterCreate逻辑，pod没有实现\n\tif e.AfterCreate != nil {\n\t\tif err := e.AfterCreate(out); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\t\n\t// 5.创建对象后，如果这对象实现了Decorator装饰, 再走AfterCreate逻辑，pod没有实现\n\tif e.Decorator != nil {\n\t\tif err := e.Decorator(out); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\treturn out, nil\n}\n```\n\n##### 3.2.1 BeforeCreate函数\n\n注意这里strategy是Pod\n\n函数逻辑如下：\n\n（1）获取objectMeta, kind, namespaces等信息\n\n（2）设置DeletionTimestamp，DeletionGracePeriodSeconds等所有对象通用的字段\n\n（3）设置pod资源特有的字段，这里是podStrategy\n\n（4）做一下验证，以及Canonicalize，这个也是不同对象特有的\n\n```\n// BeforeCreate ensures that common operations for all resources are performed on creation. It only returns\n// errors that can be converted to api.Status. It invokes PrepareForCreate, then GenerateName, then Validate.\n// It returns nil if the object should be created.\nfunc BeforeCreate(strategy RESTCreateStrategy, ctx context.Context, obj runtime.Object) error {\n  // 1.获取objectMeta, kind, namespaces等信息\n\tobjectMeta, kind, kerr := objectMetaAndKind(strategy, obj)\n\tif kerr != nil {\n\t\treturn kerr\n\t}\n\n\tif strategy.NamespaceScoped() {\n\t\tif !ValidNamespace(ctx, objectMeta) {\n\t\t\treturn errors.NewBadRequest(\"the namespace of the provided object does not match the namespace sent on the request\")\n\t\t}\n\t} else if len(objectMeta.GetNamespace()) > 0 {\n\t\tobjectMeta.SetNamespace(metav1.NamespaceNone)\n\t}\n\t\n\t// 2. 设置DeletionTimestamp，DeletionGracePeriodSeconds等所有对象通用的字段\n\tobjectMeta.SetDeletionTimestamp(nil)\n\tobjectMeta.SetDeletionGracePeriodSeconds(nil)\n\t\n\t// 3. 设置pod资源特有的字段，这里是podStrategy\n\tstrategy.PrepareForCreate(ctx, obj)\n\tFillObjectMetaSystemFields(objectMeta)\n\tif len(objectMeta.GetGenerateName()) > 0 && len(objectMeta.GetName()) == 0 {\n\t\tobjectMeta.SetName(strategy.GenerateName(objectMeta.GetGenerateName()))\n\t}\n\n\t// Ensure managedFields is not set unless the feature is enabled\n\tif !utilfeature.DefaultFeatureGate.Enabled(features.ServerSideApply) {\n\t\tobjectMeta.SetManagedFields(nil)\n\t}\n\n\t// ClusterName is ignored and should not be saved\n\tif len(objectMeta.GetClusterName()) > 0 {\n\t\tobjectMeta.SetClusterName(\"\")\n\t}\n  \n  // 4.做一下验证，以及Canonicalize，这个也是不同对象特有的\n\tif errs := strategy.Validate(ctx, obj); len(errs) > 0 {\n\t\treturn errors.NewInvalid(kind.GroupKind(), objectMeta.GetName(), errs)\n\t}\n\n\t// Custom validation (including name validation) passed\n\t// Now run common validation on object meta\n\t// Do this *after* custom validation so that specific error messages are shown whenever possible\n\tif errs := genericvalidation.ValidateObjectMetaAccessor(objectMeta, strategy.NamespaceScoped(), path.ValidatePathSegmentName, field.NewPath(\"metadata\")); len(errs) > 0 {\n\t\treturn errors.NewInvalid(kind.GroupKind(), objectMeta.GetName(), errs)\n\t}\n\n\tstrategy.Canonicalize(obj)\n\n\treturn nil\n}\n```\n\n以pod为例。podStrategy定义在： pkg/registry/core/pod/strategy.go, pod和其他资源对象一样实现了这样的接口。\n\n```\ntype RESTCreateStrategy interface {\n\truntime.ObjectTyper\n\t// The name generator is used when the standard GenerateName field is set.\n\t// The NameGenerator will be invoked prior to validation.\n\tnames.NameGenerator\n\n\t// NamespaceScoped returns true if the object must be within a namespace.\n\tNamespaceScoped() bool\n\t// PrepareForCreate is invoked on create before validation to normalize\n\t// the object.  For example: remove fields that are not to be persisted,\n\t// sort order-insensitive list fields, etc.  This should not remove fields\n\t// whose presence would be considered a validation error.\n\t//\n\t// Often implemented as a type check and an initailization or clearing of\n\t// status. Clear the status because status changes are internal. External\n\t// callers of an api (users) should not be setting an initial status on\n\t// newly created objects.\n\tPrepareForCreate(ctx context.Context, obj runtime.Object)\n\t// Validate returns an ErrorList with validation errors or nil.  Validate\n\t// is invoked after default fields in the object have been filled in\n\t// before the object is persisted.  This method should not mutate the\n\t// object.\n\tValidate(ctx context.Context, obj runtime.Object) field.ErrorList\n\t// Canonicalize allows an object to be mutated into a canonical form. This\n\t// ensures that code that operates on these objects can rely on the common\n\t// form for things like comparison.  Canonicalize is invoked after\n\t// validation has succeeded but before the object has been persisted.\n\t// This method may mutate the object. Often implemented as a type check or\n\t// empty method.\n\tCanonicalize(obj runtime.Object)\n}\n```\n\n这里就看看PrepareForCreate。可以看出来这里就是设置了pod.Status=Pending\n\n```\n// PrepareForCreate clears fields that are not allowed to be set by end users on creation.\nfunc (podStrategy) PrepareForCreate(ctx context.Context, obj runtime.Object) {\n\tpod := obj.(*api.Pod)\n\tpod.Status = api.PodStatus{\n\t\tPhase:    api.PodPending,\n\t\tQOSClass: qos.GetPodQOS(pod),\n\t}\n\n\tpodutil.DropDisabledPodFields(pod, nil)\n}\n```\n\n##### 3.2.2 Create函数\n\n可以看出来Create是通用的，不要每个对象都实现，就是调用etcd3接口操作数据库了\n\nstaging/src/k8s.io/apiserver/pkg/registry/generic/registry/dryrun.go\n\n```\nfunc (s *DryRunnableStorage) Create(ctx context.Context, key string, obj, out runtime.Object, ttl uint64, dryRun bool) error {\n\tif dryRun {\n\t\tif err := s.Storage.Get(ctx, key, \"\", out, false); err == nil {\n\t\t\treturn storage.NewKeyExistsError(key, 0)\n\t\t}\n\t\ts.copyInto(obj, out)\n\t\treturn nil\n\t}\n\treturn s.Storage.Create(ctx, key, obj, out, ttl)\n}\n\ns.Storage.Create 实现在k8s.io/apiserver/pkg/storage/etcd3/store.go \n```\n\n#### 3.3 总结\n\npod创建的逻辑如下：\n\n（1）经过apiserver通用的前端流程，就是判断是post接口，就走Create流程\n\n（2）然后走通用逻辑，beforeCreate -> Create -> AfterCreate等等逻辑\n\n比如： Create流程会先执行通用的部分，比如设置deletionStampTion等字段；然后再执行对象特有的，比如当一个pod对象创建时，需要设置pod.Status=Pending。\n\n### 4. Pod 删除\n\n同样还是回到前端逻辑，这里\n\n```\ncase \"DELETE\": // Delete a resource.\n   article := GetArticleForNoun(kind, \" \")\n   doc := \"delete\" + article + kind\n   if isSubresource {\n      doc = \"delete \" + subresource + \" of\" + article + kind\n   }\n   handler := metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, restfulDeleteResource(gracefulDeleter, isGracefulDeleter, reqScope, admit))\n   route := ws.DELETE(action.Path).To(handler).\n      Doc(doc).\n      Param(ws.QueryParameter(\"pretty\", \"If 'true', then the output is pretty printed.\")).\n      Operation(\"delete\"+namespaced+kind+strings.Title(subresource)+operationSuffix).\n      Produces(append(storageMeta.ProducesMIMETypes(action.Verb), mediaTypes...)...).\n      Writes(versionedStatus).\n      Returns(http.StatusOK, \"OK\", versionedStatus).\n      Returns(http.StatusAccepted, \"Accepted\", versionedStatus)\n   if isGracefulDeleter {\n      route.Reads(versionedDeleterObject)\n      route.ParameterNamed(\"body\").Required(false)\n      if err := AddObjectParams(ws, route, versionedDeleteOptions); err != nil {\n         return nil, err\n      }\n   }\n   addParams(route, action.Params)\n   routes = append(routes, route)\n```\n\n<br>\n\n前端逻辑和之前其实都一样，这里直接分析后端逻辑Delete\n\n```\n// DeleteResource returns a function that will handle a resource deletion\n// TODO admission here becomes solely validating admission\nfunc DeleteResource(r rest.GracefulDeleter, allowsOptions bool, scope *RequestScope, admit admission.Interface) http.HandlerFunc {\n\treturn func(w http.ResponseWriter, req *http.Request) {\n   ...\n\t\ttrace.Step(\"About to delete object from database\")\n\t\twasDeleted := true\n\t\tuserInfo, _ := request.UserFrom(ctx)\n\t\tstaticAdmissionAttrs := admission.NewAttributesRecord(nil, nil, scope.Kind, namespace, name, scope.Resource, scope.Subresource, admission.Delete, options, dryrun.IsDryRun(options.DryRun), userInfo)\n\t\tresult, err := finishRequest(timeout, func() (runtime.Object, error) {\n\t\t\tobj, deleted, err := r.Delete(ctx, name, rest.AdmissionToValidateObjectDeleteFunc(admit, staticAdmissionAttrs, scope), options)\n\t\t\twasDeleted = deleted\n\t\t\treturn obj, err\n\t\t})\n\t\tif err != nil {\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\t\ttrace.Step(\"Object deleted from database\")\n\n\n\t\ttransformResponseObject(ctx, scope, trace, req, w, status, outputMediaType, result)\n\t}\n}\n```\n\n#### 4.1 Delete\n\n核心逻辑如下：\n\n（1）如果delete options指定了UID,ResourceVersion。需要进行对比确定，防止删错。可能会出现反复创建删除的时候删除错\n\n（2）调用BeforeDelete，判断是否要优雅删除，是否正在优雅删除，BeforeDelete的核心逻辑见4.2，注意只有pod在这里会判断为优雅删除\n\n（3）判断是否有finalizers\n\n（4）如果需要优雅删除，或者有finalizers，则执行updateForGracefulDeletionAndFinalizers函数。这个函数会返回当前对象是不是可以立马删除\n\n（5）如果不可以立马删除，返回\n\n（6）如果可以立马删除，删除etcd中的数据\n\n```\n// Delete removes the item from storage.\nfunc (e *Store) Delete(ctx context.Context, name string, deleteValidation rest.ValidateObjectFunc, options *metav1.DeleteOptions) (runtime.Object, bool, error) {\n\tkey, err := e.KeyFunc(ctx, name)\n\tif err != nil {\n\t\treturn nil, false, err\n\t}\n\tobj := e.NewFunc()\n\tqualifiedResource := e.qualifiedResourceFromContext(ctx)\n\tif err = e.Storage.Get(ctx, key, \"\", obj, false); err != nil {\n\t\treturn nil, false, storeerr.InterpretDeleteError(err, qualifiedResource, name)\n\t}\n\n\t// support older consumers of delete by treating \"nil\" as delete immediately\n\tif options == nil {\n\t\toptions = metav1.NewDeleteOptions(0)\n\t}\n\t// 1. 如果delete options指定了UID,ResourceVersion。需要进行对比确定，防止删错。可能会出现反复创建删除的时候删除错\n\tvar preconditions storage.Preconditions\n\tif options.Preconditions != nil {\n\t\tpreconditions.UID = options.Preconditions.UID\n\t\tpreconditions.ResourceVersion = options.Preconditions.ResourceVersion\n\t}\n\t// 2.开始BeforeDelete\n\tgraceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, obj, options)\n\tif err != nil {\n\t\treturn nil, false, err\n\t}\n\t// this means finalizers cannot be updated via DeleteOptions if a deletion is already pending\n\tif pendingGraceful {\n\t\tout, err := e.finalizeDelete(ctx, obj, false)\n\t\treturn out, false, err\n\t}\n\t// check if obj has pending finalizers\n\taccessor, err := meta.Accessor(obj)\n\tif err != nil {\n\t\treturn nil, false, kubeerr.NewInternalError(err)\n\t}\n\t\n\t// 3.判断是否有finalizers\n\tpendingFinalizers := len(accessor.GetFinalizers()) != 0\n\tvar ignoreNotFound bool\n\tvar deleteImmediately bool = true\n\tvar lastExisting, out runtime.Object\n\n\t// Handle combinations of graceful deletion and finalization by issuing\n\t// the correct updates.\n\tshouldUpdateFinalizers, _ := deletionFinalizersForGarbageCollection(ctx, e, accessor, options)\n\t// TODO: remove the check, because we support no-op updates now.\n\t\n\t// 4. 如果需要优雅删除，或者有finalizers，则执行updateForGracefulDeletionAndFinalizers函数\n\n\tif graceful || pendingFinalizers || shouldUpdateFinalizers {\n\t\terr, ignoreNotFound, deleteImmediately, out, lastExisting = e.updateForGracefulDeletionAndFinalizers(ctx, name, key, options, preconditions, deleteValidation, obj)\n\t}\n  \n  // 5. 如果不能立马删除，就返回。所以第一次pod删除就会到这里\n\t// !deleteImmediately covers all cases where err != nil. We keep both to be future-proof.\n\tif !deleteImmediately || err != nil {\n\t\treturn out, false, err\n\t}\n\n\t// Going further in this function is not useful when we are\n\t// performing a dry-run request. Worse, it will actually\n\t// override \"out\" with the version of the object in database\n\t// that doesn't have the finalizer and deletiontimestamp set\n\t// (because the update above was dry-run too). If we already\n\t// have that version available, let's just return it now,\n\t// otherwise, we can call dry-run delete that will get us the\n\t// latest version of the object.\n\tif dryrun.IsDryRun(options.DryRun) && out != nil {\n\t\treturn out, true, nil\n\t}\n  \n  \n  // 第二次就到这里了，直接删除数据库数据了\n\t// delete immediately, or no graceful deletion supported\n\tklog.V(6).Infof(\"going to delete %s from registry: \", name)\n\tout = e.NewFunc()\n\tif err := e.Storage.Delete(ctx, key, out, &preconditions, storage.ValidateObjectFunc(deleteValidation), dryrun.IsDryRun(options.DryRun)); err != nil {\n\t\t// Please refer to the place where we set ignoreNotFound for the reason\n\t\t// why we ignore the NotFound error .\n\t\tif storage.IsNotFound(err) && ignoreNotFound && lastExisting != nil {\n\t\t\t// The lastExisting object may not be the last state of the object\n\t\t\t// before its deletion, but it's the best approximation.\n\t\t\tout, err := e.finalizeDelete(ctx, lastExisting, true)\n\t\t\treturn out, true, err\n\t\t}\n\t\treturn nil, false, storeerr.InterpretDeleteError(err, qualifiedResource, name)\n\t}\n\tout, err = e.finalizeDelete(ctx, out, true)\n\treturn out, true, err\n}\n```\n\n\n\n#### 4.2 BeforeDelete\n\n函数逻辑如下：\n\n（1）进行DeleteOptions的校验，并且如果指定了uuid,也进行判断\n\n（2）判断是否支持优雅删除，核心是是否实现了RESTGracefulDeleteStrategy接口。这个接口只有Pod实现，所以对应Pod而言是优雅删除的；如果不支持直接返回\n\n（3）如果deletionTime不为空，表示正在优雅删除了\n\n（4）设置deleteTime，和GracePeriodSecond\n\n```\n\n// BeforeDelete tests whether the object can be gracefully deleted.\n// If graceful is set, the object should be gracefully deleted.  If gracefulPending\n// is set, the object has already been gracefully deleted (and the provided grace\n// period is longer than the time to deletion). An error is returned if the\n// condition cannot be checked or the gracePeriodSeconds is invalid. The options\n// argument may be updated with default values if graceful is true. Second place\n// where we set deletionTimestamp is pkg/registry/generic/registry/store.go.\n// This function is responsible for setting deletionTimestamp during gracefulDeletion,\n// other one for cascading deletions.\nfunc BeforeDelete(strategy RESTDeleteStrategy, ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) (graceful, gracefulPending bool, err error) {\n\tobjectMeta, gvk, kerr := objectMetaAndKind(strategy, obj)\n\tif kerr != nil {\n\t\treturn false, false, kerr\n\t}\n\t// 1.进行DeleteOptions的校验，并且如果指定了uuid,也进行判断\n\tif errs := validation.ValidateDeleteOptions(options); len(errs) > 0 {\n\t\treturn false, false, errors.NewInvalid(schema.GroupKind{Group: metav1.GroupName, Kind: \"DeleteOptions\"}, \"\", errs)\n\t}\n\t// Checking the Preconditions here to fail early. They'll be enforced later on when we actually do the deletion, too.\n\tif options.Preconditions != nil {\n\t\tif options.Preconditions.UID != nil && *options.Preconditions.UID != objectMeta.GetUID() {\n\t\t\treturn false, false, errors.NewConflict(schema.GroupResource{Group: gvk.Group, Resource: gvk.Kind}, objectMeta.GetName(), fmt.Errorf(\"the UID in the precondition (%s) does not match the UID in record (%s). The object might have been deleted and then recreated\", *options.Preconditions.UID, objectMeta.GetUID()))\n\t\t}\n\t\tif options.Preconditions.ResourceVersion != nil && *options.Preconditions.ResourceVersion != objectMeta.GetResourceVersion() {\n\t\t\treturn false, false, errors.NewConflict(schema.GroupResource{Group: gvk.Group, Resource: gvk.Kind}, objectMeta.GetName(), fmt.Errorf(\"the ResourceVersion in the precondition (%s) does not match the ResourceVersion in record (%s). The object might have been modified\", *options.Preconditions.ResourceVersion, objectMeta.GetResourceVersion()))\n\t\t}\n\t}\n\t\n\t// 2. 判断是否支持优雅删除\n\tgracefulStrategy, ok := strategy.(RESTGracefulDeleteStrategy)\n\tif !ok {\n\t\t// If we're not deleting gracefully there's no point in updating Generation, as we won't update\n\t\t// the obcject before deleting it.\n\t\treturn false, false, nil\n\t}\n\t\n\t// 3.如果deletionTime不为空，所以正在优雅删除了\n\t// if the object is already being deleted, no need to update generation.\n\tif objectMeta.GetDeletionTimestamp() != nil {\n\t\t// if we are already being deleted, we may only shorten the deletion grace period\n\t\t// this means the object was gracefully deleted previously but deletionGracePeriodSeconds was not set,\n\t\t// so we force deletion immediately\n\t\t// IMPORTANT:\n\t\t// The deletion operation happens in two phases.\n\t\t// 1. Update to set DeletionGracePeriodSeconds and DeletionTimestamp\n\t\t// 2. Delete the object from storage.\n\t\t// If the update succeeds, but the delete fails (network error, internal storage error, etc.),\n\t\t// a resource was previously left in a state that was non-recoverable.  We\n\t\t// check if the existing stored resource has a grace period as 0 and if so\n\t\t// attempt to delete immediately in order to recover from this scenario.\n\t\tif objectMeta.GetDeletionGracePeriodSeconds() == nil || *objectMeta.GetDeletionGracePeriodSeconds() == 0 {\n\t\t\treturn false, false, nil\n\t\t}\n\t\t// only a shorter grace period may be provided by a user\n\t\tif options.GracePeriodSeconds != nil {\n\t\t\tperiod := int64(*options.GracePeriodSeconds)\n\t\t\tif period >= *objectMeta.GetDeletionGracePeriodSeconds() {\n\t\t\t\treturn false, true, nil\n\t\t\t}\n\t\t\tnewDeletionTimestamp := metav1.NewTime(\n\t\t\t\tobjectMeta.GetDeletionTimestamp().Add(-time.Second * time.Duration(*objectMeta.GetDeletionGracePeriodSeconds())).\n\t\t\t\t\tAdd(time.Second * time.Duration(*options.GracePeriodSeconds)))\n\t\t\tobjectMeta.SetDeletionTimestamp(&newDeletionTimestamp)\n\t\t\tobjectMeta.SetDeletionGracePeriodSeconds(&period)\n\t\t\treturn true, false, nil\n\t\t}\n\t\t// graceful deletion is pending, do nothing\n\t\toptions.GracePeriodSeconds = objectMeta.GetDeletionGracePeriodSeconds()\n\t\treturn false, true, nil\n\t}\n\n\tif !gracefulStrategy.CheckGracefulDelete(ctx, obj, options) {\n\t\treturn false, false, nil\n\t}\n\t\n\t// 4. 设置deleteTime，和GracePeriodSecond\n\tnow := metav1.NewTime(metav1.Now().Add(time.Second * time.Duration(*options.GracePeriodSeconds)))\n\tobjectMeta.SetDeletionTimestamp(&now)\n\tobjectMeta.SetDeletionGracePeriodSeconds(options.GracePeriodSeconds)\n\t// If it's the first graceful deletion we are going to set the DeletionTimestamp to non-nil.\n\t// Controllers of the object that's being deleted shouldn't take any nontrivial actions, hence its behavior changes.\n\t// Thus we need to bump object's Generation (if set). This handles generation bump during graceful deletion.\n\t// The bump for objects that don't support graceful deletion is handled in pkg/registry/generic/registry/store.go.\n\tif objectMeta.GetGeneration() > 0 {\n\t\tobjectMeta.SetGeneration(objectMeta.GetGeneration() + 1)\n\t}\n\treturn true, false, nil\n}\n\n\n这是个接口，判断该资源是否可以优雅删除, 只有pod实现了这个接口。\ntype RESTGracefulDeleteStrategy interface {\n\t// CheckGracefulDelete should return true if the object can be gracefully deleted and set\n\t// any default values on the DeleteOptions.\n\tCheckGracefulDelete(ctx context.Context, obj runtime.Object, options *metav1.DeleteOptions) bool\n}\n```\n\n#### 4.3 updateForGracefulDeletionAndFinalizers\n\n这里的一个核心就是，如果有finalizer的话，就调用markAsDeleting 函数，该函数也是设置deletionTimestamp和DeletionGracePeriodSeconds\n\n```\n// updateForGracefulDeletionAndFinalizers updates the given object for\n// graceful deletion and finalization by setting the deletion timestamp and\n// grace period seconds (graceful deletion) and updating the list of\n// finalizers (finalization); it returns:\n//\n// 1. an error\n// 2. a boolean indicating that the object was not found, but it should be\n//    ignored\n// 3. a boolean indicating that the object's grace period is exhausted and it\n//    should be deleted immediately\n// 4. a new output object with the state that was updated\n// 5. a copy of the last existing state of the object\nfunc (e *Store) updateForGracefulDeletionAndFinalizers(ctx context.Context, name, key string, options *metav1.DeleteOptions, preconditions storage.Preconditions, deleteValidation rest.ValidateObjectFunc, in runtime.Object) (err error, ignoreNotFound, deleteImmediately bool, out, lastExisting runtime.Object) {\n\tlastGraceful := int64(0)\n\tvar pendingFinalizers bool\n\tout = e.NewFunc()\n\terr = e.Storage.GuaranteedUpdate(\n\t\tctx,\n\t\tkey,\n\t\tout,\n\t\tfalse, /* ignoreNotFound */\n\t\t&preconditions,\n\t\tstorage.SimpleUpdate(func(existing runtime.Object) (runtime.Object, error) {\n\t\t\tif err := deleteValidation(ctx, existing); err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tgraceful, pendingGraceful, err := rest.BeforeDelete(e.DeleteStrategy, ctx, existing, options)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tif pendingGraceful {\n\t\t\t\treturn nil, errAlreadyDeleting\n\t\t\t}\n\n\t\t\t// Add/remove the orphan finalizer as the options dictates.\n\t\t\t// Note that this occurs after checking pendingGraceufl, so\n\t\t\t// finalizers cannot be updated via DeleteOptions if deletion has\n\t\t\t// started.\n\t\t\texistingAccessor, err := meta.Accessor(existing)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tneedsUpdate, newFinalizers := deletionFinalizersForGarbageCollection(ctx, e, existingAccessor, options)\n\t\t\tif needsUpdate {\n\t\t\t\texistingAccessor.SetFinalizers(newFinalizers)\n\t\t\t}\n\n\t\t\tpendingFinalizers = len(existingAccessor.GetFinalizers()) != 0\n\t\t\tif !graceful {\n\t\t\t\t// set the DeleteGracePeriods to 0 if the object has pendingFinalizers but not supporting graceful deletion\n\t\t\t\tif pendingFinalizers {\n\t\t\t\t\tklog.V(6).Infof(\"update the DeletionTimestamp to \\\"now\\\" and GracePeriodSeconds to 0 for object %s, because it has pending finalizers\", name)\n\t\t\t\t\terr = markAsDeleting(existing, time.Now())\n\t\t\t\t\tif err != nil {\n\t\t\t\t\t\treturn nil, err\n\t\t\t\t\t}\n\t\t\t\t\treturn existing, nil\n\t\t\t\t}\n\t\t\t\treturn nil, errDeleteNow\n\t\t\t}\n\t\t\tlastGraceful = *options.GracePeriodSeconds\n\t\t\tlastExisting = existing\n\t\t\treturn existing, nil\n\t\t}),\n\t\tdryrun.IsDryRun(options.DryRun),\n\t)\n\t\n\t\n\t\n\t\nmarkAsDeleting 函数也是设置deletionTimestamp和DeletionGracePeriodSeconds\n\t\n\t// markAsDeleting sets the obj's DeletionGracePeriodSeconds to 0, and sets the\n// DeletionTimestamp to \"now\" if there is no existing deletionTimestamp or if the existing\n// deletionTimestamp is further in future. Finalizers are watching for such updates and will\n// finalize the object if their IDs are present in the object's Finalizers list.\nfunc markAsDeleting(obj runtime.Object, now time.Time) (err error) {\n\tobjectMeta, kerr := meta.Accessor(obj)\n\tif kerr != nil {\n\t\treturn kerr\n\t}\n\t// This handles Generation bump for resources that don't support graceful\n\t// deletion. For resources that support graceful deletion is handle in\n\t// pkg/api/rest/delete.go\n\tif objectMeta.GetDeletionTimestamp() == nil && objectMeta.GetGeneration() > 0 {\n\t\tobjectMeta.SetGeneration(objectMeta.GetGeneration() + 1)\n\t}\n\texistingDeletionTimestamp := objectMeta.GetDeletionTimestamp()\n\tif existingDeletionTimestamp == nil || existingDeletionTimestamp.After(now) {\n\t\tmetaNow := metav1.NewTime(now)\n\t\tobjectMeta.SetDeletionTimestamp(&metaNow)\n\t}\n\tvar zero int64 = 0\n\tobjectMeta.SetDeletionGracePeriodSeconds(&zero)\n\treturn nil\n}\n```\n\n<br>\n\n#### 4.4 总结\n\n(1)  k8S的这套机制，只需要你自己写好对象的strategy就行，beforeCreate, afterCreate等等，不需要和数据库打交道。扩展性很强\n\n(2) K8s 中对象删除基本流程如下：\n\n- 客户端提交删除请求到 API Server\n\n- - 可选传递 GracePeriodSeconds 参数\n\n- API Server 做 Graceful Deletion 检查\n\n- - 若对象实现了 RESTGracefulDeleteStrategy 接口，会调用对应的实现并返回是否需要进行 Graceful 删除\n\n- API Server 检查 Finalizers 并结合是否需要进行 graceful 删除，来决定是否立即删除对象\n\n- - 若对象需要进行 graceful 删除，更新 metadata.DeletionGracePeriodSecond 和 metadata.DeletionTimestamp 字段，不从存储中删除对象\n\n  - 若对象不需要进行 Graceful 删除时\n\n  - - metadata.Finalizers 为空，直接删除\n    - metadata.Finalizers 不为空，不删除，只更新 metadata.DeletionTimestamp\n\n注：\n\n当前 k8s 内置资源，只有 Pod 对象实现了 [RESTGracefulDeleteStrategy](https://link.zhihu.com/?target=https%3A//github.com/kubernetes/kubernetes/blob/v1.18.0/staging/src/k8s.io/apiserver/pkg/registry/rest/delete.go%23L55-L61) 接口。对于其他对象，都不会进入 Graceful 删除状态。\n\n<br>\n\n所以k8s中的删除资源时其实是2步，第一步是设置metadata.DeletionTimestamp字段。第二步是正在的删除。\n\npod是这个逻辑的原因是它实现了RESTGracefulDeleteStrategy接口。\n\n其他资源比如deploy资源也是这个逻辑，是因为K8s删除的时候，会默认后天删除（前台删除，孤儿删除），实际会带finalizer。所以有finalizer实际上也实现了优雅删除。\n\n<br>\n\n当遇到对象删不掉的时候，方法：\n\n- 删除 finalizers ，让关联的逻辑不需要执行\n- kubelet delete --force --grace-period 0 直接删除\n\n到这里就结束吧，pod get, list, patch等对象基本都是这个思路\n\n### 5.参考\n\nhttps://duyanghao.github.io/kubernetes-apiserver-overview/\n\nhttps://blog.csdn.net/hahachenchen789/article/details/113880166\n\nhttps://www.kubesre.com/archives/chuang-jian-yi-ge-pod-bei-hou-etcd-de-gu-shi\n\nhttps://zhuanlan.zhihu.com/p/161072336"
  },
  {
    "path": "k8s/kube-apiserver/17-k8s之serviceaccount.md",
    "content": "Table of Contents\n=================\n\n  * [1. 什么是serviceaccount](#1-什么是serviceaccount)\n  * [2、Service account与User account区别](#2service-account与user-account区别)\n  * [3、默认Service Account](#3默认service-account)\n     * [3.1 默认sa的权限测试](#31-默认sa的权限测试)\n     * [3.2 自定义sa的权限测试](#32-自定义sa的权限测试)\n  * [4. 如何通过client-go使用sa](#4-如何通过client-go使用sa)\n\n### 1. 什么是serviceaccount\n\nk8s中提供了良好的多租户认证管理机制，如RBAC、ServiceAccount还有各种Policy等。\n\n当用户访问集群（例如使用kubectl命令）时，apiserver 会将用户认证为一个特定的 User Account（目前通常是admin，除非系统管理员自定义了集群配置）。\n\nPod 容器中的进程也可以与 apiserver 联系。 当它们在联系 apiserver 的时候，它们会被认证为一个特定的 Service Account（例如default）。\n\n<br>\n\n**使用场景**\n\nService Account它并不是给kubernetes集群的用户使用的，而是给pod里面的进程使用的，它为pod提供必要的身份认证。\n\n<br>\n\nService Account包含3个主要内容，分别介绍如下：\n\n* NameSpace: 指定了Pod所在的命名空间\n* CA： kube-apiserver组件的CA公钥证书，是Pod中的进程对kube-apiserver进程验证的证书\n* Token：用作身份验证，通过kube-apiserver私钥签发经过Base64b编码的Bearer Token\n\n### 2、Service account与User account区别\n\n1. User account是为人设计的，而service account则是为Pod中的进程调用Kubernetes API或其他外部服务而设计的\n2. User account是跨namespace的，而service account则是仅局限它所在的namespace；\n3. 每个namespace都会自动创建一个default service account\n4. Token controller检测service account的创建，并为它们创建secret\n5. 开启ServiceAccount Admission Controller后:\n\n 5.1 每个Pod在创建后都会自动设置spec.serviceAccount为default（除非指定了其他ServiceAccout）\n​ 5.2 验证Pod引用的service account已经存在，否则拒绝创建\n​ 5.3 如果Pod没有指定ImagePullSecrets，则把service account的ImagePullSecrets加到Pod中\n​ 5.4 每个container启动后都会挂载该service account的token和ca.crt到/var/run/secrets/kubernetes.io/serviceaccount/\n\n```bash\n# kubectl exec nginx-3137573019-md1u2 ls /run/secrets/kubernetes.io/serviceaccount\n ca.crt namespace token \n```\n\n**查看系统的config配置**\n\n这里用到的token就是被授权过的SeviceAccount账户的token,集群利用token来使用ServiceAccount账户\n\n```text\n[root@master yaml]#  cat /root/.kube/config\n```\n\n### 3、默认Service Account\n\n默认在 pod 中使用自动挂载的 service account 凭证来访问 API，如 Accessing the Cluster（[https://kubernetes.io/docs/tasks/access-application-cluster/access-cluster/#accessing-the-api-from-a-pod](https://link.zhihu.com/?target=https%3A//kubernetes.io/docs/tasks/access-application-cluster/access-cluster/%23accessing-the-api-from-a-pod)） 中所描述。\n\n当创建 pod 的时候，如果没有指定一个 service account，系统会自动在与该pod 相同的 namespace 下为其指派一个default service account，并且使用默认的 Service Account 访问 API server。\n\n例如：\n\n获取刚创建的 pod 的原始 json 或 yaml 信息，将看到spec.serviceAccountName字段已经被设置为 default。\n\n```\nroot@k8s-master:~# kubectl get sa\nNAME      SECRETS   AGE\ndefault   1         2d4h\nroot@k8s-master:~# kubectl get sa default -oyaml\napiVersion: v1\nkind: ServiceAccount\nmetadata:\n  creationTimestamp: \"2021-10-23T09:04:02Z\"\n  name: default\n  namespace: default\n  resourceVersion: \"231\"\n  selfLink: /api/v1/namespaces/default/serviceaccounts/default\n  uid: 5953ce17-9e38-4768-9d61-e7066f838b0d\nsecrets:\n- name: default-token-f8snr\nroot@k8s-master:~#\nroot@k8s-master:~#\nroot@k8s-master:~#\nroot@k8s-master:~# kubectl get secret default-token-f8snr -oyaml\napiVersion: v1\ndata:\n  ca.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUR2akNDQXFhZ0F3SUJBZ0lVZVNKWlB2SmZGangyOVBrU2NHdmw1eEFOQ2lZd0RRWUpLb1pJaHZjTkFRRUwKQlFBd1pURUxNQWtHQTFVRUJoTUNRMDR4RURBT0JnTlZCQWdUQjBKbGFXcHBibWN4RURBT0JnTlZCQWNUQjBKbAphV3BwYm1jeEREQUtCZ05WQkFvVEEyczRjekVQTUEwR0ExVUVDeE1HVTNsemRHVnRNUk13RVFZRFZRUURFd3ByCmRXSmxjbTVsZEdWek1CNFhEVEl4TVRBeU16QTRNakl3TUZvWERUSTJNVEF5TWpBNE1qSXdNRm93WlRFTE1Ba0cKQTFVRUJoTUNRMDR4RURBT0JnTlZCQWdUQjBKbGFXcHBibWN4RURBT0JnTlZCQWNUQjBKbGFXcHBibWN4RERBSwpCZ05WQkFvVEEyczRjekVQTUEwR0ExVUVDeE1HVTNsemRHVnRNUk13RVFZRFZRUURFd3ByZFdKbGNtNWxkR1Z6Ck1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBTUlJQkNnS0NBUUVBdDVPTVlLUG4xS3ZOY3FoaGxqdVQKei9pUDFiTGdWOUhFNGhZVmV0VDkralNTVTQzd20wWExqWlliT0oxZktDWkV5NU14ZUlXb1c2bFVhMDRLc2VZNAovSFdGM255VGVQVmx2citBbm9kNFZ2TWZxRXpBcmplcS85aElOcGxZdFFOMDBSanNpdHA3bDRRT1licEhTWUFNCnhXSmFPZG5lK2FNbmQrUkFaM1d0bGV1aXd5REZzVXI0NUhqeGJoeGR1YUNURUQwanNPYy9zbEQwRTFGZTRHOWoKOXpjK0xMb2ZTWHQ1N1B3Z1g5MVlwbnJUNmtTRUs0SGpMcjczMzRYTmRYbjBkektBc1A0RURzNkdibDEyZ1JiUQpuV3g2cHpSUmpkUXlua1Z0dkMzTXMrUVIrcUswb3RMMDVMTStPdy9VY2M4cXBFTUtWUVBRVFkyWGljLzZsa3IvCkV3SURBUUFCbzJZd1pEQU9CZ05WSFE4QkFmOEVCQU1DQVFZd0VnWURWUjBUQVFIL0JBZ3dCZ0VCL3dJQkFqQWQKQmdOVkhRNEVGZ1FVSjNiTDE3UGlVd0g5WDNhekp2VFVNbU1iUlgwd0h3WURWUjBqQkJnd0ZvQVVKM2JMMTdQaQpVd0g5WDNhekp2VFVNbU1iUlgwd0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFLSXpSSXpSMmp3UG0vU25LSXRBCjIyMUJFdnJTWEh4UE13VTJQbjgybmhQWjBaOFc0K0x3ZjBFcExlZ0xWaVgzMEJrTU5INkRkTkNUbEdrSnRSZW4KSHdMNVNnZkVnaTA0V0tXenVpT25jd2dnWkNOTXpyZGhGcFFqLzNOOWhqWUM0V050UXZlaWVmYjlZOGtpbUUvVAp6STh1MXpZTFRreG5FU3pHTE8yUGNtZXQ2TmtCb0NBTU1vc3R0ZC92RlN0b250TVk1OXBiMlpnejN1MXZuZkt5CmlpbzZVM1VtbWt2NGMzdnYwbzEwTVlMVElLR2ZiRVllSkROdjFhZ3NvSWlBQklNbEhGeUh1TUZIZmp5RExiamkKOHo0TTBmKzFkNXdqc2NHVFNsQng5anJXTzk3WFNFeU9BdDNkbkE0OU5sNUJjTDZXZWlhbGlQT0F4QWVPcUROZAp2elE9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K\n  namespace: ZGVmYXVsdA==\n  token: ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNklrTXhUV2hJVVdKWWJtNU9OemRoTW5GV1FYVnpURjlWTkdSbmIzcDZNVVUxUTBGTlVGOTFlVFJ2UW5jaWZRLmV5SnBjM01pT2lKcmRXSmxjbTVsZEdWekwzTmxjblpwWTJWaFkyTnZkVzUwSWl3aWEzVmlaWEp1WlhSbGN5NXBieTl6WlhKMmFXTmxZV05qYjNWdWRDOXVZVzFsYzNCaFkyVWlPaUprWldaaGRXeDBJaXdpYTNWaVpYSnVaWFJsY3k1cGJ5OXpaWEoyYVdObFlXTmpiM1Z1ZEM5elpXTnlaWFF1Ym1GdFpTSTZJbVJsWm1GMWJIUXRkRzlyWlc0dFpqaHpibklpTENKcmRXSmxjbTVsZEdWekxtbHZMM05sY25acFkyVmhZMk52ZFc1MEwzTmxjblpwWTJVdFlXTmpiM1Z1ZEM1dVlXMWxJam9pWkdWbVlYVnNkQ0lzSW10MVltVnlibVYwWlhNdWFXOHZjMlZ5ZG1salpXRmpZMjkxYm5RdmMyVnlkbWxqWlMxaFkyTnZkVzUwTG5WcFpDSTZJalU1TlROalpURTNMVGxsTXpndE5EYzJPQzA1WkRZeExXVTNNRFkyWmpnek9HSXdaQ0lzSW5OMVlpSTZJbk41YzNSbGJUcHpaWEoyYVdObFlXTmpiM1Z1ZERwa1pXWmhkV3gwT21SbFptRjFiSFFpZlEub09hNkZ6SDVhTEIzRnZYWW9ZTHNRWUNTOHl2ZTRXdWRBbGtjNjFwTVd0UEFBRy1URUJ5WjNvN3FzSU0yRmNkTW9VbXFCOEFoakx0QlZjeVhOMVFfd0dDNE9oLUdRQnpJZ3JPZTRDUm5QWkpGX2F0ZW15LXlsazI1aldJSG9VOWU1azAxMHExYjhMU0RJekVwSFd0UzZlZC01ZkQxdG5lSHdtU09LYTJtdTZ2QWVsUW9ydmFoeHU3UWxHSWFUcWRQaVk3ZWRyUFpKSUFGWUNMeFAtMklFV0ZRbFJMUkRxcVN0ckpBbTFDUFFoeFh4ZUgtSFJoTzhnQnB4bHV0VUdSOU5LNFdoMnRFYWIyaGV1YUZUQkp0dVIxeTlJbVZFQzFpaTFlT2NGeGJRRi1zWnRlZGEwWFBTbE1rZ1BHYmNUT3VPOEdvZHBZTzA5TnFZRW5WR29pQWtn\nkind: Secret\nmetadata:\n  annotations:\n    kubernetes.io/service-account.name: default\n    kubernetes.io/service-account.uid: 5953ce17-9e38-4768-9d61-e7066f838b0d\n  creationTimestamp: \"2021-10-23T09:04:02Z\"\n  name: default-token-f8snr\n  namespace: default\n  resourceVersion: \"229\"\n  selfLink: /api/v1/namespaces/default/secrets/default-token-f8snr\n  uid: 11cfe3f0-ad48-458f-8959-fcc3adccacd3\ntype: kubernetes.io/service-account-token\n```\n\n<br>\n\n**默认的Sa作用：** 目前看起来就是给pod塞了一个sa，没有任何的权限绑定。\n\n#### 3.1 默认sa的权限测试\n\n（1）kubectl get role没看见有role和 default绑定\n\n（2）进入一个pod后, 执行以下的命令发现这个sa没有权限\n\n```\n/ $ export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n/ $ TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\n/ $\n/ $ curl -H \"Authorization: Bearer $TOKEN\" https://kubernetes\n\ncurl: (6) Could not resolve host: kubernetes\n\n// 这个就是没有权限\n/ $ curl -H \"Authorization: Bearer $TOKEN\" https://192.168.0.4:6443/api/v1/namespaces/default/pods\n{\n  \"kind\": \"Status\",\n  \"apiVersion\": \"v1\",\n  \"metadata\": {\n\n  },\n  \"status\": \"Failure\",\n  \"message\": \"forbidden: User \\\"system:serviceaccount:default:default\\\" cannot get path \\\"/\\\"\",\n  \"reason\": \"Forbidden\",\n  \"details\": {\n\n  },\n  \"code\": 403\n}/ $\n/ $\n\n// 这个就是没有权限\n/ $ curl -H \"Authorization: Bearer $TOKEN\" https://10.0.0.1:443/api/v1/namespaces/default/pods\n{\n  \"kind\": \"Status\",\n  \"apiVersion\": \"v1\",\n  \"metadata\": {\n\n  },\n  \"status\": \"Failure\",\n  \"message\": \"pods is forbidden: User \\\"system:serviceaccount:default:default\\\" cannot list resource \\\"pods\\\" in API group \\\"\\\" in the namespace \\\"default\\\"\",\n  \"reason\": \"Forbidden\",\n  \"details\": {\n    \"kind\": \"pods\"\n  },\n  \"code\": 403\n}/ $\n/ $\n/ $ exit\n```\n\n#### 3.2 自定义sa的权限测试\n\n（1）创建sa\n\n```\nroot@k8s-master:~# kubectl create serviceaccount sa-example\nserviceaccount/sa-example created\n\nroot@k8s-master:~# kubectl get sa sa-example -oyaml\napiVersion: v1\nkind: ServiceAccount\nmetadata:\n  creationTimestamp: \"2021-10-25T13:51:23Z\"\n  name: sa-example\n  namespace: default\n  resourceVersion: \"434232\"\n  selfLink: /api/v1/namespaces/default/serviceaccounts/sa-example\n  uid: 42654626-8b42-4c5e-83de-fb836acfc934\nsecrets:\n- name: sa-example-token-lchv2\n```\n\n(2) 创建role \n\n```\nkind: Role\napiVersion: rbac.authorization.k8s.io/v1\nmetadata:\n  namespace: default                          # 命名空间\n  name: role-example\nrules:\n- apiGroups: [\"\"]\n  resources: [\"pods\"]                         # 可以访问pod\n  verbs: [\"get\", \"list\"]                      # 可以执行GET、LIST操作\n```\n\n(3) 创建rolebinding\n\n```\nkind: RoleBinding\napiVersion: rbac.authorization.k8s.io/v1\nmetadata:\n  name: rolebinding-example\n  namespace: default\nsubjects:                                \n- kind: User                              \n  name: user-example\n  apiGroup: rbac.authorization.k8s.io\n- kind: ServiceAccount                    \n  name: sa-example\n  namespace: default\nroleRef:                                  \n  kind: Role\n  name: role-example\n  apiGroup: rbac.authorization.k8s.io\n\n```\n\n(4) 将pod设置自定义sa\n\n```\nroot@k8s-master:~# cat pod.yaml\napiVersion: v1\nkind: Pod\nmetadata:\n  name: nginx\nspec:\n  serviceAccountName: sa-example\n  nodeName: k8s-node\n  containers:\n  - name: nginx\n    image: curlimages/curl:7.75.0\n    command:\n      - sleep\n      - \"3600\"\n```\n\n(5) 执行上诉命令\n\n```\nroot@k8s-master:~# kubectl get svc\nNAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE\nkubernetes   ClusterIP   10.0.0.1     <none>        443/TCP   2d5h\n```\n\n\n\n```\nroot@k8s-master:~# kubectl exec -it nginx sh\n\n/ $  export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt\n/ $ TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\n\n/ $ curl -H \"Authorization: Bearer $TOKEN\" https://192.168.0.4:6443\n{\n  \"kind\": \"Status\",\n  \"apiVersion\": \"v1\",\n  \"metadata\": {\n\n  },\n  \"status\": \"Failure\",\n  \"message\": \"forbidden: User \\\"system:serviceaccount:default:sa-example\\\" cannot get path \\\"/\\\"\",\n  \"reason\": \"Forbidden\",\n  \"details\": {\n\n  },\n  \"code\": 403\n}/ $\n\n//有get pod的权限\n/ $ curl -H \"Authorization: Bearer $TOKEN\" https://10.0.0.1:443/api/v1/namespace\ns/default/pods\n{\n  \"kind\": \"PodList\",\n  \"apiVersion\": \"v1\",\n  \"metadata\": {\n    \"selfLink\": \"/api/v1/namespaces/default/pods\",\n    \"resourceVersion\": \"435185\"\n  },\n  \"items\": [\n    {\n      \"metadata\": {\n        \"name\": \"nginx\",\n        \"namespace\": \"default\",\n        \"selfLink\": \"/api/v1/namespaces/default/pods/nginx\",\n        \"uid\": \"0ceadb16-588f-40ae-a8c1-4d3cfb34df20\",\n        \"resourceVersion\": \"435049\",\n        \"creationTimestamp\": \"2021-10-25T13:57:11Z\",\n        \"annotations\": {\n          \"kubectl.kubernetes.io/last-applied-configuration\": \"{\\\"apiVersion\\\":\\\"v1\\\",\\\"kind\\\":\\\"Pod\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"nginx\\\",\\\"namespace\\\":\\\"default\\\"},\\\"spec\\\":{\\\"containers\\\":[{\\\"command\\\":[\\\"sleep\\\",\\\"3600\\\"],\\\"image\\\":\\\"curlimages/curl:7.75.0\\\",\\\"name\\\":\\\"nginx\\\"}],\\\"nodeName\\\":\\\"k8s-node\\\",\\\"serviceAccountName\\\":\\\"sa-example\\\"}}\\n\"\n        }\n      },\n      \"spec\": {\n        \"volumes\": [\n          {\n            \"name\": \"sa-example-token-lchv2\",\n            \"secret\": {\n              \"secretName\": \"sa-example-token-lchv2\",\n              \"defaultMode\": 420\n            }\n          }\n        ],\n        \"containers\": [\n          {\n            \"name\": \"nginx\",\n            \"image\": \"curlimages/curl:7.75.0\",\n            \"command\": [\n              \"sleep\",\n              \"3600\"\n            ],\n            \"resources\": {\n\n            },\n            \"volumeMounts\": [\n              {\n                \"name\": \"sa-example-token-lchv2\",\n                \"readOnly\": true,\n                \"mountPath\": \"/var/run/secrets/kubernetes.io/serviceaccount\"\n              }\n            ],\n            \"terminationMessagePath\": \"/dev/termination-log\",\n            \"terminationMessagePolicy\": \"File\",\n            \"imagePullPolicy\": \"IfNotPresent\"\n          }\n        ],\n        \"restartPolicy\": \"Always\",\n        \"terminationGracePeriodSeconds\": 30,\n        \"dnsPolicy\": \"ClusterFirst\",\n        \"serviceAccountName\": \"sa-example\",\n        \"serviceAccount\": \"sa-example\",\n        \"nodeName\": \"k8s-node\",\n        \"securityContext\": {\n\n        },\n        \"schedulerName\": \"default-scheduler\",\n        \"tolerations\": [\n          {\n            \"key\": \"node.kubernetes.io/not-ready\",\n            \"operator\": \"Exists\",\n            \"effect\": \"NoExecute\",\n            \"tolerationSeconds\": 300\n          },\n          {\n            \"key\": \"node.kubernetes.io/unreachable\",\n            \"operator\": \"Exists\",\n            \"effect\": \"NoExecute\",\n            \"tolerationSeconds\": 300\n          }\n        ],\n        \"priority\": 0,\n        \"enableServiceLinks\": true\n      },\n      \"status\": {\n        \"phase\": \"Running\",\n        \"conditions\": [\n          {\n            \"type\": \"Initialized\",\n            \"status\": \"True\",\n            \"lastProbeTime\": null,\n            \"lastTransitionTime\": \"2021-10-25T13:57:11Z\"\n          },\n          {\n            \"type\": \"Ready\",\n            \"status\": \"True\",\n            \"lastProbeTime\": null,\n            \"lastTransitionTime\": \"2021-10-25T13:57:13Z\"\n          },\n          {\n            \"type\": \"ContainersReady\",\n            \"status\": \"True\",\n            \"lastProbeTime\": null,\n            \"lastTransitionTime\": \"2021-10-25T13:57:13Z\"\n          },\n          {\n            \"type\": \"PodScheduled\",\n            \"status\": \"True\",\n            \"lastProbeTime\": null,\n            \"lastTransitionTime\": \"2021-10-25T13:57:11Z\"\n          }\n        ],\n        \"hostIP\": \"192.168.0.5\",\n        \"podIP\": \"10.244.1.7\",\n        \"podIPs\": [\n          {\n            \"ip\": \"10.244.1.7\"\n          }\n        ],\n        \"startTime\": \"2021-10-25T13:57:11Z\",\n        \"containerStatuses\": [\n          {\n            \"name\": \"nginx\",\n            \"state\": {\n              \"running\": {\n                \"startedAt\": \"2021-10-25T13:57:12Z\"\n              }\n            },\n            \"lastState\": {\n\n            },\n            \"ready\": true,\n            \"restartCount\": 0,\n            \"image\": \"curlimages/curl:7.75.0\",\n            \"imageID\": \"docker-pullable://curlimages/curl@sha256:28ec2dae8001949f657dbb36141508d65572f382dbd587f868289e2ceb0d47dd\",\n            \"containerID\": \"docker://d6e4cc4acfa4b3093d3ee82286cf67da117f7f6ce23fd47254ee64a79d8ff29f\",\n            \"started\": true\n          }\n        ],\n        \"qosClass\": \"BestEffort\"\n      }\n    }\n  ]\n}/ $\n\n\n\n//使用 apiserver的ip:端口也是可以的\n/ $ curl -H \"Authorization: Bearer $TOKEN\" https://192.168.0.4:6443/api/v1/names\npaces/default/pods\n{\n  \"kind\": \"PodList\",\n  \"apiVersion\": \"v1\",\n  \"metadata\": {\n    \"selfLink\": \"/api/v1/namespaces/default/pods\",\n    \"resourceVersion\": \"435286\"\n  },\n  \"items\": [\n    {\n      \"metadata\": {\n        \"name\": \"nginx\",\n        \"namespace\": \"default\",\n        \"selfLink\": \"/api/v1/namespaces/default/pods/nginx\",\n        \"uid\": \"0ceadb16-588f-40ae-a8c1-4d3cfb34df20\",\n        \"resourceVersion\": \"435049\",\n        \"creationTimestamp\": \"2021-10-25T13:57:11Z\",\n        \"annotations\": {\n          \"kubectl.kubernetes.io/last-applied-configuration\": \"{\\\"apiVersion\\\":\\\"v1\\\",\\\"kind\\\":\\\"Pod\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"nginx\\\",\\\"namespace\\\":\\\"default\\\"},\\\"spec\\\":{\\\"containers\\\":[{\\\"command\\\":[\\\"sleep\\\",\\\"3600\\\"],\\\"image\\\":\\\"curlimages/curl:7.75.0\\\",\\\"name\\\":\\\"nginx\\\"}],\\\"nodeName\\\":\\\"k8s-node\\\",\\\"serviceAccountName\\\":\\\"sa-example\\\"}}\\n\"\n        }\n      },\n      \"spec\": {\n        \"volumes\": [\n          {\n            \"name\": \"sa-example-token-lchv2\",\n            \"secret\": {\n              \"secretName\": \"sa-example-token-lchv2\",\n              \"defaultMode\": 420\n            }\n          }\n        ],\n        \"containers\": [\n          {\n            \"name\": \"nginx\",\n            \"image\": \"curlimages/curl:7.75.0\",\n            \"command\": [\n              \"sleep\",\n              \"3600\"\n            ],\n            \"resources\": {\n\n            },\n            \"volumeMounts\": [\n              {\n                \"name\": \"sa-example-token-lchv2\",\n                \"readOnly\": true,\n                \"mountPath\": \"/var/run/secrets/kubernetes.io/serviceaccount\"\n              }\n            ],\n            \"terminationMessagePath\": \"/dev/termination-log\",\n            \"terminationMessagePolicy\": \"File\",\n            \"imagePullPolicy\": \"IfNotPresent\"\n          }\n        ],\n        \"restartPolicy\": \"Always\",\n        \"terminationGracePeriodSeconds\": 30,\n        \"dnsPolicy\": \"ClusterFirst\",\n        \"serviceAccountName\": \"sa-example\",\n        \"serviceAccount\": \"sa-example\",\n        \"nodeName\": \"k8s-node\",\n        \"securityContext\": {\n\n        },\n        \"schedulerName\": \"default-scheduler\",\n        \"tolerations\": [\n          {\n            \"key\": \"node.kubernetes.io/not-ready\",\n            \"operator\": \"Exists\",\n            \"effect\": \"NoExecute\",\n            \"tolerationSeconds\": 300\n          },\n          {\n            \"key\": \"node.kubernetes.io/unreachable\",\n            \"operator\": \"Exists\",\n            \"effect\": \"NoExecute\",\n            \"tolerationSeconds\": 300\n          }\n        ],\n        \"priority\": 0,\n        \"enableServiceLinks\": true\n      },\n      \"status\": {\n        \"phase\": \"Running\",\n        \"conditions\": [\n          {\n            \"type\": \"Initialized\",\n            \"status\": \"True\",\n            \"lastProbeTime\": null,\n            \"lastTransitionTime\": \"2021-10-25T13:57:11Z\"\n          },\n          {\n            \"type\": \"Ready\",\n            \"status\": \"True\",\n            \"lastProbeTime\": null,\n            \"lastTransitionTime\": \"2021-10-25T13:57:13Z\"\n          },\n          {\n            \"type\": \"ContainersReady\",\n            \"status\": \"True\",\n            \"lastProbeTime\": null,\n            \"lastTransitionTime\": \"2021-10-25T13:57:13Z\"\n          },\n          {\n            \"type\": \"PodScheduled\",\n            \"status\": \"True\",\n            \"lastProbeTime\": null,\n            \"lastTransitionTime\": \"2021-10-25T13:57:11Z\"\n          }\n        ],\n        \"hostIP\": \"192.168.0.5\",\n        \"podIP\": \"10.244.1.7\",\n        \"podIPs\": [\n          {\n            \"ip\": \"10.244.1.7\"\n          }\n        ],\n        \"startTime\": \"2021-10-25T13:57:11Z\",\n        \"containerStatuses\": [\n          {\n            \"name\": \"nginx\",\n            \"state\": {\n              \"running\": {\n                \"startedAt\": \"2021-10-25T13:57:12Z\"\n              }\n            },\n            \"lastState\": {\n\n            },\n            \"ready\": true,\n            \"restartCount\": 0,\n            \"image\": \"curlimages/curl:7.75.0\",\n            \"imageID\": \"docker-pullable://curlimages/curl@sha256:28ec2dae8001949f657dbb36141508d65572f382dbd587f868289e2ceb0d47dd\",\n            \"containerID\": \"docker://d6e4cc4acfa4b3093d3ee82286cf67da117f7f6ce23fd47254ee64a79d8ff29f\",\n            \"started\": true\n          }\n        ],\n        \"qosClass\": \"BestEffort\"\n      }\n    }\n  ]\n}/ $\n```\n\n### 4. 如何通过client-go使用sa\n\n直接调用client-go/rest的InClusterConfig\n\n```\n    // creates the in-cluster config\n    config, err := rest.InClusterConfig()\n    if err != nil {\n        panic(err.Error())\n    }\n    // creates the clientset\n    clientset, err := kubernetes.NewForConfig(config)\n    if err != nil {\n        panic(err.Error())\n    }\n```\n\nInClusterConfig的源码分析，这里定义了tokenFile和rootCAFile\n\n```\n// InClusterConfig returns a config object which uses the service account\n// kubernetes gives to pods. It's intended for clients that expect to be\n// running inside a pod running on kubernetes. It will return ErrNotInCluster\n// if called from a process not running in a kubernetes environment.\nfunc InClusterConfig() (*Config, error) {\n\tconst (\n\t\ttokenFile  = \"/var/run/secrets/kubernetes.io/serviceaccount/token\"\n\t\trootCAFile = \"/var/run/secrets/kubernetes.io/serviceaccount/ca.crt\"\n\t)\n\thost, port := os.Getenv(\"KUBERNETES_SERVICE_HOST\"), os.Getenv(\"KUBERNETES_SERVICE_PORT\")\n\tif len(host) == 0 || len(port) == 0 {\n\t\treturn nil, ErrNotInCluster\n\t}\n\n\ttoken, err := ioutil.ReadFile(tokenFile)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\ttlsClientConfig := TLSClientConfig{}\n\n\tif _, err := certutil.NewPool(rootCAFile); err != nil {\n\t\tklog.Errorf(\"Expected to load root CA config from %s, but got err: %v\", rootCAFile, err)\n\t} else {\n\t\ttlsClientConfig.CAFile = rootCAFile\n\t}\n\n\treturn &Config{\n\t\t// TODO: switch to using cluster DNS.\n\t\tHost:            \"https://\" + net.JoinHostPort(host, port),\n\t\tTLSClientConfig: tlsClientConfig,\n\t\tBearerToken:     string(token),\n\t\tBearerTokenFile: tokenFile,\n\t}, nil\n}\n```\n\n"
  },
  {
    "path": "k8s/kube-apiserver/18 event的定义.md",
    "content": "\n\nk8s集群中，controller-manage、kube-proxy、kube-scheduler、kubelet等组件都会产生大量的event。这些event对查看集群对象状态或者监控告警等等都非常有用。本章写一下自己对k8s中event的理解。\n\n### 1. event的定义\n\nevent定义在：k8s.io/api/core/v1/types.go中\n\n```\ntype Event struct {\n    metav1.TypeMeta `json:\",inline\"`\n    metav1.ObjectMeta `json:\"metadata\" protobuf:\"bytes,1,opt,name=metadata\"`\n    InvolvedObject ObjectReference `json:\"involvedObject\" protobuf:\"bytes,2,opt,name=involvedObject\"`\n    Reason string `json:\"reason,omitempty\" protobuf:\"bytes,3,opt,name=reason\"`\n    Message string `json:\"message,omitempty\" protobuf:\"bytes,4,opt,name=message\"`\n    Source EventSource `json:\"source,omitempty\" protobuf:\"bytes,5,opt,name=source\"`\n    FirstTimestamp metav1.Time `json:\"firstTimestamp,omitempty\" protobuf:\"bytes,6,opt,name=firstTimestamp\"`\n    LastTimestamp metav1.Time `json:\"lastTimestamp,omitempty\" protobuf:\"bytes,7,opt,name=lastTimestamp\"`\n    Count int32 `json:\"count,omitempty\" protobuf:\"varint,8,opt,name=count\"`\n    Type string `json:\"type,omitempty\" protobuf:\"bytes,9,opt,name=type\"`\n    EventTime metav1.MicroTime `json:\"eventTime,omitempty\" protobuf:\"bytes,10,opt,name=eventTime\"`\n    Series *EventSeries `json:\"series,omitempty\" protobuf:\"bytes,11,opt,name=series\"`\n    Action string `json:\"action,omitempty\" protobuf:\"bytes,12,opt,name=action\"`\n    Related *ObjectReference `json:\"related,omitempty\" protobuf:\"bytes,13,opt,name=related\"`\n    ReportingController string `json:\"reportingComponent\" protobuf:\"bytes,14,opt,name=reportingComponent\"`\n    ReportingInstance string `json:\"reportingInstance\" protobuf:\"bytes,15,opt,name=reportingInstance\"`\n    ReportingInstance string `json:\"reportingInstance\" protobuf:\"bytes,15,opt,name=reportingInstance\"`\n}\n```\n\nCount，firstTimestamp和lasteTimestamp 表示事件重复了多少次\n\nMessage 详细的事件信息\n\nReason 简单的事件原因\n\nType  目前只支持：Normal和Warning俩种\n\nSource 事件发出的来源\n\nInvolvedObject 引用的另一个Kubernetes对象，例如Pod或者Deployment\n\n<br>\n\n### 2. kubectl自定义输出k8s事件 - （该方法适用于所有对象）\n\n通常我们是通过kubectl  查看事件，如下：\n\n```\nroot@k8s-master:~# kubectl get event\nLAST SEEN   TYPE     REASON    OBJECT                        MESSAGE\n40m         Normal   Pulled    pod/zx-hpa-7b56cddd95-5j6r4   Container image \"busybox:latest\" already present on machine\n40m         Normal   Created   pod/zx-hpa-7b56cddd95-5j6r4   Created container busybox\n40m         Normal   Started   pod/zx-hpa-7b56cddd95-5j6r4   Started container busybox\n40m         Normal   Pulled    pod/zx-hpa-7b56cddd95-lthbz   Container image \"busybox:latest\" already present on machine\n40m         Normal   Created   pod/zx-hpa-7b56cddd95-lthbz   Created container busybox\n40m         Normal   Started   pod/zx-hpa-7b56cddd95-lthbz   Started container busybox\n29m         Normal   Pulled    pod/zx-hpa-7b56cddd95-n9ft9   Container image \"busybox:latest\" already present on machine\n29m         Normal   Created   pod/zx-hpa-7b56cddd95-n9ft9   Created container busybox\n29m         Normal   Started   pod/zx-hpa-7b56cddd95-n9ft9   Started container busybox\n```\n\n补充俩点注意：\n\n（1）event也有ns，所以kubectl get event 没有找到预期的事件，看看是否加上了 ns\n\n（2）自定义event的输出\n\n默认的kubectl get event只输出了五列，有时并没有我们想看到的内容，这个时候可以利用kubectl 的强大输出功能，输出自己想看到的信息。\n\n```\n根据 kubectl 操作，支持以下输出格式：\n\nOutput format\tDescription\n-o custom-columns=<spec>\t使用逗号分隔的自定义列列表打印表。\n-o custom-columns-file=<filename>\t使用 <filename> 文件中的自定义列模板打印表。\n-o json\t输出 JSON 格式的 API 对象\n-o jsonpath=<template>\t打印 jsonpath 表达式定义的字段\n-o jsonpath-file=<filename>\t打印 <filename> 文件中 jsonpath 表达式定义的字段。\n-o name\t仅打印资源名称而不打印任何其他内容。\n-o wide\t以纯文本格式输出，包含任何附加信息。对于 pod 包含节点名。\n-o yaml\t输出 YAML 格式的 API 对象。\n```\n\n举例：这里我们想看到 event的 Count 和 name, 以及namespaces。\n\n**首先**，我通过 kubectl get event -oyaml查看到了event的所有字段，这里发现 count 和  metadata.name, metadata.namespace\n\n```\n- apiVersion: v1\n  count: 308\n  eventTime: null\n  firstTimestamp: \"2021-06-13T13:42:19Z\"\n  involvedObject:\n    apiVersion: v1\n    fieldPath: spec.containers{busybox}\n    kind: Pod\n    name: zx-hpa-7b56cddd95-n9ft9\n    namespace: default\n    resourceVersion: \"1590656\"\n    uid: 379ef34e-3277-4367-a0e2-34645397590c\n  kind: Event\n  lastTimestamp: \"2021-06-26T08:46:33Z\"\n  message: Container image \"busybox:latest\" already present on machine\n  metadata:\n    creationTimestamp: \"2021-06-13T13:42:19Z\"\n    name: zx-hpa-7b56cddd95-n9ft9.16882815c5dbca52\n    namespace: default\n    resourceVersion: \"4136244\"\n    selfLink: /api/v1/namespaces/default/events/zx-hpa-7b56cddd95-n9ft9.16882815c5dbca52\n    uid: 9f4becb5-984d-4b3d-9308-aa6bde3e3d87\n  reason: Pulled\n  reportingComponent: \"\"\n  reportingInstance: \"\"\n  source:\n    component: kubelet\n    host: 192.168.0.5\n  type: Normal\n```\n\n<br>\n\n**然后** kubectl 自定义输出\n\n```\nroot@k8s-master:~# kubectl get event -o custom-columns=count:count,ns:metadata.namespace,name:metadata.name\ncount   ns        name\n51      default   test-pod2.1685b3d2de5432c9\n51      default   test-pod2.1685b3d2e7bdb58c\n50      default   test-pod2.1685d490dface300\n309     default   zx-hpa-7b56cddd95-5j6r4.168827811e1dfc40\n309     default   zx-hpa-7b56cddd95-5j6r4.168827812259f1fc\n309     default   zx-hpa-7b56cddd95-5j6r4.168827812ae481ff\n309     default   zx-hpa-7b56cddd95-lthbz.168827811fb9b32d\n309     default   zx-hpa-7b56cddd95-lthbz.16882781231b0eaf\n309     default   zx-hpa-7b56cddd95-lthbz.168827812adf08c9\n309     default   zx-hpa-7b56cddd95-n9ft9.16882815c5dbca52\n309     default   zx-hpa-7b56cddd95-n9ft9.16882815ca83dabb\n309     default   zx-hpa-7b56cddd95-n9ft9.16882815d2c77d3b\n```\n"
  },
  {
    "path": "k8s/kube-apiserver/19. secret对象详解.md",
    "content": "- [1. Secret 介绍-分为三大类](#1-secret---------)\n  * [1.1 Opaque Secret方式](#11-opaque-secret--)\n    + [1.1.1 通过volume挂载和环境变量的区别](#111---volume----------)\n    + [1.1.2 Secret 与 ConfigMap 对比](#112-secret---configmap---)\n  * [1.2 kubernetes.io/dockerconfigjson](#12-kubernetesio-dockerconfigjson)\n  * [1.3 Service Account类型](#13-service-account--)\n  * [1.4 secret三种类型的原理](#14-secret-------)\n- [3.附录](#3--)\n  * [3.1 K8S Configmap 和 Secret 作为 Volume 的热更新原理](#31-k8s-configmap---secret----volume-------)\n    + [热更新原理](#-----)\n    + [参考文献](#----)\n\n### 1. Secret 介绍-分为三大类\n\nSecret解决了密码、token、密钥等敏感数据的配置问题，而不需要把这些敏感数据暴露到镜像或者Pod Spec中。Secret可以以Volume或者环境变量的方式使用。\n\nSecret有三种类型：\n\n- Service Account：用来访问Kubernetes API，由Kubernetes自动创建，并且会自动挂载到Pod的`/run/secrets/kubernetes.io/serviceaccount`目录中；\n- Opaque：base64编码格式的Secret，用来存储密码、密钥等；\n- `kubernetes.io/dockerconfigjson`：用来存储私有docker registry的认证信息。\n\n具体详见结构体定义：type Secret struct \n\n#### 1.1 Opaque Secret方式\n\nOpaque类型的数据是一个map类型，要求value是base64编码格式：\n\n```\n$ echo -n \"admin\" | base64\nYWRtaW4=\n$ echo -n \"1f2d1e2e67df\" | base64\nMWYyZDFlMmU2N2Rm\n```\n\nsecrets.yml\n\n```\napiVersion: v1\nkind: Secret\nmetadata:\n  name: mysecret\ntype: Opaque\ndata:\n  password: MWYyZDFlMmU2N2Rm\n  username: YWRtaW4=\n```\n\n接着，就可以创建secret了：`kubectl create -f secrets.yml`。\n\n创建好secret之后，有两种方式来使用它：\n\n- 以Volume方式\n- 以环境变量方式\n\n（1）volume方式\n\n```\n#test-projected-volume.yaml\n \napiVersion: v1\nkind: Pod\nmetadata:\n  name: test-projected-volume \nspec:\n  containers:\n  - name: test-secret-volume\n    image: busybox\n    args:\n    - sleep\n    - \"86400\"\n    volumeMounts:\n    - name: mysql-cred\n      mountPath: \"/projected-volume\"\n      readOnly: true\n  volumes:\n  - name: mysql-cred\n    projected:\n      sources:\n      - secret:\n          name: user\n      - secret:\n          name: pass\n```\n\n当 Pod 变成 Running 状态之后，我们再验证一下这些 Secret 对象是不是已经在容器里了：\n\n```\n$ kubectl exec -it test-projected-volume -- /bin/sh\n$ ls /projected-volume/\nuser\npass\n$ cat /projected-volume/user\nadmin\n$ cat /projected-volume/pass\n```\n\n（2）通过环境变量\n\n```\n#pod-secret-env.yaml\napiVersion: v1\nkind: Pod\nmetadata:\n  name: pod-secret-env\nspec:\n  containers:\n  - name: myapp\nimage: busybox\nargs:\n    - sleep\n    - \"86400\"\nenv:\n    - name: SECRET_USERNAME\n      valueFrom:\n        secretKeyRef:\n          name: mysecret\n          key: user\n    - name: SECRET_PASSWORD\n      valueFrom:\n        secretKeyRef:\n          name: mysecret\n          key: pass\n  restartPolicy: Never\n```\n\npod运行成功后：\n\n$ kubectl exec -it pod-secret-env -- /bin/sh\n\n进入容器中查看环境变量: env\n\n##### 1.1.1 通过volume挂载和环境变量的区别\n\n通过Volume挂载到容器内部时，当该Secret的值发生变化时，容器内部具备自动更新的能力，但是通过环境变量设置到容器内部该值不具备自动更新的能力。所以一般推荐使用Volume挂载的方式使用Secret。\n\n**热更新原理参考附录**\n\n##### 1.1.2 Secret 与 ConfigMap 对比\n\n最后我们来对比下Secret和ConfigMap这两种资源对象的异同点：\n\n**相同点：**\n\nkey/value的形式\n\n属于某个特定的namespace\n\n可以导出到环境变量\n\n可以通过目录/文件形式挂载\n\n通过 volume 挂载的配置信息均可热更新\n\n**不同点：**\n\nSecret 可以被 ServerAccount 关联\n\nSecret 可以存储 docker register 的鉴权信息，用在 ImagePullSecret 参数中，用于拉取私有仓库的镜像\n\nSecret 支持 Base64 加密\n\nSecret 分为 kubernetes.io/service-account-token、kubernetes.io/dockerconfigjson、Opaque 三种类型，而 Configmap 不区分类型\n\n#### 1.2 kubernetes.io/dockerconfigjson\n\n这个是为了应付 pull 私有image时候的权限问题。常见用法是：\n\n（1）创建secrect\n\n```\n$ cat ~/.docker/config.json | base64\n$ cat > myregistrykey.yaml <<EOF\napiVersion: v1\nkind: Secret\nmetadata:\n  name: myregistrykey\ndata:\n  .dockerconfigjson: UmVhbGx5IHJlYWxseSByZWVlZWVlZWVlZWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWxsbGxsbGxsbGxsbGxsbGxsbGxsbGxsbGxsbGxsbGx5eXl5eXl5eXl5eXl5eXl5eXl5eSBsbGxsbGxsbGxsbGxsbG9vb29vb29vb29vb29vb29vb29vb29vb29vb25ubm5ubm5ubm5ubm5ubm5ubm5ubm5ubmdnZ2dnZ2dnZ2dnZ2dnZ2dnZ2cgYXV0aCBrZXlzCg==\ntype: kubernetes.io/dockerconfigjson\nEOF\n$ kubectl create -f myregistrykey.yaml\n```\n\n(2) 将这个secret和serviceaccount绑定\n\n```\nroot@cld-kmaster1-1022:/home/ngadm# kubectl get serviceaccount -n test-nsp-gzchenyifan\nNAME      SECRETS   AGE\ndefault   1         3h28m\n\n\nroot@cld-kmaster1-1022:/home/ngadm# kubectl get serviceaccount -n test-nsp-gzchenyifan -oyaml\napiVersion: v1\nitems:\n- apiVersion: v1\n  imagePullSecrets:\n  - name: myregistrykey\n  kind: ServiceAccount\n  metadata:\n    creationTimestamp: \"2022-11-03T06:00:30Z\"\n    name: default\n    namespace: test-test\n    resourceVersion: \"1540683279\"\n    selfLink: /api/v1/namespaces/test-nsp-gzchenyifan/serviceaccounts/default\n    uid: 957739c2-cac9-4bae-bad9-0862ca413dd2\n  secrets:\n  - name: default-token-tb8xx   //默认的secret\nkind: List\nmetadata:\n  resourceVersion: \"\"\n  selfLink: \"\"\n  \n// 再查看pod yaml的时候，就会发现指定了这个myregistrykey\n imagePullSecrets:\n  - name: nsp-dev\n```\n\n#### 1.3 Service Account类型\n\nService Account用来访问Kubernetes API，由Kubernetes自动创建，并且会自动挂载到Pod的`/run/secrets/kubernetes.io/serviceaccount`目录中。\n\n```\n$ kubectl run nginx --image nginx\ndeployment \"nginx\" created\n$ kubectl get pods\nNAME                     READY     STATUS    RESTARTS   AGE\nnginx-3137573019-md1u2   1/1       Running   0          13s\n$ kubectl exec nginx-3137573019-md1u2 ls /run/secrets/kubernetes.io/serviceaccount\nca.crt\nnamespace\ntoken\n```\n\n**serviceAccount资源介绍**\n\n参考github源码分析 https://github.com/zoux86/learning-k8s-source-code/blob/master/k8s/kube-apiserver/17-k8s%E4%B9%8Bserviceaccount.md\n\n#### 1.4 secret三种类型的原理\n\n其实都是kubelet 的secretpulgin在起作用。如果是dockerconfig类型，他通过拉取secret的值，填充pod的imagePull策略。如果是service account, 他通过拉取sa, token值。\n\n具体代码：pkg/volume/secret/secret.go\n\n### 3.附录\n\n#### 3.1 K8S Configmap 和 Secret 作为 Volume 的热更新原理\n\nconfigmap/secret 作为 volume 挂载在容器内，如果 configmap 值发生变化，最大等待时间在 kubelet resyncInterval(60s) 内 该 mount 的 key 就会变成最新值。比如 cilium pod 挂载 cilium-config configmap，如果修改该 configmap 的 debug:false 为 true， 最多等待 60s，容器内该 debug 文件值就是 true。\n\n但是作为环境变量 env 和 volume subpath 不支持热更新，环境变量在初始化过程就固定了。\n\n##### 热更新原理\n\n(1) kubelet 会在每 60s 内去 syncPod()，检查 pod 的 volume kubelet.volumeManager.WaitForAttachAndMount(pod)， [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fkubelet%2Fkubelet.go%23L1592-L1600) [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fkubelet%2Fvolumemanager%2Fvolume_manager.go%23L375-L378)\n\n这里重点是 ReprocessPod()，会把这个 pod 又标记为未处理，等待 desiredStateOfWorldPopulator 下一次循环去 MarkRemountRequired()\n\n(2) desiredStateOfWorldPopulator 下一次循环，会走 findAndAddNewPods() -> processPodVolumes() 这里重点是 dswp.actualStateOfWorld.MarkRemountRequired(uniquePodName)，在 actual 里 MarkRemountRequired [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fkubelet%2Fvolumemanager%2Fpopulator%2Fdesired_state_of_world_populator.go%23L358-L364) 这里会判断每一个 volumePlugin.RequiresRemount()，而对于 configmap/secret volume 是 true，对于 csi 是 false [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fkubelet%2Fvolumemanager%2Fcache%2Factual_state_of_world.go%23L541-L566) [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fvolume%2Fconfigmap%2Fconfigmap.go%23L81-L83) [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fvolume%2Fcsi%2Fcsi_plugin.go%23L337-L339)\n\n(3) 然后再下一次循环里去 mountAttachVolumes() PodExistsInVolume() 会走 podObj.remountRequired，因为 MarkRemountRequired() 已经设置了需要 remount，然后 mountAttachVolumes() 里走 MountVolume() 逻辑：[github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fkubelet%2Fvolumemanager%2Freconciler%2Freconciler.go%23L247-L273) 这样就走 configmap/secret mount 逻辑。\n\n(4) configmap/secret mount 会使用 emptyDir plugin 来创建落盘目录 configmap 用的 v1.StorageMediumDefault [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fvolume%2Fconfigmap%2Fconfigmap.go%23L166-L174)\n\nsecret 用的 v1.StorageMediumMemory [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fvolume%2Fsecret%2Fsecret.go%23L51-L55) ， 对于 secret 首次 mount 会使用命令 `mount -t tmpfs xxx`: [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fvolume%2Femptydir%2Fempty_dir.go%23L232-L233) [github.com/kubernetes/…](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fblob%2Fv1.19.7%2Fpkg%2Fvolume%2Femptydir%2Fempty_dir.go%23L265-L286)\n\n对于 configmap 这里的 wrapped 是 emptyDir，主要用来创建文件和权限\n\n```go\nwrapped, err := b.plugin.host.NewWrapperMounter(b.volName, wrappedVolumeSpec(), &b.pod, *b.opts)\nwrapped.SetUpAt(dir, mounterArgs)\n\n// 这里的 getConfigMap 是 configmapManager 的 configMapManager.GetConfigMap()\n// https://github.com/kubernetes/kubernetes/blob/v1.19.7/pkg/kubelet/configmap/configmap_manager.go#L82-L91\n// 注意，kubelet 默认使用 kubeletconfiginternal.WatchChangeDetectionStrategy 的 configmapManager，所以 configmap\n// 发生变化，configmapManager 立刻拿到最新的值：https://github.com/kubernetes/kubernetes/blob/v1.19.7/pkg/kubelet/kubelet.go#L538-L540\n// 只是需要等待 kubelet 每次的 resyncInterval 60s 去 syncPod，所以每次修改 configmap 最大等待时间是 60s。\nconfigMap, err := b.getConfigMap(b.pod.Namespace, b.source.Name)\n\n// 然后把最新的 configmap 对象数据写到每一个文件里\npayload, err := MakePayload(b.source.Items, configMap, b.source.DefaultMode, optional)\nerr = writer.Write(payload)\n\n复制代码\n```\n\n##### 参考文献\n\n**[mounted-configmaps-are-updated-automatically](https://link.juejin.cn?target=https%3A%2F%2Fkubernetes.io%2Fdocs%2Ftasks%2Fconfigure-pod-container%2Fconfigure-pod-configmap%2F%23mounted-configmaps-are-updated-automatically)**\n\n**[mounted-configmaps-are-updated-automatically](https://link.juejin.cn?target=https%3A%2F%2Fkubernetes.io%2Fdocs%2Fconcepts%2Fconfiguration%2Fconfigmap%2F%23mounted-configmaps-are-updated-automatically)**\n\n**[Kubernetes Pod 中的 ConfigMap 配置更新](https://link.juejin.cn?target=https%3A%2F%2Fdockone.io%2Farticle%2F8632)**\n\n**[分别测试使用 ConfigMap 挂载 Env 和 Volume 的情况](https://link.juejin.cn?target=https%3A%2F%2Fcodeantenna.com%2Fa%2Fpf1zJAzHF6)**\n\n开始只有 NewCachingConfigMapManager()，除了 kubelet resyncInterval 时间还有个 ttl 时间，经过讨论后期加了 NewWatchingConfigMapManager, 直接 watch 立刻拿到最新的 configmap，只需要等待最大 kubelet resyncInterval 时间。下面链接是 issue 和 pr：\n\n**[Kubelet watches necessary secrets/configmaps instead of periodic polling](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F64752)**\n\n**[Migrate kubelet to ConfigMapManager interface and use TTL-based caching manager](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fpull%2F46470)**\n\n**[kubelet refresh times for configmaps is long and random](https://link.juejin.cn?target=https%3A%2F%2Fgithub.com%2Fkubernetes%2Fkubernetes%2Fissues%2F30189)**"
  },
  {
    "path": "k8s/kube-apiserver/2-kube-apiserver概述.md",
    "content": "* [Table of Contents](#table-of-contents)\n    * [1\\. kube\\-apiserver组件整体功能](#1-kube-apiserver组件整体功能)\n    * [2\\. bootstrap\\-controller](#2-bootstrap-controller)\n      * [2\\.1 NewBootstrapController](#21-newbootstrapcontroller)\n      * [2\\.2 BootstrapController\\.PostStartHook](#22-bootstrapcontrollerpoststarthook)\n      * [2\\.3 四个函数](#23-四个函数)\n        * [1\\-RunKubernetesNamespaces](#1-runkubernetesnamespaces)\n        * [2\\- RunKubernetesService](#2--runkubernetesservice)\n        * [3\\- repairClusterIPs\\.RunUntil](#3--repairclusteripsrununtil)\n        * [4\\-repairNodePorts\\.RunUntil](#4-repairnodeportsrununtil)\n      * [2\\.4 总结](#24-总结)\n    * [3\\. KubeAPIServer](#3-kubeapiserver)\n    * [4\\.aggregatorServer](#4aggregatorserver)\n    * [5\\. apiExtensionsServer](#5-apiextensionsserver)\n    * [6\\.总结](#6总结)\n      * [6\\.1 kubeAPIServer, apiExtensionsServer, aggregatorServer 总结](#61-kubeapiserver-apiextensionsserver-aggregatorserver-总结)\n      * [6\\.2 bootstrap\\-controller](#62-bootstrap-controller)\n    * [7\\. 参考文档](#7-参考文档)\n\n**本章重点：**\n\n（1）对kube-apiserver进行简单介绍\n\n（2）介绍kube-apiserver的四个组成部分：kubeApiServer, aggregatorServer, apiExtensionsServer, 以及bootstrap-controller（这个一般很少关注到）\n\nbootstrap-controller主要有以下四个功能：\n\n- 创建 kubernetes service；\n- 创建 default、kube-system 和 kube-public 以及 kube-node-lease 命名空间；\n- 提供基于 Service ClusterIP 的修复及检查功能；\n- 提供基于 Service NodePort 的修复及检查功能；\n\n<br>\n\n### 1. kube-apiserver组件整体功能\n\nkube-apiserver 是唯一一个和 etcd打交道的组件。其他的组件都是通过apiserver提供的RESTful APIs间接操作集群中的资源，主要有以下的功能：\n\n* 获取请求内容\n\n* 请求内容检查\n\n* 认证、audit、授权、\n\n* 修改式准入控制\n\n* 路由\n\n* 验证式准入控制\n\n* 资源的格式转换、\n\n* 持久化存储到etcd等功能\n\n![image-20210128112659986](../images/apiserver-construct.png)\n\nk8s中api-server实际上包括四个部分：\n\n- **KubeApiServer**： 负责对请求的一些通用处理，包括：认证、鉴权以及各个内建资源(pod, deployment，service and etc)的REST服务等\n- **bootstrap-controller**，主要负责Kubernetes default apiserver service的创建以及管理。\n- **ApiExtensionsServer**   负责CustomResourceDefinition（CRD）apiResources以及apiVersions的注册，同时处理CRD以及相应CustomResource（CR）的REST请求(如果对应CR不能被处理的话则会返回404)，也是apiserver Delegation的最后一环\n- **AggregatorServer**   负责处理 `apiregistration.k8s.io` 组下的APIService资源请求，同时将来自用户的请求拦截转发给aggregated server(AA)\n\n其中**KubeApiServer** , **ApiExtensionsServer**  ,**AggregatorServer**   通过链条的形式组合起来（**使用了责任链模式**）。\n\n![image-20210223173400900](../images/apiserver-construct-1.png)\n\n接下来将对四个组件进行分析\n\n<br>\n\n### 2. bootstrap-controller\n\nbootstrap-controller主要有以下四个功能：\n\n- 创建 kubernetes service；\n- 创建 default、kube-system 和 kube-public 以及 kube-node-lease 命名空间；\n- 提供基于 Service ClusterIP 的修复及检查功能；\n- 提供基于 Service NodePort 的修复及检查功能；\n\n**创建 kubernetes service就是下面这个 svc. 这个用于集群内部资源的访问**\n\n```\n[root@k8s-master ~]# kubectl get svc\nNAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE\nkubernetes   ClusterIP   10.0.0.1     <none>        443/TCP   108d\n[root@k8s-master ~]# kubectl get ep -oyaml\napiVersion: v1\nitems:\n- apiVersion: v1\n  kind: Endpoints\n  metadata:\n    creationTimestamp: 2020-12-23T12:34:11Z\n    name: kubernetes\n    namespace: default\n    resourceVersion: \"34\"\n    selfLink: /api/v1/namespaces/default/endpoints/kubernetes\n    uid: 287f22bd-451b-11eb-bb05-fa270004b00d\n  subsets:\n  - addresses:\n    - ip: 192.168.0.4\n    ports:\n    - name: https\n      port: 6443\n      protocol: TCP\nkind: List\nmetadata:\n  resourceVersion: \"\"\n  selfLink: \"\"\n```\n\n<br>\n\n- **apiserver bootstrap-controller** 创建&运行逻辑在k8s.io/kubernetes/pkg/master目录\n- **bootstrap-controller** 主要用于创建以及维护内部kubernetes default apiserver service (就是 default命名空间下的 kubernetes服务)\n- **kubernetes default apiserver service spec.selector**为空，这是default apiserver service与其它正常service的最大区别，表明了这个特殊的service对应的endpoints不由endpoints controller控制，而是直接受kube-apiserver bootstrap-controller管理(maintained by this code, not by the pod selector)\n\n```\n// CreateServerChain creates the apiservers connected via delegation.\nfunc CreateServerChain(completedOptions completedServerRunOptions, stopCh <-chan struct{}) (*aggregatorapiserver.APIAggregator, error) {\n\n\tapiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegate())\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n    // 1.创建KubeAPIServer的时候，调用了CreateKubeAPIServer\n\tkubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// aggregator comes last in the chain\n\taggregatorConfig, err := createAggregatorConfig(*kubeAPIServerConfig.GenericConfig, completedOptions.ServerRunOptions, kubeAPIServerConfig.ExtraConfig.VersionedInformers, serviceResolver, proxyTransport, pluginInitializer)\n\tif err != nil {\n\t\treturn \n\t\t\n\treturn aggregatorServer, nil\n}\n\n## 2.调用了New\n// CreateKubeAPIServer creates and wires a workable kube-apiserver\nfunc CreateKubeAPIServer(kubeAPIServerConfig *master.Config, delegateAPIServer genericapiserver.DelegationTarget) (*master.Master, error) {\n\tkubeAPIServer, err := kubeAPIServerConfig.Complete().New(delegateAPIServer)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\treturn kubeAPIServer, nil\n}\n\n\n## 3. 其中有一步调用了InstallLegacyAPI\n// New returns a new instance of Master from the given config.\n// Certain config fields will be set to a default value if unset.\n// Certain config fields must be specified, including:\n//   KubeletClientConfig\nfunc (c completedConfig) New(delegationTarget genericapiserver.DelegationTarget) (*Master, error) {\n\t....\n\t\n\t\tif err := m.InstallLegacyAPI(&c, c.GenericConfig.RESTOptionsGetter, legacyRESTStorageProvider); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\n\treturn m, nil\n}\n\n\n## 4.installLegacyApi 中将 boot-controller的启动和停止，添加到了apiserver 的 PostStartHook 和 PreShutdownHook \n// InstallLegacyAPI will install the legacy APIs for the restStorageProviders if they are enabled.\nfunc (m *Master) InstallLegacyAPI(c *completedConfig, restOptionsGetter generic.RESTOptionsGetter, legacyRESTStorageProvider corerest.LegacyRESTStorageProvider) error {\n\n   ## 1.将boostrap-controller的启停添加到apiserver 的 PostStartHook 和 PreShutdownHook \n\tcontrollerName := \"bootstrap-controller\"\n\tcoreClient := corev1client.NewForConfigOrDie(c.GenericConfig.LoopbackClientConfig)\n\t\n\t## 2. New一个BootstrapController\n\tbootstrapController := c.NewBootstrapController(legacyRESTStorage, coreClient, coreClient, coreClient, coreClient.RESTClient())\n\t\n\tm.GenericAPIServer.AddPostStartHookOrDie(controllerName, bootstrapController.PostStartHook)\n\tm.GenericAPIServer.AddPreShutdownHookOrDie(controllerName, bootstrapController.PreShutdownHook)\n\n\treturn nil\n}\n\n## postStartHooks 会在 kube-apiserver 的启动方法 prepared.Run 中调用 RunPostStartHooks 启动所有 Hook\n\n// NonBlockingRun spawns the secure http server. An error is\n// returned if the secure port cannot be listened on.\nfunc (s preparedGenericAPIServer) NonBlockingRun(stopCh <-chan struct{}) error {\n\t// Use an stop channel to allow graceful shutdown without dropping audit events\n\t// after http server shutdown.\n\tauditStopCh := make(chan struct{})\n\n\t// Start the audit backend before any request comes in. This means we must call Backend.Run\n\t// before http server start serving. Otherwise the Backend.ProcessEvents call might block.\n\tif s.AuditBackend != nil {\n\t\tif err := s.AuditBackend.Run(auditStopCh); err != nil {\n\t\t\treturn fmt.Errorf(\"failed to run the audit backend: %v\", err)\n\t\t}\n\t}\n\n\t// Use an internal stop channel to allow cleanup of the listeners on error.\n\tinternalStopCh := make(chan struct{})\n\tvar stoppedCh <-chan struct{}\n\tif s.SecureServingInfo != nil && s.Handler != nil {\n\t\tvar err error\n\t\tstoppedCh, err = s.SecureServingInfo.Serve(s.Handler, s.ShutdownTimeout, internalStopCh)\n\t\tif err != nil {\n\t\t\tclose(internalStopCh)\n\t\t\tclose(auditStopCh)\n\t\t\treturn err\n\t\t}\n\t}\n\n\t// Now that listener have bound successfully, it is the\n\t// responsibility of the caller to close the provided channel to\n\t// ensure cleanup.\n\tgo func() {\n\t\t<-stopCh\n\t\tclose(internalStopCh)\n\t\tif stoppedCh != nil {\n\t\t\t<-stoppedCh\n\t\t}\n\t\ts.HandlerChainWaitGroup.Wait()\n\t\tclose(auditStopCh)\n\t}()\n\n\ts.RunPostStartHooks(stopCh)\n\n\tif _, err := systemd.SdNotify(true, \"READY=1\\n\"); err != nil {\n\t\tklog.Errorf(\"Unable to send systemd daemon successful start message: %v\\n\", err)\n\t}\n\n\treturn nil\n}\n\n// RunPostStartHooks runs the PostStartHooks for the server\nfunc (s *GenericAPIServer) RunPostStartHooks(stopCh <-chan struct{}) {\n\ts.postStartHookLock.Lock()\n\tdefer s.postStartHookLock.Unlock()\n\ts.postStartHooksCalled = true\n\n\tcontext := PostStartHookContext{\n\t\tLoopbackClientConfig: s.LoopbackClientConfig,\n\t\tStopCh:               stopCh,\n\t}\n\n\tfor hookName, hookEntry := range s.postStartHooks {\n\t\tgo runPostStartHook(hookName, hookEntry, context)\n\t}\n}\n```\n\nbootstrap controller 的初始化以及启动是在 `CreateKubeAPIServer` 调用链的 `InstallLegacyAPI` 方法中完成的，bootstrap controller 的启停是由 apiserver 的 `PostStartHook` 和 `PreShutdownHook` 进行控制的\n\n<br>\n\n#### 2.1 NewBootstrapController\n\n bootstrap controller 在初始化时需要设定多个参数，主要有 PublicIP、ServiceCIDR、PublicServicePort 等。PublicIP 是通过命令行参数 `--advertise-address` 指定的，PublicServicePort 通过 `--secure-port` 启动参数来指定（默认为 6443），ServiceCIDR 通过 `--service-cluster-ip-range` 参数指定（默认为 10.0.0.0/24） \n\n```\n// k8s.io/kubernetes/pkg/master/controller.go:87\n// NewBootstrapController returns a controller for watching the core capabilities of the master\nfunc (c *completedConfig) NewBootstrapController(legacyRESTStorage corerest.LegacyRESTStorage, serviceClient corev1client.ServicesGetter, nsClient corev1client.NamespacesGetter, eventClient corev1client.EventsGetter, healthClient rest.Interface) *Controller {\n\t// 1、获取 PublicServicePort  \n\t_, publicServicePort, err := c.GenericConfig.SecureServing.HostPort()\n\tif err != nil {\n\t\tklog.Fatalf(\"failed to get listener address: %v\", err)\n\t}\n\n\t// 2、指定需要创建的kube-system，kube-public以及kube-node-lease namespace\n\tsystemNamespaces := []string{metav1.NamespaceSystem, metav1.NamespacePublic, corev1.NamespaceNodeLease}\n\n\treturn &Controller{\n\t\tServiceClient:   serviceClient,\n\t\tNamespaceClient: nsClient,\n\t\tEventClient:     eventClient,\n\t\thealthClient:    healthClient,\n\n\t\tEndpointReconciler: c.ExtraConfig.EndpointReconcilerConfig.Reconciler,\n\t\tEndpointInterval:   c.ExtraConfig.EndpointReconcilerConfig.Interval,\n\n\t\tSystemNamespaces:         systemNamespaces,\n\t\tSystemNamespacesInterval: 1 * time.Minute,\n\n\t\tServiceClusterIPRegistry:          legacyRESTStorage.ServiceClusterIPAllocator,\n\t\t// ServiceCIDR 通过 --service-cluster-ip-range 参数指定  \n\t\tServiceClusterIPRange:             c.ExtraConfig.ServiceIPRange,\n\t\tSecondaryServiceClusterIPRegistry: legacyRESTStorage.SecondaryServiceClusterIPAllocator,\n\t\tSecondaryServiceClusterIPRange:    c.ExtraConfig.SecondaryServiceIPRange,\n\n\t\tServiceClusterIPInterval: 3 * time.Minute,\n\n\t\tServiceNodePortRegistry: legacyRESTStorage.ServiceNodePortAllocator,\n\t\tServiceNodePortRange:    c.ExtraConfig.ServiceNodePortRange,\n\t\tServiceNodePortInterval: 3 * time.Minute,\n\n\t\t// API Server 绑定的IP，这个IP会作为kubernetes service的Endpoint的IP，通过--advertise-address指定   \n\t\tPublicIP: c.GenericConfig.PublicAddress,\n\n\t\t// 取 clusterIP range 中的第一个 IP    \n\t\tServiceIP:                 c.ExtraConfig.APIServerServiceIP,\n\t\t// 默认为 443    \n\t\tServicePort:               c.ExtraConfig.APIServerServicePort,\n\t\tExtraServicePorts:         c.ExtraConfig.ExtraServicePorts,\n\t\tExtraEndpointPorts:        c.ExtraConfig.ExtraEndpointPorts,\n\t\t// 通过--secure-port指定，默认为6443\n\t\tPublicServicePort:         publicServicePort,\n\t\t// 缺省是基于 ClusterIP 启动模式，这里为0    \n\t\tKubernetesServiceNodePort: c.ExtraConfig.KubernetesServiceNodePort,\n\t}\n}\n```\n\n<br>\n\n#### 2.2 BootstrapController.PostStartHook\n\nbootstrapController.PostStartHook 就是下面的 Start()函数.\n\n kube-apiserver会运行起来前调用BootstrapController.PostStartHook，该函数涵盖了bootstrapController的核心功能，主要包括：修复 ClusterIP、修复 NodePort、更新 kubernetes service以及创建系统所需要的名字空间（default、kube-system、kube-public）。bootstrap controller 在启动后首先会完成一次 ClusterIP、NodePort 和 Kubernets 服务的处理，然后异步循环运行上面的4个工作。以下是其 `PostStartHook`方法： \n\n```\n// k8s.io/kubernetes/pkg/master/controller.go:142\n// Start begins the core controller loops that must exist for bootstrapping\n// a cluster.\nfunc (c *Controller) Start() {\n\tif c.runner != nil {\n\t\treturn\n\t}\n\n\t// 1、首次启动时首先从 kubernetes endpoints 中移除自身的配置，此时 kube-apiserver 可能处于非 ready 状态\n\t// Reconcile during first run removing itself until server is ready.\n\tendpointPorts := createEndpointPortSpec(c.PublicServicePort, \"https\", c.ExtraEndpointPorts)\n\tif err := c.EndpointReconciler.RemoveEndpoints(kubernetesServiceName, c.PublicIP, endpointPorts); err != nil {\n\t\tklog.Errorf(\"Unable to remove old endpoints from kubernetes service: %v\", err)\n\t}\n\n\t// 2、初始化 repairClusterIPs 和 repairNodePorts 对象  \n\trepairClusterIPs := servicecontroller.NewRepair(c.ServiceClusterIPInterval, c.ServiceClient, c.EventClient, &c.ServiceClusterIPRange, c.ServiceClusterIPRegistry, &c.SecondaryServiceClusterIPRange, c.SecondaryServiceClusterIPRegistry)\n\trepairNodePorts := portallocatorcontroller.NewRepair(c.ServiceNodePortInterval, c.ServiceClient, c.EventClient, c.ServiceNodePortRange, c.ServiceNodePortRegistry)\n\n\t// 3、首先运行一次 repairClusterIPs 和 repairNodePorts，即进行初始化  \n\t// run all of the controllers once prior to returning from Start.\n\tif err := repairClusterIPs.RunOnce(); err != nil {\n\t\t// If we fail to repair cluster IPs apiserver is useless. We should restart and retry.\n\t\tklog.Fatalf(\"Unable to perform initial IP allocation check: %v\", err)\n\t}\n\tif err := repairNodePorts.RunOnce(); err != nil {\n\t\t// If we fail to repair node ports apiserver is useless. We should restart and retry.\n\t\tklog.Fatalf(\"Unable to perform initial service nodePort check: %v\", err)\n\t}\n\n  // 4、定期执行 bootstrap controller 主要的四个功能(reconciliation)  \n\tc.runner = async.NewRunner(c.RunKubernetesNamespaces, c.RunKubernetesService, repairClusterIPs.RunUntil, repairNodePorts.RunUntil)\n\tc.runner.Start()\n}\n\n// NewRunner makes a runner for the given function(s). The function(s) should loop until\n// the channel is closed.\nfunc NewRunner(f ...func(stop chan struct{})) *Runner {\n\treturn &Runner{loopFuncs: f}\n}\n\n// Start begins running.\nfunc (r *Runner) Start() {\n\tr.lock.Lock()\n\tdefer r.lock.Unlock()\n\tif r.stop == nil {\n\t\tc := make(chan struct{})\n\t\tr.stop = &c\n\t\tfor i := range r.loopFuncs {\n\t\t\tgo r.loopFuncs[i](*r.stop)\n\t\t}\n\t}\n}\n```\n\n<br>\n\n#### 2.3 四个函数\n\n##### 1-RunKubernetesNamespaces\n\n`c.RunKubernetesNamespaces` 主要功能是通过createNamespaceIfNeeded创建 kube-system，kube-public 以及 kube-node-lease 命名空间，之后每隔一分钟检查一次：\n\n```\n// RunKubernetesNamespaces periodically makes sure that all internal namespaces exist\nfunc (c *Controller) RunKubernetesNamespaces(ch chan struct{}) {\n\twait.Until(func() {\n\t\t// Loop the system namespace list, and create them if they do not exist\n\t\tfor _, ns := range c.SystemNamespaces {\n\t\t\tif err := createNamespaceIfNeeded(c.NamespaceClient, ns); err != nil {\n\t\t\t\truntime.HandleError(fmt.Errorf(\"unable to create required kubernetes system namespace %s: %v\", ns, err))\n\t\t\t}\n\t\t}\n\t}, c.SystemNamespacesInterval, ch)\n}\n\n\n// k8s.io/kubernetes/pkg/master/client_util.go:27\nfunc createNamespaceIfNeeded(c corev1client.NamespacesGetter, ns string) error {\n\tif _, err := c.Namespaces().Get(context.TODO(), ns, metav1.GetOptions{}); err == nil {\n\t\t// the namespace already exists\n\t\treturn nil\n\t}\n\tnewNs := &corev1.Namespace{\n\t\tObjectMeta: metav1.ObjectMeta{\n\t\t\tName:      ns,\n\t\t\tNamespace: \"\",\n\t\t},\n\t}\n\t_, err := c.Namespaces().Create(context.TODO(), newNs, metav1.CreateOptions{})\n\tif err != nil && errors.IsAlreadyExists(err) {\n\t\terr = nil\n\t}\n\treturn err\n}\n```\n\n<br>\n\n##### 2- RunKubernetesService\n\n `c.RunKubernetesService` 主要是检查 kubernetes service 是否处于正常状态，并定期执行同步操作。首先调用 `/healthz` 接口检查 apiserver 当前是否处于 ready 状态，若处于 ready 状态然后调用 `c.UpdateKubernetesService` 服务更新 kubernetes service 状态 \n\n```\n// RunKubernetesService periodically updates the kubernetes service\nfunc (c *Controller) RunKubernetesService(ch chan struct{}) {\n\t// wait until process is ready\n\twait.PollImmediateUntil(100*time.Millisecond, func() (bool, error) {\n\t\tvar code int\n\t\tc.healthClient.Get().AbsPath(\"/healthz\").Do().StatusCode(&code)\n\t\treturn code == http.StatusOK, nil\n\t}, ch)\n\n\twait.NonSlidingUntil(func() {\n\t\t// Service definition is not reconciled after first\n\t\t// run, ports and type will be corrected only during\n\t\t// start.\n\t\tif err := c.UpdateKubernetesService(false); err != nil {\n\t\t\truntime.HandleError(fmt.Errorf(\"unable to sync kubernetes service: %v\", err))\n\t\t}\n\t}, c.EndpointInterval, ch)\n}\n```\n\n`c.UpdateKubernetesService` 的主要逻辑为：\n\n- 1、调用 `createNamespaceIfNeeded` 创建 default namespace；\n- 2、调用 `c.CreateOrUpdateMasterServiceIfNeeded` 为 master 创建 kubernetes service；\n- 3、调用 `c.EndpointReconciler.ReconcileEndpoints` 更新 master 的 endpoint；\n\n```\n// UpdateKubernetesService attempts to update the default Kube service.\nfunc (c *Controller) UpdateKubernetesService(reconcile bool) error {\n\t// Update service & endpoint records.\n\t// TODO: when it becomes possible to change this stuff,\n\t// stop polling and start watching.\n\t// TODO: add endpoints of all replicas, not just the elected master.\n\tif err := createNamespaceIfNeeded(c.NamespaceClient, metav1.NamespaceDefault); err != nil {\n\t\treturn err\n\t}\n\n\tservicePorts, serviceType := createPortAndServiceSpec(c.ServicePort, c.PublicServicePort, c.KubernetesServiceNodePort, \"https\", c.ExtraServicePorts)\n\tif err := c.CreateOrUpdateMasterServiceIfNeeded(kubernetesServiceName, c.ServiceIP, servicePorts, serviceType, reconcile); err != nil {\n\t\treturn err\n\t}\n\tendpointPorts := createEndpointPortSpec(c.PublicServicePort, \"https\", c.ExtraEndpointPorts)\n\tif err := c.EndpointReconciler.ReconcileEndpoints(kubernetesServiceName, c.PublicIP, endpointPorts, reconcile); err != nil {\n\t\treturn err\n\t}\n\treturn nil\n}\n```\n\n 这里通过createPortAndServiceSpec创建了ServicePort，为Kubernetes default service的创建做准备 \n\n接着掉用CreateOrUpdateMasterServiceIfNeeded创建kubernetes default service：\n\n```go\nconst kubernetesServiceName = \"kubernetes\"\n\n// CreateOrUpdateMasterServiceIfNeeded will create the specified service if it\n// doesn't already exist.\nfunc (c *Controller) CreateOrUpdateMasterServiceIfNeeded(serviceName string, serviceIP net.IP, servicePorts []corev1.ServicePort, serviceType corev1.ServiceType, reconcile bool) error {\n\tif s, err := c.ServiceClient.Services(metav1.NamespaceDefault).Get(context.TODO(), serviceName, metav1.GetOptions{}); err == nil {\n\t\t// The service already exists.\n\t\tif reconcile {\n\t\t\tif svc, updated := reconcilers.GetMasterServiceUpdateIfNeeded(s, servicePorts, serviceType); updated {\n\t\t\t\tklog.Warningf(\"Resetting master service %q to %#v\", serviceName, svc)\n\t\t\t\t_, err := c.ServiceClient.Services(metav1.NamespaceDefault).Update(context.TODO(), svc, metav1.UpdateOptions{})\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t\treturn nil\n\t}\n\tsvc := &corev1.Service{\n\t\tObjectMeta: metav1.ObjectMeta{\n\t\t\tName:      serviceName,\n\t\t\tNamespace: metav1.NamespaceDefault,\n\t\t\tLabels:    map[string]string{\"provider\": \"kubernetes\", \"component\": \"apiserver\"},\n\t\t},\n\t\tSpec: corev1.ServiceSpec{\n\t\t\tPorts: servicePorts,\n\t\t\t// maintained by this code, not by the pod selector\n\t\t\tSelector:        nil,\n\t\t\tClusterIP:       serviceIP.String(),\n\t\t\tSessionAffinity: corev1.ServiceAffinityNone,\n\t\t\tType:            serviceType,\n\t\t},\n\t}\n\n\t_, err := c.ServiceClient.Services(metav1.NamespaceDefault).Create(context.TODO(), svc, metav1.CreateOptions{})\n\tif errors.IsAlreadyExists(err) {\n\t\treturn c.CreateOrUpdateMasterServiceIfNeeded(serviceName, serviceIP, servicePorts, serviceType, reconcile)\n\t}\n\treturn err\n} \n```\n\n逻辑很清晰，先判断是否存在default kubernetes service，如果不存在则创建该service：\n\n```\napiVersion: v1\nkind: Service\nmetadata:\n  labels:\n    component: apiserver\n    provider: kubernetes\n  name: kubernetes\n  namespace: default\nspec:\n  clusterIP: 10.96.0.1\n  ports:\n  - name: https\n    port: 443\n    protocol: TCP\n    targetPort: 6443\n  sessionAffinity: None\n  type: ClusterIP\n```\n\n<br>\n\n注意这里spec.selector为空，这是default kubernetes service与其它正常service的最大区别，表明了这个特殊的service对应的endpoints不由endpoints controller控制，而是直接受kube-apiserver bootstrap-controller管理(maintained by this code, not by the pod selector)\n\n在创建完default kubernetes service之后，会构建default kubernetes endpoint(c.EndpointReconciler.ReconcileEndpoints)\n\nEndpointReconciler 的具体实现由 `EndpointReconcilerType` 决定，`EndpointReconcilerType` 是 `--endpoint-reconciler-type` 参数指定的，可选的参数有 `master-count, lease, none`，每种类型对应不同的 EndpointReconciler 实例，在 v1.18 中默认为 lease，此处仅分析 lease 对应的 EndpointReconciler 的实现\n\n一个集群中可能会有多个 apiserver 实例，因此需要统一管理 apiserver service 的 endpoints，`c.EndpointReconciler.ReconcileEndpoints` 就是用来管理 apiserver endpoints 的。一个集群中 apiserver 的所有实例会在 etcd 中的对应目录下创建 key，并定期更新这个 key 来上报自己的心跳信息，ReconcileEndpoints 会从 etcd 中获取 apiserver 的实例信息并更新 endpoint：\n\n\n ```\n// createEndpointPortSpec creates an array of endpoint ports\nfunc createEndpointPortSpec(endpointPort int, endpointPortName string, extraEndpointPorts []corev1.EndpointPort) []corev1.EndpointPort {\n\tendpointPorts := []corev1.EndpointPort{{Protocol: corev1.ProtocolTCP,\n\t\tPort: int32(endpointPort),\n\t\tName: endpointPortName,\n\t}}\n\tif extraEndpointPorts != nil {\n\t\tendpointPorts = append(endpointPorts, extraEndpointPorts...)\n\t}\n\treturn endpointPorts\n}\n\n// NewLeaseEndpointReconciler creates a new LeaseEndpoint reconciler\nfunc NewLeaseEndpointReconciler(epAdapter EndpointsAdapter, masterLeases Leases) EndpointReconciler {\n\treturn &leaseEndpointReconciler{\n\t\tepAdapter:             epAdapter,\n\t\tmasterLeases:          masterLeases,\n\t\tstopReconcilingCalled: false,\n\t}\n}\n\nfunc (c *Config) createLeaseReconciler() reconcilers.EndpointReconciler {\n\tendpointClient := corev1client.NewForConfigOrDie(c.GenericConfig.LoopbackClientConfig)\n\tvar endpointSliceClient discoveryclient.EndpointSlicesGetter\n\tif utilfeature.DefaultFeatureGate.Enabled(features.EndpointSlice) {\n\t\tendpointSliceClient = discoveryclient.NewForConfigOrDie(c.GenericConfig.LoopbackClientConfig)\n\t}\n\tendpointsAdapter := reconcilers.NewEndpointsAdapter(endpointClient, endpointSliceClient)\n\n\tttl := c.ExtraConfig.MasterEndpointReconcileTTL\n\tconfig, err := c.ExtraConfig.StorageFactory.NewConfig(api.Resource(\"apiServerIPInfo\"))\n\tif err != nil {\n\t\tklog.Fatalf(\"Error determining service IP ranges: %v\", err)\n\t}\n\tleaseStorage, _, err := storagefactory.Create(*config)\n\tif err != nil {\n\t\tklog.Fatalf(\"Error creating storage factory: %v\", err)\n\t}\n\tmasterLeases := reconcilers.NewLeases(leaseStorage, \"/masterleases/\", ttl)\n\n\treturn reconcilers.NewLeaseEndpointReconciler(endpointsAdapter, masterLeases)\n}\n\nfunc (c *Config) createEndpointReconciler() reconcilers.EndpointReconciler {\n\tklog.Infof(\"Using reconciler: %v\", c.ExtraConfig.EndpointReconcilerType)\n\tswitch c.ExtraConfig.EndpointReconcilerType {\n\t// there are numerous test dependencies that depend on a default controller\n\tcase \"\", reconcilers.MasterCountReconcilerType:\n\t\treturn c.createMasterCountReconciler()\n\tcase reconcilers.LeaseEndpointReconcilerType:\n\t\treturn c.createLeaseReconciler()\n\tcase reconcilers.NoneEndpointReconcilerType:\n\t\treturn c.createNoneReconciler()\n\tdefault:\n\t\tklog.Fatalf(\"Reconciler not implemented: %v\", c.ExtraConfig.EndpointReconcilerType)\n\t}\n\treturn nil\n}\n\n// ReconcileEndpoints lists keys in a special etcd directory.\n// Each key is expected to have a TTL of R+n, where R is the refresh interval\n// at which this function is called, and n is some small value.  If an\n// apiserver goes down, it will fail to refresh its key's TTL and the key will\n// expire. ReconcileEndpoints will notice that the endpoints object is\n// different from the directory listing, and update the endpoints object\n// accordingly.\nfunc (r *leaseEndpointReconciler) ReconcileEndpoints(serviceName string, ip net.IP, endpointPorts []corev1.EndpointPort, reconcilePorts bool) error {\n\tr.reconcilingLock.Lock()\n\tdefer r.reconcilingLock.Unlock()\n\n\tif r.stopReconcilingCalled {\n\t\treturn nil\n\t}\n\n\t// 更新masterleases key TTL\n\t// Refresh the TTL on our key, independently of whether any error or\n\t// update conflict happens below. This makes sure that at least some of\n\t// the masters will add our endpoint.\n\tif err := r.masterLeases.UpdateLease(ip.String()); err != nil {\n\t\treturn err\n\t}\n\n\treturn r.doReconcile(serviceName, endpointPorts, reconcilePorts)\n}\n\nfunc (r *leaseEndpointReconciler) doReconcile(serviceName string, endpointPorts []corev1.EndpointPort, reconcilePorts bool) error {\n\t// 获取default kubernetes endpoints  \n\te, err := r.epAdapter.Get(corev1.NamespaceDefault, serviceName, metav1.GetOptions{})\n\tshouldCreate := false\n\tif err != nil {\n\t\tif !errors.IsNotFound(err) {\n\t\t\treturn err\n\t\t}\n\n\t\t// 如果不存在，则创建endpoints    \n\t\tshouldCreate = true\n\t\te = &corev1.Endpoints{\n\t\t\tObjectMeta: metav1.ObjectMeta{\n\t\t\t\tName:      serviceName,\n\t\t\t\tNamespace: corev1.NamespaceDefault,\n\t\t\t},\n\t\t}\n\t}\n\t\n  // 从etcd中获取master IP keys(代表了kube-apiserver数目)  \n\t// ... and the list of master IP keys from etcd\n\tmasterIPs, err := r.masterLeases.ListLeases()\n\tif err != nil {\n\t\treturn err\n\t}\n\t  \n\t// Since we just refreshed our own key, assume that zero endpoints\n\t// returned from storage indicates an issue or invalid state, and thus do\n\t// not update the endpoints list based on the result.\n\tif len(masterIPs) == 0 {\n\t\treturn fmt.Errorf(\"no master IPs were listed in storage, refusing to erase all endpoints for the kubernetes service\")\n\t}\n\n\t// 将dafault kubernetes endpoint与masterIP列表以及端口列表进行比较，验证已经存在的endpoint有效性\n\t// Next, we compare the current list of endpoints with the list of master IP keys\n\tformatCorrect, ipCorrect, portsCorrect := checkEndpointSubsetFormatWithLease(e, masterIPs, endpointPorts, reconcilePorts)\n\tif formatCorrect && ipCorrect && portsCorrect {\n\t\treturn r.epAdapter.EnsureEndpointSliceFromEndpoints(corev1.NamespaceDefault, e)\n\t}\n\n\t// 如果不正确，则重新创建endpoint  \n\tif !formatCorrect {\n\t\t// Something is egregiously wrong, just re-make the endpoints record.\n\t\te.Subsets = []corev1.EndpointSubset{{\n\t\t\tAddresses: []corev1.EndpointAddress{},\n\t\t\tPorts:     endpointPorts,\n\t\t}}\n\t}\n\n\tif !formatCorrect || !ipCorrect {\n\t\t// repopulate the addresses according to the expected IPs from etcd\n\t\te.Subsets[0].Addresses = make([]corev1.EndpointAddress, len(masterIPs))\n\t\tfor ind, ip := range masterIPs {\n\t\t\te.Subsets[0].Addresses[ind] = corev1.EndpointAddress{IP: ip}\n\t\t}\n\n\t\t// Lexicographic order is retained by this step.\n\t\te.Subsets = endpointsv1.RepackSubsets(e.Subsets)\n\t}\n\n\tif !portsCorrect {\n\t\t// Reset ports.\n\t\te.Subsets[0].Ports = endpointPorts\n\t}\n\n\t// 创建或者更新default kubernetes endpoint  \n\tklog.Warningf(\"Resetting endpoints for master service %q to %v\", serviceName, masterIPs)\n\tif shouldCreate {\n\t\tif _, err = r.epAdapter.Create(corev1.NamespaceDefault, e); errors.IsAlreadyExists(err) {\n\t\t\terr = nil\n\t\t}\n\t} else {\n\t\t_, err = r.epAdapter.Update(corev1.NamespaceDefault, e)\n\t}\n\treturn err\n}\n\n// checkEndpointSubsetFormatWithLease determines if the endpoint is in the\n// format ReconcileEndpoints expects when the controller is using leases.\n//\n// Return values:\n// * formatCorrect is true if exactly one subset is found.\n// * ipsCorrect when the addresses in the endpoints match the expected addresses list\n// * portsCorrect is true when endpoint ports exactly match provided ports.\n//     portsCorrect is only evaluated when reconcilePorts is set to true.\nfunc checkEndpointSubsetFormatWithLease(e *corev1.Endpoints, expectedIPs []string, ports []corev1.EndpointPort, reconcilePorts bool) (formatCorrect bool, ipsCorrect bool, portsCorrect bool) {\n\tif len(e.Subsets) != 1 {\n\t\treturn false, false, false\n\t}\n\tsub := &e.Subsets[0]\n\tportsCorrect = true\n\tif reconcilePorts {\n\t\tif len(sub.Ports) != len(ports) {\n\t\t\tportsCorrect = false\n\t\t} else {\n\t\t\tfor i, port := range ports {\n\t\t\t\tif port != sub.Ports[i] {\n\t\t\t\t\tportsCorrect = false\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\tipsCorrect = true\n\tif len(sub.Addresses) != len(expectedIPs) {\n\t\tipsCorrect = false\n\t} else {\n\t\t// check the actual content of the addresses\n\t\t// present addrs is used as a set (the keys) and to indicate if a\n\t\t// value was already found (the values)\n\t\tpresentAddrs := make(map[string]bool, len(expectedIPs))\n\t\tfor _, ip := range expectedIPs {\n\t\t\tpresentAddrs[ip] = false\n\t\t}\n\n\t\t// uniqueness is assumed amongst all Addresses.\n\t\tfor _, addr := range sub.Addresses {\n\t\t\tif alreadySeen, ok := presentAddrs[addr.IP]; alreadySeen || !ok {\n\t\t\t\tipsCorrect = false\n\t\t\t\tbreak\n\t\t\t}\n\n\t\t\tpresentAddrs[addr.IP] = true\n\t\t}\n\t}\n\n\treturn true, ipsCorrect, portsCorrect\n}\n ```\n\nleaseEndpointReconciler.ReconcileEndpoints的流程如上所示：\n\n- 更新masterleases key TTL\n- 获取default kubernetes endpoints\n- 如果不存在，则创建endpoints\n- 将dafault kubernetes endpoint与masterIP列表以及端口列表进行比较，验证已经存在的endpoint有效性\n- 如果不正确，则修正endpoint字段并更新\n\n```\n$ ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key get --prefix --keys-only /registry/masterleases\n/registry/masterleases/192.168.60.21\n/registry/masterleases/192.168.60.22\n/registry/masterleases/192.168.60.23\n```\n\n 这里再次总结RunKubernetesService的逻辑：检查 kubernetes service 是否处于正常状态，并定期执行同步操作。首先调用 `/healthz` 接口检查 apiserver 当前是否处于 ready 状态，若处于 ready 状态然后调用 `c.UpdateKubernetesService` 服务更新 kubernetes service 状态（创建 default namespace => 创建 kubernetes service => 更新 master 的 endpoint） \n\n<br>\n\n##### 3- repairClusterIPs.RunUntil\n\n 在Controller.Start函数中， \n\n```\n// Start begins the core controller loops that must exist for bootstrapping\n// a cluster.\nfunc (c *Controller) Start() {\n\n\trepairClusterIPs := servicecontroller.NewRepair(c.ServiceClusterIPInterval, c.ServiceClient, c.EventClient, &c.ServiceClusterIPRange, c.ServiceClusterIPRegistry, &c.SecondaryServiceClusterIPRange, c.SecondaryServiceClusterIPRegistry)\n\trepairNodePorts := portallocatorcontroller.NewRepair(c.ServiceNodePortInterval, c.ServiceClient, c.EventClient, c.ServiceNodePortRange, c.ServiceNodePortRegistry)\n\n\n\tc.runner = async.NewRunner(c.RunKubernetesNamespaces, c.RunKubernetesService, repairClusterIPs.RunUntil, repairNodePorts.RunUntil)\n\tc.runner.Start()\n}\n```\n\n这里会先创建repairClusterIPs，然后执行repairClusterIPs.RunUntil来提供基于 Service ClusterIP 的修复及检查功能：\n\n```\n// k8s.io/kubernetes/pkg/registry/core/service/ipallocator/controller/repair.go:76\n// NewRepair creates a controller that periodically ensures that all clusterIPs are uniquely allocated across the cluster\n// and generates informational warnings for a cluster that is not in sync.\nfunc NewRepair(interval time.Duration, serviceClient corev1client.ServicesGetter, eventClient corev1client.EventsGetter, network *net.IPNet, alloc rangeallocation.RangeRegistry, secondaryNetwork *net.IPNet, secondaryAlloc rangeallocation.RangeRegistry) *Repair {\n\teventBroadcaster := record.NewBroadcaster()\n\teventBroadcaster.StartRecordingToSink(&corev1client.EventSinkImpl{Interface: eventClient.Events(\"\")})\n\trecorder := eventBroadcaster.NewRecorder(legacyscheme.Scheme, v1.EventSource{Component: \"ipallocator-repair-controller\"})\n\n\treturn &Repair{\n\t\tinterval:      interval,\n\t\tserviceClient: serviceClient,\n\n\t\tnetwork:          network,\n\t\talloc:            alloc,\n\t\tsecondaryNetwork: secondaryNetwork,\n\t\tsecondaryAlloc:   secondaryAlloc,\n\n\t\tleaks:    map[string]int{},\n\t\trecorder: recorder,\n\t}\n}\n\n// RunUntil starts the controller until the provided ch is closed.\nfunc (c *Repair) RunUntil(ch chan struct{}) {\n\twait.Until(func() {\n\t\tif err := c.RunOnce(); err != nil {\n\t\t\truntime.HandleError(err)\n\t\t}\n\t}, c.interval, ch)\n}\n\n// RunOnce verifies the state of the cluster IP allocations and returns an error if an unrecoverable problem occurs.\nfunc (c *Repair) RunOnce() error {\n\treturn retry.RetryOnConflict(retry.DefaultBackoff, c.runOnce)\n}\n\n// runOnce verifies the state of the cluster IP allocations and returns an error if an unrecoverable problem occurs.\nfunc (c *Repair) runOnce() error {\n\t// TODO: (per smarterclayton) if Get() or ListServices() is a weak consistency read,\n\t// or if they are executed against different leaders,\n\t// the ordering guarantee required to ensure no IP is allocated twice is violated.\n\t// ListServices must return a ResourceVersion higher than the etcd index Get triggers,\n\t// and the release code must not release services that have had IPs allocated but not yet been created\n\t// See #8295\n\n\t// If etcd server is not running we should wait for some time and fail only then. This is particularly\n\t// important when we start apiserver and etcd at the same time.\n\tvar snapshot *api.RangeAllocation\n\tvar secondarySnapshot *api.RangeAllocation\n\n\tvar stored, secondaryStored ipallocator.Interface\n\tvar err, secondaryErr error\n\n\t// 1、首先从 etcd 中获取已经使用 ClusterIP 的快照  \n\terr = wait.PollImmediate(time.Second, 10*time.Second, func() (bool, error) {\n\t\tvar err error\n\t\tsnapshot, err = c.alloc.Get()\n\t\tif err != nil {\n\t\t\treturn false, err\n\t\t}\n\n\t\tif c.shouldWorkOnSecondary() {\n\t\t\tsecondarySnapshot, err = c.secondaryAlloc.Get()\n\t\t\tif err != nil {\n\t\t\t\treturn false, err\n\t\t\t}\n\t\t}\n\n\t\treturn true, nil\n\t})\n\tif err != nil {\n\t\treturn fmt.Errorf(\"unable to refresh the service IP block: %v\", err)\n\t}\n\t// 2、判断 snapshot 是否已经初始化  \n\t// If not yet initialized.\n\tif snapshot.Range == \"\" {\n\t\tsnapshot.Range = c.network.String()\n\t}\n\n\tif c.shouldWorkOnSecondary() && secondarySnapshot.Range == \"\" {\n\t\tsecondarySnapshot.Range = c.secondaryNetwork.String()\n\t}\n\t// Create an allocator because it is easy to use.\n\n\tstored, err = ipallocator.NewFromSnapshot(snapshot)\n\tif c.shouldWorkOnSecondary() {\n\t\tsecondaryStored, secondaryErr = ipallocator.NewFromSnapshot(secondarySnapshot)\n\t}\n\n\tif err != nil || secondaryErr != nil {\n\t\treturn fmt.Errorf(\"unable to rebuild allocator from snapshots: %v\", err)\n\t}\n\n\t// 3、获取 service list  \n\t// We explicitly send no resource version, since the resource version\n\t// of 'snapshot' is from a different collection, it's not comparable to\n\t// the service collection. The caching layer keeps per-collection RVs,\n\t// and this is proper, since in theory the collections could be hosted\n\t// in separate etcd (or even non-etcd) instances.\n\tlist, err := c.serviceClient.Services(metav1.NamespaceAll).List(context.TODO(), metav1.ListOptions{})\n\tif err != nil {\n\t\treturn fmt.Errorf(\"unable to refresh the service IP block: %v\", err)\n\t}\n\n\t// 4、将 CIDR 转换为对应的 IP range 格式  \n\tvar rebuilt, secondaryRebuilt *ipallocator.Range\n\trebuilt, err = ipallocator.NewCIDRRange(c.network)\n\tif err != nil {\n\t\treturn fmt.Errorf(\"unable to create CIDR range: %v\", err)\n\t}\n\n\tif c.shouldWorkOnSecondary() {\n\t\tsecondaryRebuilt, err = ipallocator.NewCIDRRange(c.secondaryNetwork)\n\t}\n\n\tif err != nil {\n\t\treturn fmt.Errorf(\"unable to create CIDR range: %v\", err)\n\t}\n\n\t// 5、检查每个 Service 的 ClusterIP，保证其处于正常状态  \n\t// Check every Service's ClusterIP, and rebuild the state as we think it should be.\n\tfor _, svc := range list.Items {\n\t\tif !helper.IsServiceIPSet(&svc) {\n\t\t\t// didn't need a cluster IP\n\t\t\tcontinue\n\t\t}\n\t\tip := net.ParseIP(svc.Spec.ClusterIP)\n\t\tif ip == nil {\n\t\t\t// cluster IP is corrupt\n\t\t\tc.recorder.Eventf(&svc, v1.EventTypeWarning, \"ClusterIPNotValid\", \"Cluster IP %s is not a valid IP; please recreate service\", svc.Spec.ClusterIP)\n\t\t\truntime.HandleError(fmt.Errorf(\"the cluster IP %s for service %s/%s is not a valid IP; please recreate\", svc.Spec.ClusterIP, svc.Name, svc.Namespace))\n\t\t\tcontinue\n\t\t}\n\n\t\t// mark it as in-use\n\t\tactualAlloc := c.selectAllocForIP(ip, rebuilt, secondaryRebuilt)\n\t\tswitch err := actualAlloc.Allocate(ip); err {\n\t\t// 6、检查 ip 是否泄漏      \n\t\tcase nil:\n\t\t\tactualStored := c.selectAllocForIP(ip, stored, secondaryStored)\n\t\t\tif actualStored.Has(ip) {\n\t\t\t\t// remove it from the old set, so we can find leaks\n\t\t\t\tactualStored.Release(ip)\n\t\t\t} else {\n\t\t\t\t// cluster IP doesn't seem to be allocated\n\t\t\t\tc.recorder.Eventf(&svc, v1.EventTypeWarning, \"ClusterIPNotAllocated\", \"Cluster IP %s is not allocated; repairing\", ip)\n\t\t\t\truntime.HandleError(fmt.Errorf(\"the cluster IP %s for service %s/%s is not allocated; repairing\", ip, svc.Name, svc.Namespace))\n\t\t\t}\n\t\t\tdelete(c.leaks, ip.String()) // it is used, so it can't be leaked\n\t\t// 7、ip 重复分配      \n\t\tcase ipallocator.ErrAllocated:\n\t\t\t// cluster IP is duplicate\n\t\t\tc.recorder.Eventf(&svc, v1.EventTypeWarning, \"ClusterIPAlreadyAllocated\", \"Cluster IP %s was assigned to multiple services; please recreate service\", ip)\n\t\t\truntime.HandleError(fmt.Errorf(\"the cluster IP %s for service %s/%s was assigned to multiple services; please recreate\", ip, svc.Name, svc.Namespace))\n\t\t// 8、ip 超出范围      \n\t\tcase err.(*ipallocator.ErrNotInRange):\n\t\t\t// cluster IP is out of range\n\t\t\tc.recorder.Eventf(&svc, v1.EventTypeWarning, \"ClusterIPOutOfRange\", \"Cluster IP %s is not within the service CIDR %s; please recreate service\", ip, c.network)\n\t\t\truntime.HandleError(fmt.Errorf(\"the cluster IP %s for service %s/%s is not within the service CIDR %s; please recreate\", ip, svc.Name, svc.Namespace, c.network))\n \t\t// 9、ip 已经分配完     \n\t\tcase ipallocator.ErrFull:\n\t\t\t// somehow we are out of IPs\n\t\t\tcidr := actualAlloc.CIDR()\n\t\t\tc.recorder.Eventf(&svc, v1.EventTypeWarning, \"ServiceCIDRFull\", \"Service CIDR %v is full; you must widen the CIDR in order to create new services\", cidr)\n\t\t\treturn fmt.Errorf(\"the service CIDR %v is full; you must widen the CIDR in order to create new services\", cidr)\n\t\tdefault:\n\t\t\tc.recorder.Eventf(&svc, v1.EventTypeWarning, \"UnknownError\", \"Unable to allocate cluster IP %s due to an unknown error\", ip)\n\t\t\treturn fmt.Errorf(\"unable to allocate cluster IP %s for service %s/%s due to an unknown error, exiting: %v\", ip, svc.Name, svc.Namespace, err)\n\t\t}\n\t}\n\n\t// 10、对比是否有泄漏 ip  \n\tc.checkLeaked(stored, rebuilt)\n\tif c.shouldWorkOnSecondary() {\n\t\tc.checkLeaked(secondaryStored, secondaryRebuilt)\n\t}\n\n\t// 11、更新快照  \n\t// Blast the rebuilt state into storage.\n\terr = c.saveSnapShot(rebuilt, c.alloc, snapshot)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tif c.shouldWorkOnSecondary() {\n\t\terr := c.saveSnapShot(secondaryRebuilt, c.secondaryAlloc, secondarySnapshot)\n\t\tif err != nil {\n\t\t\treturn nil\n\t\t}\n\t}\n\treturn nil\n}\n```\n\nrepairClusterIP 主要解决的问题有：\n\n- 保证集群中所有的 ClusterIP 都是唯一分配的；\n- 保证分配的 ClusterIP 不会超出指定范围；\n- 确保已经分配给 service 但是因为 crash 等其它原因没有正确创建 ClusterIP\n\n##### 4-repairNodePorts.RunUntil\n\n```\n// PreShutdownHook triggers the actions needed to shut down the API Server cleanly.\nfunc (c *Controller) PreShutdownHook() error {\n\tc.Stop()\n\treturn nil\n}\n\n// Stop cleans up this API Servers endpoint reconciliation leases so another master can take over more quickly.\nfunc (c *Controller) Stop() {\n\tif c.runner != nil {\n\t\tc.runner.Stop()\n\t}\n\tendpointPorts := createEndpointPortSpec(c.PublicServicePort, \"https\", c.ExtraEndpointPorts)\n\tfinishedReconciling := make(chan struct{})\n\tgo func() {\n\t\tdefer close(finishedReconciling)\n\t\tklog.Infof(\"Shutting down kubernetes service endpoint reconciler\")\n\t\tc.EndpointReconciler.StopReconciling()\n\t\tif err := c.EndpointReconciler.RemoveEndpoints(kubernetesServiceName, c.PublicIP, endpointPorts); err != nil {\n\t\t\tklog.Error(err)\n\t\t}\n\t}()\n\n\tselect {\n\tcase <-finishedReconciling:\n\t\t// done\n\tcase <-time.After(2 * c.EndpointInterval):\n\t\t// don't block server shutdown forever if we can't reach etcd to remove ourselves\n\t\tklog.Warning(\"RemoveEndpoints() timed out\")\n\t}\n}\n\nfunc (r *leaseEndpointReconciler) RemoveEndpoints(serviceName string, ip net.IP, endpointPorts []corev1.EndpointPort) error {\n\tif err := r.masterLeases.RemoveLease(ip.String()); err != nil {\n\t\treturn err\n\t}\n\n\treturn r.doReconcile(serviceName, endpointPorts, true)\n}\n\nfunc (r *leaseEndpointReconciler) StopReconciling() {\n\tr.reconcilingLock.Lock()\n\tdefer r.reconcilingLock.Unlock()\n\tr.stopReconcilingCalled = true\n}\n\n// ReconcileEndpoints lists keys in a special etcd directory.\n// Each key is expected to have a TTL of R+n, where R is the refresh interval\n// at which this function is called, and n is some small value.  If an\n// apiserver goes down, it will fail to refresh its key's TTL and the key will\n// expire. ReconcileEndpoints will notice that the endpoints object is\n// different from the directory listing, and update the endpoints object\n// accordingly.\nfunc (r *leaseEndpointReconciler) ReconcileEndpoints(serviceName string, ip net.IP, endpointPorts []corev1.EndpointPort, reconcilePorts bool) error {\n\tr.reconcilingLock.Lock()\n\tdefer r.reconcilingLock.Unlock()\n\n\tif r.stopReconcilingCalled {\n\t\treturn nil\n\t}\n\n\t// Refresh the TTL on our key, independently of whether any error or\n\t// update conflict happens below. This makes sure that at least some of\n\t// the masters will add our endpoint.\n\tif err := r.masterLeases.UpdateLease(ip.String()); err != nil {\n\t\treturn err\n\t}\n\n\treturn r.doReconcile(serviceName, endpointPorts, reconcilePorts)\n}\n```\n\n可以看到PreShutdownHook会先停止ReconcileEndpoints，然后清理掉default Kubernetes endpoint中本身masterIP的记录(cleans up this API Servers endpoint)\n\n#### 2.4 总结\n\n- apiserver bootstrap-controller创建&运行逻辑在k8s.io/kubernetes/pkg/master目录\n- bootstrap-controller主要用于创建以及维护内部kubernetes apiserver service\n- default kubernetes service spec.selector为空，这是default kubernetes service与其它正常service的最大区别，表明了这个特殊的service对应的endpoints不由endpoints controller控制，而是直接受kube-apiserver bootstrap-controller管理(maintained by this code, not by the pod selector)\n- bootstrap-controller的几个主要功能如下：\n  - 创建 default、kube-system 和 kube-public 以及 kube-node-lease 命名空间\n  - 创建&维护 default kubernetes service以及对应的endpoint\n  - 提供基于 Service ClusterIP 的修复及检查功能(`--service-cluster-ip-range`指定范围)\n  - 提供基于 Service NodePort 的修复及检查功能(`--service-node-port-range`指定范围)\n\n<br>\n\n### 3. KubeAPIServer\n\nKubeAPIServer主要提供对内建API Resources的操作请求，为Kubernetes中各API Resources注册路由信息，同时暴露RESTful API，使集群中以及集群外的服务都可以通过RESTful API操作Kubernetes中的资源\n\n另外，kubeAPIServer是整个Kubernetes apiserver的核心，下面将要讲述的aggregatorServer以及apiExtensionsServer都是建立在kubeAPIServer基础上进行扩展的(补充了Kubernetes对用户自定义资源的能力支持)\n\nkubeAPIServer最核心的功能是为Kubernetes内置资源添加路由，如下：\n\n- 调用 `m.InstallLegacyAPI` 将核心 API Resources添加到路由中，在apiserver中即是以 `/api` 开头的 resource；\n- 调用 `m.InstallAPIs` 将扩展的 API Resources添加到路由中，在apiserver中即是以 `/apis` 开头的 resource；\n\n```\n// k8s.io/kubernetes/pkg/master/master.go:332\n// New returns a new instance of Master from the given config.\n// Certain config fields will be set to a default value if unset.\n// Certain config fields must be specified, including:\n//   KubeletClientConfig\nfunc (c completedConfig) New(delegationTarget genericapiserver.DelegationTarget) (*Master, error) {\n    ...\n    // 安装 LegacyAPI(core API)\n    // install legacy rest storage\n    if c.ExtraConfig.APIResourceConfigSource.VersionEnabled(apiv1.SchemeGroupVersion) {\n        legacyRESTStorageProvider := corerest.LegacyRESTStorageProvider{\n            StorageFactory:              c.ExtraConfig.StorageFactory,\n            ProxyTransport:              c.ExtraConfig.ProxyTransport,\n            KubeletClientConfig:         c.ExtraConfig.KubeletClientConfig,\n            EventTTL:                    c.ExtraConfig.EventTTL,\n            ServiceIPRange:              c.ExtraConfig.ServiceIPRange,\n            SecondaryServiceIPRange:     c.ExtraConfig.SecondaryServiceIPRange,\n            ServiceNodePortRange:        c.ExtraConfig.ServiceNodePortRange,\n            LoopbackClientConfig:        c.GenericConfig.LoopbackClientConfig,\n            ServiceAccountIssuer:        c.ExtraConfig.ServiceAccountIssuer,\n            ServiceAccountMaxExpiration: c.ExtraConfig.ServiceAccountMaxExpiration,\n            APIAudiences:                c.GenericConfig.Authentication.APIAudiences,\n        }\n        if err := m.InstallLegacyAPI(&c, c.GenericConfig.RESTOptionsGetter, legacyRESTStorageProvider); err != nil {\n            return nil, err\n        }\n    }\n    ...\n    // 安装 APIs(named groups apis)\n    if err := m.InstallAPIs(c.ExtraConfig.APIResourceConfigSource, c.GenericConfig.RESTOptionsGetter, restStorageProviders...); err != nil {\n        return nil, err\n    }\n    ...\n    return m, nil\n}\n```\n\n整个kubeAPIServer提供了三类API Resource接口：\n\n- core group：主要在 `/api/v1` 下；\n- named groups：其 path 为 `/apis/$GROUP/$VERSION`；\n- 系统状态的一些 API：如`/metrics` 、`/version` 等；\n\n而API的URL大致以 `/apis/{group}/{version}/namespaces/{namespace}/resource/{name}` 组成，结构如下图所示：\n\n![image-20210128172701863](../images/apiserver-construct-2.png)\n\n\n\nkubeAPIServer会为每种API资源创建对应的RESTStorage，RESTStorage的目的是将每种资源的访问路径及其后端存储的操作对应起来：通过构造的REST Storage\n\n实现的接口判断该资源可以执行哪些操作（如：create、update等），将其对应的操作存入到action中，每一个操作对应一个标准的REST method，如create对应\n\nREST method为POST，而update对应REST method为PUT。最终根据actions数组依次遍历，对每一个操作添加一个handler(handler对应REST Storage实现的相\n\n关接口)，并注册到route，最终对外提供RESTful API，如下：\n\n```\n// m.GenericAPIServer.InstallLegacyAPIGroup --> s.installAPIResources --> apiGroupVersion.InstallREST --> installer.Install --> a.registerResourceHandlers\n// k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/endpoints/installer.go:181\nfunc (a *APIInstaller) registerResourceHandlers(path string, storage rest.Storage, ws *restful.WebService) (*metav1.APIResource, error) {\n    ...\n    // 1、判断该 resource 实现了哪些 REST 操作接口，以此来判断其支持的 verbs 以便为其添加路由\n    // what verbs are supported by the storage, used to know what verbs we support per path\n    creater, isCreater := storage.(rest.Creater)\n    namedCreater, isNamedCreater := storage.(rest.NamedCreater)\n    lister, isLister := storage.(rest.Lister)\n    getter, isGetter := storage.(rest.Getter)\n    ...\n    // 2、为 resource 添加对应的 actions(+根据是否支持 namespace)\n    // Get the list of actions for the given scope.\n    switch {\n    case !namespaceScoped:\n        // Handle non-namespace scoped resources like nodes.\n        resourcePath := resource\n        resourceParams := params\n        itemPath := resourcePath + \"/{name}\"\n        nameParams := append(params, nameParam)\n        proxyParams := append(nameParams, pathParam)\n        ...\n        // Handler for standard REST verbs (GET, PUT, POST and DELETE).\n        // Add actions at the resource path: /api/apiVersion/resource\n        actions = appendIf(actions, action{\"LIST\", resourcePath, resourceParams, namer, false}, isLister)\n        actions = appendIf(actions, action{\"POST\", resourcePath, resourceParams, namer, false}, isCreater)\n        ...\n    }\n    ...\n    // 3、从 rest.Storage 到 restful.Route 映射\n    // 为每个操作添加对应的 handler\n    for _, action := range actions {\n        ...\n        switch action.Verb {\n        ...\n        case \"POST\": // Create a resource.\n            var handler restful.RouteFunction\n            // 4、初始化 handler\n            if isNamedCreater {\n                handler = restfulCreateNamedResource(namedCreater, reqScope, admit)\n            } else {\n                handler = restfulCreateResource(creater, reqScope, admit)\n            }\n            handler = metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, handler)\n            ...\n            // 5、route 与 handler 进行绑定    \n            route := ws.POST(action.Path).To(handler).\n                Doc(doc).\n                Param(ws.QueryParameter(\"pretty\", \"If 'true', then the output is pretty printed.\")).\n                Operation(\"create\"+namespaced+kind+strings.Title(subresource)+operationSuffix).\n                Produces(append(storageMeta.ProducesMIMETypes(action.Verb), mediaTypes...)...).\n                Returns(http.StatusOK, \"OK\", producedObject).\n                // TODO: in some cases, the API may return a v1.Status instead of the versioned object\n                // but currently go-restful can't handle multiple different objects being returned.\n                Returns(http.StatusCreated, \"Created\", producedObject).\n                Returns(http.StatusAccepted, \"Accepted\", producedObject).\n                Reads(defaultVersionedObject).\n                Writes(producedObject)\n            if err := AddObjectParams(ws, route, versionedCreateOptions); err != nil {\n                return nil, err\n            }\n            addParams(route, action.Params)\n            // 6、添加到路由中    \n            routes = append(routes, route)\n        case \"DELETE\": // Delete a resource.\n        ...\n        default:\n            return nil, fmt.Errorf(\"unrecognized action verb: %s\", action.Verb)\n        }\n        for _, route := range routes {\n            route.Metadata(ROUTE_META_GVK, metav1.GroupVersionKind{\n                Group:   reqScope.Kind.Group,\n                Version: reqScope.Kind.Version,\n                Kind:    reqScope.Kind.Kind,\n            })\n            route.Metadata(ROUTE_META_ACTION, strings.ToLower(action.Verb))\n            ws.Route(route)\n        }\n        // Note: update GetAuthorizerAttributes() when adding a custom handler.\n    }\n    ...\n}\n```\n\n<br>\n\nkubeAPIServer代码结构整理如下：\n\n```\n1. apiserver整体启动逻辑 k8s.io/kubernetes/cmd/kube-apiserver\n2. apiserver bootstrap-controller创建&运行逻辑 k8s.io/kubernetes/pkg/master\n3. API Resource对应后端RESTStorage(based on genericregistry.Store)创建k8s.io/kubernetes/pkg/registry\n4. aggregated-apiserver创建&处理逻辑 k8s.io/kubernetes/staging/src/k8s.io/kube-aggregator\n5. extensions-apiserver创建&处理逻辑 k8s.io/kubernetes/staging/src/k8s.io/apiextensions-apiserver\n6. apiserver创建&运行 k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/server\n7. 注册API Resource资源处理handler(InstallREST&Install®isterResourceHandlers) k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/endpoints\n8. 创建存储后端(etcdv3) k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/storage\n9. genericregistry.Store.CompleteWithOptions初始化 k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/registry\n```\n\n![image-20210128174024333](../images/apiserver-code-1.png)\n\n<br>\n\n### 4.aggregatorServer\n\naggregatorServer主要用于处理扩展Kubernetes API Resources的第二种方式Aggregated APIServer(AA)，将CR请求代理给AA：\n\n![image-20210128174206677](../images/aggserver-1.png)\n\n这里结合Kubernetes官方给出的aggregated apiserver例子[sample-apiserver](https://github.com/kubernetes/sample-apiserver)，总结原理如下：\n\n- aggregatorServer通过APIServices对象关联到某个Service来进行请求的转发，其关联的Service类型进一步决定了请求转发的形式。aggregatorServer包括一个`GenericAPIServer`和维护自身状态的`Controller`。其中`GenericAPIServer`主要处理`apiregistration.k8s.io`组下的APIService资源请求，而Controller包括：\n  - `apiserviceRegistrationController`：负责根据APIService定义的aggregated server service构建代理，将CR的请求转发给后端的aggregated server\n  - `availableConditionController`：维护 APIServices 的可用状态，包括其引用 Service 是否可用等；\n  - `autoRegistrationController`：用于保持 API 中存在的一组特定的 APIServices；\n  - `crdRegistrationController`：负责将 CRD GroupVersions 自动注册到 APIServices 中；\n  - `openAPIAggregationController`：将 APIServices 资源的变化同步至提供的 OpenAPI 文档；\n  \n- apiserviceRegistrationController负责根据APIService定义的aggregated server service构建代理，将CR的请求转发给后端的aggregated server。apiService\n\n  有两种类型：Local(Service为空)以及Service(Service非空)。apiserviceRegistrationController负责对这两种类型apiService设置代理：Local类型会直接路\n\n  给kube-apiserver进行处理；而Service类型则会设置代理并将请求转化为对aggregated Service的请求(proxyPath := \"/apis/\" + apiService.Spec.Group + \"/\"\n\n  apiService.Spec.Version)，而请求的负载均衡策略则是优先本地访问kube-apiserver(如果service为kubernetes default apiserver service:443)=>通\n\n  service ClusterIP:Port访问(默认) 或者 通过随机选择service endpoint backend进行访问：\n\n```go\nfunc (s *APIAggregator) AddAPIService(apiService *v1.APIService) error {\n  ...\n    proxyPath := \"/apis/\" + apiService.Spec.Group + \"/\" + apiService.Spec.Version\n    // v1. is a special case for the legacy API.  It proxies to a wider set of endpoints.\n    if apiService.Name == legacyAPIServiceName {\n        proxyPath = \"/api\"\n    }\n    // register the proxy handler\n    proxyHandler := &proxyHandler{\n        localDelegate:   s.delegateHandler,\n        proxyClientCert: s.proxyClientCert,\n        proxyClientKey:  s.proxyClientKey,\n        proxyTransport:  s.proxyTransport,\n        serviceResolver: s.serviceResolver,\n        egressSelector:  s.egressSelector,\n    }\n  ...\n    s.proxyHandlers[apiService.Name] = proxyHandler\n    s.GenericAPIServer.Handler.NonGoRestfulMux.Handle(proxyPath, proxyHandler)\n    s.GenericAPIServer.Handler.NonGoRestfulMux.UnlistedHandlePrefix(proxyPath+\"/\", proxyHandler)\n  ...\n    // it's time to register the group aggregation endpoint\n    groupPath := \"/apis/\" + apiService.Spec.Group\n    groupDiscoveryHandler := &apiGroupHandler{\n        codecs:    aggregatorscheme.Codecs,\n        groupName: apiService.Spec.Group,\n        lister:    s.lister,\n        delegate:  s.delegateHandler,\n    }\n    // aggregation is protected\n    s.GenericAPIServer.Handler.NonGoRestfulMux.Handle(groupPath, groupDiscoveryHandler)\n    s.GenericAPIServer.Handler.NonGoRestfulMux.UnlistedHandle(groupPath+\"/\", groupDiscoveryHandler)\n    s.handledGroups.Insert(apiService.Spec.Group)\n    return nil\n}\n// k8s.io/kubernetes/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_proxy.go:109\nfunc (r *proxyHandler) ServeHTTP(w http.ResponseWriter, req *http.Request) {\n    // 加载roxyHandlingInfo处理请求  \n    value := r.handlingInfo.Load()\n    if value == nil {\n        r.localDelegate.ServeHTTP(w, req)\n        return\n    }\n    handlingInfo := value.(proxyHandlingInfo)\n  ...\n    // 判断APIService服务是否正常\n    if !handlingInfo.serviceAvailable {\n        proxyError(w, req, \"service unavailable\", http.StatusServiceUnavailable)\n        return\n    }\n    // 将原始请求转化为对APIService的请求\n    // write a new location based on the existing request pointed at the target service\n    location := &url.URL{}\n    location.Scheme = \"https\"\n    rloc, err := r.serviceResolver.ResolveEndpoint(handlingInfo.serviceNamespace, handlingInfo.serviceName, handlingInfo.servicePort)\n    if err != nil {\n        klog.Errorf(\"error resolving %s/%s: %v\", handlingInfo.serviceNamespace, handlingInfo.serviceName, err)\n        proxyError(w, req, \"service unavailable\", http.StatusServiceUnavailable)\n        return\n    }\n    location.Host = rloc.Host\n    location.Path = req.URL.Path\n    location.RawQuery = req.URL.Query().Encode()\n    newReq, cancelFn := newRequestForProxy(location, req)\n    defer cancelFn()\n   ...\n    proxyRoundTripper = transport.NewAuthProxyRoundTripper(user.GetName(), user.GetGroups(), user.GetExtra(), proxyRoundTripper)\n    handler := proxy.NewUpgradeAwareHandler(location, proxyRoundTripper, true, upgrade, &responder{w: w})\n    handler.ServeHTTP(w, newReq)\n}\n```\n\n<br>\n\n```\n$ kubectl get APIService           \nNAME                                   SERVICE                      AVAILABLE   AGE\n...\nv1.apps                                Local                        True        50d\n...\nv1beta1.metrics.k8s.io                 kube-system/metrics-server   True        50d\n...\n```\n\n```\n# default APIServices\n$ kubectl get -o yaml APIService/v1.apps\napiVersion: apiregistration.k8s.io/v1\nkind: APIService\nmetadata:\n  labels:\n    kube-aggregator.kubernetes.io/automanaged: onstart\n  name: v1.apps\n  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1.apps\nspec:\n  group: apps\n  groupPriorityMinimum: 17800\n  version: v1\n  versionPriority: 15\nstatus:\n  conditions:\n  - lastTransitionTime: \"2020-10-20T10:39:48Z\"\n    message: Local APIServices are always available\n    reason: Local\n    status: \"True\"\n    type: Available\n\n# aggregated server    \n$ kubectl get -o yaml APIService/v1beta1.metrics.k8s.io\napiVersion: apiregistration.k8s.io/v1\nkind: APIService\nmetadata:\n  labels:\n    addonmanager.kubernetes.io/mode: Reconcile\n    kubernetes.io/cluster-service: \"true\"\n  name: v1beta1.metrics.k8s.io\n  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.metrics.k8s.io\nspec:\n  group: metrics.k8s.io\n  groupPriorityMinimum: 100\n  insecureSkipTLSVerify: true\n  service:\n    name: metrics-server\n    namespace: kube-system\n    port: 443\n  version: v1beta1\n  versionPriority: 100\nstatus:\n  conditions:\n  - lastTransitionTime: \"2020-12-05T00:50:48Z\"\n    message: all checks passed\n    reason: Passed\n    status: \"True\"\n    type: Available\n\n# CRD\n$ kubectl get -o yaml APIService/v1.duyanghao.example.com\napiVersion: apiregistration.k8s.io/v1\nkind: APIService\nmetadata:\n  labels:\n    kube-aggregator.kubernetes.io/automanaged: \"true\"\n  name: v1.duyanghao.example.com\n  selfLink: /apis/apiregistration.k8s.io/v1/apiservices/v1.duyanghao.example.com\nspec:\n  group: duyanghao.example.com\n  groupPriorityMinimum: 1000\n  version: v1\n  versionPriority: 100\nstatus:\n  conditions:\n  - lastTransitionTime: \"2020-12-11T08:45:37Z\"\n    message: Local APIServices are always available\n    reason: Local\n    status: \"True\"\n    type: Available\n```\n\n- aggregatorServer创建过程中会根据所有kube-apiserver定义的API资源创建默认的APIService列表，名称即是`$VERSION.$GROUP`，这些APIService都会有标签`kube-aggregator.kubernetes.io/automanaged: onstart`，例如：v1.apps apiService。autoRegistrationController创建并维护这些列表中的APIService，也即我们看到的Local apiService；对于自定义的APIService(aggregated server)，则不会对其进行处理\n- aggregated server实现CR(自定义API资源) 的CRUD API接口，并可以灵活选择后端存储，可以与core kube-apiserver一起公用etcd，也可自己独立部署etcd数据库或者其它数据库。aggregated server实现的CR API路径为：/apis/$GROUP/$VERSION，具体到sample apiserver为：/apis/wardle.example.com/v1alpha1，下面的资源类型有：flunders以及fischers\n- aggregated server通过部署APIService类型资源，service fields指向对应的aggregated server service实现与core kube-apiserver的集成与交互\n\nsample-apiserver目录结构如下，可参考编写自己的aggregated server：\n\n```\nstaging/src/k8s.io/sample-apiserver\n├── artifacts\n│   ├── example\n│   │   ├── apiservice.yaml\n      ...\n├── hack\n├── main.go\n└── pkg\n├── admission\n├── apis\n├── apiserver\n├── cmd\n├── generated\n│   ├── clientset\n│   │   └── versioned\n              ...\n│   │       └── typed\n│   │           └── wardle\n│   │               ├── v1alpha1\n│   │               └── v1beta1\n│   ├── informers\n│   │   └── externalversions\n│   │       └── wardle\n│   │           ├── v1alpha1\n│   │           └── v1beta1\n│   ├── listers\n│   │   └── wardle\n│   │       ├── v1alpha1\n│   │       └── v1beta1\n└── registry\n\n```\n\n- - 其中，artifacts用于部署yaml示例\n  - hack目录存放自动脚本(eg: update-codegen)\n  - main.go是aggregated server启动入口；pkg/cmd负责启动aggregated server具体逻辑；pkg/apiserver用于aggregated server初始化以及路由注册\n  - pkg/apis负责相关CR的结构体定义，自动生成(update-codegen)\n  - pkg/admission负责准入的相关代码\n  - pkg/generated负责生成访问CR的clientset，informers，以及listers\n  - pkg/registry目录负责CR相关的RESTStorage实现\n\n更多代码原理详情，参考 [kubernetes-reading-notes](https://github.com/duyanghao/kubernetes-reading-notes/tree/master/core/api-server) 。\n\n<br>\n\n### 5. apiExtensionsServer\n\napiExtensionsServer主要负责CustomResourceDefinition（CRD）apiResources以及apiVersions的注册，同时处理CRD以及相应CustomResource（CR）的REST请求(如果对应CR不能被处理的话则会返回404)，也是apiserver Delegation的最后一环\n\n原理总结如下：\n\n- Custom Resource，简称CR，是Kubernetes自定义资源类型，与之相对应的就是Kubernetes内置的各种资源类型，例如Pod、Service等。利用CR我们可以定义任何想要的资源类型\n- CRD通过yaml文件的形式向Kubernetes注册CR实现自定义api-resources，属于第二种扩展Kubernetes API资源的方式，也是普遍使用的一种\n- APIExtensionServer负责CustomResourceDefinition（CRD）apiResources以及apiVersions的注册，同时处理CRD以及相应CustomResource（CR）的REST请求(如果对应CR不能被处理的话则会返回404)，也是apiserver Delegation的最后一环\n- `crdRegistrationController`负责将CRD GroupVersions自动注册到APIServices中。具体逻辑为：枚举所有CRDs，然后根据CRD定义的crd.Spec.Group以及crd.Spec.Versions字段构建APIService，并添加到autoRegisterController.apiServicesToSync中，由autoRegisterController进行创建以及维护操作。这也是为什么创建完CRD后会产生对应的APIService对象\n- APIExtensionServer包含的controller以及功能如下所示：\n  - `openapiController`：将 crd 资源的变化同步至提供的 OpenAPI 文档，可通过访问 `/openapi/v2` 进行查看；\n  - `crdController`：负责将 crd 信息注册到 apiVersions 和 apiResources 中，两者的信息可通过 `kubectl api-versions` 和 `kubectl api-resources` 查看；\n  - `kubectl api-versions`命令返回所有Kubernetes集群资源的版本信息（实际发出了两个请求，分别是`https://127.0.0.1:6443/api`以及`https://127.0.0.1:6443/apis`，并在最后将两个请求的返回结果进行了合并）\n\n```\n$ kubectl -v=8 api-versions \nI1211 11:44:50.276446   22493 loader.go:375] Config loaded from file:  /root/.kube/config\nI1211 11:44:50.277005   22493 round_trippers.go:420] GET https://127.0.0.1:6443/api?timeout=32s\n...\nI1211 11:44:50.290265   22493 request.go:1068] Response Body: {\"kind\":\"APIVersions\",\"versions\":[\"v1\"],\"serverAddressByClientCIDRs\":[{\"clientCIDR\":\"0.0.0.0/0\",\"serverAddress\":\"x.x.x.x:6443\"}]}\nI1211 11:44:50.293673   22493 round_trippers.go:420] GET https://127.0.0.1:6443/apis?timeout=32s\n...\nI1211 11:44:50.298360   22493 request.go:1068] Response Body: {\"kind\":\"APIGroupList\",\"apiVersion\":\"v1\",\"groups\":[{\"name\":\"apiregistration.k8s.io\",\"versions\":[{\"groupVersion\":\"apiregistration.k8s.io/v1\",\"version\":\"v1\"},{\"groupVersion\":\"apiregistration.k8s.io/v1beta1\",\"version\":\"v1beta1\"}],\"preferredVersion\":{\"groupVersion\":\"apiregistration.k8s.io/v1\",\"version\":\"v1\"}},{\"name\":\"extensions\",\"versions\":[{\"groupVersion\":\"extensions/v1beta1\",\"version\":\"v1beta1\"}],\"preferredVersion\":{\"groupVersion\":\"extensions/v1beta1\",\"version\":\"v1beta1\"}},{\"name\":\"apps\",\"versions\":[{\"groupVersion\":\"apps/v1\",\"version\":\"v1\"}],\"preferredVersion\":{\"groupVersion\":\"apps/v1\",\"version\":\"v1\"}},{\"name\":\"events.k8s.io\",\"versions\":[{\"groupVersion\":\"events.k8s.io/v1beta1\",\"version\":\"v1beta1\"}],\"preferredVersion\":{\"groupVersion\":\"events.k8s.io/v1beta1\",\"version\":\"v1beta1\"}},{\"name\":\"authentication.k8s.io\",\"versions\":[{\"groupVersion\":\"authentication.k8s.io/v1\",\"version\":\"v1\"},{\"groupVersion\":\"authentication.k8s.io/v1beta1\",\"version\":\"v1beta1\"}],\"preferredVersion\":{\"groupVersion\":\"authentication.k8s.io/v1\",\" [truncated 4985 chars]\napiextensions.k8s.io/v1\napiextensions.k8s.io/v1beta1\napiregistration.k8s.io/v1\napiregistration.k8s.io/v1beta1\napps/v1\nauthentication.k8s.io/v1beta1\n...\nstorage.k8s.io/v1\nstorage.k8s.io/v1beta1\nv1\n\n```\n\n\n\n`kubectl api-resources`命令就是先获取所有API版本信息，然后对每一个API版本调用接口获取该版本下的所有API资源类型\n\n```\n$ kubectl -v=8 api-resources\n 5077 loader.go:375] Config loaded from file:  /root/.kube/config\n I1211 15:19:47.593450   15077 round_trippers.go:420] GET https://127.0.0.1:6443/api?timeout=32s\n I1211 15:19:47.602273   15077 request.go:1068] Response Body: {\"kind\":\"APIVersions\",\"versions\":[\"v1\"],\"serverAddressByClientCIDRs\":[{\"clientCIDR\":\"0.0.0.0/0\",\"serverAddress\":\"x.x.x.x:6443\"}]}\n I1211 15:19:47.606279   15077 round_trippers.go:420] GET https://127.0.0.1:6443/apis?timeout=32s\n I1211 15:19:47.610333   15077 request.go:1068] Response Body: {\"kind\":\"APIGroupList\",\"apiVersion\":\"v1\",\"groups\":[{\"name\":\"apiregistration.k8s.io\",\"versions\":[{\"groupVersion\":\"apiregistration.k8s.io/v1\",\"version\":\"v1\"},{\"groupVersion\":\"apiregistration.k8s.io/v1beta1\",\"version\":\"v1beta1\"}],\"preferredVersion\":{\"groupVersion\":\"apiregistration.k8s.io/v1\",\"version\":\"v1\"}},{\"name\":\"extensions\",\"versions\":[{\"groupVersion\":\"extensions/v1beta1\",\"version\":\"v1beta1\"}],\"preferredVersion\":{\"groupVersion\":\"extensions/v1beta1\",\"version\":\"v1beta1\"}},{\"name\":\"apps\",\"versions\":[{\"groupVersion\":\"apps/v1\",\"version\":\"v1\"}],\"preferredVersion\":{\"groupVersion\":\"apps/v1\",\"version\":\"v1\"}},{\"name\":\"events.k8s.io\",\"versions\":[{\"groupVersion\":\"events.k8s.io/v1beta1\",\"version\":\"v1beta1\"}],\"preferredVersion\":{\"groupVersion\":\"events.k8s.io/v1beta1\",\"version\":\"v1beta1\"}},{\"name\":\"authentication.k8s.io\",\"versions\":[{\"groupVersion\":\"authentication.k8s.io/v1\",\"version\":\"v1\"},{\"groupVersion\":\"authentication.k8s.io/v1beta1\",\"version\":\"v1beta1\"}],\"preferredVersion\":{\"groupVersion\":\"authentication.k8s.io/v1\",\" [truncated 4985 chars]\n I1211 15:19:47.614700   15077 round_trippers.go:420] GET https://127.0.0.1:6443/apis/batch/v1?timeout=32s\n I1211 15:19:47.614804   15077 round_trippers.go:420] GET https://127.0.0.1:6443/apis/authentication.k8s.io/v1?timeout=32s\n I1211 15:19:47.615687   15077 round_trippers.go:420] GET https://127.0.0.1:6443/apis/auth.tkestack.io/v1?timeout=32s\n https://127.0.0.1:6443/apis/authentication.k8s.io/v1beta1?timeout=32s\n I1211 15:19:47.616794   15077 round_trippers.go:420] GET https://127.0.0.1:6443/apis/coordination.k8s.io/v1?timeout=32s\n I1211 15:19:47.616863   15077 round_trippers.go:420] GET https://127.0.0.1:6443/apis/apps/v1?timeout=32s\n ...\n NAME                              SHORTNAMES   APIGROUP                       NAMESPACED   KIND\n bindings                                                                      true         Binding\n endpoints                         ep                                          true         Endpoints\n events                            ev                                          true         Event\n limitranges                       limits                                      true         LimitRange\n namespaces                        ns                                          false        Namespace\n nodes                             no                                          false        Node\n ...\n```\n\n- `namingController`：检查 crd obj 中是否有命名冲突，可在 crd `.status.conditions` 中查看；\n- `establishingController`：检查 crd 是否处于正常状态，可在 crd `.status.conditions` 中查看；\n- `nonStructuralSchemaController`：检查 crd obj 结构是否正常，可在 crd `.status.conditions` 中查看；\n- `apiApprovalController`：检查 crd 是否遵循 Kubernetes API 声明策略，可在 crd `.status.conditions` 中查看；\n- `finalizingController`：类似于 finalizes 的功能，与 CRs 的删除有关；\n\n总结CR CRUD APIServer处理逻辑如下：\n\n- createAPIExtensionsServer=>NewCustomResourceDefinitionHandler=>crdHandler=>注册CR CRUD API接口：\n\n```\n// New returns a new instance of CustomResourceDefinitions from the given config.\nfunc (c completedConfig) New(delegationTarget genericapiserver.DelegationTarget) (*CustomResourceDefinitions, error) {\n  ...\n    crdHandler, err := NewCustomResourceDefinitionHandler(\n      versionDiscoveryHandler,\n        groupDiscoveryHandler,\n      s.Informers.Apiextensions().V1().CustomResourceDefinitions(),\n        delegateHandler,\n      c.ExtraConfig.CRDRESTOptionsGetter,\n        c.GenericConfig.AdmissionControl,\n      establishingController,\n        c.ExtraConfig.ServiceResolver,\n      c.ExtraConfig.AuthResolverWrapper,\n        c.ExtraConfig.MasterCount,\n        s.GenericAPIServer.Authorizer,\n        c.GenericConfig.RequestTimeout,\n        time.Duration(c.GenericConfig.MinRequestTimeout)*time.Second,\n        apiGroupInfo.StaticOpenAPISpec,\n        c.GenericConfig.MaxRequestBodyBytes,\n    )\n    if err != nil {\n        return nil, err\n    }\n    s.GenericAPIServer.Handler.NonGoRestfulMux.Handle(\"/apis\", crdHandler)\n    s.GenericAPIServer.Handler.NonGoRestfulMux.HandlePrefix(\"/apis/\", crdHandler)\n    ...\n    return s, nil\n}\n\n```\n\ncrdHandler处理逻辑如下：\n\n- 解析req(GET /apis/duyanghao.example.com/v1/namespaces/default/students)，根据请求路径中的group(duyanghao.example.com)，version(v1)，以及resource字段(students)获取对应CRD内容(crd, err := r.crdLister.Get(crdName))\n- 通过crd.UID以及crd.Name获取crdInfo，若不存在则创建对应的crdInfo(crdInfo, err := r.getOrCreateServingInfoFor(crd.UID, crd.Name))。crdInfo中包含了CRD定义以及该CRD对应Custom Resource的customresource.REST storage\n- customresource.REST storage由CR对应的Group(duyanghao.example.com)，Version(v1)，Kind(Student)，Resource(students)等创建完成，由于CR在Kubernetes代码中并没有具体结构体定义，所以这里会先初始化一个范型结构体Unstructured(用于保存所有类型的Custom Resource)，并对该结构体进行SetGroupVersionKind操作(设置具体Custom Resource Type)\n- 从customresource.REST storage获取Unstructured结构体后会对其进行相应转换然后返回\n\n```\n// k8s.io/kubernetes/staging/src/k8s.io/apiextensions-apiserver/pkg/apiserver/customresource_handler.go:223\nfunc (r *crdHandler) ServeHTTP(w http.ResponseWriter, req *http.Request) {\n  ctx := req.Context()\n  requestInfo, ok := apirequest.RequestInfoFrom(ctx)\n  ...\n  crdName := requestInfo.Resource + \".\" + requestInfo.APIGroup\n  crd, err := r.crdLister.Get(crdName)\n  ...\n  crdInfo, err := r.getOrCreateServingInfoFor(crd.UID, crd.Name)\n  verb := strings.ToUpper(requestInfo.Verb)\n  resource := requestInfo.Resource\n  subresource := requestInfo.Subresource\n  scope := metrics.CleanScope(requestInfo)\n  ...\n  switch {\n  case subresource == \"status\" && subresources != nil && subresources.Status != nil:\n      handlerFunc = r.serveStatus(w, req, requestInfo, crdInfo, terminating, supportedTypes)\n  case subresource == \"scale\" && subresources != nil && subresources.Scale != nil:\n      handlerFunc = r.serveScale(w, req, requestInfo, crdInfo, terminating, supportedTypes)\n  case len(subresource) == 0:\n      handlerFunc = r.serveResource(w, req, requestInfo, crdInfo, terminating, supportedTypes)\n  default:\n      responsewriters.ErrorNegotiated(\n          apierrors.NewNotFound(schema.GroupResource{Group: requestInfo.APIGroup, Resource: requestInfo.Resource}, requestInfo.Name),\n          Codecs, schema.GroupVersion{Group: requestInfo.APIGroup, Version: requestInfo.APIVersion}, w, req,\n      )\n  }\n  if handlerFunc != nil {\n      handlerFunc = metrics.InstrumentHandlerFunc(verb, requestInfo.APIGroup, requestInfo.APIVersion, resource, subresource, scope, metrics.APIServerComponent, handlerFunc)\n      handler := genericfilters.WithWaitGroup(handlerFunc, longRunningFilter, crdInfo.waitGroup)\n      handler.ServeHTTP(w, req)\n      return\n  }\n}\n\n```\n\n<br>\n\n### 6.总结\n\n#### 6.1 kubeAPIServer, apiExtensionsServer, aggregatorServer 总结\n\nkubeAPIServer：处理Pod, svc ,deploy等k8s内置资源对象。\n\napiExtensionsServer：处理CRD相关的对象\n\naggregatorServer：处理 `apiregistration.k8s.io/v1` 这个组下面的apiserver。\n\n需要aggregatorServer的原因在于：\n\n（1）APIService是集群中的一个对象，用户可以创建这个对象来扩展k8s的apiserver\n\n举例来说。hpa在获取 metric数据时，我们可以定义aggregatorServer。hpa往apiserver发送请求的时候，会 转到 aggregatorServer，然后转到我们自定义的函数，然后从monitor获得数据。\n\n<br>\n\n```\nroot@k8s-master:~# kubectl get APIService\nNAME                                   SERVICE                AVAILABLE                  AGE\nv1.                                    Local                  True                       52d\nv1.admissionregistration.k8s.io        Local                  True                       52d\nv1.apiextensions.k8s.io                Local                  True                       52d\nv1.apps                                Local                  True                       52d\nv1.authentication.k8s.io               Local                  True                       52d\nv1.authorization.k8s.io                Local                  True                       52d\nv1.autoscaling                         Local                  True                       52d\nv1.batch                               Local                  True                       52d\nv1.coordination.k8s.io                 Local                  True                       52d\nv1.networking.k8s.io                   Local                  True                       52d\nv1.rbac.authorization.k8s.io           Local                  True                       52d\nv1.scheduling.k8s.io                   Local                  True                       52d\nv1.storage.k8s.io                      Local                  True                       52d\nv1alpha1.auditregistration.k8s.io      Local                  True                       52d\nv1alpha1.node.k8s.io                   Local                  True                       52d\nv1alpha1.rbac.authorization.k8s.io     Local                  True                       52d\nv1alpha1.scheduling.k8s.io             Local                  True                       52d\nv1alpha1.settings.k8s.io               Local                  True                       52d\nv1alpha1.storage.k8s.io                Local                  True                       52d\nv1beta1.admissionregistration.k8s.io   Local                  True                       52d\nv1beta1.apiextensions.k8s.io           Local                  True                       52d\nv1beta1.apps                           Local                  True                       52d\nv1beta1.authentication.k8s.io          Local                  True                       52d\nv1beta1.authorization.k8s.io           Local                  True                       52d\nv1beta1.batch                          Local                  True                       52d\nv1beta1.certificates.k8s.io            Local                  True                       52d\nv1beta1.coordination.k8s.io            Local                  True                       52d\nv1beta1.custom.metrics.k8s.io          kube-system/kube-hpa   False (MissingEndpoints)   41d\nv1beta1.discovery.k8s.io               Local                  True                       52d\nv1beta1.events.k8s.io                  Local                  True                       52d\nv1beta1.extensions                     Local                  True                       52d\nv1beta1.networking.k8s.io              Local                  True                       52d\nv1beta1.node.k8s.io                    Local                  True                       52d\nv1beta1.policy                         Local                  True                       52d\nv1beta1.rbac.authorization.k8s.io      Local                  True                       52d\nv1beta1.scheduling.k8s.io              Local                  True                       52d\nv1beta1.storage.k8s.io                 Local                  True                       52d\nv1beta2.apps                           Local                  True                       52d\nv2alpha1.batch                         Local                  True                       52d\nv2beta1.autoscaling                    Local                  True                       52d\nv2beta2.autoscaling                    Local                  True                       52d\nroot@k8s-master:~#\n\n这个就是apiregistration.k8s.io/v1\nroot@k8s-master:~# kubectl get APIService v1beta1.custom.metrics.k8s.io -oyaml\napiVersion: apiregistration.k8s.io/v1\nkind: APIService\n。。。\n\n这个就是apiregistration.k8s.io/v1\nroot@k8s-master:~# kubectl get APIService v2beta2.autoscaling -oyaml\napiVersion: apiregistration.k8s.io/v1\nkind: APIService\n```\n\n<br>\n\n#### 6.2 bootstrap-controller\n\n(1) bootstrap-controller主要用于创建以及维护内部kubernetes apiserver service\n\n(2) 维护Service ClusterIP ，Nodeport\n\n###  7. 参考文档\n\nhttps://github.com/duyanghao/kubernetes-reading-notes/blob/master/core/api-server/extension/bootstrap_controller.md\n\n下面文档更详细的描述k8s是如何管理svc中的ip\n\n  [[k8s apiserver对service的IP和nodeport的管理](https://segmentfault.com/a/1190000021836886)]\n\nkubernetes源码解剖： https://weread.qq.com/web/reader/f1e3207071eeeefaf1e138akb5332110237b53b3a3d68d2"
  },
  {
    "path": "k8s/kube-apiserver/20. kubectl exec原理介绍.md",
    "content": "- [0. 章节目标](#0-----)\n- [1. kubectl 端做的操作](#1-kubectl------)\n  * [1.1 remotecommand包简介](#11-remotecommand---)\n  * [1.2 SPDY协议的大致原理](#12-spdy-------)\n  * [1.3 kubectl exec请求长什么样子](#13-kubectl-exec-------)\n  * [1.4 模仿kubectl 实现一个exec](#14---kubectl-----exec)\n- [2. kube-apiserver端](#2-kube-apiserver-)\n  * [2.1 pod/exec的路由注册](#21-pod-exec-----)\n- [3. kubelet端exec实现](#3-kubelet-exec--)\n- [4. 参考文章](#4-----)\n\n### 0. 章节目标\n\n弄清楚kubectl exec -it  podName -n  namespace bash的整个过程\n\n### 1. kubectl 端做的操作\n\nkubectl 相关源代码在 kubectl exec command实现里面。核心代码如下，可以看到就做了两件事：\n\n（1）get pod，确定pod存在，以及pod状态是非completed的\n\n（2）调用remotecommand.NewSPDYExecutor，往apiserver post exec这个SubResource的请求\n\n```\n// Run executes a validated remote execution against a pod.\nfunc (p *ExecOptions) Run() error {\n    \n    // 1. 判断pod是否存在\n\t\tp.Pod, err = p.PodClient.Pods(p.Namespace).Get(p.PodName, metav1.GetOptions{})\n\t\t\n\t\t// 2. 调用\n\t\t// TODO: consider abstracting into a client invocation or client helper\n\t\treq := restClient.Post().\n\t\t\tResource(\"pods\").\n\t\t\tName(pod.Name).\n\t\t\tNamespace(pod.Namespace).\n\t\t\tSubResource(\"exec\")\n\t\treq.VersionedParams(&corev1.PodExecOptions{\n\t\t\tContainer: containerName,\n\t\t\tCommand:   p.Command,\n\t\t\tStdin:     p.Stdin,\n\t\t\tStdout:    p.Out != nil,\n\t\t\tStderr:    p.ErrOut != nil,\n\t\t\tTTY:       t.Raw,\n\t\t}, scheme.ParameterCodec)\n\n\t\treturn p.Executor.Execute(\"POST\", req.URL(), p.Config, p.In, p.Out, p.ErrOut, t.Raw, sizeQueue)\n\t}\n\n\tif err := t.Safe(fn); err != nil {\n\t\treturn err\n\t}\n\n\treturn nil\n}\n\n// Execute期间是remotecommand.NewSPDYExecutor\nfunc (*DefaultRemoteExecutor) Execute(method string, url *url.URL, config *restclient.Config, stdin io.Reader, stdout, stderr io.Writer, tty bool, terminalSizeQueue remotecommand.TerminalSizeQueue) error {\n\texec, err := remotecommand.NewSPDYExecutor(config, method, url)\n\tif err != nil {\n\t\treturn err\n\t}\n\treturn exec.Stream(remotecommand.StreamOptions{\n\t\tStdin:             stdin,\n\t\tStdout:            stdout,\n\t\tStderr:            stderr,\n\t\tTty:               tty,\n\t\tTerminalSizeQueue: terminalSizeQueue,\n\t})\n}\n```\n\n<br>\n\n#### 1.1 remotecommand包简介\n\n`k8s.io/client-go/tools/remotecommand` kubernetes client-go 提供的 remotecommand 包，提供了方法与集群中的容器建立长连接，并设置容器的 stdin，stdout 等。\nremotecommand 包提供基于 [SPDY](https://en.wikipedia.org/wiki/SPDY) 协议的 Executor interface，进行和 pod 终端的流的传输。初始化一个 Executor 很简单，只需要调用 remotecommand 的 NewSPDYExecutor 并传入对应参数。\nExecutor 的 Stream 方法，会建立一个流传输的连接，直到服务端和调用端一端关闭连接，才会停止传输。常用的做法是定义一个如下 `PtyHandler` 的 interface，然后使用你想用的客户端实现该 interface 对应的`Read(p []byte) (int, error)`和`Write(p []byte) (int, error)`方法即可，调用 Stream 方法时，只要将 StreamOptions 的 Stdin Stdout 都设置为 ptyHandler，Executor 就会通过你定义的 write 和 read 方法来传输数据。\n\n#### 1.2 SPDY协议的大致原理\n\nSPDY协议可以理解为是websocket。他是一种在单个 TCP 连接上进行 全双工 通信的协议。\n\n![image-20221113221917526](../images/image-20221113221917526.png)\n\n这里实际上就是通过spdy，使得stdin, stdout, stderr都通过一个tcp请求来通信。通过streamid来区分。\n\n具体的实现一般都是通过 在握手阶段将 `http.ResponseWriter` 断言为 `http.Hijacker` 接口并调用其中的 `Hijack()` 方法，拿到原始tcp链接对象并进行接管。\n\nHijack作用就是：接管http 的tcp连接。\n\n`Hijack()`可以将HTTP对应的TCP连接取出，连接在`Hijack()`之后，HTTP的相关操作就会受到影响，调用方需要负责去关闭连接。\n\n所以需要调用方自己去负责怎么处理tcp流，关闭连接。\n\n正是可以自己去处理tcp流，kubectl 通过spdy就进行了 kubectl <-> apiserver的双向流连接。\n\n<br>\n\n#### 1.3 kubectl exec请求长什么样子\n\n从下面日历可以看出来，这和上面是对上的。先get, 然后再post exec请求。这里主要关注post请求。post请求的url里面有命令，containerName, stdin, tty等信息。\n\n同时头部带有了： X-Stream-Protocol-Version: v4.channel.k8s.io，v3,v2,v1等。这些是k8s基于spdy实现的subprotocol，用于远程建立双向流。v1是第一版，后面每一个版都是进行了优化或者修复。\n\n```\n# kubectl exec -it zx-nginx-6b9bf7fc6d-lzhh9 -n test-test bash -v 8\nI1113 17:15:18.683343 3409932 loader.go:375] Config loaded from file:  /root/.kube/config\n\n// 第一步，先get pod\nI1113 17:15:18.696772 3409932 round_trippers.go:420] GET https://xxx:xxx/api/v1/namespaces/test-test/pods/zx-nginx-6b9bf7fc6d-lzhh9\nI1113 17:15:18.696812 3409932 round_trippers.go:427] Request Headers:\nI1113 17:15:18.696823 3409932 round_trippers.go:431]     Accept: application/json, */*\nI1113 17:15:18.696832 3409932 round_trippers.go:431]     User-Agent: kubectl/v1.17.4 (linux/amd64) kubernetes/6a41ada\nI1113 17:15:18.706429 3409932 round_trippers.go:446] Response Status: 200 OK in 9 milliseconds\nI1113 17:15:18.706456 3409932 round_trippers.go:449] Response Headers:\nI1113 17:15:18.706462 3409932 round_trippers.go:452]     Content-Type: application/json\nI1113 17:15:18.706467 3409932 round_trippers.go:452]     Date: Sun, 13 Nov 2022 09:15:18 GMT\nI1113 17:15:18.706584 3409932 request.go:1017] Response Body: {\"kind\":\"Pod\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"zx-nginx-6b9bf7fc6d-lzhh9\",\"generateName\":\"zx-nginx-6b9bf7fc6d-\",\"namespace\":\"test-test\",\"selfLink\":\"/api/v1/namespaces/test-test/pods/zx-nginx-6b9bf7fc6d-lzhh9\",\"uid\":\"b496d8fb-0d5e-42e3-ac23-e25d02ae262b\",\"resourceVersion\":\"1172314995\",\"creationTimestamp\":\"2022-10-20T02:25:25Z\",\"labels\":{\"app\":\"zx-nginx\",\"pod-template-hash\":\"6b9bf7fc6d\",\"project\":\"test\",\"uuid\":\"64cc50af-dbf8-4a75-b7a4-2807606045a5\"},\"annotations\":{\"pod.symphony.com/project\":\"test\",\"symphony.netease.com/last-update-time\":\"1666232725\",\"v2-subnet\":\"6a5e6bde-650a-4081-992a-d2aaec2080d5\",\"v2-tenant\":\"dcf722f63d0249f8b154876a79e1ce05\",\"v2-vpc\":\"e0a80d55-57c7-4b7f-b9cc-0e7c09647409\"},\"ownerReferences\":[{\"apiVersion\":\"apps/v1\",\"kind\":\"ReplicaSet\",\"name\":\"zx-nginx-6b9bf7fc6d\",\"uid\":\"5c48aca3-f80f-4a21-9853-64c348866edd\",\"controller\":true,\"blockOwnerDeletion\":true}]},\"spec\":{\"volumes\":[{\"name\":\"lxc\",\"hostPath\":{\"path\":\"/lxcfs\", [truncated 3124 chars]\n\n\n// 第二步post请求\nI1113 17:15:18.713105 3409932 round_trippers.go:420] POST https://xxx:xxx/api/v1/namespaces/test-test/pods/zx-nginx-6b9bf7fc6d-lzhh9/exec?command=bash&container=zx-router&stdin=true&stdout=true&tty=true\n// request信息\nI1113 17:15:18.713132 3409932 round_trippers.go:427] Request Headers:\nI1113 17:15:18.713138 3409932 round_trippers.go:431]     User-Agent: kubectl/v1.17.4 (linux/amd64) kubernetes/6a41ada\nI1113 17:15:18.713151 3409932 round_trippers.go:431]     X-Stream-Protocol-Version: v4.channel.k8s.io\nI1113 17:15:18.713161 3409932 round_trippers.go:431]     X-Stream-Protocol-Version: v3.channel.k8s.io\nI1113 17:15:18.713166 3409932 round_trippers.go:431]     X-Stream-Protocol-Version: v2.channel.k8s.io                                                                                                                                      I1113 17:15:18.713175 3409932 round_trippers.go:431]     X-Stream-Protocol-Version: channel.k8s.io\n\n\n// 返回信息了一个 Upgrade: SPDY/3.1\nI1113 17:15:18.736268 3409932 round_trippers.go:446] Response Status: 101 Switching Protocols in 23 milliseconds\nI1113 17:15:18.736297 3409932 round_trippers.go:449] Response Headers:\nI1113 17:15:18.736303 3409932 round_trippers.go:452]     Connection: Upgrade                                                                                                                                                                I1113 17:15:18.736309 3409932 round_trippers.go:452]     Upgrade: SPDY/3.1\nI1113 17:15:18.736315 3409932 round_trippers.go:452]     X-Stream-Protocol-Version: v4.channel.k8s.io                                                                                                                                                                  I1113 17:15:18.736337 3409932 round_trippers.go:452]     Date: Sun, 13 Nov 2022 09:15:18 GMT\n```\n\nHTTP/1.1中允许在同一个链接上通过Header头中的Connection配合Upgrade来实现协议的转换，简单来说就是允许在通过HTTP建立的链接之上使用其他的协议来进行通信，这也是k8s中命令中实现协议升级的关键。\n\n在HTTP协议中除了我们常见的HTTP1.1,还支持websocket/spdy等协议,那服务端和客户端如何在http之上完成不同协议的切换呢，首先第一个要素就是这里的101(Switching Protocal)状态码, 即服务端告知客户端我们切换到Uprage定义的协议上来进行通信(复用当前链接)\n\n#### 1.4 模仿kubectl 实现一个exec\n\n从客户端看来，想要实现exec是非常简单的。我post exec这个子资源，并且利用remotecommand这个包的stream函数就可以实现。\n\n当然核心工作是stream函数做的，他直接通过Hijacker获取到两个底层的tcp的readerwriter之后，就可以直接通过io.copy在两个流上完成对应数据的拷贝，这样就不需要在apiserver这个地方进行协议的转换，而是直接通过tcp的流对拷就可以实现请求和结果的转发。\n\n```\nstreamProtocolV4的实现\nfunc (p *streamProtocolV4) stream(conn streamCreator) error {\n\tif err := p.createStreams(conn); err != nil {\n\t\treturn err\n\t}\n\n\t// now that all the streams have been created, proceed with reading & copying\n\n\terrorChan := watchErrorStream(p.errorStream, &errorDecoderV4{})\n\n\tp.handleResizes()\n\n\tp.copyStdin()\n\n\tvar wg sync.WaitGroup\n\tp.copyStdout(&wg)\n\tp.copyStderr(&wg)\n\n\t// we're waiting for stdout/stderr to finish copying\n\twg.Wait()\n\n\t// waits for errorStream to finish reading with an error or nil\n\treturn <-errorChan\n}\n```\n\n**go实现kubectl exec 示范代码**\n\n```\npackage main\n\nimport (\n\t\"flag\"\n\t\"fmt\"\n\t\"io\"\n\t\"os\"\n\t\"path/filepath\"\n\n\t\"golang.org/x/crypto/ssh/terminal\"\n\tcorev1 \"k8s.io/api/core/v1\"\n\t\"k8s.io/client-go/kubernetes\"\n\t\"k8s.io/client-go/kubernetes/scheme\"\n\t\"k8s.io/client-go/tools/clientcmd\"\n\t\"k8s.io/client-go/tools/remotecommand\"\n\t\"k8s.io/client-go/util/homedir\"\n)\n\nfunc main() {\n\n\tvar kubeconfig *string\n\tif home := homedir.HomeDir(); home != \"\" {\n\t\tkubeconfig = flag.String(\"kubeconfig\", filepath.Join(home, \"go/src/kubeconfig/xx\", \"kubeconfig\"), \"(optional) absolute path to the kubeconfig file\")\n\t} else {\n\t\tkubeconfig = flag.String(\"kubeconfig\", \"\", \"absolute path to the kubeconfig file\")\n\t}\n\tflag.Parse()\n\n\tconfig, err := clientcmd.BuildConfigFromFlags(\"\", *kubeconfig)\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\tclientset, err := kubernetes.NewForConfig(config)\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\n\t// 初始化pod所在的corev1资源组，发送请求\n\t// PodExecOptions struct 包括Container stdout stdout  Command 等结构\n\t// scheme.ParameterCodec 应该是pod 的GVK （GroupVersion & Kind）之类的\n\treq := clientset.CoreV1().RESTClient().Post().\n\t\tResource(\"pods\").\n\t\tName(\"fix-validate-cm-5b58cf68cd-6mckn\").\n\t\tNamespace(\"test-test\").\n\t\tSubResource(\"exec\").\n\t\tVersionedParams(&corev1.PodExecOptions{\n\t\t\tCommand: []string{\"ls\"},\n\t\t\tStdin:   true,\n\t\t\tStdout:  true,\n\t\t\tStderr:  true,\n\t\t\tTTY:     false,\n\t\t}, scheme.ParameterCodec)\n\n\t// remotecommand 主要实现了http 转 SPDY 添加X-Stream-Protocol-Version相关header 并发送请求\n\texec, err := remotecommand.NewSPDYExecutor(config, \"POST\", req.URL())\n\n\t// 检查是不是终端\n\tif !terminal.IsTerminal(0) || !terminal.IsTerminal(1) {\n\t\tfmt.Errorf(\"stdin/stdout should be terminal\")\n\t}\n\t// 这个应该是处理Ctrl + C 这种特殊键位\n\toldState, err := terminal.MakeRaw(0)\n\tif err != nil {\n\t\tfmt.Println(err)\n\t}\n\tdefer terminal.Restore(0, oldState)\n\n\t// 用IO读写替换 os stdout\n\tscreen := struct {\n\t\tio.Reader\n\t\tio.Writer\n\t}{os.Stdin, os.Stdout}\n\n\t// 建立链接之后从请求的sream中发送、读取数据\n\tif err = exec.Stream(remotecommand.StreamOptions{\n\t\tStdin:  screen,\n\t\tStdout: screen,\n\t\tStderr: screen,\n\t\tTty:    false,\n\t}); err != nil {\n\t\tfmt.Print(err)\n\t}\n}\n\n//是可以成功弄出来的\n# go run main.go\nbin  boot dev docker-entrypoint.d docker-entrypoint.sh etc home ....                                 \n```\n\n### 2. kube-apiserver端\n\n#### 2.1 pod/exec的路由注册\n\n这里可以参考 [**11-kube-apiserver 启动http和https服务**](https://github.com/zoux86/learning-k8s-source-code/blob/master/k8s/kube-apiserver/11-kube-apiserver%20%E5%90%AF%E5%8A%A8http%E5%92%8Chttps%E6%9C%8D%E5%8A%A1.md)\n\n了解一下apiserver是如何注册路由的。\n\n简单说一下就是：kube-apiserver调用registerResourceHandlers为pod创建路由。就是访问某个path，对应哪个处理函数。\n\n对于 pod/exec而言可以就是CONNECT动作。执行CONNECT的handler\n\n**为啥是CONNECT呢**\n\n核心代码都在：staging/src/k8s.io/apiserver/pkg/endpoints/installer.go 的registerResourceHandlers\n\nregisterResourceHandlers 函数首先会进行一堆判断，根据资源storage判断，某个资源支持哪些操作。\n\n而查看podexec这个资源的storage，它只实现了connet方法。所以他只支持connect动作。\n\n```\ncreater, isCreater := storage.(rest.Creater)\n\tnamedCreater, isNamedCreater := storage.(rest.NamedCreater)\n\tlister, isLister := storage.(rest.Lister)\n\tgetter, isGetter := storage.(rest.Getter)\n\tgetterWithOptions, isGetterWithOptions := storage.(rest.GetterWithOptions)\n\tgracefulDeleter, isGracefulDeleter := storage.(rest.GracefulDeleter)\n\tcollectionDeleter, isCollectionDeleter := storage.(rest.CollectionDeleter)\n\tupdater, isUpdater := storage.(rest.Updater)\n\tpatcher, isPatcher := storage.(rest.Patcher)\n\twatcher, isWatcher := storage.(rest.Watcher)\n\tconnecter, isConnecter := storage.(rest.Connecter)\n\tstorageMeta, isMetadata := storage.(rest.StorageMetadata)\n\tstorageVersionProvider, isStorageVersionProvider := storage.(rest.StorageVersionProvider)\n```\n\n所以，当客户端post exec的时候，最终调用的是下面的方法。\n\n这里的核心就是调用restfulConnectResource方法，实现kube-apiserver-kubelet流的建立。\n\n这样kubectl <-> kube-apiserver <-> kubelet就打通了\n\n```\ncase \"CONNECT\":\n   for _, method := range connecter.ConnectMethods() {\n      connectProducedObject := storageMeta.ProducesObject(method)\n      if connectProducedObject == nil {\n         connectProducedObject = \"string\"\n      }\n      doc := \"connect \" + method + \" requests to \" + kind\n      if isSubresource {\n         doc = \"connect \" + method + \" requests to \" + subresource + \" of \" + kind\n      }\n      handler := metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, restfulConnectResource(connecter, reqScope, admit, path, isSubresource))\n      route := ws.Method(method).Path(action.Path).\n         To(handler).\n         Doc(doc).\n         Operation(\"connect\" + strings.Title(strings.ToLower(method)) + namespaced + kind + strings.Title(subresource) + operationSuffix).\n         Produces(\"*/*\").\n         Consumes(\"*/*\").\n         Writes(connectProducedObject)\n      if versionedConnectOptions != nil {\n         if err := AddObjectParams(ws, route, versionedConnectOptions); err != nil {\n            return nil, err\n         }\n      }\n      addParams(route, action.Params)\n      routes = append(routes, route)\n\n      // transform ConnectMethods to kube verbs\n      if kubeVerb, found := toDiscoveryKubeVerb[method]; found {\n         if len(kubeVerb) != 0 {\n            kubeVerbs[kubeVerb] = struct{}{}\n         }\n      }\n   }\n```\n\nrestfulConnectResource的调用路线如下：\n\n```\nrestfulConnectResource -> ConnectResource->ExecREST.Connect \n\n\n// 核心就是调用ExecLocation.streamLocation来根据pod获取node信息，主要是kubelet ip+Port。然后获取kubelet流的连接\n// 然后调用newThrottledUpgradeAwareProxyHandler升级流的协议，这就是response看到的101升级信息\n// Connect returns a handler for the pod exec proxy\nfunc (r *ExecREST) Connect(ctx context.Context, name string, opts runtime.Object, responder rest.Responder) (http.Handler, error) {\n\texecOpts, ok := opts.(*api.PodExecOptions)\n\tif !ok {\n\t\treturn nil, fmt.Errorf(\"invalid options object: %#v\", opts)\n\t}\n\tlocation, transport, err := pod.ExecLocation(r.Store, r.KubeletConn, ctx, name, execOpts)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn newThrottledUpgradeAwareProxyHandler(location, transport, false, true, true, responder), nil\n}\n\n\n\nfunc streamLocation(\n\tgetter ResourceGetter,\n\tconnInfo client.ConnectionInfoGetter,\n\tctx context.Context,\n\tname string,\n\topts runtime.Object,\n\tcontainer,\n\tpath string,\n) (*url.URL, http.RoundTripper, error) {\n\tpod, err := getPod(getter, ctx, name)\n\tif err != nil {\n\t\treturn nil, nil, err\n\t}\n\n\t// Try to figure out a container\n\t// If a container was provided, it must be valid\n\tif container == \"\" {\n\t\tswitch len(pod.Spec.Containers) {\n\t\tcase 1:\n\t\t\tcontainer = pod.Spec.Containers[0].Name\n\t\tcase 0:\n\t\t\treturn nil, nil, errors.NewBadRequest(fmt.Sprintf(\"a container name must be specified for pod %s\", name))\n\t\tdefault:\n\t\t\tcontainerNames := getContainerNames(pod.Spec.Containers)\n\t\t\tinitContainerNames := getContainerNames(pod.Spec.InitContainers)\n\t\t\terr := fmt.Sprintf(\"a container name must be specified for pod %s, choose one of: [%s]\", name, containerNames)\n\t\t\tif len(initContainerNames) > 0 {\n\t\t\t\terr += fmt.Sprintf(\" or one of the init containers: [%s]\", initContainerNames)\n\t\t\t}\n\t\t\treturn nil, nil, errors.NewBadRequest(err)\n\t\t}\n\t} else {\n\t\tif !podHasContainerWithName(pod, container) {\n\t\t\treturn nil, nil, errors.NewBadRequest(fmt.Sprintf(\"container %s is not valid for pod %s\", container, name))\n\t\t}\n\t}\n\tnodeName := types.NodeName(pod.Spec.NodeName)\n\tif len(nodeName) == 0 {\n\t\t// If pod has not been assigned a host, return an empty location\n\t\treturn nil, nil, errors.NewBadRequest(fmt.Sprintf(\"pod %s does not have a host assigned\", name))\n\t}\n\tnodeInfo, err := connInfo.GetConnectionInfo(ctx, nodeName)\n\tif err != nil {\n\t\treturn nil, nil, err\n\t}\n\tparams := url.Values{}\n\tif err := streamParams(params, opts); err != nil {\n\t\treturn nil, nil, err\n\t}\n\tloc := &url.URL{\n\t\tScheme:   nodeInfo.Scheme,\n\t\tHost:     net.JoinHostPort(nodeInfo.Hostname, nodeInfo.Port),\n\t\tPath:     fmt.Sprintf(\"/%s/%s/%s/%s\", path, pod.Namespace, pod.Name, container),\n\t\tRawQuery: params.Encode(),\n\t}\n\treturn loc, nodeInfo.Transport, nil\n}\n```\n\n### 3. kubelet端exec实现\n\n参考：https://www.kubernetes.org.cn/7195.html\n\n### 4. 参考文章\n\n[图解kubernetes命令执行核心实现](https://www.kubernetes.org.cn/7195.html)\n\n[kubernetes exec源码简析](https://blog.csdn.net/nangonghen/article/details/110411187)\n\n[Kubernetes首个重要安全漏洞研究&百度云全量修复报告](https://zhuanlan.zhihu.com/p/52268484)\n\n[Kubectl exec 背后到底发生了什么？](https://cloud.tencent.com/developer/article/1632735#:~:text=kubectl%20exec%20%E7%9A%84%E5%B7%A5%E4%BD%9C%E5%8E%9F%E7%90%86%E7%94%A8%E4%B8%80%E5%BC%A0%E5%9B%BE%E5%B0%B1%E5%8F%AF%E4%BB%A5%E8%A1%A8%E7%A4%BA%EF%BC%9A%20%E7%AC%AC%E4%B8%80%E4%B8%AA%20kubectl%20exec%20%E5%9C%A8%E5%AE%B9%E5%99%A8%E5%86%85%E6%89%A7%E8%A1%8C%E4%BA%86%20date,Pod%20%E4%BF%A1%E6%81%AF%E3%80%82%20POST%20%E8%AF%B7%E6%B1%82%E8%B0%83%E7%94%A8%20Pod%20%E7%9A%84%E5%AD%90%E8%B5%84%E6%BA%90%20exec%20%E5%9C%A8%E5%AE%B9%E5%99%A8%E5%86%85%E6%89%A7%E8%A1%8C%E5%91%BD%E4%BB%A4%E3%80%82)\n\n[Kubectl exec 的工作原理解读](https://juejin.cn/post/6844904168860155911)\n\n[使用 client-go 实现 kubectl port-forward](https://www.modb.pro/db/137716)\n\n[自己动手实现一个 kubectl exec](https://cloud.tencent.com/developer/article/1824992)\n\n"
  },
  {
    "path": "k8s/kube-apiserver/21-kube-apiserver list-watch源码分析.md",
    "content": "- [0. 背景](#0---)\n- [1. List-watch api定义](#1-list-watch-api--)\n- [2. 核心handler函数-ListResource](#2---handler---listresource)\n  * [2.1 serveWatch](#21-servewatch)\n  * [2.2 ServeHTTP](#22-servehttp)\n  * [2.3 初步总结](#23-----)\n  * [2.3 rw.Watch()](#23-rwwatch--)\n    + [2.3.1 WatchPredicate](#231-watchpredicate)\n    + [2.3.2 WatchList](#232-watchlist)\n  * [2.4 cache的初始化](#24-cache----)\n    + [2.4.1 Cacher结构体如下](#241-cacher-----)\n    + [2.4.2 cache是如何初始化的-NewCacherFromConfig](#242-cache--------newcacherfromconfig)\n    + [2.4.3 Cache的watchCache结构体](#243-cache-watchcache---)\n- [3. 参考文档](#3---参考)\n\n### 0. 背景\n\n从代码这边探究一下list-watch的实现方式。为性能优化提供思路\n\n### 1. List-watch api定义\n\n提供list和watch服务的入口是同一个，在API接口中是通过 `GET /pods?watch=true`这种方式来区分是list还是watch\n\n和其他接口意义，这一部分都是定义在 registerResourceHandlers函数中\n\n```\n// staging/src/k8s.io/apiserver/pkg/endpoints/installer.go      \n\ncase \"LIST\": // List all resources of a kind.\n\t\t\tdoc := \"list objects of kind \" + kind\n\t\t\tif isSubresource {\n\t\t\t\tdoc = \"list \" + subresource + \" of objects of kind \" + kind\n\t\t\t}\n\t\t\thandler := metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, restfulListResource(lister, watcher, reqScope, false, a.minRequestTimeout))\n\t\t\troute := ws.GET(action.Path).To(handler).\n\t\t\t\tDoc(doc).\n\t\t\t\tParam(ws.QueryParameter(\"pretty\", \"If 'true', then the output is pretty printed.\")).\n\t\t\t\tOperation(\"list\"+namespaced+kind+strings.Title(subresource)+operationSuffix).\n\t\t\t\tProduces(append(storageMeta.ProducesMIMETypes(action.Verb), allMediaTypes...)...).\n\t\t\t\tReturns(http.StatusOK, \"OK\", versionedList).\n\t\t\t\tWrites(versionedList)\n\t\t\tif err := AddObjectParams(ws, route, versionedListOptions); err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tswitch {\n\t\t\tcase isLister && isWatcher:\n\t\t\t\tdoc := \"list or watch objects of kind \" + kind\n\t\t\t\tif isSubresource {\n\t\t\t\t\tdoc = \"list or watch \" + subresource + \" of objects of kind \" + kind\n\t\t\t\t}\n\t\t\t\troute.Doc(doc)\n\t\t\tcase isWatcher:\n\t\t\t\tdoc := \"watch objects of kind \" + kind\n\t\t\t\tif isSubresource {\n\t\t\t\t\tdoc = \"watch \" + subresource + \"of objects of kind \" + kind\n\t\t\t\t}\n\t\t\t\troute.Doc(doc)\n\t\t\t}\n\t\t\taddParams(route, action.Params)\n\t\t\troutes = append(routes, route)\n```\n\n### 2. 核心handler函数-ListResource\n\n上面的handler函数是restfulListResource，核心是调用了ListResource函数。\n\n```\nfunc restfulListResource(r rest.Lister, rw rest.Watcher, scope handlers.RequestScope, forceWatch bool, minRequestTimeout time.Duration) restful.RouteFunction {\n\treturn func(req *restful.Request, res *restful.Response) {\n\t\thandlers.ListResource(r, rw, &scope, forceWatch, minRequestTimeout)(res.ResponseWriter, req.Request)\n\t}\n}\n```\n\nListResource具体逻辑如下：\n\n（1）第一步获取ns, name。获取不到ns直接报错，获取不到name，则表示监听多个资源。可以get pods -w 也可以get  pod podA -w\n\n（2）第二步提取label。如果没有label，有name其实也是一个label。  list-watch经常会指定lable减少apiserver压力\n\n（3）第三步，如果是watch，v3级别log会有打印。最终会调用serveWatch函数。注意，如果是list-watch，最终还是只调用rw.watch不会走到第四步。看起来watch包括了list\n\n（4）第四步，如果只是list,调用 r.List(ctx, &opts)\n\n```\nfunc ListResource(r rest.Lister, rw rest.Watcher, scope *RequestScope, forceWatch bool, minRequestTimeout time.Duration) http.HandlerFunc {\n\treturn func(w http.ResponseWriter, req *http.Request) {\n\t\t// For performance tracking purposes.\n\t\ttrace := utiltrace.New(\"List\", utiltrace.Field{Key: \"url\", Value: req.URL.Path}, utiltrace.Field{Key: \"user-agent\", Value: &lazyTruncatedUserAgent{req}}, utiltrace.Field{Key: \"client\", Value: &lazyClientIP{req}})\n    // 1.第一步获取ns, name。获取不到ns直接报错，获取不到name，则表示监听多个资源\n\t\tnamespace, err := scope.Namer.Namespace(req)\n\t\tif err != nil {\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\n\t\t// Watches for single objects are routed to this function.\n\t\t// Treat a name parameter the same as a field selector entry.\n\t\thasName := true\n\t\t_, name, err := scope.Namer.Name(req)\n\t\tif err != nil {\n\t\t\thasName = false\n\t\t}\n\n\t\tctx := req.Context()\n\t\tctx = request.WithNamespace(ctx, namespace)\n  \n\t\toutputMediaType, _, err := negotiation.NegotiateOutputMediaType(req, scope.Serializer, scope)\n\t\tif err != nil {\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n    \n    \n    // 2.第二步提取label。如果没有label，有name其实也是一个label。  list-watch经常会指定lable减少apiserver压力\n\t\topts := metainternalversion.ListOptions{}\n\t\tif err := metainternalversionscheme.ParameterCodec.DecodeParameters(req.URL.Query(), scope.MetaGroupVersion, &opts); err != nil {\n\t\t\terr = errors.NewBadRequest(err.Error())\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n  \n\t\t// transform fields\n\t\t// TODO: DecodeParametersInto should do this.\n\t\tif opts.FieldSelector != nil {\n\t\t\tfn := func(label, value string) (newLabel, newValue string, err error) {\n\t\t\t\treturn scope.Convertor.ConvertFieldLabel(scope.Kind, label, value)\n\t\t\t}\n\t\t\tif opts.FieldSelector, err = opts.FieldSelector.Transform(fn); err != nil {\n\t\t\t\t// TODO: allow bad request to set field causes based on query parameters\n\t\t\t\terr = errors.NewBadRequest(err.Error())\n\t\t\t\tscope.err(err, w, req)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\n\t\tif hasName {\n\t\t\t// metadata.name is the canonical internal name.\n\t\t\t// SelectionPredicate will notice that this is a request for\n\t\t\t// a single object and optimize the storage query accordingly.\n\t\t\tnameSelector := fields.OneTermEqualSelector(\"metadata.name\", name)\n\n\t\t\t// Note that fieldSelector setting explicitly the \"metadata.name\"\n\t\t\t// will result in reaching this branch (as the value of that field\n\t\t\t// is propagated to requestInfo as the name parameter.\n\t\t\t// That said, the allowed field selectors in this branch are:\n\t\t\t// nil, fields.Everything and field selector matching metadata.name\n\t\t\t// for our name.\n\t\t\tif opts.FieldSelector != nil && !opts.FieldSelector.Empty() {\n\t\t\t\tselectedName, ok := opts.FieldSelector.RequiresExactMatch(\"metadata.name\")\n\t\t\t\tif !ok || name != selectedName {\n\t\t\t\t\tscope.err(errors.NewBadRequest(\"fieldSelector metadata.name doesn't match requested name\"), w, req)\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\topts.FieldSelector = nameSelector\n\t\t\t}\n\t\t}\n\t\t\n\t\t// 3.第三步，如果是watch，v3级别log会有打印。最终会调用rw.Watch函数\n\t\tif opts.Watch || forceWatch {\n\t\t\tif rw == nil {\n\t\t\t\tscope.err(errors.NewMethodNotSupported(scope.Resource.GroupResource(), \"watch\"), w, req)\n\t\t\t\treturn\n\t\t\t}\n\t\t\t// TODO: Currently we explicitly ignore ?timeout= and use only ?timeoutSeconds=.\n\t\t\ttimeout := time.Duration(0)\n\t\t\tif opts.TimeoutSeconds != nil {\n\t\t\t\ttimeout = time.Duration(*opts.TimeoutSeconds) * time.Second\n\t\t\t}\n\t\t\tif timeout == 0 && minRequestTimeout > 0 {\n\t\t\t\ttimeout = time.Duration(float64(minRequestTimeout) * (rand.Float64() + 1.0))\n\t\t\t}\n\t\t\tklog.V(3).Infof(\"Starting watch for %s, rv=%s labels=%s fields=%s timeout=%s\", req.URL.Path, opts.ResourceVersion, opts.LabelSelector, opts.FieldSelector, timeout)\n\t\t\tctx, cancel := context.WithTimeout(ctx, timeout)\n\t\t\tdefer cancel()\n\t\t\twatcher, err := rw.Watch(ctx, &opts)\n\t\t\tif err != nil {\n\t\t\t\tscope.err(err, w, req)\n\t\t\t\treturn\n\t\t\t}\n\t\t\trequestInfo, _ := request.RequestInfoFrom(ctx)\n\t\t\tmetrics.RecordLongRunning(req, requestInfo, metrics.APIServerComponent, func() {\n\t\t\t\tserveWatch(watcher, scope, outputMediaType, req, w, timeout)\n\t\t\t})\n\t\t\treturn\n\t\t}\n\n\t\t// Log only long List requests (ignore Watch).\n\t\tdefer trace.LogIfLong(500 * time.Millisecond)\n\t\ttrace.Step(\"About to List from storage\")\n\t\tresult, err := r.List(ctx, &opts)\n\t\tif err != nil {\n\t\t\tscope.err(err, w, req)\n\t\t\treturn\n\t\t}\n\t\ttrace.Step(\"Listing from storage done\")\n\n\t\ttransformResponseObject(ctx, scope, trace, req, w, http.StatusOK, outputMediaType, result)\n\t\ttrace.Step(\"Writing http response done\", utiltrace.Field{\"count\", meta.LenList(result)})\n\t}\n}\n```\n\n\n\n每次有一个watch的url请求过来，都会调用`rw.Watch()`创建一个`watcher`，然后使用`serveWatch()`来处理这个请求。**watcher的生命周期是每个http请求的**，这一点非常重要。\n\n#### 2.1 serveWatch\n\n这里只是实例化一个WatchServer结构体。核心是ServeHTTP函数\n\n```\n// serveWatch will serve a watch response.\n// TODO: the functionality in this method and in WatchServer.Serve is not cleanly decoupled.\nfunc serveWatch(watcher watch.Interface, scope *RequestScope, mediaTypeOptions negotiation.MediaTypeOptions, req *http.Request, w http.ResponseWriter, timeout time.Duration) {\n   options, err := optionsForTransform(mediaTypeOptions, req)\n   if err != nil {\n      scope.err(err, w, req)\n      return\n   }\n\n   // negotiate for the stream serializer from the scope's serializer\n   serializer, err := negotiation.NegotiateOutputMediaTypeStream(req, scope.Serializer, scope)\n   if err != nil {\n      scope.err(err, w, req)\n      return\n   }\n   framer := serializer.StreamSerializer.Framer\n   streamSerializer := serializer.StreamSerializer.Serializer\n   encoder := scope.Serializer.EncoderForVersion(streamSerializer, scope.Kind.GroupVersion())\n   useTextFraming := serializer.EncodesAsText\n   if framer == nil {\n      scope.err(fmt.Errorf(\"no framer defined for %q available for embedded encoding\", serializer.MediaType), w, req)\n      return\n   }\n   // TODO: next step, get back mediaTypeOptions from negotiate and return the exact value here\n   mediaType := serializer.MediaType\n   if mediaType != runtime.ContentTypeJSON {\n      mediaType += \";stream=watch\"\n   }\n\n   // locate the appropriate embedded encoder based on the transform\n   var embeddedEncoder runtime.Encoder\n   contentKind, contentSerializer, transform := targetEncodingForTransform(scope, mediaTypeOptions, req)\n   if transform {\n      info, ok := runtime.SerializerInfoForMediaType(contentSerializer.SupportedMediaTypes(), serializer.MediaType)\n      if !ok {\n         scope.err(fmt.Errorf(\"no encoder for %q exists in the requested target %#v\", serializer.MediaType, contentSerializer), w, req)\n         return\n      }\n      embeddedEncoder = contentSerializer.EncoderForVersion(info.Serializer, contentKind.GroupVersion())\n   } else {\n      embeddedEncoder = scope.Serializer.EncoderForVersion(serializer.Serializer, contentKind.GroupVersion())\n   }\n\n   ctx := req.Context()\n\n   server := &WatchServer{\n      Watching: watcher,\n      Scope:    scope,\n\n      UseTextFraming:  useTextFraming,\n      MediaType:       mediaType,\n      Framer:          framer,\n      Encoder:         encoder,\n      EmbeddedEncoder: embeddedEncoder,\n\n      Fixup: func(obj runtime.Object) runtime.Object {\n         result, err := transformObject(ctx, obj, options, mediaTypeOptions, scope, req)\n         if err != nil {\n            utilruntime.HandleError(fmt.Errorf(\"failed to transform object %v: %v\", reflect.TypeOf(obj), err))\n            return obj\n         }\n         // When we are transformed to a table, use the table options as the state for whether we\n         // should print headers - on watch, we only want to print table headers on the first object\n         // and omit them on subsequent events.\n         if tableOptions, ok := options.(*metav1beta1.TableOptions); ok {\n            tableOptions.NoHeaders = true\n         }\n         return result\n      },\n\n      TimeoutFactory: &realTimeoutFactory{timeout},\n   }\n\n   server.ServeHTTP(w, req)\n}\n```\n\n#### 2.2 ServeHTTP\n\n这里核心就是从`watcher`的结果channel中读取一个event对象，然后持续不断的编码写入到http response的流当中。\n\n```\nch := s.Watching.ResultChan()\n\tfor {\n\t\tselect {\n\t\tcase <-cn.CloseNotify():\n\t\t\treturn\n\t\tcase <-timeoutCh:\n\t\t\treturn\n\t\tcase event, ok := <-ch:\n\t\t\tif !ok {\n\t\t\t\t// End of results.\n\t\t\t\treturn\n\t\t\t}\n\t\t\tmetrics.WatchEvents.WithLabelValues(kind.Group, kind.Version, kind.Kind).Inc()\n\n\t\t\tobj := s.Fixup(event.Object)\n\t\t\tif err := s.EmbeddedEncoder.Encode(obj, buf); err != nil {\n\t\t\t\t// unexpected error\n\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"unable to encode watch object %T: %v\", obj, err))\n\t\t\t\treturn\n\t\t\t}\n\n\t\t\t// ContentType is not required here because we are defaulting to the serializer\n\t\t\t// type\n\t\t\tunknown.Raw = buf.Bytes()\n\t\t\tevent.Object = &unknown\n\t\t\tmetrics.WatchEventsSizes.WithLabelValues(kind.Group, kind.Version, kind.Kind).Observe(float64(len(unknown.Raw)))\n\n\t\t\t*outEvent = metav1.WatchEvent{}\n\n\t\t\t// create the external type directly and encode it.  Clients will only recognize the serialization we provide.\n\t\t\t// The internal event is being reused, not reallocated so its just a few extra assignments to do it this way\n\t\t\t// and we get the benefit of using conversion functions which already have to stay in sync\n\t\t\t*internalEvent = metav1.InternalEvent(event)\n\t\t\terr := metav1.Convert_v1_InternalEvent_To_v1_WatchEvent(internalEvent, outEvent, nil)\n\t\t\tif err != nil {\n\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"unable to convert watch object: %v\", err))\n\t\t\t\t// client disconnect.\n\t\t\t\treturn\n\t\t\t}\n\t\t\tif err := e.Encode(outEvent); err != nil {\n\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"unable to encode watch object %T: %v (%#v)\", outEvent, err, e))\n\t\t\t\t// client disconnect.\n\t\t\t\treturn\n\t\t\t}\n\t\t\tif len(ch) == 0 {\n\t\t\t\tflusher.Flush()\n\t\t\t}\n\n\t\t\tbuf.Reset()\n\t\t}\n\t}\n```\n\n#### 2.3 初步总结\n\n目前看起来的流程就是这样的：\n\n用户发起一个watch请求，apiserver会初始化一个wach对象。wach.serverhttp会初始化一个流，然后监听一个channel，源源不断的发送channel出现的event到流中。这样客户端就接收到了watch对象了。\n\n目前还有2个问题需要进一步确定：\n\n（1）`rw.Watch()`创建一个`watcher`。创建的watcher到底是什么样子\n\n（2）它是怎么从etcd中获得变化的数据的？又是怎么过滤条件的？\n\n![image-20221206113009053](../images/image-20221206113009053.png)\n\n#### 2.3 rw.Watch()\n\n```\n// Watcher should be implemented by all Storage objects that\n// want to offer the ability to watch for changes through the watch api.\ntype Watcher interface {\n   // 'label' selects on labels; 'field' selects on the object's fields. Not all fields\n   // are supported; an error should be returned if 'field' tries to select on a field that\n   // isn't supported. 'resourceVersion' allows for continuing/starting a watch at a\n   // particular version.\n   Watch(ctx context.Context, options *metainternalversion.ListOptions) (watch.Interface, error)\n}\n```\n\n查看这个函数的实现，最终是：\n\n```\n// Watch makes a matcher for the given label and field, and calls\n// WatchPredicate. If possible, you should customize PredicateFunc to produce\n// a matcher that matches by key. SelectionPredicate does this for you\n// automatically.\nfunc (e *Store) Watch(ctx context.Context, options *metainternalversion.ListOptions) (watch.Interface, error) {\n\tlabel := labels.Everything()\n\tif options != nil && options.LabelSelector != nil {\n\t\tlabel = options.LabelSelector\n\t}\n\tfield := fields.Everything()\n\tif options != nil && options.FieldSelector != nil {\n\t\tfield = options.FieldSelector\n\t}\n\tpredicate := e.PredicateFunc(label, field)\n\n\tresourceVersion := \"\"\n\tif options != nil {\n\t\tresourceVersion = options.ResourceVersion\n\t\tpredicate.AllowWatchBookmarks = options.AllowWatchBookmarks\n\t}\n\treturn e.WatchPredicate(ctx, predicate, resourceVersion)\n}\n```\n\n这里也很简单最终调用了e.WatchPredicate。\n\n其中有个参数是一个函数：predicate，这个也是每个对象storage需要实现的。\n\n以pod为例：predicate对应的函数为MatchPod，其中GetAttrs判断一个对象是否为pod\n\n```\n// MatchPod returns a generic matcher for a given label and field selector.\nfunc MatchPod(label labels.Selector, field fields.Selector) storage.SelectionPredicate {\n\treturn storage.SelectionPredicate{\n\t\tLabel:       label,\n\t\tField:       field,\n\t\tGetAttrs:    GetAttrs,\n\t\tIndexFields: []string{\"spec.nodeName\"},\n\t}\n}\n```\n\n##### 2.3.1 WatchPredicate\n\n```\n// WatchPredicate starts a watch for the items that matches.\nfunc (e *Store) WatchPredicate(ctx context.Context, p storage.SelectionPredicate, resourceVersion string) (watch.Interface, error) {\n\tif name, ok := p.MatchesSingle(); ok {\n\t\tif key, err := e.KeyFunc(ctx, name); err == nil {\n\t\t\tw, err := e.Storage.Watch(ctx, key, resourceVersion, p)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tif e.Decorator != nil {\n\t\t\t\treturn newDecoratedWatcher(w, e.Decorator), nil\n\t\t\t}\n\t\t\treturn w, nil\n\t\t}\n\t\t// if we cannot extract a key based on the current context, the\n\t\t// optimization is skipped\n\t}\n\n\tw, err := e.Storage.WatchList(ctx, e.KeyRootFunc(ctx), resourceVersion, p)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tif e.Decorator != nil {\n\t\treturn newDecoratedWatcher(w, e.Decorator), nil\n\t}\n\treturn w, nil\n}\n```\n\n这里有2个点，第一个就是e.Storage.WatchList函数。\n\n第二个就是newDecoratedWatcher(w, e.Decorator)\n\n##### 2.3.2 WatchList\n\n```\n// 返回的就是一个watcher对象。第一个参数ctx 上下文信息。第二个e.KeyRootFunc(ctx)其实就是取出来ns/podname， 第三个是resourceversion,指定了按指定值来。没有指定，是空字符串。最后一个参数p就是MatchPod\nw, err := e.Storage.WatchList(ctx, e.KeyRootFunc(ctx), resourceVersion, p)\n\n\n// 到这里Storage只是一个接口。Storage包含了delete,Create, WatchList等等的接口。接下里以pod为例子来看具体实现\nfunc (s *DryRunnableStorage) WatchList(ctx context.Context, key string, resourceVersion string, p storage.SelectionPredicate) (watch.Interface, error) {\n\treturn s.Storage.WatchList(ctx, key, resourceVersion, p)\n}\n```\n\nWatchList最终调用 Watch函数，在cacher.go中。核心就是：\n\n（1）生成一个新的watcher\n\n（2）找出所有已有的event\n\n（3）将watcher注册到所有的watchers中去\n\n（4）进行watch逻辑的处理\n\n看到这里可能有点懵，因为这是自下而上的分析。\n\n这里有个前提是: 在apiserver运行的时候，apiserver内部有一个大cache，监听保存了所有etcd的数据。\n\n所以，在新来一个watch3之前。可能已有的结构就是下图这样了。\n\n所以对于一个新的watch，做的事情就是：\n\n（1）生成一个新的watch3。对象有filter，watch.Event channel, type资源类型等等\n\n（2）对于新来的watch3，要将cache已有的数据传给watch3处理，这里就是下面的initEvents\n\n（3）新来的watch3, 也要注册到watcher集群里面，这样etcd后面的add/update/delete事件也能触发给这个watch3。\n\n```mermaid\nflowchart LR\nA[etcd] -->|add/update/delete event| B(cache)\nB --> C{watcher集合}\nC -->|One| D[watcher1]\nC -->|Two| E[watcher2]\n```\n\n```\nstaging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go\n// Watch implements storage.Interface.\nfunc (c *Cacher) Watch(ctx context.Context, key string, resourceVersion string, pred storage.SelectionPredicate) (watch.Interface, error) {\n\twatchRV, err := c.versioner.ParseResourceVersion(resourceVersion)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tc.ready.wait()\n\n\ttriggerValue, triggerSupported := \"\", false\n\tif c.indexedTrigger != nil {\n\t\tfor _, field := range pred.IndexFields {\n\t\t\tif field == c.indexedTrigger.indexName {\n\t\t\t\tif value, ok := pred.Field.RequiresExactMatch(field); ok {\n\t\t\t\t\ttriggerValue, triggerSupported = value, true\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\t// If there is indexedTrigger defined, but triggerSupported is false,\n\t// we can't narrow the amount of events significantly at this point.\n\t//\n\t// That said, currently indexedTrigger is defined only for couple resources:\n\t// Pods, Nodes, Secrets and ConfigMaps and there is only a constant\n\t// number of watchers for which triggerSupported is false (excluding those\n\t// issued explicitly by users).\n\t// Thus, to reduce the risk of those watchers blocking all watchers of a\n\t// given resource in the system, we increase the sizes of buffers for them.\n\tchanSize := 10\n\tif c.indexedTrigger != nil && !triggerSupported {\n\t\t// TODO: We should tune this value and ideally make it dependent on the\n\t\t// number of objects of a given type and/or their churn.\n\t\tchanSize = 1000\n\t}\n\n\t// Determine watch timeout('0' means deadline is not set, ignore checking)\n\tdeadline, _ := ctx.Deadline()\n\t// Create a watcher here to reduce memory allocations under lock,\n\t// given that memory allocation may trigger GC and block the thread.\n\t// Also note that emptyFunc is a placeholder, until we will be able\n\t// to compute watcher.forget function (which has to happen under lock).\n\t// 1.生成一个新的watcher\n\twatcher := newCacheWatcher(chanSize, filterWithAttrsFunction(key, pred), emptyFunc, c.versioner, deadline, pred.AllowWatchBookmarks, c.objectType)\n\n\t// We explicitly use thread unsafe version and do locking ourself to ensure that\n\t// no new events will be processed in the meantime. The watchCache will be unlocked\n\t// on return from this function.\n\t// Note that we cannot do it under Cacher lock, to avoid a deadlock, since the\n\t// underlying watchCache is calling processEvent under its lock.\n\tc.watchCache.RLock()\n\tdefer c.watchCache.RUnlock()\n\t\n\t// 2. 找出所有已有的event\n\tinitEvents, err := c.watchCache.GetAllEventsSinceThreadUnsafe(watchRV)\n\tif err != nil {\n\t\t// To match the uncached watch implementation, once we have passed authn/authz/admission,\n\t\t// and successfully parsed a resource version, other errors must fail with a watch event of type ERROR,\n\t\t// rather than a directly returned error.\n\t\treturn newErrWatcher(err), nil\n\t}\n\n\t// With some events already sent, update resourceVersion so that\n\t// events that were buffered and not yet processed won't be delivered\n\t// to this watcher second time causing going back in time.\n\tif len(initEvents) > 0 {\n\t\twatchRV = initEvents[len(initEvents)-1].ResourceVersion\n\t}\n  \n  \n  // 3.将watcher注册到所有的watchers中去\n\tfunc() {\n\t\tc.Lock()\n\t\tdefer c.Unlock()\n\t\t// Update watcher.forget function once we can compute it.\n\t\twatcher.forget = forgetWatcher(c, c.watcherIdx, triggerValue, triggerSupported)\n\t\tc.watchers.addWatcher(watcher, c.watcherIdx, triggerValue, triggerSupported)\n\n\t\t// Add it to the queue only when the client support watch bookmarks.\n\t\tif watcher.allowWatchBookmarks {\n\t\t\tc.bookmarkWatchers.addWatcher(watcher)\n\t\t}\n\t\tc.watcherIdx++\n\t}()\n\t\n\t// 4.进行watch逻辑的处理\n\tgo watcher.process(ctx, initEvents, watchRV)\n\treturn watcher, nil\n}\n\n```\n\n#### 2.4 cache的初始化\n\n对于一个watch请求，apiserver初始化了一个watch，然后建立一个流，发送event。\n\n初始化watch的逻辑出来了。但是还差一步，就是cache的逻辑处理是怎样的。接下里分析这个。\n\n##### 2.4.1 Cacher结构体如下\n\n这里有几个核心的结构体：\n\n\t// 和etcd打交道的\n\t// Underlying storage.Interface.\n\tstorage storage.Interface\n\n\n​\t\t\n\t// 从这里看出来，不是缓存所有对象的数据，而是一种数据一个缓存\n\t// Expected type of objects in the underlying cache.\n\tobjectType reflect.Type\n\t// \"sliding window\" of recent changes of objects and the current state.\n\t// 内存数据\n\twatchCache *watchCache\n\treflector  *cache.Reflector\n\n\n​\t\n\ttype indexedWatchers struct {\n\t\tallWatchers   watchersMap\n\t\tvalueWatchers map[string]watchersMap\n\t}\n\n完整定义在：staging/src/k8s.io/apiserver/pkg/storage/cacher/cacher.go\n\n其实从上面的分析我们就知道。cache 需要有watch集合外，更重要的是如何和etcd打交道，存储etcd数据。\n\n##### 2.4.2 cache是如何初始化的-NewCacherFromConfig\n\n大概的调用链路为 func (e *Store) CompleteWithOptions -> GetRESTOptions  -> StorageWithCacher ->  NewCacherFromConfig \n\n在每个对象完成store 补全的时候，通过NewCacherFromConfig生成了cache。\n\n```\n// 1.从StorageWithCacher可以看出来，这里的Storage直接和etcd打交道\n// Creates a cacher based given storageConfig.\nfunc StorageWithCacher(capacity int) generic.StorageDecorator {\n\t\t// TODO: we would change this later to make storage always have cacher and hide low level KV layer inside.\n\t\t// Currently it has two layers of same storage interface -- cacher and low level kv.\n\t\tcacherConfig := cacherstorage.Config{\n\t\t\tCacheCapacity:  capacity,\n\t\t\tStorage:        s,\n\t\t\tVersioner:      etcd3.APIObjectVersioner{},  //使用etcd3\n\t\t\tResourcePrefix: resourcePrefix,\n\t\t\tKeyFunc:        keyFunc,\n\t\t\tNewFunc:        newFunc,\n\t\t\tNewListFunc:    newListFunc,\n\t\t\tGetAttrsFunc:   getAttrsFunc,\n\t\t\tIndexerFuncs:   triggerFuncs,\n\t\t\tCodec:          storageConfig.Codec,\n\t\t}\n\t\tcacher, err := cacherstorage.NewCacherFromConfig(cacherConfig)\n\n}\n```\n\n<br>\n\nNewCacherFromConfig 初始化reflector后，通过startCaching函数从etcd获取数据。具体是调用了c.reflector.ListAndWatch函数。这里看起来又封装了，细节没扣出来。先放一下。。\n\n```\n// NewCacherFromConfig creates a new Cacher responsible for servicing WATCH and LIST requests from\n// its internal cache and updating its cache in the background based on the\n// given configuration.\nfunc NewCacherFromConfig(config Config) (*Cacher, error) {\n   。。。。\n   // Ensure that timer is stopped.\n   if !cacher.timer.Stop() {\n      // Consume triggered (but not yet received) timer event\n      // so that future reuse does not get a spurious timeout.\n      <-cacher.timer.C\n   }\n\n   watchCache := newWatchCache(\n      config.CacheCapacity, config.KeyFunc, cacher.processEvent, config.GetAttrsFunc, config.Versioner)\n   listerWatcher := NewCacherListerWatcher(config.Storage, config.ResourcePrefix, config.NewListFunc)\n   reflectorName := \"storage/cacher.go:\" + config.ResourcePrefix\n\n   reflector := cache.NewNamedReflector(reflectorName, listerWatcher, obj, watchCache, 0)\n   // Configure reflector's pager to for an appropriate pagination chunk size for fetching data from\n   // storage. The pager falls back to full list if paginated list calls fail due to an \"Expired\" error.\n   reflector.WatchListPageSize = storageWatchListPageSize\n\n   cacher.watchCache = watchCache\n   cacher.reflector = reflector\n\n   go cacher.dispatchEvents()\n\n   cacher.stopWg.Add(1)\n   go func() {\n      defer cacher.stopWg.Done()\n      defer cacher.terminateAllWatchers()\n      wait.Until(\n         func() {\n            if !cacher.isStopped() {\n               cacher.startCaching(stopCh)\n            }\n         }, time.Second, stopCh,\n      )\n   }()\n\n   return cacher, nil\n}\n```\n\n##### 2.4.3 Cache的watchCache结构体\n\n这个是保存在内存中的数据，可以看出来就是一个切片数组。watchCache实现了Add，update, delete函数（etcd的数据变化）并且borc，更新数据。具体不再展开了。\n\n```\n// watchCache implements a Store interface.\n// However, it depends on the elements implementing runtime.Object interface.\n//\n// watchCache is a \"sliding window\" (with a limited capacity) of objects\n// observed from a watch.\ntype watchCache struct {\n   sync.RWMutex\n\n   // Condition on which lists are waiting for the fresh enough\n   // resource version.\n   cond *sync.Cond\n\n   // Maximum size of history window.\n   capacity int\n\n   // keyFunc is used to get a key in the underlying storage for a given object.\n   keyFunc func(runtime.Object) (string, error)\n\n   // getAttrsFunc is used to get labels and fields of an object.\n   getAttrsFunc func(runtime.Object) (labels.Set, fields.Set, error)\n\n   // cache is used a cyclic buffer - its first element (with the smallest\n   // resourceVersion) is defined by startIndex, its last element is defined\n   // by endIndex (if cache is full it will be startIndex + capacity).\n   // Both startIndex and endIndex can be greater than buffer capacity -\n   // you should always apply modulo capacity to get an index in cache array.\n   cache      []*watchCacheEvent\n   startIndex int\n   endIndex   int\n\n   // store will effectively support LIST operation from the \"end of cache\n   // history\" i.e. from the moment just after the newest cached watched event.\n   // It is necessary to effectively allow clients to start watching at now.\n   // NOTE: We assume that <store> is thread-safe.\n   store cache.Store\n\n   // ResourceVersion up to which the watchCache is propagated.\n   resourceVersion uint64\n\n   // ResourceVersion of the last list result (populated via Replace() method).\n   listResourceVersion uint64\n\n   // This handler is run at the end of every successful Replace() method.\n   onReplace func()\n\n   // This handler is run at the end of every Add/Update/Delete method\n   // and additionally gets the previous value of the object.\n   eventHandler func(*watchCacheEvent)\n\n   // for testing timeouts.\n   clock clock.Clock\n\n   // An underlying storage.Versioner.\n   versioner storage.Versioner\n}\n\n\n\nfunc newWatchCache(\n\tcapacity int,\n\tkeyFunc func(runtime.Object) (string, error),\n\teventHandler func(*watchCacheEvent),\n\tgetAttrsFunc func(runtime.Object) (labels.Set, fields.Set, error),\n\tversioner storage.Versioner) *watchCache {\n\twc := &watchCache{\n\t\tcapacity:            capacity,\n\t\tkeyFunc:             keyFunc,\n\t\tgetAttrsFunc:        getAttrsFunc,\n\t\tcache:               make([]*watchCacheEvent, capacity),\n\t\tstartIndex:          0,\n\t\tendIndex:            0,\n\t\tstore:               cache.NewStore(storeElementKey),\n\t\tresourceVersion:     0,\n\t\tlistResourceVersion: 0,\n\t\teventHandler:        eventHandler,\n\t\tclock:               clock.RealClock{},\n\t\tversioner:           versioner,\n\t}\n\twc.cond = sync.NewCond(wc.RLocker())\n\treturn wc\n}\n```\n\n### 3 参考\n\nhttps://developer.aliyun.com/article/680204"
  },
  {
    "path": "k8s/kube-apiserver/3-k8s之资源介绍.md",
    "content": "* [1\\. 资源的表示](#1-资源的表示)\n  * [1\\.1 Group](#11-group)\n  * [1\\.2 version](#12-version)\n  * [1\\.3 Resource](#13-resource)\n  * [1\\.4 gvk, gvr是什么](#14-gvk-gvr是什么)\n  * [1\\.4 resource和kind的区别](#14-resource和kind的区别)\n* [2\\. 资源外部版本与内部版本](#2-资源外部版本与内部版本)\n* [3\\. 资源代码定义](#3-资源代码定义)\n* [4\\. 资源注册](#4-资源注册)\n* [5\\. k8s内置资源全展](#5-k8s内置资源全展)\n* [6\\. 资源转换](#6-资源转换)\n* [7\\. 总结](#7-总结)\n\n**重点：**\n\n（1）k8s中资源是如何表示的\n\n（2）内部资源和外部资源的作用以及相互转换\n\n<br>\n\n### 1. 资源的表示\n\n资源是Kubernetes的核心概念，Kubernetes将资源再次分组和版本化，形成Group（资源组）、Version（资源版本）、Resource（资源）。Group、Version、Resource核心数据结构如图3-1所示。\n\n* Group：被称为资源组，在Kubernetes API Server中也可称其为APIGroup。\n\n* Version：被称为资源版本，在Kubernetes API Server中也可称其为APIVersions。\n\n* Resource：被称为资源，在Kubernetes API Server中也可称其为APIResource。\n\n* Kind：资源种类，描述Resource的种类，与Resource为同一级别。\n\n![](../images/source-kind.png)\n\n\n\nKubernetes系统支持多个Group，每个Group支持多个Version，每个Version支持多个Resource，其中部分资源同时会拥有自己的子资源（即SubResource）。例如，Deployment资源拥有Status子资源。\n\n<br>\n\n资源对象由“资源组+资源版本+资源种类”组成，并在实例化后表达一个资源对象，例如Deployment资源实例化后\n\n拥有资源组、资源版本及资源种类，其表现形式为<group>/<version>，Kind=<kind>。\n\n例如 apps/v1，Kind=Deployment。 Apps 表示group, v1表示version, deployment表示kind。\n\n<br>\n\n#### 1.1 Group\n\n```\ntype APIGroup struct {\n\tTypeMeta `json:\",inline\"`\n\t\n\tName string `json:\"name\" protobuf:\"bytes,1,opt,name=name\"`\n\t// 当前这个组支持的所有version\n\tVersions []GroupVersionForDiscovery `json:\"versions\" protobuf:\"bytes,2,rep,name=versions\"`\n  \n  // GroupVersionForDiscovery是一个结构体，只包含GroupVersion和version两个字段\n\tPreferredVersion GroupVersionForDiscovery `json:\"preferredVersion,omitempty\" protobuf:\"bytes,3,opt,name=preferredVersion\"`\n\n  // CIDR相关\n\t// +optional\n\tServerAddressByClientCIDRs []ServerAddressByClientCIDR `json:\"serverAddressByClientCIDRs,omitempty\" protobuf:\"bytes,4,rep,name=serverAddressByClientCIDRs\"`\n}\n```\n\nGroup（资源组），在Kubernetes API Server中也可称其为APIGroup。Kubernetes系统中定义了许多资源组，\n\n这些资源组按照不同功能将资源进行了划分，资源组特点如下。\n\n* 将众多资源按照功能划分成不同的资源组，并允许单独启用/禁用资源组。当然也可以单独启用/禁用资源组中的资源。\n* 支持不同资源组中拥有不同的资源版本。这方便组内的资源根据版本进行迭代升级。\n* 支持同名的资源种类（即Kind）存在于不同的资源组内。\n* 资源组与资源版本通过Kubernetes API Server对外暴露，允许开发者通过HTTP协议进行交互并通过动态客户端（即DynamicClient）进行资源发现。\n* 支持CRD自定义资源扩展。\n\nk8s中存在没有组的资源，例如pod，直接就是 /v1/pod\n\n其他的，deploy，  /apps/v1/deploy。  Apps 就是一个组\n\n<br>\n\n#### 1.2 version\n\nKubernetes的资源版本控制可分为3种，分别是Alpha、Beta、Stable，它们之间的迭代顺序为Alpha→Beta→Stable，其通常用来表示软件测试过程中的3个阶段。Alpha是第1个阶段，一般用于内部测试；\n\nBeta是第2个阶段，该版本已经修复了大部分不完善之处，但仍有可能存在缺陷和漏洞，一般由特定的用户群来进\n\n行测试；Stable是第3个阶段，此时基本形成了产品并达到了一定的成熟度，可稳定运行。Kubernetes资源版本控\n\n制详情如下。\n\n<br>\n\nAlpha版本名称一般为v1alpha1、v1alpha2、v2alpha1等。\n\nBeta版本命名一般为v1beta1、v1beta2、v2beta1。\n\nStable版本命名一般为v1、v2、v3。\n\n<br>\n\n#### 1.3 Resource\n\nResource是一个整体的描述，能直接看出来的点就是 资源名字，缩写，可以支持的操作，组名（Group）等等。\n\nGroup负责分组。\n\nversion负责标记版本。\n\nkind是对象本身\n\nVerbs: 这个资源支持的操作，get, list, watch, create, update, patch, delete等等。\n\n```\n// APIResource specifies the name of a resource and whether it is namespaced.\ntype APIResource struct {\n\t// name is the plural name of the resource.\n\tName string `json:\"name\" protobuf:\"bytes,1,opt,name=name\"`\n\t// singularName is the singular name of the resource.  This allows clients to handle plural and singular opaquely.\n\t// The singularName is more correct for reporting status on a single item and both singular and plural are allowed\n\t// from the kubectl CLI interface.\n\tSingularName string `json:\"singularName\" protobuf:\"bytes,6,opt,name=singularName\"`\n\t// namespaced indicates if a resource is namespaced or not.\n\tNamespaced bool `json:\"namespaced\" protobuf:\"varint,2,opt,name=namespaced\"`\n\t// group is the preferred group of the resource.  Empty implies the group of the containing resource list.\n\t// For subresources, this may have a different value, for example: Scale\".\n\tGroup string `json:\"group,omitempty\" protobuf:\"bytes,8,opt,name=group\"`\n\t// version is the preferred version of the resource.  Empty implies the version of the containing resource list\n\t// For subresources, this may have a different value, for example: v1 (while inside a v1beta1 version of the core resource's group)\".\n\tVersion string `json:\"version,omitempty\" protobuf:\"bytes,9,opt,name=version\"`\n\t// kind is the kind for the resource (e.g. 'Foo' is the kind for a resource 'foo')\n\t\n\tKind string `json:\"kind\" protobuf:\"bytes,3,opt,name=kind\"`\n\t\n\t// verbs is a list of supported kube verbs (this includes get, list, watch, create,\n\t// update, patch, delete, deletecollection, and proxy)\n\tVerbs Verbs `json:\"verbs\" protobuf:\"bytes,4,opt,name=verbs\"`\n\t// shortNames is a list of suggested short names of the resource.\n\t\n\tShortNames []string `json:\"shortNames,omitempty\" protobuf:\"bytes,5,rep,name=shortNames\"`\n\t// categories is a list of the grouped resources this resource belongs to (e.g. 'all')\n\tCategories []string `json:\"categories,omitempty\" protobuf:\"bytes,7,rep,name=categories\"`\n}\n```\n\n**Resource**  更多是方便HTTP 协议和 JSON 格式传输的资源展现形式，可以以单个资源对象展现，例如 `.../namespaces/default`，也可以以列表的形式展现，例如 `.../jobs`。要正确的请求资源对象，API-Server 必须知道 `apiVersion` 与请求的资源，这样 API-Server 才能正确地解码请求信息，这些信息正是处于请求的资源路径中。一般来说，把 API Group、API Version 以及 Resource 组合成为 GVR 可以区分特定的资源请求路径，例如 `/apis/batch/v1/jobs` 就是请求所有的 jobs 信息。\n\n`Resource` 是 `Kind` 在 API 中的标识，通常情况下 `Kind` 和 `Resource` 是一一对应的, 但是有时候相同的 `Kind` 可能对应多个 `Resources`, 比如 Scale Kind 可能对应很多 Resources：deployments/scale 或者 replicasets/scale, 但是在 CRD 中，每个 `Kind` 只会对应一种 `Resource`。\n\n`Resource` 始终是小写形式，并且通常情况下是 `Kind` 的小写形式。\n\n当我们使用 kubectl 操作 API 时，操作的就是 `Resource`，比如 `kubectl get pods`, 这里的 `pods` 就是指 `Resource`。\n\n而我们在编写 YAML 文件时，会编写类似 `Kind: Pod` 这样的内容，这里 `Pod` 就是 `Kind`\n\n#### 1.4 gvk, gvr是什么\n\n```\ntype GroupVersionKind struct {\n\tGroup   string\n\tVersion string\n\tKind    string\n}\n\ntype GroupVersionResource struct {\n\tGroup    string\n\tVersion  string\n\tResource string\n}\n```\n\nGvk: group, version,kind。用于定位一个资源的总类。用面向对象的思想就是  gvk是一个类。gvk并没有实例化。\n\n比如一个yaml文件中的对象就是一个GVK。apiversion定义了 group和version； kind定义了kind\n\n```\nroot@k8s-master:~# cat pod.yaml\napiVersion: v1\nkind: Pod\n```\n\nGVR: 是K8S中的一个资源对象，用于定位到一个资源, gvr常用于组合成 RESTful API 请求路径。\n\n例如，针对应用程序 v1 deployment部署的 RESTful API 请求如下所示：\n\n```fallback\nGET /apis/apps/v1/namespaces/{namespace}/deployments/{name}\n```\n\n#### 1.4 resource和kind的区别\n\nresource有单复数的区别，resource的单数形式其实就是kind的小写。\n\n<br>\n\n以Pod为例，GVK是{Group:\"\", Version: \"v1\", Kind: \"Pod\"}。\n\nresource有单复数的区别\n\n 那么singular是{Group:\"\", Version: \"v1\", Resource: \"pod\"},  plural则是{Group：\"\", Version：\"v1\", Resource:\"pods\"}。\n\n<br>\n\n### 2. 资源外部版本与内部版本\n\nKubernetes资源代码定义在pkg/apis目录下，在详解资源代码定义之前，先来了解一下资源的外部版本\n\n（External Version）与内部版本（InternalVersion）。在Kubernetes系统中，同一资源对应着两个版本，分别\n\n是外部版本和内部版本。例如，Deployment资源，它所属的外部版本表现形式为apps/v1，内部版本表现形式为\n\napps/__internal。\n\n* External Object：外部版本资源对象，也称为Versioned Object（即拥有资源版本的资源对象）。外部版本用于对外暴露给用户请求的接口所使用的资源对象，例如，用户在通过YAML或JSON格式的描述文件创建资源对象时，所使用的是外部版本的资源对象。外部版本的资源对象通过资源版本（Alpha、Beta、Stable）进行标识。\n\n* Internal Object：内部版本资源对象。内部版本不对外暴露，仅在Kubernetes API Server内部使用。内部版本用于多资源版本的转换，例如将v1beta1版本转换为v1版本，其过程为v1beta1→internal→v1，即先将v1beta1转换为内部版本（internal），再由内部版本（internal）转换为v1版本。内部版本资源对象通过runtime.APIVersionInternal（即__internal）进行标识。\n\n<br>\n\n**提示：**在Kubernetes源码中，外部版本的资源类型定义在vendor/k8s.io/api目录下，其完整描述路径为\n\nvendor/k8s.io/api/<group>/<version>/<resource file>。例如，Pod资源的外部版本，定义在\n\nvendor/k8s.io/api/core/v1/目录下。\n\n这里的各个字段，都有提供了 json或者protobuf序列化之后的名称。这是外部的版本。\n\n```\nk8s.io/api/core/v1/types.go\n\n// Pod is a collection of containers that can run on a host. This resource is created\n// by clients and scheduled onto hosts.\ntype Pod struct {\n\tmetav1.TypeMeta `json:\",inline\"`\n\t// Standard object's metadata.\n\t// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata\n\t// +optional\n\tmetav1.ObjectMeta `json:\"metadata,omitempty\" protobuf:\"bytes,1,opt,name=metadata\"`\n\n\t// Specification of the desired behavior of the pod.\n\t// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status\n\t// +optional\n\tSpec PodSpec `json:\"spec,omitempty\" protobuf:\"bytes,2,opt,name=spec\"`\n\n\t// Most recently observed status of the pod.\n\t// This data may not be up to date.\n\t// Populated by the system.\n\t// Read-only.\n\t// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status\n\t// +optional\n\tStatus PodStatus `json:\"status,omitempty\" protobuf:\"bytes,3,opt,name=status\"`\n}\n```\n\n<br>\n\npkg/apis/core/types.go\n\n这里也定义了pod，这是pod的内部版本定义。这里就没有任何json,protobuf的名称定义了。\n\n```\n// Pod is a collection of containers, used as either input (create, update) or as output (list, get).\ntype Pod struct {\n\tmetav1.TypeMeta\n\t// +optional\n\tmetav1.ObjectMeta\n\n\t// Spec defines the behavior of a pod.\n\t// +optional\n\tSpec PodSpec\n\n\t// Status represents the current information about a pod. This data may not be up\n\t// to date.\n\t// +optional\n\tStatus PodStatus\n}\n```\n\n<br>\n\n### 3. 资源代码定义\n\nKubernetes资源代码定义在pkg/apis目录下，同一资源对应着内部版本和外部版本，内部版本和外部版本的资源代码结构并不相同。资源的内部版本定义了所支持的资源类型（types.go）、资源验证方法（validation.go）、资源注册至资源注册表的方法（install/install.go）等。而资源的外部版本定义了资源的转换方法（conversion.go）、资源的默认值（defaults.go）等。\n\n（1）以Deployment资源为例，它的内部版本定义在pkg/apis/apps/目录下，其资源代码结构如下：\n\n![image-20210223104358356](../images/deploy.png)\n\n● doc.go：GoDoc文件，定义了当前包的注释信息。在Kubernetes资源包中，它还担当了代码生成器的全局Tags描述文件。\n\n● register.go：定义了资源组、资源版本及资源的注册信息。\n\n● types.go：定义了在当前资源组、资源版本下所支持的资源类型。\n\n● v1、v1beta1、v1beta2：定义了资源组下拥有的资源版本的资源（即外部版本）。\n\n● install：把当前资源组下的所有资源注册到资源注册表中。\n\n● validation：定义了资源的验证方法。\n\n● zz_generated.deepcopy.go：定义了资源的深复制操作，该文件由代码生成器自动生成。\n\n每一个Kubernetes资源目录，都通过register.go代码文件定义所属的资源组和资源版本，内部版本资源对象通过runtime.APIVersionInternal（即__internal）标识，代码示例如下：\n\n```\nvar SchemeGroupVersion = schema.GroupVersion{Group: GroupName, Version: runtime.APIVersionInternal}\n```\n\n每一个Kubernetes资源目录，都通过type.go代码文件定义当前资源组/资源版本下所支持的资源类型，代码示例如下：\n\n代码路径：pkg/apis/apps/types.go\n\n```\ntype Deployment struct {}type stateful struct{}type daemonSet struct{}\n```\n\n（2）以Deployment资源为例，它的外部版本定义在pkg/apis/apps/{v1，v1beta1，v1beta2}目录下，其资源代码结构如下：\n\n其中doc.go和register.go的功能与内部版本资源代码结构中的相似，故不再赘述。外部版本的资源代码结构说明如下。\n\n● conversion.go：定义了资源的转换函数（默认转换函数），并将默认转换函数注册到资源注册表中。\n\n● zz_generated.conversion.go：定义了资源的转换函数（自动生成的转换函数），并将生成的转换函数注册到资源注册表中。该文件由代码生成器自动生成。\n\n● defaults.go：定义了资源的默认值函数，并将默认值函数注册到资源注册表中。\n\n● zz_generated.defaults.go：定义了资源的默认值函数（自动生成的默认值函数），并将生成的默认值函数注册到资源注册表中。该文件由代码生成器自动生成。\n\n<br>\n\n外部版本与内部版本资源类型相同，都通过register.go代码文件定义所属的资源组和资源版本，外部版本资源对象\n\n通过资源版本（Alpha、Beta、Stable）标识，代码示例如下：代码路径：pkg/apis/apps/v1/register.go\n\n```\n// GroupName is the group name use in this packageconst GroupName = \"apps\"// SchemeGroupVersion is group version used to register these objectsvar SchemeGroupVersion = schema.GroupVersion{Group: GroupName, Version: \"v1\"}\n```\n\n<br>\n\n### 4. 资源注册\n\n在每一个Kubernetes资源组目录中，都拥有一个install/install.go代码文件，它负责将资源信息注册到资源注册表（Scheme）中。以core核心资源组为例，代码示例如下：\n\n```\nfunc init() {\tInstall(legacyscheme.Scheme)}\n// Install registers the API group and adds types to a schemefunc \nInstall(scheme *runtime.Scheme) {\t\n\t\tutilruntime.Must(apps.AddToScheme(scheme))\t\n\t\tutilruntime.Must(v1beta1.AddToScheme(scheme))\t\t\t \n\t\tutilruntime.Must(v1beta2.AddToScheme(scheme))\t\n\t\tutilruntime.Must(v1.AddToScheme(scheme))\n\t\tutilruntime.Must(scheme.SetVersionPriority(v1.SchemeGroupVersion, v1beta2.SchemeGroupVersion, v1beta1.SchemeGroupVersion))\n}\n```\n\n这里注册了kind\n\n```\n// Adds the list of known types to the given scheme.\nfunc addKnownTypes(scheme *runtime.Scheme) error {\n\t// TODO this will get cleaned up with the scheme types are fixed\n\tscheme.AddKnownTypes(SchemeGroupVersion,\n\t\t&DaemonSet{},\n\t\t&DaemonSetList{},\n\t\t&Deployment{},\n\t\t&DeploymentList{},\n\t\t&DeploymentRollback{},\n\t\t&autoscaling.Scale{},\n\t\t&StatefulSet{},\n\t\t&StatefulSetList{},\n\t\t&ControllerRevision{},\n\t\t&ControllerRevisionList{},\n\t\t&ReplicaSet{},\n\t\t&ReplicaSetList{},\n\t)\n\treturn nil\n}\n```\n\nlegacyscheme.Scheme是kube-apiserver组件的全局资源注册表，Kubernetes的所有资源信息都交给资源注册表统一管理。\n\ncore.AddToScheme函数注册core资源组内部版本的资源。v1.AddToScheme函数注册core资源组外部版本的资源。\n\nscheme.SetVersionPriority函数注册资源组的版本顺序，如有多个资源版本，排在最前面的为资源首选版本。\n\n**这里先留个疑问，scheme到底是什么。然后再看为什么要注册到这里。**\n\n<br>\n\n### 5. k8s内置资源全展\n\nKubernetes系统内置了众多“资源组、资源版本、资源”，这才有了现在功能强大的资源管理系统。可通过如下方式获得当前Kubernetes系统所支持的内置资源。\n\n● kubectl api-versions：列出当前Kubernetes系统支持的资源组和资源版本，其表现形式为<group>/<version>。\n\n● kubectl api-resources：列出当前Kubernetes系统支持的Resource资源列表。\n\n```\n[root@k8s-master ~]# kubectl api-versions\nadmissionregistration.k8s.io/v1\nadmissionregistration.k8s.io/v1beta1\napiextensions.k8s.io/v1\napiextensions.k8s.io/v1beta1\napiregistration.k8s.io/v1\napiregistration.k8s.io/v1beta1\napps/v1\napps/v1beta1\napps/v1beta2\nargoproj.io/v1alpha1\nauditregistration.k8s.io/v1alpha1\nauthentication.istio.io/v1alpha1\nauthentication.k8s.io/v1\nauthentication.k8s.io/v1beta1\nauthorization.k8s.io/v1\nauthorization.k8s.io/v1beta1\nautoscaling/v1\nautoscaling/v2beta1\nautoscaling/v2beta2\nbatch/v1\nbatch/v1beta1\nbatch/v2alpha1\ncertificates.k8s.io/v1beta1\ncertmanager.k8s.io/v1alpha1\ncomcast.github.io/v1\nconfig.istio.io/v1alpha2\nconfiguration.konghq.com/v1\nconfiguration.konghq.com/v1beta1\ncoordination.k8s.io/v1\ncoordination.k8s.io/v1beta1\ncrdlbcontroller.k8s.io/v1alpha1\ncustom.metrics.k8s.io/v1beta1\ndiscovery.k8s.io/v1beta1\nevents.k8s.io/v1beta1\nextensions/v1beta1\nloadbalancer.k8s.io/v1alpha1\nnetworking.istio.io/v1alpha3\nnetworking.k8s.io/v1\nnetworking.k8s.io/v1beta1\nnetworking.symphony.netease.com/v1alpha1\nnode.k8s.io/v1alpha1\nnode.k8s.io/v1beta1\npolicy/v1beta1\nrbac.authorization.k8s.io/v1\nrbac.authorization.k8s.io/v1alpha1\nrbac.authorization.k8s.io/v1beta1\nrbac.istio.io/v1alpha1\nschedular.istio.io/v1\nscheduling.k8s.io/v1\nscheduling.k8s.io/v1alpha1\nscheduling.k8s.io/v1beta1\nsecurity.istio.io/v1beta1\nsecurity.symphony.netease.com/v1\nsettings.k8s.io/v1alpha1\nstorage.k8s.io/v1\nstorage.k8s.io/v1alpha1\nstorage.k8s.io/v1beta1\nv1\n```\n\n<br>\n\n```\nroot@k8s-master ~]# kubectl api-resourcesNAME                              SHORTNAMES   APIGROUP                       NAME                              SHORTNAMES           APIGROUP                          NAMESPACED   KIND\nbindings                                                                                 true         Binding\ncomponentstatuses                 cs                                                     false        ComponentStatus\nconfigmaps                        cm                                                     true         ConfigMap\nendpoints                         ep                                                     true         Endpoints\nevents                            ev                                                     true         Event\nlimitranges                       limits                                                 true         LimitRange\nnamespaces                        ns                                                     false        Namespace\nnodes                             no                                                     false        Node\npersistentvolumeclaims            pvc                                                    true         PersistentVolumeClaim\npersistentvolumes                 pv                                                     false        PersistentVolume\npods                              po                                                     true         Pod\npodtemplates                                                                             true         PodTemplate\nreplicationcontrollers            rc                                                     true         ReplicationController\nresourcequotas                    quota                                                  true         ResourceQuota\nsecrets                                                                                  true         Secret\nserviceaccounts                   sa                                                     true         ServiceAccount\nservices                          svc                                                    true         Service\nmutatingwebhookconfigurations                          admissionregistration.k8s.io      false        MutatingWebhookConfiguration\nvalidatingwebhookconfigurations                        admissionregistration.k8s.io      false        ValidatingWebhookConfiguration\ncustomresourcedefinitions         crd,crds             apiextensions.k8s.io              false        CustomResourceDefinition\napiservices                                            apiregistration.k8s.io            false        APIService\ncontrollerrevisions                                    apps                              true         ControllerRevision\ndaemonsets                        ds                   apps                              true         DaemonSet\ndeployments                       deploy               apps                              true         Deployment\nreplicasets                       rs                   apps                              true         ReplicaSet\nstatefulsets                      sts                  apps                              true         StatefulSet\nclusterworkflowtemplates          clusterwftmpl,cwft   argoproj.io                       false        ClusterWorkflowTemplate\ncronworkflows                     cwf,cronwf           argoproj.io                       true         CronWorkflow\nworkfloweventbindings             wfeb                 argoproj.io                       true         WorkflowEventBinding\nworkflows                         wf                   argoproj.io                       true         Workflow\nworkflowtasksets                  wfts                 argoproj.io                       true         WorkflowTaskSet\nworkflowtemplates                 wftmpl               argoproj.io                       true         WorkflowTemplate\nauditsinks                                             auditregistration.k8s.io          false        AuditSink\nmeshpolicies                                           authentication.istio.io           false        MeshPolicy\npolicies                                               authentication.istio.io           true         Policy\ntokenreviews                                           authentication.k8s.io             false        TokenReview\n```\n\n这里没有展示版本，如果想看某个对象支持哪些版本可以使用 kubectl explain \n\n```\nroot@k8s-master:~# kubectl explain pods\nKIND:     Pod\nVERSION:  v1\n\nDESCRIPTION:\n     Pod is a collection of containers that can run on a host. This resource is\n     created by clients and scheduled onto hosts.\n\nFIELDS:\n   apiVersion\t<string>\n     APIVersion defines the versioned schema of this representation of an\n     object. Servers should convert recognized schemas to the latest internal\n     value, and may reject unrecognized values. More info:\n     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources\n\n   kind\t<string>\n     Kind is a string value representing the REST resource this object\n     represents. Servers may infer this from the endpoint the client submits\n     requests to. Cannot be updated. In CamelCase. More info:\n     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds\n\n   metadata\t<Object>\n     Standard object's metadata. More info:\n     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata\n\n   spec\t<Object>\n     Specification of the desired behavior of the pod. More info:\n     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status\n\n   status\t<Object>\n     Most recently observed status of the pod. This data may not be up to date.\n     Populated by the system. Read-only. More info:\n     https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#spec-and-status\n```\n\n<br>\n\n### 6. 资源转换\n\nk8s中每中资源都在kubernetes/pkg/apis/apps/v1/conversion.go中定义了转换函数。\n\n内部版本（__internal）作为中间桥梁。如下，v1alpha1要想转到v1必须先转换成 (__internal)版本。这样的好处在于，以后新增的版本，只要能和__internal互相转换即可。\n\n![image-20210223104621788](../images/resource-convert.png)\n\n```\n// 转换函数为kubernetes/pkg/apis/apps/v1/conversion.gofunc addConversionFuncs(scheme *runtime.Scheme) error {\t// Add non-generated conversion functions to handle the *int32 -> int32\t// conversion. A pointer is useful in the versioned type so we can default\t// it, but a plain int32 is more convenient in the internal type. These\t// functions are the same as the autogenerated ones in every other way.\terr := scheme.AddConversionFuncs(\t\tConvert_v1_StatefulSetSpec_To_apps_StatefulSetSpec,\t\tConvert_apps_StatefulSetSpec_To_v1_StatefulSetSpec,\t\tConvert_v1_StatefulSetUpdateStrategy_To_apps_StatefulSetUpdateStrategy,\t\tConvert_apps_StatefulSetUpdateStrategy_To_v1_StatefulSetUpdateStrategy,\t\tConvert_extensions_RollingUpdateDaemonSet_To_v1_RollingUpdateDaemonSet,\t\tConvert_v1_RollingUpdateDaemonSet_To_extensions_RollingUpdateDaemonSet,\t\tConvert_v1_StatefulSetStatus_To_apps_StatefulSetStatus,\t\tConvert_apps_StatefulSetStatus_To_v1_StatefulSetStatus,\t\tConvert_v1_Deployment_To_extensions_Deployment,\t\tConvert_extensions_Deployment_To_v1_Deployment,\t\tConvert_extensions_DaemonSet_To_v1_DaemonSet,\t\tConvert_v1_DaemonSet_To_extensions_DaemonSet,\t\tConvert_extensions_DaemonSetSpec_To_v1_DaemonSetSpec,\t\tConvert_v1_DaemonSetSpec_To_extensions_DaemonSetSpec,\t\tConvert_extensions_DaemonSetUpdateStrategy_To_v1_DaemonSetUpdateStrategy,\t\tConvert_v1_DaemonSetUpdateStrategy_To_extensions_DaemonSetUpdateStrategy,\t\t// extensions\t\t// TODO: below conversions should be dropped in favor of auto-generated\t\t// ones, see https://github.com/kubernetes/kubernetes/issues/39865\t\tConvert_v1_DeploymentSpec_To_extensions_DeploymentSpec,\t\tConvert_extensions_DeploymentSpec_To_v1_DeploymentSpec,\t\tConvert_v1_DeploymentStrategy_To_extensions_DeploymentStrategy,\t\tConvert_extensions_DeploymentStrategy_To_v1_DeploymentStrategy,\t\tConvert_v1_RollingUpdateDeployment_To_extensions_RollingUpdateDeployment,\t\tConvert_extensions_RollingUpdateDeployment_To_v1_RollingUpdateDeployment,\t\tConvert_extensions_ReplicaSetSpec_To_v1_ReplicaSetSpec,\t\tConvert_v1_ReplicaSetSpec_To_extensions_ReplicaSetSpec,\t)\tif err != nil {\t\treturn err\t}\treturn nil}\n```\n\n### 7. 总结\n\n（1）k8s中的资源有内部版本和外部版本之分\n\n（2）通过划 group, version。让一个对象支持有多个版本，利于对象的发展，例如alaph->v1\n\n（3）gvk(group, version, kind),  gvr(group, version, resource) 可以方便集群内部和http传输的时候定位到一个对象或者对象列表\n\n<b>\n\n**参考文档**： Kubernetes源码剖析，郑东旭\n\nhttps://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md\n\n"
  },
  {
    "path": "k8s/kube-apiserver/4-scheme介绍.md",
    "content": "[toc]\n\n<br>\n\n### 1. schema简介-内存型的资源注册表\n\nKubernetes系统拥有众多资源，每一种资源就是一个资源类型，这些资源类型需要有统一的注册、存储、查询、管理等机制。目前Kubernetes系统中的所有资源类型都已注册到Scheme资源注册表中，其是一个内存型的资源注册表，拥有如下特点。\n\n● 支持注册多种资源类型，包括内部版本和外部版本。\n\n● 支持多种版本转换机制。\n\n● 支持不同资源的序列化/反序列化机制。\n\nScheme资源注册表支持两种资源类型（Type）的注册，分别是UnversionedType和KnownType资源类型，分别介绍如下。\n\n● UnversionedType：无版本资源类型，这是一个早期Kubernetes系统中的概念，它主要应用于某些没有版本的资源类型，该类型的资源对象并不需要进行转换。在目前的Kubernetes发行版本中，无版本类型已被弱化，几乎所有的资源对象都拥有版本，但在metav1元数据中还有部分类型，它们既属于meta.k8s.io/v1又属于UnversionedType无版本资源类型，例如metav1.Status、metav1.APIVersions、metav1.APIGroupList、metav1.APIGroup、metav1.APIResourceList。\n\n● KnownType：是目前Kubernetes最常用的资源类型，也可称其为“拥有版本的资源类型”。\n\n在Scheme资源注册表中，UnversionedType资源类型的对象通过scheme.AddUnversionedTypes方法进行注册，KnownType资源类型的对象通过scheme.AddKnownTypes方法进行注册。\n\n<br>\n\n### 2.schema数据结构\n\n```\n// Scheme defines methods for serializing and deserializing API objects, a type\n// registry for converting group, version, and kind information to and from Go\n// schemas, and mappings between Go schemas of different versions. A scheme is the\n// foundation for a versioned API and versioned configuration over time.\n//\n// In a Scheme, a Type is a particular Go struct, a Version is a point-in-time\n// identifier for a particular representation of that Type (typically backwards\n// compatible), a Kind is the unique name for that Type within the Version, and a\n// Group identifies a set of Versions, Kinds, and Types that evolve over time. An\n// Unversioned Type is one that is not yet formally bound to a type and is promised\n// to be backwards compatible (effectively a \"v1\" of a Type that does not expect\n// to break in the future).\n//\n// Schemes are not expected to change at runtime and are only threadsafe after\n// registration is complete.\ntype Scheme struct {\n\t// versionMap allows one to figure out the go type of an object with\n\t// the given version and name.\n\t// 1.存储GVK与Type的映射关系。  Int, float就是一种type\n\tgvkToType map[schema.GroupVersionKind]reflect.Type\n\n\t// typeToGroupVersion allows one to find metadata for a given go object.\n\t// The reflect.Type we index by should *not* be a pointer.\n\t// 2. 存储Type与GVK的映射关系，一个Type会对应一个或多个GVK。 因为一个kind可能有多个版本\n\ttypeToGVK map[reflect.Type][]schema.GroupVersionKind\n  \n  \n\t// unversionedTypes are transformed without conversion in ConvertToVersion.\n\tunversionedTypes map[reflect.Type]schema.GroupVersionKind\n\n\t// unversionedKinds are the names of kinds that can be created in the context of any group\n\t// or version\n\t// TODO: resolve the status of unversioned types.\n\tunversionedKinds map[string]reflect.Type\n\n\t// Map from version and resource to the corresponding func to convert\n\t// resource field labels in that version to internal version.\n\tfieldLabelConversionFuncs map[schema.GroupVersionKind]FieldLabelConversionFunc\n\n\t// defaulterFuncs is an array of interfaces to be called with an object to provide defaulting\n\t// the provided object must be a pointer.\n\tdefaulterFuncs map[reflect.Type]func(interface{})\n\n\t// converter stores all registered conversion functions. It also has\n\t// default converting behavior.\n\tconverter *conversion.Converter\n\n\t// versionPriority is a map of groups to ordered lists of versions for those groups indicating the\n\t// default priorities of these versions as registered in the scheme\n\tversionPriority map[string][]string\n\n\t// observedVersions keeps track of the order we've seen versions during type registration\n\tobservedVersions []schema.GroupVersion\n\n\t// schemeName is the name of this scheme.  If you don't specify a name, the stack of the NewScheme caller will be used.\n\t// This is useful for error reporting to indicate the origin of the scheme.\n\tschemeName string\n}\n```\n\nscheme主要有俩大功能：资源的注册，和内部外部资源的转换。\n\n#### 2.1 资源的注册\n\nScheme资源注册表结构字段说明如下。\n\n● gvkToType：存储GVK与Type的映射关系。  Int, float就是一种type\n\n● typeToGVK：存储Type与GVK的映射关系，一个Type会对应一个或多个GVK。\n\n● unversionedTypes：存储UnversionedType与GVK的映射关系。\n\n● unversionedKinds：存储Kind（资源种类）名称与UnversionedType的映射关系。\n\nScheme资源注册表通过Go语言的map结构实现映射关系，这些映射关系可以实现高效的正向和反向检索，从Scheme资源注册表中检索某个GVK的Type，它的时间复杂度为O（1）。资源注册表映射关系如下所示。\n\n![image-20210223104824373](../images/kind-convert.png)\n\n<br>\n\n##### 2.1.1 k8s资源注册原理\n\n这里先用一个例子说明这种注册模式。下一节详细说一下k8s中时如何注册的\n\n```\n|____scheme\n| |____test.go\n|____test1\n| |____test1.go\n|____test2\n| |____Test2.go\n|____test.go\n```\n\nscheme/test.go文件如下：很简单，就是定义里一个TestScheme结构体。和一个TestSchemeA变量。\n\n```\npackage scheme\n\ntype TestScheme struct{\n\tT map[int]int\n}\n\nvar (\n\tTestSchemeA = NewTestScheme()\n)\n\nfunc NewTestScheme() *TestScheme{\n\treturn &TestScheme{ T: map[int]int{}}\n}\n```\n\n<br>\n\nTest1/test1.go文件如下：\n\n```\npackage test1\n\nimport (\n\t\"Practice/scheme\"\n\t\"fmt\"\n)\n\nfunc init() {\n\tfmt.Println(\"before add test1\")\n\tfmt.Println(scheme.TestSchemeA)\n\tAdd(scheme.TestSchemeA)\n\tfmt.Println(\"after add test1\")\n\tfmt.Println(scheme.TestSchemeA)\n}\n\n\nfunc Add(a *scheme.TestScheme) {\n\ta.T[1] = 1\n}\n```\n\nTest2/test2.go文件如下：\n\n```\npackage test2\n\nimport (\n\t\"Practice/scheme\"\n\t\"fmt\"\n)\n\nfunc init() {\n\tfmt.Println(\"before add test1\")\n\tfmt.Println(scheme.TestSchemeA)\n\tAdd(scheme.TestSchemeA)\n\tfmt.Println(\"after add test1\")\n\tfmt.Println(scheme.TestSchemeA)\n}\n\n\nfunc Add(a *scheme.TestScheme) {\n\ta.T[2] = 2\n}\n```\n\n<br>\n\nTest.go文件如下：\n\n```\npackage main\n\nimport (\n\t\"Practice/scheme\"\n\t_ \"Practice/test1\"\n\t_ \"Practice/test2\"\n\t\"fmt\"\n)\n\n\nfunc main() {\n\tfmt.Println(scheme.TestSchemeA)\n}\n```\n\n<br>\n\n运行test.go文件。输出如下：\n\n```\nbefore add test1\n&{map[]}\nafter add test1\n&{map[1:1]}\nbefore add test1\n&{map[1:1]}\nafter add test1\n&{map[1:1 2:2]}\n&{map[1:1 2:2]}\n```\n\n所以，可以看出来，通过go包的 init, import将 test1, test2自动注册到了，TestSchemeA这个全局的map中。\n\n##### 2.1.2 k8S资源注册过程\n\n(1) 初始化scheme资源注册表\n\n在legacyscheme包中，定义了scheme资源注册函数，codec编码器以及ParameterCodec参数解码器。它们被定义为全局变量，这些变量在kube-apiserver的任何地方都可以被调用。\n\n```\npkg/api/legacyscheme/scheme.go\nvar (\n\t// Scheme is the default instance of runtime.Scheme to which types in the Kubernetes API are already registered.\n\t// NOTE: If you are copying this file to start a new api group, STOP! Copy the\n\t// extensions group instead. This Scheme is special and should appear ONLY in\n\t// the api group, unless you really know what you're doing.\n\t// TODO(lavalamp): make the above error impossible.\n\tScheme = runtime.NewScheme()\n\n\t// Codecs provides access to encoding and decoding for the scheme\n\tCodecs = serializer.NewCodecFactory(Scheme)\n\n\t// ParameterCodec handles versioning of objects that are converted to query parameters.\n\tParameterCodec = runtime.NewParameterCodec(Scheme)\n)\n```\n\n（2）注册kubernetes所支持的资源\n\nKube-apiserver启动的时候导入了master包。master包中的import_known_versions.go文件调用了所有资源的Install包。通过包的导入机制触发Init函数，从而完成了注册。\n\n```\npkg/master/import_known_versions.go\n\npackage master\n\nimport (\n\t// These imports are the API groups the API server will support.\n\t_ \"k8s.io/kubernetes/pkg/apis/admission/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/admissionregistration/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/apps/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/auditregistration/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/authentication/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/authorization/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/autoscaling/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/batch/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/certificates/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/coordination/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/core/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/discovery/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/events/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/extensions/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/flowcontrol/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/imagepolicy/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/networking/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/node/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/policy/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/rbac/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/scheduling/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/settings/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/storage/install\"\n)\n```\n\n随便找一个install包, 可以发现这里引入了全局的legacyscheme.Scheme，然后调用了AddToScheme函数进行了注册\n\n```\npackage install\n\nimport (\n\t\"k8s.io/apimachinery/pkg/runtime\"\n\tutilruntime \"k8s.io/apimachinery/pkg/util/runtime\"\n\t\"k8s.io/kubernetes/pkg/api/legacyscheme\"\n\t\"k8s.io/kubernetes/pkg/apis/core\"\n\t\"k8s.io/kubernetes/pkg/apis/core/v1\"\n)\n\nfunc init() {\n\tInstall(legacyscheme.Scheme)\n}\n\n// Install registers the API group and adds types to a scheme\nfunc Install(scheme *runtime.Scheme) {\n\tutilruntime.Must(core.AddToScheme(scheme))\n\tutilruntime.Must(v1.AddToScheme(scheme))\n\tutilruntime.Must(scheme.SetVersionPriority(v1.SchemeGroupVersion))\n}\n\n\npkg/apis/core/register.go\nvar (\n\t// SchemeBuilder object to register various known types\n\tSchemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)\n\n\t// AddToScheme represents a func that can be used to apply all the registered\n\t// funcs in a scheme\n\tAddToScheme = SchemeBuilder.AddToScheme\n)\n\nfunc addKnownTypes(scheme *runtime.Scheme) error {\n\tif err := scheme.AddIgnoredConversionType(&metav1.TypeMeta{}, &metav1.TypeMeta{}); err != nil {\n\t\treturn err\n\t}\n\tscheme.AddKnownTypes(SchemeGroupVersion,\n\t\t&Pod{},\n\t\t&PodList{},\n\t\t&PodStatusResult{},\n\t   。。。\n\t)\n\n\treturn nil\n}\n```\n\n<br>\n\n##### 2.1.3 每种资源的kind，resource是如何转换的\n\n在register.go函数中注册了资源。而K8S中的每种资源都必须有apiVersion:和Kind字段。所以只要注册一种资源就有了GVK在scheme中\n\n```\npkg/apis/apps/register.go\n// Adds the list of known types to the given scheme.\nfunc addKnownTypes(scheme *runtime.Scheme) error {\n\t// TODO this will get cleaned up with the scheme types are fixed\n\tscheme.AddKnownTypes(SchemeGroupVersion,\n\t\t&DaemonSet{},\n\t\t&DaemonSetList{},\n\t\t&Deployment{},\n\t\t&DeploymentList{},\n\t\t&DeploymentRollback{},\n\t\t&autoscaling.Scale{},\n\t\t&StatefulSet{},\n\t\t&StatefulSetList{},\n\t\t&ControllerRevision{},\n\t\t&ControllerRevisionList{},\n\t\t&ReplicaSet{},\n\t\t&ReplicaSetList{},\n\t)\n\treturn nil\n}\n\n// AddKnownTypes registers all types passed in 'types' as being members of version 'version'.\n// All objects passed to types should be pointers to structs. The name that go reports for\n// the struct becomes the \"kind\" field when encoding. Version may not be empty - use the\n// APIVersionInternal constant if you have a type that does not have a formal version.\nfunc (s *Scheme) AddKnownTypes(gv schema.GroupVersion, types ...Object) {\n\ts.addObservedVersion(gv)\n\tfor _, obj := range types {\n\t\tt := reflect.TypeOf(obj)\n\t\tif t.Kind() != reflect.Ptr {\n\t\t\tpanic(\"All types must be pointers to structs.\")\n\t\t}\n\t\tt = t.Elem()\n\t\ts.AddKnownTypeWithName(gv.WithKind(t.Name()), obj)\n\t}\n}\n```\n\nschme只要有gvk就行了，因为gvk->gvr的转换很简单，知道了kind,就知道了resource。resource就行kind的小写，单数和复数形式。\n\nrestmapper中定义了转换方法。\n\nstaging/src/k8s.io/apimachinery/pkg/api/meta/restmapper.go\n\n```\n// ResourceSingularizer implements RESTMapper\n// It converts a resource name from plural to singular (e.g., from pods to pod)\nfunc (m *DefaultRESTMapper) ResourceSingularizer(resourceType string) (string, error) {\n\tpartialResource := schema.GroupVersionResource{Resource: resourceType}\n\tresources, err := m.ResourcesFor(partialResource)\n\tif err != nil {\n\t\treturn resourceType, err\n\t}\n\n\tsingular := schema.GroupVersionResource{}\n\tfor _, curr := range resources {\n\t\tcurrSingular, ok := m.pluralToSingular[curr]\n\t\tif !ok {\n\t\t\tcontinue\n\t\t}\n\t\tif singular.Empty() {\n\t\t\tsingular = currSingular\n\t\t\tcontinue\n\t\t}\n\n\t\tif currSingular.Resource != singular.Resource {\n\t\t\treturn resourceType, fmt.Errorf(\"multiple possible singular resources (%v) found for %v\", resources, resourceType)\n\t\t}\n\t}\n\n\tif singular.Empty() {\n\t\treturn resourceType, fmt.Errorf(\"no singular of resource %v has been defined\", resourceType)\n\t}\n\n\treturn singular.Resource, nil\n}\n```\n\n#### 2.2 内部版本和外部版本的转换\n\nscheme结构体中，有一个converter的结构体，这个结构体包含了所有的转换函数。\n\n```\n\t// converter stores all registered conversion functions. It also has\n\t// default converting behavior.\n\tconverter *conversion.Converter\n\t\n\t// Converter knows how to convert one type to another.\ntype Converter struct {\n\t// Map from the conversion pair to a function which can\n\t// do the conversion.\n\tconversionFuncs          ConversionFuncs\n\tgeneratedConversionFuncs ConversionFuncs\n\n\t// Set of conversions that should be treated as a no-op\n\tignoredConversions map[typePair]struct{}\n\n\t// This is a map from a source field type and name, to a list of destination\n\t// field type and name.\n\tstructFieldDests map[typeNamePair][]typeNamePair\n\n\t// Allows for the opposite lookup of structFieldDests. So that SourceFromDest\n\t// copy flag also works. So this is a map of destination field name, to potential\n\t// source field name and type to look for.\n\tstructFieldSources map[typeNamePair][]typeNamePair\n\n\t// Map from an input type to a function which can apply a key name mapping\n\tinputFieldMappingFuncs map[reflect.Type]FieldMappingFunc\n\n\t// Map from an input type to a set of default conversion flags.\n\tinputDefaultFlags map[reflect.Type]FieldMatchingFlags\n\n\t// If non-nil, will be called to print helpful debugging info. Quite verbose.\n\tDebug DebugLogger\n\n\t// nameFunc is called to retrieve the name of a type; this name is used for the\n\t// purpose of deciding whether two types match or not (i.e., will we attempt to\n\t// do a conversion). The default returns the go type name.\n\tnameFunc func(t reflect.Type) string\n}\n```\n\nconverter转换函数需要提前注册，每个资源都需要注册自己内部<->外部转换函数。目前scheme支持5个注册转换函数。分别如下：\n\n（1）scheme.AddIgnoredConversionType: 注册忽略的资源类型，不会只想转换操作。忽略资源对象的转换操作\n\n（2）scheme.AddConversionFuncs: 注册多个conversion Func转换函数\n\n（3）scheme.AddConversionFunc: 注册单个conversion Func转换函数\n\n（4）scheme.AddGeneratedConversionFunc: 注册自动生成的转换函数\n\n（5）scheme.AddFieldLabelConversionFunc: 注册字段标签的转换函数\n\n为什么需要这个，可以参考：https://github.com/kubernetes/kubernetes/pull/4575\n\n<br>\n\n##### 2.2.1 k8s转换函数是如何注册的\n\n也是同样的机制：以core/v1下面的资源为例。\n\n所有的转换函数都定义在这个包中。pkg/apis/core/v1/zz_generated.conversion.go\n\n以Affinity为例，\n\nConvert_core_Affinity_To_v1_Affinity是内部转v1版本的转换函数\n\nConvert_v1_Affinity_To_core_Affinity是v1转内部版本的转换函数。\n\n这些通过init函数进行了注册。\n\n```\npkg/apis/core/v1/register.go\nvar (\n\tlocalSchemeBuilder = &v1.SchemeBuilder\n\tAddToScheme        = localSchemeBuilder.AddToScheme\n)\n\n\npkg/apis/core/v1/zz_generated.conversion.go\nfunc init() {\n\tlocalSchemeBuilder.Register(RegisterConversions)\n}\n\n// RegisterConversions adds conversion functions to the given scheme.\n// Public to allow building arbitrary schemes.\nfunc RegisterConversions(s *runtime.Scheme) error {\n\tif err := s.AddGeneratedConversionFunc((*v1.AWSElasticBlockStoreVolumeSource)(nil), (*core.AWSElasticBlockStoreVolumeSource)(nil), func(a, b interface{}, scope conversion.Scope) error {\n\t\treturn Convert_v1_AWSElasticBlockStoreVolumeSource_To_core_AWSElasticBlockStoreVolumeSource(a.(*v1.AWSElasticBlockStoreVolumeSource), b.(*core.AWSElasticBlockStoreVolumeSource), scope)\n\t}); err != nil {\n\t\treturn err\n\t}\n\tif err := s.AddGeneratedConversionFunc((*core.AWSElasticBlockStoreVolumeSource)(nil), (*v1.AWSElasticBlockStoreVolumeSource)(nil), func(a, b interface{}, scope conversion.Scope) error {\n\t\treturn Convert_core_AWSElasticBlockStoreVolumeSource_To_v1_AWSElasticBlockStoreVolumeSource(a.(*core.AWSElasticBlockStoreVolumeSource), b.(*v1.AWSElasticBlockStoreVolumeSource), scope)\n\t}); err != nil {\n\t\treturn err\n\t}\n\tif err := s.AddGeneratedConversionFunc((*v1.Affinity)(nil), (*core.Affinity)(nil), func(a, b interface{}, scope conversion.Scope) error {\n\t\treturn Convert_v1_Affinity_To_core_Affinity(a.(*v1.Affinity), b.(*core.Affinity), scope)\n\t}); err != nil {\n\t\treturn err\n\t}\n\tif err := s.AddGeneratedConversionFunc((*core.Affinity)(nil), (*v1.Affinity)(nil), func(a, b interface{}, scope conversion.Scope) error {\n\t\treturn Convert_core_Affinity_To_v1_Affinity(a.(*core.Affinity), b.(*v1.Affinity), scope)\n\t}); err != nil {\n\t\treturn err\n\t}\n\tif err := s.AddGeneratedConversionFunc((*v1.AttachedVolume)(nil), (*core.AttachedVolume)(nil), func(a, b interface{}, scope conversion.Scope) error {\n\t\treturn Convert_v1_AttachedVolume_To_core_AttachedVolume(a.(*v1.AttachedVolume), b.(*core.AttachedVolume), scope)\n\t}); err != nil {\n\t\treturn err\n\t}\n\tif err := s.AddGeneratedConversionFunc((*core.AttachedVolume)(nil), (*v1.AttachedVolume)(nil), func(a, b interface{}, scope conversion.Scope) error {\n\t\treturn Convert_core_AttachedVolume_To_v1_AttachedVolume(a.(*core.AttachedVolume), b.(*v1.AttachedVolume), scope)\n\t}); err != nil {\n\t\treturn err\n\t}\n\tif err := s.AddGeneratedConversionFunc((*v1.AvoidPods)(nil), (*core.AvoidPods)(nil), func(a, b interface{}, scope conversion.Scope) error {\n\t\treturn Convert_v1_AvoidPods_To_core_AvoidPods(a.(*v1.AvoidPods), b.(*core.AvoidPods), scope)\n\t}); err != nil {\n\t\treturn err\n\t}\n\tif err := s.AddGeneratedConversionFunc((*core.AvoidPods)(nil), (*v1.AvoidPods)(nil), func(a, b interface{}, scope conversion.Scope) error {\n\t\treturn Convert_core_AvoidPods_To_v1_AvoidPods(a.(*core.AvoidPods), b.(*v1.AvoidPods), scope)\n\t}); err != nil {\n\t\treturn err\n\t}\n\t....\n}\n```\n\n<br>\n\n### 3.  Scheme代码展示\n\n```\npackage main\n\nimport (\n\tappsv1 \"k8s.io/api/apps/v1\"\n\tcorev1 \"k8s.io/api/core/v1\"\n\tmetav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\n\t\"k8s.io/apimachinery/pkg/runtime\"\n\t\"ks8.io/apimachinery/pkg/runtime/schema\"\n)\n\nfunc main() {\n\t// KnownType external\n\tcoreGv := schema.GroupVersion{Group: \"\", Version: \"v1\"}\n\textensionsGV := schema.GroupVersion{Group: \"extensions\", Version: \"V1beta1\"}\n\t\n\t// KnownType internal\n\tcoreInternalGv := schema.GroupVersion{Group: \"\", Version: \"v1\"}\n\t\n\t// UnversionedType \n\tUnversioned := schema.GroupVersion{Group: \"\", Version: \"V1\"}\n\t\n\tscheme := runtime.NewScheme()\n\tscheme.AddKnownTypes(coreGv, &corev1.Pod{})\n\tscheme.AddKnownTypes(extensionsGV, &corev1.DaemonSet{})\n\tscheme.AddKnownTypes(coreInternalGv, &corev1.Pod{})\n\tscheme.AddKnownTypes(Unversioned, &metav1.Status{})\n}\n```\n\n在上述代码中，首先定义了两种类型的GV（资源组、资源版本），KnownType类型有coreGV、extensionsGV、\n\ncoreInternalGV对象，其中coreInternalGV对象属于内部版本（即runtime.APIVersionInternal），而\n\nUnversionedType类型有Unversioned对象。通过runtime.NewScheme实例化一个新的Scheme资源注册表。注\n\n册资源类型到Scheme资源注册表有两种方式，第一种通过scheme.AddKnownTypes方法注册KnownType类型的\n\n对象，第二种通过scheme.AddUnversionedTypes方法注册UnversionedType类型的对象。在Scheme Example\n\n代码示例中，我们往Scheme资源注册表中分别注册了Pod、DaemonSet、Pod（内部版本）及Status（无版本资\n\n源类型）类型对象，那么这些资源的映射关系如下所示。\n\n![image-20210223104858927](../images/01.png)\n\n通过这个代码可以看出来，通过AddKnownTypes将资源的对应关系注册到了 schema中。其实就是补充对应的map。\n\n```\n// AddKnownTypes registers all types passed in 'types' as being members of version 'version'.\n// All objects passed to types should be pointers to structs. The name that go reports for\n// the struct becomes the \"kind\" field when encoding. Version may not be empty - use the\n// APIVersionInternal constant if you have a type that does not have a formal version.\nfunc (s *Scheme) AddKnownTypes(gv schema.GroupVersion, types ...Object) {\n\ts.addObservedVersion(gv)\n\tfor _, obj := range types {\n\t\tt := reflect.TypeOf(obj)\n\t\tif t.Kind() != reflect.Ptr {\n\t\t\tpanic(\"All types must be pointers to structs.\")\n\t\t}\n\t\tt = t.Elem()\n\t\ts.AddKnownTypeWithName(gv.WithKind(t.Name()), obj)\n\t}\n}\n```\n\n为了更深刻的展示scheme的作用，我写了一个小的demo，展示如何利用scheme 进行deploy v1beta1-> v1版本，以及hpa v1beta2->v1版本的转换。详情：https://github.com/zoux86/k8sConvert\n\n<br>\n\n### 4. 总结\n\nschema就是一个内存资源版本数据库。scheme可以做这样的事情：\n\n（1）来一个资源，通过scheme就能知道该资源有没有注册到sheme\n\n（2）如果注册了，就知道你这个资源的type, gvk\n\n（3）知道这个资源如何进行内部和外部版本的转换\n\n所以，有了scheme，apiserver就管理起来了资源多版本的问题。\n\n<br>\n\n**参考文档：** Kubernetes源码剖析，郑东旭"
  },
  {
    "path": "k8s/kube-apiserver/5-kube-apiserver启动流程汇总.md",
    "content": "本节算是apiserver源码分析的总纲，apiserver启动流程如下所示：\n\n（1）Pod, svc,node 等资源注册\n\n（2）apiserver cobra命令行解析\n\n（3）RunE运行Run(completedOptions, genericapiserver.SetupSignalHandler()) 核心函数\n\nRun函数核心逻辑如下：\n\nRun函数核心调用链逻辑如下：\n\n* CreateServerChian\n  * createNodeDialer\n  * CreateKubeAPIServerConfig\n  * createAPIExtensionsConfig\n  * createAPIExtensionsServer\n  * CreateKubeAPIServer\n  * createAggregatorConfig\n  * createAggregatorServer\n  * BuildInsecureHandlerChain\n\n* PrePareRun\n  * 添加openAPI，installHealthz，installLivez，AddPreShutdownHook。可以认为是一些监控检查，swagger接口等准备工作\n* Run\n  * 运行NonBlockingRun函数，核心是开启审计服务，并且开启https服务\n\n<br>\n\n因此接下里的源码分析归纳为以下流程：\n\n（1）资源注册。\n\n（2）Cobra命令行参数解析\n\n（3）创建APIServer通用配置\n\n（4）创建APIExtensionsServer\n\n（5）创建KubeAPIServer\n\n（6）创建AggregatorServer\n\n（7）启动HTTP服务。\n\n（8）启动HTTPS服务"
  },
  {
    "path": "k8s/kube-apiserver/6-kube-apiserver启动流程-资源注册+命令行初始.md",
    "content": "* [Table of Contents](#table-of-contents)\n    * [1\\. 资源注册](#1-资源注册)\n    * [2\\. Cobra命令行参数解析](#2-cobra命令行参数解析)\n      * [2\\.1\\.  入口函数 main\\-&gt;NewAPIServerCommand](#21--入口函数-main-newapiservercommand)\n      * [2\\.2 options\\.NewServerRunOptions](#22-optionsnewserverrunoptions)\n      * [2\\.3 cmd\\.Flags()](#23-cmdflags)\n      * [2\\.3\\.1 C\\.Name](#231-cname)\n      * [2\\.3\\.2 NewFlagSet](#232-newflagset)\n      * [2\\.4 s\\.Flags()](#24-sflags)\n      * [2\\.5 command\\.Execute() 真正的参数解析](#25-commandexecute-真正的参数解析)\n      * [2\\.6 总结](#26-总结)\n    * [3  RunE](#3--rune)\n      * [3\\.1 completedOptions, err := Complete(s)](#31-completedoptions-err--completes)\n      * [3\\.2 validate](#32-validate)\n    * [4\\. 总结](#4-总结)\n\n**本章重点：**\n\n（1）kube-apiserver启动过程中，前两个步骤：资源注册和命令行解析\n\n<br>\n\n在kube-apiserver组件启动过程中，代码逻辑可分为9个步骤，分别介绍如下。\n\n（1）资源注册。\n\n（2）Cobra命令行参数解析。\n\n（3）创建APIServer通用配置。\n\n（4）创建APIExtensionsServer。\n\n（5）创建KubeAPIServer。\n\n（6）创建AggregatorServer。\n\n（7）启动HTTP服务。\n\n（8）启动HTTPS服务\n\n\n\n对应的流程图如下：\n\n![image-20210223193026586](../images/apiserver-liucheng-1.png)\n\n\n\n### 1. 资源注册\n\nkube-apiserver组件启动后的第一件事情是将Kubernetes所支持的资源注册到Scheme资源注册表中，这样后面启动的逻辑才能够从Scheme资源注册表中拿到资源信息并启动和运行APIExtensionsServer、KubeAPIServer、AggregatorServer这3种服务。资源的注册过程并不是通过函数调用触发的，而是通过Go语言的导入（import）和初始化（init）机制触发的。导入和初始化机制如下图所示。\n\n![image-20210223193147681](../images/apiserver-liucheng-2.png)\n\n\n\nkube-apiserver的资源注册过程就利用了 import 和 init机制，代码示例如下：\n\n在 kube-apiserver 入口函数所在的文件中 cmd\\kube-apiserver\\app\\server.go  （NewAPIServerCommand 函数就在这个文件中）\n\n```\n\t\"k8s.io/kubernetes/pkg/api/legacyscheme\"\n\t\"k8s.io/kubernetes/pkg/master\"\n```\n\nkube-apiserver导入了legacyscheme和master包。kube-apiserver资源注册分为两步：第1步，初始化Scheme资源注册表；第2步，注册Kubernetes所支持的资源。\n\n**初始化资源注册表 scheme**\n\n因为   server.go  import了 legacyscheme包，所以 legacyscheme包里面的 var会被初始化。\n\n而在legacyscheme包中，就初始化了 scheme表。\n\n```\npackage legacyscheme\n\nimport (\n\t\"k8s.io/apimachinery/pkg/runtime\"\n\t\"k8s.io/apimachinery/pkg/runtime/serializer\"\n)\n\n// Scheme is the default instance of runtime.Scheme to which types in the Kubernetes API are already registered.\n// NOTE: If you are copying this file to start a new api group, STOP! Copy the\n// extensions group instead. This Scheme is special and should appear ONLY in\n// the api group, unless you really know what you're doing.\n// TODO(lavalamp): make the above error impossible.\nvar Scheme = runtime.NewScheme()\n\n// Codecs provides access to encoding and decoding for the scheme\nvar Codecs = serializer.NewCodecFactory(Scheme)\n\n// ParameterCodec handles versioning of objects that are converted to query parameters.\nvar ParameterCodec = runtime.NewParameterCodec(Scheme)\n```\n\n<br>\n\n同样 server.go import了 master包。所以 master 包引入的 包也会 运行  init var等过程。\n\n在 master包中，有一个 pkg\\master\\import_known_versions.go 文件。这个文件 引入了包，但是啥都没干。这样的目的，就是  完成 引入包的 init, var。\n\n```\npackage master\n\n// These imports are the API groups the API server will support.\nimport (\n\t_ \"k8s.io/kubernetes/pkg/apis/admission/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/admissionregistration/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/apps/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/authentication/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/authorization/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/autoscaling/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/batch/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/certificates/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/coordination/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/core/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/events/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/extensions/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/imagepolicy/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/networking/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/policy/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/rbac/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/scheduling/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/settings/install\"\n\t_ \"k8s.io/kubernetes/pkg/apis/storage/install\"\n)\n```\n\n随便拿一个为例，例如  k8s.io/kubernetes/pkg/apis/core/install 包\n\n```\npackage install\n\nimport (\n\t\"k8s.io/apimachinery/pkg/runtime\"\n\tutilruntime \"k8s.io/apimachinery/pkg/util/runtime\"\n\t\"k8s.io/kubernetes/pkg/api/legacyscheme\"\n\t\"k8s.io/kubernetes/pkg/apis/core\"\n\t\"k8s.io/kubernetes/pkg/apis/core/v1\"\n)\n\nfunc init() {\n\tInstall(legacyscheme.Scheme)\n}\n\n// Install registers the API group and adds types to a scheme\nfunc Install(scheme *runtime.Scheme) {\n\tutilruntime.Must(core.AddToScheme(scheme))\n\tutilruntime.Must(v1.AddToScheme(scheme))\n\tutilruntime.Must(scheme.SetVersionPriority(v1.SchemeGroupVersion))\n}\n```\n\n<br>\n\n所以上面的一个 utilruntime.Must(core.AddToScheme(scheme))。\n\n就将pod,podlist等等都注册到了scheme中去。\n\n```\nvar (\n   // SchemeBuilder object to register various known types\n   SchemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)\n\n   // AddToScheme represents a func that can be used to apply all the registered\n   // funcs in a scheme\n   AddToScheme = SchemeBuilder.AddToScheme\n)\n\nfunc addKnownTypes(scheme *runtime.Scheme) error {\n   if err := scheme.AddIgnoredConversionType(&metav1.TypeMeta{}, &metav1.TypeMeta{}); err != nil {\n      return err\n   }\n   scheme.AddKnownTypes(SchemeGroupVersion,\n      &Pod{},\n      &PodList{},\n      &PodStatusResult{},\n      &PodTemplate{},\n      &PodTemplateList{},\n      &ReplicationControllerList{},\n      &ReplicationController{},\n      &ServiceList{},\n      &Service{},\n      &ServiceProxyOptions{},\n      &NodeList{},\n      &Node{},\n      &NodeProxyOptions{},\n      &Endpoints{},\n      &EndpointsList{},\n      &Binding{},\n      &Event{},\n      &EventList{},\n      &List{},\n      &LimitRange{},\n      &LimitRangeList{},\n      &ResourceQuota{},\n      &ResourceQuotaList{},\n      &Namespace{},\n      &NamespaceList{},\n      &ServiceAccount{},\n      &ServiceAccountList{},\n      &Secret{},\n      &SecretList{},\n      &PersistentVolume{},\n      &PersistentVolumeList{},\n      &PersistentVolumeClaim{},\n      &PersistentVolumeClaimList{},\n      &PodAttachOptions{},\n      &PodLogOptions{},\n      &PodExecOptions{},\n      &PodPortForwardOptions{},\n      &PodProxyOptions{},\n      &ComponentStatus{},\n      &ComponentStatusList{},\n      &SerializedReference{},\n      &RangeAllocation{},\n      &ConfigMap{},\n      &ConfigMapList{},\n      &EphemeralContainers{},\n   )\n\n   return nil\n}\n\n// GroupName is the group name use in this package\nconst GroupName = \"\"\n\n// SchemeGroupVersion is group version used to register these objects\nvar SchemeGroupVersion = schema.GroupVersion{Group: GroupName, Version: runtime.APIVersionInternal}\n```\n\n\n\n这里可以看出来，init中 就注册了 core资源。在上述代码中，core.AddToScheme函数注册了core资源组内部版本的资源，v1.AddToScheme函数注册了core资源\n\n组外部版本的资源，scheme.SetVersionPriority函数注册了资源组的版本顺序。如果有多个资源版本，排在最前面的为资源首选版本。\n\n提示：除将KubeAPIServer（API核心服务）注册至legacyscheme.Scheme资源注册表以外，还需要了解APIExtensionsServer和AggregatorServer资源注册过程。\n\n●  将APIExtensionsServer （API扩展服务）注册至extensionsapiserver.Scheme资源注册表，注册过程定义在vendor/k8s.io/apiextensions-apiserver/pkg/apiserver/apiserver.go中。\n\n●  将AggregatorServer（API聚合服务）注册至aggregatorscheme.Scheme资源注册表，注册过程定义在vendor/k8s.io/kube-aggregator/pkg/apiserver/scheme/scheme.go中。\n\n```\nvendor/k8s.io/kube-aggregator/pkg/apiserver/scheme/scheme.go\n\nvar (\n\tScheme = runtime.NewScheme()\n\tCodecs = serializer.NewCodecFactory(Scheme)\n\n\t// if you modify this, make sure you update the crEncoder\n\tunversionedVersion = schema.GroupVersion{Group: \"\", Version: \"v1\"}\n\tunversionedTypes   = []runtime.Object{\n\t\t&metav1.Status{},\n\t\t&metav1.WatchEvent{},\n\t\t&metav1.APIVersions{},\n\t\t&metav1.APIGroupList{},\n\t\t&metav1.APIGroup{},\n\t\t&metav1.APIResourceList{},\n\t}\n)\n\nfunc init() {\n\tinstall.Install(Scheme)\n\n\t// we need to add the options to empty v1\n\tmetav1.AddToGroupVersion(Scheme, schema.GroupVersion{Group: \"\", Version: \"v1\"})\n\n\tScheme.AddUnversionedTypes(unversionedVersion, unversionedTypes...)\n}\n```\n\n<br>\n\n### 2. Cobra命令行参数解析\n\n#### 2.1.  入口函数 main->NewAPIServerCommand\n\ncmd\\kube-apiserver\\apiserver.go\n\n```\nfunc main() {\n\trand.Seed(time.Now().UTC().UnixNano())\n\n\tcommand := app.NewAPIServerCommand(server.SetupSignalHandler())\n\n\t// TODO: once we switch everything over to Cobra commands, we can go back to calling\n\t// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the\n\t// normalize func and add the go flag set by hand.\n\tpflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc)\n\tpflag.CommandLine.AddGoFlagSet(goflag.CommandLine)\n\t// utilflag.InitFlags()\n\tlogs.InitLogs()\n\tdefer logs.FlushLogs()\n\n\tif err := command.Execute(); err != nil {\n\t\tfmt.Fprintf(os.Stderr, \"error: %v\\n\", err)\n\t\tos.Exit(1)\n\t}\n}\n```\n\n<br>\n\n**NewAPIServerCommand**\n\n```\n// NewAPIServerCommand creates a *cobra.Command object with default parameters\nfunc NewAPIServerCommand() *cobra.Command {\n    // 1. 定义 NewServerRunOptions。就是定义所有参数的结构体对象。详见：2.2\n\ts := options.NewServerRunOptions()\n\t\n\t// 2. 定义一个 cmd\n\tcmd := &cobra.Command{\n\t\tUse: \"kube-apiserver\",\n\t\tLong: `The Kubernetes API server validates and configures data\nfor the api objects which include pods, services, replicationcontrollers, and\nothers. The API Server services REST operations and provides the frontend to the\ncluster's shared state through which all other components interact.`,\n\t\tRunE: func(cmd *cobra.Command, args []string) error {\n\t\t\tverflag.PrintAndExitIfRequested()\n\t\t\tutilflag.PrintFlags(cmd.Flags())\n\n\t\t\t// set default options\n\t\t\tcompletedOptions, err := Complete(s)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\n\t\t\t// validate options\n\t\t\tif errs := completedOptions.Validate(); len(errs) != 0 {\n\t\t\t\treturn utilerrors.NewAggregate(errs)\n\t\t\t}\n\n\t\t\treturn Run(completedOptions, genericapiserver.SetupSignalHandler())\n\t\t},\n\t}\n    \n    \n    // 3. 初始化cmd的 flagset。这里就是定义一个空的，以 kube-apserver为名字的 flagset。详见2.3\n\tfs := cmd.Flags()\n\t\n\t\n\t// 4. 绑定 kube-apiserver各个组件（etcd,CloudProvider等等），详见 2.4\n\tnamedFlagSets := s.Flags()\n\n\n\n    // 5. 将各个组件的 flagset加入，fs中。这样fs就有了 kube-apiserver，以及各个组件的flagset了。\n\tfor _, f := range namedFlagSets.FlagSets {\n\t\tfs.AddFlagSet(f)\n\t}\n    \n    // 6. 设置打印使用函数\n\tusageFmt := \"Usage:\\n  %s\\n\"\n\tcols, _, _ := term.TerminalSize(cmd.OutOrStdout())\n\tcmd.SetUsageFunc(func(cmd *cobra.Command) error {\n\t\tfmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())\n\t\tcliflag.PrintSections(cmd.OutOrStderr(), namedFlagSets, cols)\n\t\treturn nil\n\t})\n\t\n\t// 7. 设置help函数。\n\tcmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {\n\t\tfmt.Fprintf(cmd.OutOrStdout(), \"%s\\n\\n\"+usageFmt, cmd.Long, cmd.UseLine())\n\t\tcliflag.PrintSections(cmd.OutOrStdout(), namedFlagSets, cols)\n\t})\n\n\treturn cmd\n}\n```\n\n<br>\n\n#### 2.2 options.NewServerRunOptions\n\n实例化一个对象，这里包括了所有的部分：GenericServerRunOptions, etcd , SecureServing...CloudProvider等等。\n\n```\n// NewServerRunOptions creates a new ServerRunOptions object with default parameters\nfunc NewServerRunOptions() *ServerRunOptions {\n\ts := ServerRunOptions{\n\t\tGenericServerRunOptions: genericoptions.NewServerRunOptions(),\n\t\tEtcd:                 genericoptions.NewEtcdOptions(storagebackend.NewDefaultConfig(kubeoptions.DefaultEtcdPathPrefix, nil)),\n\t\tSecureServing:        kubeoptions.NewSecureServingOptions(),\n\t\tInsecureServing:      kubeoptions.NewInsecureServingOptions(),\n\t\tAudit:                genericoptions.NewAuditOptions(),\n\t\tFeatures:             genericoptions.NewFeatureOptions(),\n\t\tAdmission:            kubeoptions.NewAdmissionOptions(),\n\t\tAuthentication:       kubeoptions.NewBuiltInAuthenticationOptions().WithAll(),\n\t\tAuthorization:        kubeoptions.NewBuiltInAuthorizationOptions(),\n\t\tCloudProvider:        kubeoptions.NewCloudProviderOptions(),\n\t\tStorageSerialization: kubeoptions.NewStorageSerializationOptions(),\n\t\tAPIEnablement:        genericoptions.NewAPIEnablementOptions(),\n\n\t\tEnableLogsHandler:      true,\n\t\tEventTTL:               1 * time.Hour,\n\t\tMasterCount:            1,\n\t\tEndpointReconcilerType: string(reconcilers.LeaseEndpointReconcilerType),\n\t\tKubeletConfig: kubeletclient.KubeletClientConfig{\n\t\t\tPort:         ports.KubeletPort,\n\t\t\tReadOnlyPort: ports.KubeletReadOnlyPort,\n\t\t\tPreferredAddressTypes: []string{\n\t\t\t\t// --override-hostname\n\t\t\t\tstring(api.NodeHostName),\n\n\t\t\t\t// internal, preferring DNS if reported\n\t\t\t\tstring(api.NodeInternalDNS),\n\t\t\t\tstring(api.NodeInternalIP),\n\n\t\t\t\t// external, preferring DNS if reported\n\t\t\t\tstring(api.NodeExternalDNS),\n\t\t\t\tstring(api.NodeExternalIP),\n\t\t\t},\n\t\t\tEnableHttps: true,\n\t\t\tHTTPTimeout: time.Duration(5) * time.Second,\n\t\t},\n\t\tServiceNodePortRange: kubeoptions.DefaultServiceNodePortRange,\n\t}\n\ts.ServiceClusterIPRange = kubeoptions.DefaultServiceIPCIDR\n\n\t// Overwrite the default for storage data format.\n\ts.Etcd.DefaultStorageMediaType = \"application/vnd.kubernetes.protobuf\"\n\n\treturn &s\n}\n```\n\n以etcd为例，可以看出来，就是实例化etcd对象，然后有默认值的就赋默认值。\n\n```\nfunc NewEtcdOptions(backendConfig *storagebackend.Config) *EtcdOptions {\n\toptions := &EtcdOptions{\n\t\tStorageConfig:           *backendConfig,\n\t\tDefaultStorageMediaType: \"application/json\",\n\t\tDeleteCollectionWorkers: 1,\n\t\tEnableGarbageCollection: true,\n\t\tEnableWatchCache:        true,\n\t\tDefaultWatchCacheSize:   100,\n\t}\n\toptions.StorageConfig.CountMetricPollPeriod = time.Minute\n\treturn options\n}\n```\n\n```\ntype EtcdOptions struct {\n\t// The value of Paging on StorageConfig will be overridden by the\n\t// calculated feature gate value.\n\tStorageConfig                    storagebackend.Config\n\tEncryptionProviderConfigFilepath string\n\n\tEtcdServersOverrides []string\n\n\t// To enable protobuf as storage format, it is enough\n\t// to set it to \"application/vnd.kubernetes.protobuf\".\n\tDefaultStorageMediaType string\n\tDeleteCollectionWorkers int\n\tEnableGarbageCollection bool\n\n\t// Set EnableWatchCache to false to disable all watch caches\n\tEnableWatchCache bool\n\t// Set DefaultWatchCacheSize to zero to disable watch caches for those resources that have no explicit cache size set\n\tDefaultWatchCacheSize int\n\t// WatchCacheSizes represents override to a given resource\n\tWatchCacheSizes []string\n}\n```\n\n\n\n#### 2.3 cmd.Flags()\n\n就是定义一个 kube-apiserver 为名字的 flagset。\n\n```\n// Flags returns the complete FlagSet that applies\n// to this command (local and persistent declared here and by all parents).\nfunc (c *Command) Flags() *flag.FlagSet {\n\tif c.flags == nil {\n\t\tc.flags = flag.NewFlagSet(c.Name(), flag.ContinueOnError)\n\t\tif c.flagErrorBuf == nil {\n\t\t\tc.flagErrorBuf = new(bytes.Buffer)\n\t\t}\n\t\tc.flags.SetOutput(c.flagErrorBuf)\n\t}\n\n\treturn c.flags\n}\n```\n\n#### 2.3.1 C.Name\n\n Command.Name = Command.Use。所以这里就是  c.Name = \"kube-apiserver\"\n\n```\n// Name returns the command's name: the first word in the use line.\nfunc (c *Command) Name() string {\n\tname := c.Use\n\ti := strings.Index(name, \" \")\n\tif i >= 0 {\n\t\tname = name[:i]\n\t}\n\treturn name\n}\n```\n\n<br>\n\n#### 2.3.2 NewFlagSet\n\n就是返回一个 flagset对象。\n\n```\n// NewFlagSet returns a new, empty flag set with the specified name,\n// error handling property and SortFlags set to true.\nfunc NewFlagSet(name string, errorHandling ErrorHandling) *FlagSet {\n   f := &FlagSet{\n      name:          name,\n      errorHandling: errorHandling,\n      argsLenAtDash: -1,\n      interspersed:  true,\n      SortFlags:     true,\n   }\n   return f\n}\n```\n\n\n\n```\n// A FlagSet represents a set of defined flags.\ntype FlagSet struct {\n\t// Usage is the function called when an error occurs while parsing flags.\n\t// The field is a function (not a method) that may be changed to point to\n\t// a custom error handler.\n\tUsage func()\n\n\t// SortFlags is used to indicate, if user wants to have sorted flags in\n\t// help/usage messages.\n\tSortFlags bool\n\n\t// ParseErrorsWhitelist is used to configure a whitelist of errors\n\tParseErrorsWhitelist ParseErrorsWhitelist\n\n\tname              string\n\tparsed            bool\n\tactual            map[NormalizedName]*Flag\n\torderedActual     []*Flag\n\tsortedActual      []*Flag\n\tformal            map[NormalizedName]*Flag\n\torderedFormal     []*Flag\n\tsortedFormal      []*Flag\n\tshorthands        map[byte]*Flag\n\targs              []string // arguments after flags\n\targsLenAtDash     int      // len(args) when a '--' was located when parsing, or -1 if no --\n\terrorHandling     ErrorHandling\n\toutput            io.Writer // nil means stderr; use out() accessor\n\tinterspersed      bool      // allow interspersed option/non-option args\n\tnormalizeNameFunc func(f *FlagSet, name string) NormalizedName\n\n\taddedGoFlagSets []*goflag.FlagSet\n}\n```\n\n通过打印日志发现，这个时候都还是空的。\n\n```\nI0127 10:51:41.053661    5612 flag.go:1209] zxtest f.name is kube-apiserver\nI0127 10:51:41.053666    5612 flag.go:1210] zxtest f.actual is map[]\nI0127 10:51:41.053677    5612 flag.go:1211] zxtest f.orderedActual is []\nI0127 10:51:41.053684    5612 flag.go:1212] zxtest f.sortedActual is []\nI0127 10:51:41.053689    5612 flag.go:1213] zxtest f.formal is map[]\nI0127 10:51:41.053695    5612 flag.go:1214] zxtest f.orderedFormal is []\nI0127 10:51:41.053702    5612 flag.go:1215] zxtest f.sortedFormal is []\nI0127 10:51:41.053708    5612 flag.go:1216] zxtest f.shorthands is map[]\nI0127 10:51:41.053714    5612 flag.go:1217] zxtest f.args is []\n```\n\n<br>\n\n#### 2.4 s.Flags()\n\ns.Flags就是让 结构体的参数 和启动时输入的参数进行一个绑定。\n\nfss apiserverflag.NamedFlagSets是一个 map，对应 [key] flagSet\n\n可以认为是：\n\n```\nfss = {\n\n​    “etcd”， flagset1,\n\n​     \"secure serving\", flagset2,\n\n} \n```\n\n\n\n```\n// Flags returns flags for a specific APIServer by section name\nfunc (s *ServerRunOptions) Flags() (fss apiserverflag.NamedFlagSets) {\n\t// Add the generic flags.\n\ts.GenericServerRunOptions.AddUniversalFlags(fss.FlagSet(\"generic\"))\n\ts.Etcd.AddFlags(fss.FlagSet(\"etcd\"))\n\ts.SecureServing.AddFlags(fss.FlagSet(\"secure serving\"))\n\ts.InsecureServing.AddFlags(fss.FlagSet(\"insecure serving\"))\n\ts.InsecureServing.AddUnqualifiedFlags(fss.FlagSet(\"insecure serving\")) // TODO: remove it until kops stops using `--address`\n\ts.Audit.AddFlags(fss.FlagSet(\"auditing\"))\n\ts.Features.AddFlags(fss.FlagSet(\"features\"))\n\ts.Authentication.AddFlags(fss.FlagSet(\"authentication\"))\n\ts.Authorization.AddFlags(fss.FlagSet(\"authorization\"))\n\ts.CloudProvider.AddFlags(fss.FlagSet(\"cloud provider\"))\n\ts.StorageSerialization.AddFlags(fss.FlagSet(\"storage\"))\n\ts.APIEnablement.AddFlags(fss.FlagSet(\"api enablement\"))\n\ts.Admission.AddFlags(fss.FlagSet(\"admission\"))\n\n\t// Note: the weird \"\"+ in below lines seems to be the only way to get gofmt to\n\t// arrange these text blocks sensibly. Grrr.\n\tfs := fss.FlagSet(\"misc\")\n\tfs.DurationVar(&s.EventTTL, \"event-ttl\", s.EventTTL,\n\t\t\"Amount of time to retain events.\")\n\n\tfs.BoolVar(&s.AllowPrivileged, \"allow-privileged\", s.AllowPrivileged,\n\t\t\"If true, allow privileged containers. [default=false]\")\n\n\tfs.BoolVar(&s.EnableLogsHandler, \"enable-logs-handler\", s.EnableLogsHandler,\n\t\t\"If true, install a /logs handler for the apiserver logs.\")\n\n\t// Deprecated in release 1.9\n\tfs.StringVar(&s.SSHUser, \"ssh-user\", s.SSHUser,\n\t\t\"If non-empty, use secure SSH proxy to the nodes, using this user name\")\n\tfs.MarkDeprecated(\"ssh-user\", \"This flag will be removed in a future version.\")\n\n\t// Deprecated in release 1.9\n\tfs.StringVar(&s.SSHKeyfile, \"ssh-keyfile\", s.SSHKeyfile,\n\t\t\"If non-empty, use secure SSH proxy to the nodes, using this user keyfile\")\n\tfs.MarkDeprecated(\"ssh-keyfile\", \"This flag will be removed in a future version.\")\n\n\tfs.Int64Var(&s.MaxConnectionBytesPerSec, \"max-connection-bytes-per-sec\", s.MaxConnectionBytesPerSec, \"\"+\n\t\t\"If non-zero, throttle each user connection to this number of bytes/sec. \"+\n\t\t\"Currently only applies to long-running requests.\")\n\n\tfs.IntVar(&s.MasterCount, \"apiserver-count\", s.MasterCount,\n\t\t\"The number of apiservers running in the cluster, must be a positive number. (In use when --endpoint-reconciler-type=master-count is enabled.)\")\n\n\tfs.StringVar(&s.EndpointReconcilerType, \"endpoint-reconciler-type\", string(s.EndpointReconcilerType),\n\t\t\"Use an endpoint reconciler (\"+strings.Join(reconcilers.AllTypes.Names(), \", \")+\")\")\n\n\t// See #14282 for details on how to test/try this option out.\n\t// TODO: remove this comment once this option is tested in CI.\n\tfs.IntVar(&s.KubernetesServiceNodePort, \"kubernetes-service-node-port\", s.KubernetesServiceNodePort, \"\"+\n\t\t\"If non-zero, the Kubernetes master service (which apiserver creates/maintains) will be \"+\n\t\t\"of type NodePort, using this as the value of the port. If zero, the Kubernetes master \"+\n\t\t\"service will be of type ClusterIP.\")\n\n\tfs.IPNetVar(&s.ServiceClusterIPRange, \"service-cluster-ip-range\", s.ServiceClusterIPRange, \"\"+\n\t\t\"A CIDR notation IP range from which to assign service cluster IPs. This must not \"+\n\t\t\"overlap with any IP ranges assigned to nodes for pods.\")\n\n\tfs.Var(&s.ServiceNodePortRange, \"service-node-port-range\", \"\"+\n\t\t\"A port range to reserve for services with NodePort visibility. \"+\n\t\t\"Example: '30000-32767'. Inclusive at both ends of the range.\")\n\n\t// Kubelet related flags:\n\tfs.BoolVar(&s.KubeletConfig.EnableHttps, \"kubelet-https\", s.KubeletConfig.EnableHttps,\n\t\t\"Use https for kubelet connections.\")\n\n\tfs.StringSliceVar(&s.KubeletConfig.PreferredAddressTypes, \"kubelet-preferred-address-types\", s.KubeletConfig.PreferredAddressTypes,\n\t\t\"List of the preferred NodeAddressTypes to use for kubelet connections.\")\n\n\tfs.UintVar(&s.KubeletConfig.Port, \"kubelet-port\", s.KubeletConfig.Port,\n\t\t\"DEPRECATED: kubelet port.\")\n\tfs.MarkDeprecated(\"kubelet-port\", \"kubelet-port is deprecated and will be removed.\")\n\n\tfs.UintVar(&s.KubeletConfig.ReadOnlyPort, \"kubelet-read-only-port\", s.KubeletConfig.ReadOnlyPort,\n\t\t\"DEPRECATED: kubelet port.\")\n\n\tfs.DurationVar(&s.KubeletConfig.HTTPTimeout, \"kubelet-timeout\", s.KubeletConfig.HTTPTimeout,\n\t\t\"Timeout for kubelet operations.\")\n\n\tfs.StringVar(&s.KubeletConfig.CertFile, \"kubelet-client-certificate\", s.KubeletConfig.CertFile,\n\t\t\"Path to a client cert file for TLS.\")\n\n\tfs.StringVar(&s.KubeletConfig.KeyFile, \"kubelet-client-key\", s.KubeletConfig.KeyFile,\n\t\t\"Path to a client key file for TLS.\")\n\n\tfs.StringVar(&s.KubeletConfig.CAFile, \"kubelet-certificate-authority\", s.KubeletConfig.CAFile,\n\t\t\"Path to a cert file for the certificate authority.\")\n\n\t// TODO: delete this flag in 1.13\n\trepair := false\n\tfs.BoolVar(&repair, \"repair-malformed-updates\", false, \"deprecated\")\n\tfs.MarkDeprecated(\"repair-malformed-updates\", \"This flag will be removed in a future version\")\n\n\tfs.StringVar(&s.ProxyClientCertFile, \"proxy-client-cert-file\", s.ProxyClientCertFile, \"\"+\n\t\t\"Client certificate used to prove the identity of the aggregator or kube-apiserver \"+\n\t\t\"when it must call out during a request. This includes proxying requests to a user \"+\n\t\t\"api-server and calling out to webhook admission plugins. It is expected that this \"+\n\t\t\"cert includes a signature from the CA in the --requestheader-client-ca-file flag. \"+\n\t\t\"That CA is published in the 'extension-apiserver-authentication' configmap in \"+\n\t\t\"the kube-system namespace. Components receiving calls from kube-aggregator should \"+\n\t\t\"use that CA to perform their half of the mutual TLS verification.\")\n\tfs.StringVar(&s.ProxyClientKeyFile, \"proxy-client-key-file\", s.ProxyClientKeyFile, \"\"+\n\t\t\"Private key for the client certificate used to prove the identity of the aggregator or kube-apiserver \"+\n\t\t\"when it must call out during a request. This includes proxying requests to a user \"+\n\t\t\"api-server and calling out to webhook admission plugins.\")\n\n\tfs.BoolVar(&s.EnableAggregatorRouting, \"enable-aggregator-routing\", s.EnableAggregatorRouting,\n\t\t\"Turns on aggregator routing requests to endpoints IP rather than cluster IP.\")\n\n\tfs.StringVar(&s.ServiceAccountSigningKeyFile, \"service-account-signing-key-file\", s.ServiceAccountSigningKeyFile, \"\"+\n\t\t\"Path to the file that contains the current private key of the service account token issuer. The issuer will sign issued ID tokens with this private key. (Requires the 'TokenRequest' feature gate.)\")\n\n\treturn fss\n}\n```\n\n<br>\n\nmap中如果没有就新建一个\n\n```\n// FlagSet returns the flag set with the given name and adds it to the\n// ordered name list if it is not in there yet.\nfunc (nfs *NamedFlagSets) FlagSet(name string) *pflag.FlagSet {\n\tif nfs.FlagSets == nil {\n\t\tnfs.FlagSets = map[string]*pflag.FlagSet{}\n\t}\n\tif _, ok := nfs.FlagSets[name]; !ok {\n\t\tnfs.FlagSets[name] = pflag.NewFlagSet(name, pflag.ExitOnError)\n\t\tnfs.Order = append(nfs.Order, name)\n\t}\n\treturn nfs.FlagSets[name]\n}\n```\n\n\n\n还是以etcd为例，这里就是将 EtcdOptions结构体中的一个一个变量和 输入的参数进行绑定。\n\n```\n// AddEtcdFlags adds flags related to etcd storage for a specific APIServer to the specified FlagSet\nfunc (s *EtcdOptions) AddFlags(fs *pflag.FlagSet) {\n   if s == nil {\n      return\n   }\n\n   fs.StringSliceVar(&s.EtcdServersOverrides, \"etcd-servers-overrides\", s.EtcdServersOverrides, \"\"+\n      \"Per-resource etcd servers overrides, comma separated. The individual override \"+\n      \"format: group/resource#servers, where servers are URLs, semicolon separated.\")\n\n   fs.StringVar(&s.DefaultStorageMediaType, \"storage-media-type\", s.DefaultStorageMediaType, \"\"+\n      \"The media type to use to store objects in storage. \"+\n      \"Some resources or storage backends may only support a specific media type and will ignore this setting.\")\n   fs.IntVar(&s.DeleteCollectionWorkers, \"delete-collection-workers\", s.DeleteCollectionWorkers,\n      \"Number of workers spawned for DeleteCollection call. These are used to speed up namespace cleanup.\")\n\n   fs.BoolVar(&s.EnableGarbageCollection, \"enable-garbage-collector\", s.EnableGarbageCollection, \"\"+\n      \"Enables the generic garbage collector. MUST be synced with the corresponding flag \"+\n      \"of the kube-controller-manager.\")\n\n   fs.BoolVar(&s.EnableWatchCache, \"watch-cache\", s.EnableWatchCache,\n      \"Enable watch caching in the apiserver\")\n\n   fs.IntVar(&s.DefaultWatchCacheSize, \"default-watch-cache-size\", s.DefaultWatchCacheSize,\n      \"Default watch cache size. If zero, watch cache will be disabled for resources that do not have a default watch size set.\")\n\n   fs.StringSliceVar(&s.WatchCacheSizes, \"watch-cache-sizes\", s.WatchCacheSizes, \"\"+\n      \"List of watch cache sizes for every resource (pods, nodes, etc.), comma separated. \"+\n      \"The individual override format: resource[.group]#size, where resource is lowercase plural (no version), \"+\n      \"group is optional, and size is a number. It takes effect when watch-cache is enabled. \"+\n      \"Some resources (replicationcontrollers, endpoints, nodes, pods, services, apiservices.apiregistration.k8s.io) \"+\n      \"have system defaults set by heuristics, others default to default-watch-cache-size\")\n\n   fs.StringVar(&s.StorageConfig.Type, \"storage-backend\", s.StorageConfig.Type,\n      \"The storage backend for persistence. Options: 'etcd3' (default), 'etcd2'.\")\n\n   fs.IntVar(&s.StorageConfig.DeserializationCacheSize, \"deserialization-cache-size\", s.StorageConfig.DeserializationCacheSize,\n      \"Number of deserialized json objects to cache in memory.\")\n\n   fs.StringSliceVar(&s.StorageConfig.ServerList, \"etcd-servers\", s.StorageConfig.ServerList,\n      \"List of etcd servers to connect with (scheme://ip:port), comma separated.\")\n\n   fs.StringVar(&s.StorageConfig.Prefix, \"etcd-prefix\", s.StorageConfig.Prefix,\n      \"The prefix to prepend to all resource paths in etcd.\")\n\n   fs.StringVar(&s.StorageConfig.KeyFile, \"etcd-keyfile\", s.StorageConfig.KeyFile,\n      \"SSL key file used to secure etcd communication.\")\n\n   fs.StringVar(&s.StorageConfig.CertFile, \"etcd-certfile\", s.StorageConfig.CertFile,\n      \"SSL certification file used to secure etcd communication.\")\n\n   fs.StringVar(&s.StorageConfig.CAFile, \"etcd-cafile\", s.StorageConfig.CAFile,\n      \"SSL Certificate Authority file used to secure etcd communication.\")\n\n   fs.BoolVar(&s.StorageConfig.Quorum, \"etcd-quorum-read\", s.StorageConfig.Quorum,\n      \"If true, enable quorum read. It defaults to true and is strongly recommended not setting to false.\")\n   fs.MarkDeprecated(\"etcd-quorum-read\", \"This flag is deprecated and the ability to switch off quorum read will be removed in a future release.\")\n\n   fs.StringVar(&s.EncryptionProviderConfigFilepath, \"experimental-encryption-provider-config\", s.EncryptionProviderConfigFilepath,\n      \"The file containing configuration for encryption providers to be used for storing secrets in etcd\")\n\n   fs.DurationVar(&s.StorageConfig.CompactionInterval, \"etcd-compaction-interval\", s.StorageConfig.CompactionInterval,\n      \"The interval of compaction requests. If 0, the compaction request from apiserver is disabled.\")\n\n   fs.DurationVar(&s.StorageConfig.CountMetricPollPeriod, \"etcd-count-metric-poll-period\", s.StorageConfig.CountMetricPollPeriod, \"\"+\n      \"Frequency of polling etcd for number of resources per type. 0 disables the metric collection.\")\n}\n```\n\n<br>\n\n#### 2.5 command.Execute() 真正的参数解析 \n\n以下的流程都是在该文件中：github.com\\spf13\\cobra\\command.go\n\n这些都是 cobra自动解析的。\n\ncommand.Execute() -> ExecuteC() -> cmd.execute(flags)\n\ncmd.execute 的大体流程如下：\n\n（1）解析参数\n\n（2）判断cmd是否设置了 run, runE函数。没有就直接返回\n\n（3）运行设置的初始化函数，preRun\n\n（4）运行RunE,或者Run函数。\n\n```\nfunc (c *Command) execute(a []string) (err error) {\n\tif c == nil {\n\t\treturn fmt.Errorf(\"Called Execute() on a nil Command\")\n\t}\n\n\tif len(c.Deprecated) > 0 {\n\t\tc.Printf(\"Command %q is deprecated, %s\\n\", c.Name(), c.Deprecated)\n\t}\n\n\t// initialize help and version flag at the last point possible to allow for user\n\t// overriding\n\tc.InitDefaultHelpFlag()\n\tc.InitDefaultVersionFlag()\n  \n    // 1. 解析参数\n\terr = c.ParseFlags(a)\n\tif err != nil {\n\t\treturn c.FlagErrorFunc()(c, err)\n\t}\n\n\t// If help is called, regardless of other flags, return we want help.\n\t// Also say we need help if the command isn't runnable.\n\thelpVal, err := c.Flags().GetBool(\"help\")\n\tif err != nil {\n\t\t// should be impossible to get here as we always declare a help\n\t\t// flag in InitDefaultHelpFlag()\n\t\tc.Println(\"\\\"help\\\" flag declared as non-bool. Please correct your code\")\n\t\treturn err\n\t}\n\n\tif helpVal {\n\t\treturn flag.ErrHelp\n\t}\n\n\t// for back-compat, only add version flag behavior if version is defined\n\tif c.Version != \"\" {\n\t\tversionVal, err := c.Flags().GetBool(\"version\")\n\t\tif err != nil {\n\t\t\tc.Println(\"\\\"version\\\" flag declared as non-bool. Please correct your code\")\n\t\t\treturn err\n\t\t}\n\t\tif versionVal {\n\t\t\terr := tmpl(c.OutOrStdout(), c.VersionTemplate(), c)\n\t\t\tif err != nil {\n\t\t\t\tc.Println(err)\n\t\t\t}\n\t\t\treturn err\n\t\t}\n\t}\n     \n    // 2.判断cmd是否设置了 run, runE函数。\n\tif !c.Runnable() {\n\t\treturn flag.ErrHelp\n\t}\n \n    // 3.运行设置的初始化函数\n\tc.preRun()\n\n\targWoFlags := c.Flags().Args()\n\tif c.DisableFlagParsing {\n\t\targWoFlags = a\n\t}\n\n\tif err := c.ValidateArgs(argWoFlags); err != nil {\n\t\treturn err\n\t}\n    \n    // 4. 开始运行 Run，或者RunE函数。可以看出来，RunE函数的优先级是大于Run的。\n\tfor p := c; p != nil; p = p.Parent() {\n\t\tif p.PersistentPreRunE != nil {\n\t\t\tif err := p.PersistentPreRunE(c, argWoFlags); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tbreak\n\t\t} else if p.PersistentPreRun != nil {\n\t\t\tp.PersistentPreRun(c, argWoFlags)\n\t\t\tbreak\n\t\t}\n\t}\n\tif c.PreRunE != nil {\n\t\tif err := c.PreRunE(c, argWoFlags); err != nil {\n\t\t\treturn err\n\t\t}\n\t} else if c.PreRun != nil {\n\t\tc.PreRun(c, argWoFlags)\n\t}\n\n\tif err := c.validateRequiredFlags(); err != nil {\n\t\treturn err\n\t}\n\tif c.RunE != nil {\n\t\tif err := c.RunE(c, argWoFlags); err != nil {\n\t\t\treturn err\n\t\t}\n\t} else {\n\t\tc.Run(c, argWoFlags)\n\t}\n\tif c.PostRunE != nil {\n\t\tif err := c.PostRunE(c, argWoFlags); err != nil {\n\t\t\treturn err\n\t\t}\n\t} else if c.PostRun != nil {\n\t\tc.PostRun(c, argWoFlags)\n\t}\n\tfor p := c; p != nil; p = p.Parent() {\n\t\tif p.PersistentPostRunE != nil {\n\t\t\tif err := p.PersistentPostRunE(c, argWoFlags); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tbreak\n\t\t} else if p.PersistentPostRun != nil {\n\t\t\tp.PersistentPostRun(c, argWoFlags)\n\t\t\tbreak\n\t\t}\n\t}\n\n\treturn nil\n}\n```\n\n\n\n\n\n\n\n#### 2.6 总结\n\n（1）这里主要就是利用corba工具，进行初始化。options.NewServerRunOptions 函数将 corba和 kube-apiserver的参数进行了解耦。\n\n（2）Execute()里面才会真正的进行参数解析，所以Run函数外面的都是没有解析的值。打印出来确实都是默认值。Run函数里面的都是参数解析完的。\n\n<br>\n\n### 3  RunE\n\n这个是NewAPIServerCommand中定义的RunE函数。\n\n```\nRunE: func(cmd *cobra.Command, args []string) error {\n\t\t\t// 1. 如果监测到输入了 --version，就打印当前的k8s版本信息，然后退出。\n\t\t\tverflag.PrintAndExitIfRequested()\n\t\t\t\n\t\t\t// 2. 打印flags\n\t\t\tutilflag.PrintFlags(cmd.Flags())\n\t\t\t\n\t\t\t// 3. 补全 s的配置，这里是补充默认的配置。（s := options.NewServerRunOptions()），详见1.1\n\t\t\t// set default options\n\t\t\tcompletedOptions, err := Complete(s)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\t\n\t\t\t// 4. 分组件validate，主要验证每个组件是否缺失一些重要的参数。以及参数是否符合规范等。详见1.2\n\t\t\t// validate options\n\t\t\tif errs := completedOptions.Validate(); len(errs) != 0 {\n\t\t\t\treturn utilerrors.NewAggregate(errs)\n\t\t\t}\n            \n            // 5. 这里已经获得了所有的配置，并且通过验证，然后真正可以运行 api-server函数。  详见1.3\n\t\t\treturn Run(completedOptions, stopCh)\n\t\t},\n```\n\n<br>\n\n#### 3.1 completedOptions, err := Complete(s)\n\ncompletedOptions 和 s 都是ServerRunOptions。 complete主要是通过默认的配置补全 s。同时还有一些实现一些限制，比如Etcd.StorageConfig.DeserializationCacheSize>=1000。如果用户设置了小于1000的值，这里会自动改为1000。\n\n```\nif s.Etcd.StorageConfig.DeserializationCacheSize < 1000 {\n\t\t\ts.Etcd.StorageConfig.DeserializationCacheSize = 1000\n\t\t}\n```\n\n<br>\n\n#### 3.2 validate\n\n分组件validate，主要验证每个组件是否缺失一些重要的参数。以及参数是否符合规范等。\n\n```\n// Validate checks ServerRunOptions and return a slice of found errors.\nfunc (s *ServerRunOptions) Validate() []error {\n\tvar errors []error\n\tif errs := s.Etcd.Validate(); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\tif errs := validateClusterIPFlags(s); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\tif errs := validateServiceNodePort(s); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\tif errs := s.SecureServing.Validate(); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\tif errs := s.Authentication.Validate(); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\tif errs := s.Authorization.Validate(); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\tif errs := s.Audit.Validate(); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\tif errs := s.Admission.Validate(); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\tif errs := s.InsecureServing.Validate(); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\tif s.MasterCount <= 0 {\n\t\terrors = append(errors, fmt.Errorf(\"--apiserver-count should be a positive number, but value '%d' provided\", s.MasterCount))\n\t}\n\tif errs := s.APIEnablement.Validate(legacyscheme.Scheme, apiextensionsapiserver.Scheme, aggregatorscheme.Scheme); len(errs) > 0 {\n\t\terrors = append(errors, errs...)\n\t}\n\n\treturn errors\n}\n```\n\n<br>\n\n### 4. 总结\n\nkube-apiserver启动过程分为8个步骤。这里先分析到前两个\n\n（1）资源注册\n\n（2）Cobra命令行参数解析\n\n通过这个分析，了解到了apiserver是如何感知pod，deploy等内置资源的存在\n\n<br>\n\n**参考文档：** Kubernetes源码剖析，郑东旭\n\n"
  },
  {
    "path": "k8s/kube-apiserver/7-kube-apiserver创建APIServer通用配置.md",
    "content": "* [Table of Contents](#table-of-contents)\n    * [1\\. 背景介绍](#1-背景介绍)\n      * [1\\.1 CreateServerChain](#11-createserverchain)\n        * [1\\.1\\.1 函数输入输出](#111-函数输入输出)\n        * [1\\.1\\.2 CreateServerChain 主体](#112-createserverchain-主体)\n        * [1\\.1\\.3  CreateNodeDialer](#113--createnodedialer)\n      * [1\\.2 PrepareRun](#12-preparerun)\n      * [1\\.3 Run](#13-run)\n        * [1\\.3\\.1 NonBlockingRun](#131-nonblockingrun)\n      * [1\\.4 总结](#14-总结)\n    * [2\\. 创建APIServer通用配置](#2-创建apiserver通用配置)\n      * [2\\.1  genericConfig实例化](#21--genericconfig实例化)\n      * [2\\.2 OpenAPI/Swagger配置](#22-openapiswagger配置)\n      * [2\\.3 StorageFactory存储（Etcd）配置](#23-storagefactory存储etcd配置)\n      * [2\\.4 Authentication认证配置](#24-authentication认证配置)\n      * [2\\.5 Authorization授权配置](#25-authorization授权配置)\n      * [2\\.6 Admission准入控制器配置](#26-admission准入控制器配置)\n\n**本章重点：**\n\n介绍kube-apiserver启动过程中第三个步骤-定义通用配置，包含如下配置：\n\n![image-20210225152550545](../images/apiserver-config-1.png)\n\n### 1. 背景介绍\n\n接上文分析，这里直接从Run函数开始分析。这部分主要从代码角度，进行 kube-apiserver第三个流程分析\n\n（1）资源注册。\n\n（2）Cobra命令行参数解析\n\n（3）创建APIServer通用配置\n\n（4）创建APIExtensionsServer\n\n（5）创建KubeAPIServer\n\n（6）创建AggregatorServer\n\n（7）启动HTTP服务。\n\n（8）启动HTTPS服务\n\n**Run**函数可以分为三个部分：\n\n（1）CreateServerChain\n\n（2）PrepareRun\n\n（3）Run \n\n```\n// Run runs the specified APIServer.  This should never exit.\nfunc Run(completeOptions completedServerRunOptions, stopCh <-chan struct{}) error {\n   // To help debugging, immediately log version\n   glog.Infof(\"Version: %+v\", version.Get())\n\n   server, err := CreateServerChain(completeOptions, stopCh)\n   if err != nil {\n      return err\n   }\n\n   return server.PrepareRun().Run(stopCh)\n}\n```\n\n本节首先对apiserver的整体流程进行介绍\n\n<br>\n\n#### 1.1 CreateServerChain\n\n##### 1.1.1 函数输入输出\n\n**输入：**   completedOptions 完整的配置；  stopCh，退出信号，`stopCh` 最初是 `NewAPIServerCommand()` 中创建的：\n\n```\nstopCh := server.SetupSignalHandler()\n```\n\n很容易看出来这个 channel 跟系统信号量绑定了，即 `Ctrl+c` 或 `kill` 通知程序关闭的时候会 close 这个 channel ，然后调用 `<-stopCh` 的地方就会停止阻塞，做关闭程序需要的一些清理操作实现优雅关闭\n\n```\n// SetupSignalHandler registered for SIGTERM and SIGINT. A stop channel is returned\n// which is closed on one of these signals. If a second signal is caught, the program\n// is terminated with exit code 1.\nfunc SetupSignalHandler() (stopCh <-chan struct{}) {\n\tclose(onlyOneSignalHandler) // panics when called twice\n\n\tstop := make(chan struct{})\n\tc := make(chan os.Signal, 2)\n\tsignal.Notify(c, shutdownSignals...)\n\tgo func() {\n\t\t<-c\n\t\tclose(stop)\n\t\t<-c\n\t\tos.Exit(1) // second signal. Exit directly.\n\t}()\n\n\treturn stop\n}\n```\n\n<br>\n\n**输出：**GenericAPIServer\n\n```\n// GenericAPIServer contains state for a Kubernetes cluster api server.\ntype GenericAPIServer struct {\n\t// discoveryAddresses is used to build cluster IPs for discovery.\n\tdiscoveryAddresses discovery.Addresses\n\n\t// LoopbackClientConfig is a config for a privileged loopback connection to the API server\n\tLoopbackClientConfig *restclient.Config\n\n\t// minRequestTimeout is how short the request timeout can be.  This is used to build the RESTHandler\n\tminRequestTimeout time.Duration\n\n\t// ShutdownTimeout is the timeout used for server shutdown. This specifies the timeout before server\n\t// gracefully shutdown returns.\n\tShutdownTimeout time.Duration\n\n\t// legacyAPIGroupPrefixes is used to set up URL parsing for authorization and for validating requests\n\t// to InstallLegacyAPIGroup\n\tlegacyAPIGroupPrefixes sets.String\n\n\t// admissionControl is used to build the RESTStorage that backs an API Group.\n\tadmissionControl admission.Interface\n\n\t// SecureServingInfo holds configuration of the TLS server.\n\tSecureServingInfo *SecureServingInfo\n\n\t// ExternalAddress is the address (hostname or IP and port) that should be used in\n\t// external (public internet) URLs for this GenericAPIServer.\n\tExternalAddress string\n\n\t// Serializer controls how common API objects not in a group/version prefix are serialized for this server.\n\t// Individual APIGroups may define their own serializers.\n\tSerializer runtime.NegotiatedSerializer\n\n\t// \"Outputs\"\n\t// Handler holds the handlers being used by this API server\n\tHandler *APIServerHandler\n\n\t// listedPathProvider is a lister which provides the set of paths to show at /\n\tlistedPathProvider routes.ListedPathProvider\n\n\t// DiscoveryGroupManager serves /apis\n\tDiscoveryGroupManager discovery.GroupManager\n\n\t// Enable swagger and/or OpenAPI if these configs are non-nil.\n\tswaggerConfig *swagger.Config\n\topenAPIConfig *openapicommon.Config\n\n\t// PostStartHooks are each called after the server has started listening, in a separate go func for each\n\t// with no guarantee of ordering between them.  The map key is a name used for error reporting.\n\t// It may kill the process with a panic if it wishes to by returning an error.\n\tpostStartHookLock      sync.Mutex\n\tpostStartHooks         map[string]postStartHookEntry\n\tpostStartHooksCalled   bool\n\tdisabledPostStartHooks sets.String\n\n\tpreShutdownHookLock    sync.Mutex\n\tpreShutdownHooks       map[string]preShutdownHookEntry\n\tpreShutdownHooksCalled bool\n\n\t// healthz checks\n\thealthzLock    sync.Mutex\n\thealthzChecks  []healthz.HealthzChecker\n\thealthzCreated bool\n\n\t// auditing. The backend is started after the server starts listening.\n\tAuditBackend audit.Backend\n\n\t// Authorizer determines whether a user is allowed to make a certain request. The Handler does a preliminary\n\t// authorization check using the request URI but it may be necessary to make additional checks, such as in\n\t// the create-on-update case\n\tAuthorizer authorizer.Authorizer\n\n\t// enableAPIResponseCompression indicates whether API Responses should support compression\n\t// if the client requests it via Accept-Encoding\n\tenableAPIResponseCompression bool\n\n\t// delegationTarget is the next delegate in the chain. This is never nil.\n\tdelegationTarget DelegationTarget\n\n\t// HandlerChainWaitGroup allows you to wait for all chain handlers finish after the server shutdown.\n\tHandlerChainWaitGroup *utilwaitgroup.SafeWaitGroup\n}\n```\n\n<br>\n\n##### 1.1.2 CreateServerChain 主体\n\n```go\n// CreateServerChain creates the apiservers connected via delegation.\nfunc CreateServerChain(completedOptions completedServerRunOptions, stopCh <-chan struct{}) (*genericapiserver.GenericAPIServer, error) {\n    \n    // 1.创建到节点拨号连接,目的为了和节点交互。在云平台中，则需要安装本机的SSH Key到Kubernetes集群中所有节点上，可通过用户名和私钥，SSH到node节点\n\tnodeTunneler, proxyTransport, err := CreateNodeDialer(completedOptions)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n    // 2. 配置API Server的Config。\n\tkubeAPIServerConfig, insecureServingInfo, serviceResolver, pluginInitializer, admissionPostStartHook, err := CreateKubeAPIServerConfig(completedOptions, nodeTunneler, proxyTransport)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n    // 3.这里同时还配置了Extension API Server的Config，用于配置用户自己编写的API Server。\n\t// If additional API servers are added, they should be gated.\n\tapiExtensionsConfig, err := createAPIExtensionsConfig(*kubeAPIServerConfig.GenericConfig, kubeAPIServerConfig.ExtraConfig.VersionedInformers, pluginInitializer, completedOptions.ServerRunOptions, completedOptions.MasterCount)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t// 4.创建APIExtensionsServer\n\tapiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegate())\n\tif err != nil {\n\t\treturn nil, err\n\t}\n    \n    // 5.创建kubeapiserver，这里就是定义了 /apis/groups等这些api。\n\tkubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer, admissionPostStartHook)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// otherwise go down the normal path of standing the aggregator up in front of the API server\n\t// this wires up openapi\n\t\n\t// 6. kubeAPIServer prepareRun\n\tkubeAPIServer.GenericAPIServer.PrepareRun()\n\n    // 7. apiExtensionsServer prepareRun\n\t// This will wire up openapi for extension api server\n\tapiExtensionsServer.GenericAPIServer.PrepareRun()\n\n    // 8. 配置AA config，然后创建AA server。\n\t// aggregator comes last in the chain\n\taggregatorConfig, err := createAggregatorConfig(*kubeAPIServerConfig.GenericConfig, completedOptions.ServerRunOptions, kubeAPIServerConfig.ExtraConfig.VersionedInformers, serviceResolver, proxyTransport, pluginInitializer)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t\n\t// 9.创建AA server.这里传入了参数 kube-apiserver, apiExtensionServer。\n\taggregatorServer, err := createAggregatorServer(aggregatorConfig, kubeAPIServer.GenericAPIServer, apiExtensionsServer.Informers)\n\tif err != nil {\n\t\t// we don't need special handling for innerStopCh because the aggregator server doesn't create any go routines\n\t\treturn nil, err\n\t}\n\n\tif insecureServingInfo != nil {\n\t\tinsecureHandlerChain := kubeserver.BuildInsecureHandlerChain(aggregatorServer.GenericAPIServer.UnprotectedHandler(), kubeAPIServerConfig.GenericConfig)\n\t\tif err := insecureServingInfo.Serve(insecureHandlerChain, kubeAPIServerConfig.GenericConfig.RequestTimeout, stopCh); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\n\treturn aggregatorServer.GenericAPIServer, nil\n}\n```\n\ncreateServerChain的主要功能就是定义  各种URI（路径）。这里使用了委托模式，就是先定义 apiExtensionServer, kube-apiserver，但是最后定义AA 。最后返回AA server，并且执行后面的run函数。通过委托模式，就可以执行AA的run，但是 apiExtensionServer, kube-apiserver对应的 rest服务都起来了。\n\n委托模式是软件设计模式中的一项基本技巧。在委托模式中，有两个对象参与处理同一个请求，接受请求的对象将请求委托给另一个对象来处理。委托模式是一项基本技巧，许多其他的模式，如状态模式、策略模式、访问者模式本质上是在更特殊的场合采用了委托模式。委托模式使得我们可以用聚合来替代继承，它还使我们可以模拟mixin。\n\n委托模式参考：https://www.runoob.com/w3cnote/delegate-mode.html\n\n<br>\n\n##### 1.1.3  CreateNodeDialer\n\n函数定义如下：\n\n```\nfunc CreateNodeDialer(s completedServerRunOptions) (tunneler.Tunneler, *http.Transport, error) {\n```\n\n这里关注 **tunneler** 和 **transport**。\n\n**tunneler** 的最终定义如下，在 pkg\\master\\tunneler\\ssh.go。\n\n从该文件其他函数也可以看出来，这个作用就是 通过私钥公钥等信息，和node节点建立了一个通道。\n\n```\ntype SSHTunneler struct {\n\t// Important: Since these two int64 fields are using sync/atomic, they have to be at the top of the struct due to a bug on 32-bit platforms\n\t// See: https://golang.org/pkg/sync/atomic/ for more information\n\tlastSync       int64 // Seconds since Epoch\n\tlastSSHKeySync int64 // Seconds since Epoch\n\n\tSSHUser        string\n\tSSHKeyfile     string\n\tInstallSSHKey  InstallSSHKey\n\tHealthCheckURL *url.URL\n\n\ttunnels        *ssh.SSHTunnelList\n\tlastSyncMetric prometheus.GaugeFunc\n\tclock          clock.Clock\n\n\tgetAddresses AddressFunc\n\tstopChan     chan struct{}\n}\n\n\n// Run establishes tunnel loops and returns\nfunc (c *SSHTunneler) Run(getAddresses AddressFunc) {\n\tif c.stopChan != nil {\n\t\treturn\n\t}\n\tc.stopChan = make(chan struct{})\n\n\t// Save the address getter\n\tif getAddresses != nil {\n\t\tc.getAddresses = getAddresses\n\t}\n\n\t// Usernames are capped @ 32\n\tif len(c.SSHUser) > 32 {\n\t\tglog.Warning(\"SSH User is too long, truncating to 32 chars\")\n\t\tc.SSHUser = c.SSHUser[0:32]\n\t}\n\tglog.Infof(\"Setting up proxy: %s %s\", c.SSHUser, c.SSHKeyfile)\n\n\t// public keyfile is written last, so check for that.\n\tpublicKeyFile := c.SSHKeyfile + \".pub\"\n\texists, err := utilfile.FileExists(publicKeyFile)\n\tif err != nil {\n\t\tglog.Errorf(\"Error detecting if key exists: %v\", err)\n\t} else if !exists {\n\t\tglog.Infof(\"Key doesn't exist, attempting to create\")\n\t\tif err := generateSSHKey(c.SSHKeyfile, publicKeyFile); err != nil {\n\t\t\tglog.Errorf(\"Failed to create key pair: %v\", err)\n\t\t}\n\t}\n\n\tc.tunnels = ssh.NewSSHTunnelList(c.SSHUser, c.SSHKeyfile, c.HealthCheckURL, c.stopChan)\n\t// Sync loop to ensure that the SSH key has been installed.\n\tc.lastSSHKeySync = c.clock.Now().Unix()\n\tc.installSSHKeySyncLoop(c.SSHUser, publicKeyFile)\n\t// Sync tunnelList w/ nodes.\n\tc.lastSync = c.clock.Now().Unix()\n\tc.nodesSyncLoop()\n}\n```\n\n<br>\n\n**transport** 看起来就是调用 **tunneler** 的，进行交互，然后进行解析啥的。这里不是很清楚，不过整体而言知道，CreateServerChain就是 和node节点建立交互就行了。\n\n<br>\n\n#### 1.2 PrepareRun\n\n定义一些  服务器端的接口，和处理函数。从名字也可以看出来就是补全一下接口用的。\n\n```go\n// PrepareRun does post API installation setup steps.\nfunc (s *GenericAPIServer) PrepareRun() preparedGenericAPIServer {\n\tif s.swaggerConfig != nil {\n\t\troutes.Swagger{Config: s.swaggerConfig}.Install(s.Handler.GoRestfulContainer)\n\t}\n\t\n\t// 1.安装OpenAPI,就是定义openapi接口\n\tif s.openAPIConfig != nil {\n\t\troutes.OpenAPI{\n\t\t\tConfig: s.openAPIConfig,\n\t\t}.Install(s.Handler.GoRestfulContainer, s.Handler.NonGoRestfulMux)\n\t}\n    \n    // 2. 安装 Healthz接口,就是定义heal接口\n\ts.installHealthz()\n\n\n\t// Register audit backend preShutdownHook.\n\tif s.AuditBackend != nil {\n\t\ts.AddPreShutdownHook(\"audit-backend\", func() error {\n\t\t\ts.AuditBackend.Shutdown()\n\t\t\treturn nil\n\t\t})\n\t}\n\n\treturn preparedGenericAPIServer{s}\n}\n```\n\n<br>\n\nopenAPIConfig.Install函数如下：\n\n```go\n// Install adds the SwaggerUI webservice to the given mux.\nfunc (oa OpenAPI) Install(c *restful.Container, mux *mux.PathRecorderMux) {\n\t// NOTE: [DEPRECATION] We will announce deprecation for format-separated endpoints for OpenAPI spec,\n\t// and switch to a single /openapi/v2 endpoint in Kubernetes 1.10. The design doc and deprecation process\n\t// are tracked at: https://docs.google.com/document/d/19lEqE9lc4yHJ3WJAJxS_G7TcORIJXGHyq3wpwcH28nU.\n\t_, err := handler.BuildAndRegisterOpenAPIService(\"/swagger.json\", c.RegisteredWebServices(), oa.Config, mux)\n\tif err != nil {\n\t\tglog.Fatalf(\"Failed to register open api spec for root: %v\", err)\n\t}\n\t_, err = handler.BuildAndRegisterOpenAPIVersionedService(\"/openapi/v2\", c.RegisteredWebServices(), oa.Config, mux)\n\tif err != nil {\n\t\tglog.Fatalf(\"Failed to register versioned open api spec for root: %v\", err)\n\t}\n}\n```\n\ninstallHealthz()最终会调用InstallHandler函数。可以看出来，这里就是 定义一个 uri和处理函数。\n\n```go\n// InstallHandler registers handlers for health checking on the path\n// \"/healthz\" to mux. *All handlers* for mux must be specified in\n// exactly one call to InstallHandler. Calling InstallHandler more\n// than once for the same mux will result in a panic.\nfunc InstallHandler(mux mux, checks ...HealthzChecker) {\n\tInstallPathHandler(mux, \"/healthz\", checks...)\n}\n```\n\n<br>\n\n#### 1.3 Run\n\n这里看起来和1.3.1有关联。先分析1.3.1。\n\n```go\n// Run spawns the secure http server. It only returns if stopCh is closed\n// or the secure port cannot be listened on initially.\nfunc (s preparedGenericAPIServer) Run(stopCh <-chan struct{}) error {\n\terr := s.NonBlockingRun(stopCh)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t<-stopCh\n\n\terr = s.RunPreShutdownHooks()\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t// Wait for all requests to finish, which are bounded by the RequestTimeout variable.\n\ts.HandlerChainWaitGroup.Wait()\n\n\treturn nil\n}\n```\n\n<br>\n\n##### 1.3.1 NonBlockingRun\n\n`s.NonBlockingRun` 的主要逻辑为：\n\n- 1、判断是否要启动审计日志服务；\n- 2、调用 `s.SecureServingInfo.Serve` 配置并启动 https server；\n- 3、执行 postStartHooks；\n- 4、向 systemd 发送 ready 信号；\n\n```go\n// NonBlockingRun spawns the secure http server. An error is\n// returned if the secure port cannot be listened on.\nfunc (s preparedGenericAPIServer) NonBlockingRun(stopCh <-chan struct{}) error {\n\t// Use an stop channel to allow graceful shutdown without dropping audit events\n\t// after http server shutdown.\n\tauditStopCh := make(chan struct{})\n    \n    // 1、判断是否要启动审计日志\n\t// Start the audit backend before any request comes in. This means we must call Backend.Run\n\t// before http server start serving. Otherwise the Backend.ProcessEvents call might block.\n\tif s.AuditBackend != nil {\n\t\tif err := s.AuditBackend.Run(auditStopCh); err != nil {\n\t\t\treturn fmt.Errorf(\"failed to run the audit backend: %v\", err)\n\t\t}\n\t}\n\n\t// Use an internal stop channel to allow cleanup of the listeners on error.\n\tinternalStopCh := make(chan struct{})\n\n    // 2、启动 https server\n\tif s.SecureServingInfo != nil && s.Handler != nil {\n\t\tif err := s.SecureServingInfo.Serve(s.Handler, s.ShutdownTimeout, internalStopCh); err != nil {\n\t\t\tclose(internalStopCh)\n\t\t\treturn err\n\t\t}\n\t}\n\n\t// Now that listener have bound successfully, it is the\n\t// responsibility of the caller to close the provided channel to\n\t// ensure cleanup.\n\tgo func() {\n\t\t<-stopCh\n\t\tclose(internalStopCh)\n\t\ts.HandlerChainWaitGroup.Wait()\n\t\tclose(auditStopCh)\n\t}()\n   \n    // 3、执行 postStartHooks；\n\ts.RunPostStartHooks(stopCh)\n    \n    // 4、向 systemd 发送 ready 信号\n\tif _, err := systemd.SdNotify(true, \"READY=1\\n\"); err != nil {\n\t\tglog.Errorf(\"Unable to send systemd daemon successful start message: %v\\n\", err)\n\t}\n\n\treturn nil\n}\n```\n\n其中 第二步 Server最终会调用到启动 https服务。\n\n<br>\n\n#### 1.4 总结\n\n从代码结流程看，在命令行初始化之后。主要运行了  CreateServerChain这个关键函数。这个就是定义了链条，可以认为是定义了   url 和 处理函数。\n\n然后 PrePareRun, Run都是运行服务。\n\n本节要介绍的创建APIServer通用配置，就是CreateServerChain的第一个操作\n\n<br>\n\n### 2. 创建APIServer通用配置\n\nAPIServer通用配置是kube-apiserver不同模块实例化所需的配置，APIServer通用配置流程如下图所示。\n\n![image-20210225152550545](../images/apiserver-config-1.png)\n\n\n\n这里再回到CreateServerChain。CreateServerChain函数中先定定义了通用配置，再启动各个apiserver.\n\n```go\n// CreateServerChain creates the apiservers connected via delegation.\nfunc CreateServerChain(completedOptions completedServerRunOptions, stopCh <-chan struct{}) (*genericapiserver.GenericAPIServer, error) {\n\tnodeTunneler, proxyTransport, err := CreateNodeDialer(completedOptions)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tkubeAPIServerConfig, insecureServingInfo, serviceResolver, pluginInitializer, admissionPostStartHook, err := CreateKubeAPIServerConfig(completedOptions, nodeTunneler, proxyTransport)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n  ....\n }\n \n \n \n // CreateKubeAPIServerConfig creates all the resources for running the API server, but runs none of them\nfunc CreateKubeAPIServerConfig(\n\ts completedServerRunOptions,\n\tnodeTunneler tunneler.Tunneler,\n\tproxyTransport *http.Transport,\n) (\n\tconfig *master.Config,\n\tinsecureServingInfo *genericapiserver.DeprecatedInsecureServingInfo,\n\tserviceResolver aggregatorapiserver.ServiceResolver,\n\tpluginInitializers []admission.PluginInitializer,\n\tadmissionPostStartHook genericapiserver.PostStartHookFunc,\n\tlastErr error,\n) {\n\tvar genericConfig *genericapiserver.Config\n\tvar storageFactory *serverstorage.DefaultStorageFactory\n\tvar sharedInformers informers.SharedInformerFactory\n\tvar versionedInformers clientgoinformers.SharedInformerFactory\n\tgenericConfig, sharedInformers, versionedInformers, insecureServingInfo, serviceResolver, pluginInitializers, admissionPostStartHook, storageFactory, lastErr = buildGenericConfig(s.ServerRunOptions, proxyTransport)\n\tif lastErr != nil {\n\t\treturn\n\t}\n}\n\n// 最终是 buildGenericConfig 生成了这些配置\n// BuildGenericConfig takes the master server options and produces the genericapiserver.Config associated with it\nfunc buildGenericConfig(\n\ts *options.ServerRunOptions,\n\tproxyTransport *http.Transport,\n) (\n\tgenericConfig *genericapiserver.Config,\n\tsharedInformers informers.SharedInformerFactory,\n\tversionedInformers clientgoinformers.SharedInformerFactory,\n\tinsecureServingInfo *genericapiserver.DeprecatedInsecureServingInfo,\n\tserviceResolver aggregatorapiserver.ServiceResolver,\n\tpluginInitializers []admission.PluginInitializer,\n\tadmissionPostStartHook genericapiserver.PostStartHookFunc,\n\tstorageFactory *serverstorage.DefaultStorageFactory,\n\tlastErr error,\n) {\n   \n    // 1. 生成 genericConfig,  用于决定k8s开启哪些资源\n\tgenericConfig = genericapiserver.NewConfig(legacyscheme.Codecs)\n\tgenericConfig.MergedResourceConfig = master.DefaultAPIResourceConfigSource()\n\n    // 2. 生成 OpenAPIConfig/swaggerConfig    swaggerapi 展示\n\tgenericConfig.OpenAPIConfig = genericapiserver.DefaultOpenAPIConfig(generatedopenapi.GetOpenAPIDefinitions, openapinamer.NewDefinitionNamer(legacyscheme.Scheme, extensionsapiserver.Scheme, aggregatorscheme.Scheme))\n\tgenericConfig.OpenAPIConfig.PostProcessSpec = postProcessOpenAPISpecForBackwardCompatibility\n\tgenericConfig.OpenAPIConfig.Info.Title = \"Kubernetes\"\n\tgenericConfig.SwaggerConfig = genericapiserver.DefaultSwaggerConfig()\n\t\n\t// 3. StorageFactory存储（Etcd）配置\n\tstorageFactoryConfig := kubeapiserver.NewStorageFactoryConfig()\n\tstorageFactoryConfig.ApiResourceConfig = genericConfig.MergedResourceConfig\n\tcompletedStorageFactoryConfig, err := storageFactoryConfig.Complete(s.Etcd, s.StorageSerialization)\n\tif err != nil {\n\t\tlastErr = err\n\t\treturn\n\t}\n\tstorageFactory, lastErr = completedStorageFactoryConfig.New()\n\n    // 4. auth认证\n\tgenericConfig.Authentication.Authenticator, genericConfig.OpenAPIConfig.SecurityDefinitions, err = BuildAuthenticator(s, clientgoExternalClient, sharedInformers)\n\tif err != nil {\n\t\tlastErr = fmt.Errorf(\"invalid authentication config: %v\", err)\n\t\treturn\n\t}\n   \n    // 5. auth授权\n\tgenericConfig.Authorization.Authorizer, genericConfig.RuleResolver, err = BuildAuthorizer(s, versionedInformers)\n\t\n\t\n   // 6. admission准入控制配置\n    pluginInitializers, admissionPostStartHook, err = BuildAdmissionPluginInitializers(\n\t\ts,\n\t\tclient,\n\t\tsharedInformers,\n\t\tserviceResolver,\n\t\twebhookAuthResolverWrapper,\n\t)\n\tif err != nil {\n\t\tlastErr = fmt.Errorf(\"failed to create admission plugin initializer: %v\", err)\n\t\treturn\n\t}\n\n\terr = s.Admission.ApplyTo(\n\t\tgenericConfig,\n\t\tversionedInformers,\n\t\tkubeClientConfig,\n\t\tlegacyscheme.Scheme,\n\t\tpluginInitializers...)\n\n\t\n\treturn\n}\n```\n\n#### 2.1  genericConfig实例化\n\nCreateServerChain  -> CreateKubeAPIServerConfig \n\nbuildGenericConfig 生成以下对象\n\n* genericConfig： 生成通用参数, 用于决定k8s开启哪些资源\n\n* versionedInformers:   client-go的sharedInformerFactory \n\n* insecureServingInfo: 用于开启http服务，高版本这个已经不支持\n\n* serviceResolver:  内部服务的dns解析器\n\n* pluginInitializers：admission-control参数下面指定的plugin\n\n --admission-control=NamespaceLifecycle,NamespaceExists,LimitRanger,ServiceAccount,ResourceQuota,DefaultStorageClass,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,EventRateLimit\n\n* admissionPostStartHook \n\n  MutatingAdmissionWebhook,ValidatingAdmissionWebhook的PostStartHook \n\n* storageFactory:  生成etcd storageFactory\n\n<br>\n\n其中genericConfig.MergedResourceConfig用于设置启用/禁用GV（资源组、资源版本）及其Resource （资源）。如果未在命令行参数中指定启用/禁用的GV，则通过master.DefaultAPIResourceConfigSource启用默认设置的GV及其资源。master.DefaultAPIResourceConfigSource将启用资源版本为Stable和Beta的资源，默认不启用Alpha资源版本的资源。通过EnableVersions函数启用指定资源，而通过DisableVersions函数禁用指定资源，代码示例如下：\n\n```go\n  // 1. 生成 genericConfig,  用于决定k8s开启哪些资源\n\tgenericConfig = genericapiserver.NewConfig(legacyscheme.Codecs)\n\tgenericConfig.MergedResourceConfig = master.DefaultAPIResourceConfigSource()\n\n\npkg\\master\\master.go\nfunc DefaultAPIResourceConfigSource() *serverstorage.ResourceConfig {\n\tret := serverstorage.NewResourceConfig()\n\t// NOTE: GroupVersions listed here will be enabled by default. Don't put alpha versions in the list.\n\tret.EnableVersions(\n\t\tadmissionregistrationv1beta1.SchemeGroupVersion,\n\t\tapiv1.SchemeGroupVersion,\n\t\tappsv1beta1.SchemeGroupVersion,\n\t\tappsv1beta2.SchemeGroupVersion,\n\t\tappsv1.SchemeGroupVersion,\n\t\tauthenticationv1.SchemeGroupVersion,\n\t\tauthenticationv1beta1.SchemeGroupVersion,\n\t\tauthorizationapiv1.SchemeGroupVersion,\n\t\tauthorizationapiv1beta1.SchemeGroupVersion,\n\t\tautoscalingapiv1.SchemeGroupVersion,\n\t\tautoscalingapiv2beta1.SchemeGroupVersion,\n\t\tautoscalingapiv2beta2.SchemeGroupVersion,\n\t\tbatchapiv1.SchemeGroupVersion,\n\t\tbatchapiv1beta1.SchemeGroupVersion,\n\t\tcertificatesapiv1beta1.SchemeGroupVersion,\n\t\tcoordinationapiv1beta1.SchemeGroupVersion,\n\t\teventsv1beta1.SchemeGroupVersion,\n\t\textensionsapiv1beta1.SchemeGroupVersion,\n\t\tnetworkingapiv1.SchemeGroupVersion,\n\t\tpolicyapiv1beta1.SchemeGroupVersion,\n\t\trbacv1.SchemeGroupVersion,\n\t\trbacv1beta1.SchemeGroupVersion,\n\t\tstorageapiv1.SchemeGroupVersion,\n\t\tstorageapiv1beta1.SchemeGroupVersion,\n\t\tschedulingapiv1beta1.SchemeGroupVersion,\n\t)\n\t// disable alpha versions explicitly so we have a full list of what's possible to serve\n\tret.DisableVersions(\n\t\tadmissionregistrationv1alpha1.SchemeGroupVersion,\n\t\tbatchapiv2alpha1.SchemeGroupVersion,\n\t\trbacv1alpha1.SchemeGroupVersion,\n\t\tschedulingv1alpha1.SchemeGroupVersion,\n\t\tsettingsv1alpha1.SchemeGroupVersion,\n\t\tstorageapiv1alpha1.SchemeGroupVersion,\n\t)\n\n\treturn ret\n}\n```\n\n<br>\n\napiserver 通过  --runtime-config 指定支持哪些内置资源。 一般都是api/all=true\n\n```\n    --runtime-config mapStringString\n                A set of key=value pairs that enable or disable built-in APIs. Supported options are:\n                v1=true|false for the core API group\n                <group>/<version>=true|false for a specific API group and version (e.g. apps/v1=true)\n                api/all=true|false controls all API versions\n                api/ga=true|false controls all API versions of the form v[0-9]+\n                api/beta=true|false controls all API versions of the form v[0-9]+beta[0-9]+\n                api/alpha=true|false controls all API versions of the form v[0-9]+alpha[0-9]+\n                api/legacy is deprecated, and will be removed in a future version\n```\n\n#### 2.2 OpenAPI/Swagger配置\n\n```go\n// 2. 生成 OpenAPIConfig/swaggerConfig    swaggerapi 展示\n\tgenericConfig.OpenAPIConfig = genericapiserver.DefaultOpenAPIConfig(generatedopenapi.GetOpenAPIDefinitions, openapinamer.NewDefinitionNamer(legacyscheme.Scheme, extensionsapiserver.Scheme, aggregatorscheme.Scheme))\n\tgenericConfig.OpenAPIConfig.PostProcessSpec = postProcessOpenAPISpecForBackwardCompatibility\n\tgenericConfig.OpenAPIConfig.Info.Title = \"Kubernetes\"\n\tgenericConfig.SwaggerConfig = genericapiserver.DefaultSwaggerConfig()\n```\n\ngenericConfig.OpenAPIConfig用于生成OpenAPI规范。在默认的情况下，通过DefaultOpenAPIConfig函数为其设置默认值，代码示例如下：\n\n```go\nfunc DefaultOpenAPIConfig(getDefinitions openapicommon.GetOpenAPIDefinitions, defNamer *apiopenapi.DefinitionNamer) *openapicommon.Config {\n\treturn &openapicommon.Config{\n\t\tProtocolList:   []string{\"https\"},\n\t\tIgnorePrefixes: []string{\"/swaggerapi\"},\n\t\tInfo: &spec.Info{\n\t\t\tInfoProps: spec.InfoProps{\n\t\t\t\tTitle: \"Generic API Server\",\n\t\t\t},\n\t\t},\n\t\tDefaultResponse: &spec.Response{\n\t\t\tResponseProps: spec.ResponseProps{\n\t\t\t\tDescription: \"Default Response.\",\n\t\t\t},\n\t\t},\n\t\tGetOperationIDAndTags: apiopenapi.GetOperationIDAndTags,\n\t\tGetDefinitionName:     defNamer.GetDefinitionName,\n\t\tGetDefinitions:        getDefinitions,\n\t}\n}\n```\n\n其中 generatedopenapi.GetOpenAPIDefinitions 定义了OpenAPIDefinition文件（OpenAPI定义文件），该文件由openapi-gen代码生成器自动生成。\n\n<br>\n\n这里需要注意的是，从v1.14版本开始，官方已经抛弃了swagger接口，使用的是openapi规范，暴露的是 /openapi/v2。直接可以通过\n\nhttp://master-ip:apiserver-port/openapi/v2 查看\n\n#### 2.3 StorageFactory存储（Etcd）配置\n\nkube-apiserver组件使用Etcd作为Kubernetes系统集群的存储，系统中所有资源信息、集群状态、配置信息等都存储于Etcd中，代码示例如下：\n\n```go\n// 3. StorageFactory存储（Etcd）配置\n\tstorageFactoryConfig := kubeapiserver.NewStorageFactoryConfig()\n\tstorageFactoryConfig.ApiResourceConfig = genericConfig.MergedResourceConfig\n\tcompletedStorageFactoryConfig, err := storageFactoryConfig.Complete(s.Etcd, s.StorageSerialization)\n\tif err != nil {\n\t\tlastErr = err\n\t\treturn\n\t}\n\tstorageFactory, lastErr = completedStorageFactoryConfig.New()\n\n\n// Complete completes the StorageFactoryConfig with provided etcdOptions returning completedStorageFactoryConfig.\nfunc (c *StorageFactoryConfig) Complete(etcdOptions *serveroptions.EtcdOptions) (*completedStorageFactoryConfig, error) {\n\tc.StorageConfig = etcdOptions.StorageConfig\n\tc.DefaultStorageMediaType = etcdOptions.DefaultStorageMediaType\n\tc.EtcdServersOverrides = etcdOptions.EtcdServersOverrides\n\tc.EncryptionProviderConfigFilepath = etcdOptions.EncryptionProviderConfigFilepath\n\treturn &completedStorageFactoryConfig{c}, nil\n}\n```\n\nkubeapiserver.NewStorageFactoryConfig函数实例化了storageFactoryConfig对象，该对象定义了kube-apiserver与Etcd的交互方式，例如Etcd认证、Etcd地址（--etcd-servers）、存储前缀（ --etcd-prefix参数）等。另外，该对象也定义了资源存储方式，例如资源信息、资源编码类型、资源状态等。\n\n<br>\n\n#### 2.4 Authentication认证配置\n\nkube-apiserver作为Kubernetes集群的请求入口，接收组件与客户端的访问请求，每个请求都需要经过认证（Authentication）、授权（Authorization）及准入控制器（Admission Controller）3个阶段，之后才能真正地操作资源。\n\nkube-apiserver目前提供了9种认证机制，分别是BasicAuth、ClientCA、TokenAuth、BootstrapToken、RequestHeader、WebhookTokenAuth、Anonymous、OIDC、ServiceAccountAuth。每一种认证机制被实例化后会成为认证器（Authenticator），每一个认证器都被封装在http.Handler请求处理函数中，它们接收组件或客户端的请求并认证请求。kube-apiserver通过BuildAuthenticator函数实例化认证器，代码示例如下：\n\n```go\n    // 4. auth认证\n\tgenericConfig.Authentication.Authenticator, genericConfig.OpenAPIConfig.SecurityDefinitions, err = BuildAuthenticator(s, clientgoExternalClient, sharedInformers)\n\tif err != nil {\n\t\tlastErr = fmt.Errorf(\"invalid authentication config: %v\", err)\n\t\treturn\n\t}\n\n\n\t\n    然后调用 ToAuthorizationConfig 方法\n\t// BuildAuthorizer constructs the authorizer\nfunc BuildAuthorizer(s *options.ServerRunOptions, versionedInformers clientgoinformers.SharedInformerFactory) (authorizer.Authorizer, authorizer.RuleResolver, error) {\n\tauthorizationConfig := s.Authorization.ToAuthorizationConfig(versionedInformers)\n\treturn authorizationConfig.New()\n}\n\nfunc (s *BuiltInAuthorizationOptions) ToAuthorizationConfig(versionedInformerFactory versionedinformers.SharedInformerFactory) authorizer.AuthorizationConfig {\n\treturn authorizer.AuthorizationConfig{\n\t\tAuthorizationModes:          s.Modes,\n\t\tPolicyFile:                  s.PolicyFile,\n\t\tWebhookConfigFile:           s.WebhookConfigFile,\n\t\tWebhookCacheAuthorizedTTL:   s.WebhookCacheAuthorizedTTL,\n\t\tWebhookCacheUnauthorizedTTL: s.WebhookCacheUnauthorizedTTL,\n\t\tVersionedInformerFactory:    versionedInformerFactory,\n\t}\n}\n\nauthorizationConfig.New() = union.New(authorizers...), union.NewRuleResolvers(ruleResolvers...)\n```\n\nBuildAuthenticator函数会生成认证器。在该函数中，首先生成认证器的配置文件，然后调用authenticatorConfig.New函数实例化认证器。认证实例化流程如下图所示。\n\n![image-20210225155241127](../images/apiserver-auth-1.png)\n\n<br>\n\nauthenticatorConfig.New函数在实例化认证器的过程中，会根据认证的配置信息（由flags命令行参数传入）决定是否启用认证方法，并对启用的认证方法生成对应的HTTP Handler函数，最后通过union函数将已启用的认证器合并到authenticators数组对象中，代码示例如下：\n\n```\nauthorizationConfig.New() = union.New(authorizers...), union.NewRuleResolvers(ruleResolvers...)\n```\n\nauthenticators中存放的是已启用的认证器列表。union.New函数将authenticators合并成一个authenticator认证器，实际上将认证器列表存放在union结构的Handlers []authenticator.Request对象中。当客户端请求到达kube-apiserver时，kube-apiserver会遍历认证器列表，尝试执行每个认证器，当有一个认证器返回true时，则认证成功。\n\n<br>\n\nAuthentication可以通过下面的参数配置，开启上诉的认证。例如：--authentication-token-webhook-config-file 指定认证的webhook配置。一般是和公司的权限认证相关。\n\n```\nAuthentication flags:\n\n      --anonymous-auth\n                Enables anonymous requests to the secure port of the API server. Requests that are not rejected by another authentication method are treated as anonymous\n                requests. Anonymous requests have a username of system:anonymous, and a group name of system:unauthenticated. (default true)\n      --api-audiences strings\n                Identifiers of the API. The service account token authenticator will validate that tokens used against the API are bound to at least one of these audiences.\n                If the --service-account-issuer flag is configured and this flag is not, this field defaults to a single element list containing the issuer URL .\n      --authentication-token-webhook-cache-ttl duration\n                The duration to cache responses from the webhook token authenticator. (default 2m0s)\n      --authentication-token-webhook-config-file string\n                File with webhook configuration for token authentication in kubeconfig format. The API server will query the remote service to determine authentication for\n                bearer tokens.\n      --authentication-token-webhook-version string\n                The API version of the authentication.k8s.io TokenReview to send to and expect from the webhook. (default \"v1beta1\")\n      --client-ca-file string\n                If set, any request presenting a client certificate signed by one of the authorities in the client-ca-file is authenticated with an identity corresponding\n                to the CommonName of the client certificate.\n      --enable-bootstrap-token-auth\n                Enable to allow secrets of type 'bootstrap.kubernetes.io/token' in the 'kube-system' namespace to be used for TLS bootstrapping authentication.\n      --oidc-ca-file string\n                If set, the OpenID server's certificate will be verified by one of the authorities in the oidc-ca-file, otherwise the host's root CA set will be used.\n      --oidc-client-id string\n                The client ID for the OpenID Connect client, must be set if oidc-issuer-url is set.\n      --oidc-groups-claim string\n                If provided, the name of a custom OpenID Connect claim for specifying user groups. The claim value is expected to be a string or array of strings. This flag\n                is experimental, please see the authentication documentation for further details.\n      --oidc-groups-prefix string\n                If provided, all groups will be prefixed with this value to prevent conflicts with other authentication strategies.\n      --oidc-issuer-url string\n                The URL of the OpenID issuer, only HTTPS scheme will be accepted. If set, it will be used to verify the OIDC JSON Web Token (JWT).\n      --oidc-required-claim mapStringString\n                A key=value pair that describes a required claim in the ID Token. If set, the claim is verified to be present in the ID Token with a matching value. Repeat\n                this flag to specify multiple claims.\n      --oidc-signing-algs strings\n                Comma-separated list of allowed JOSE asymmetric signing algorithms. JWTs with a 'alg' header value not in this list will be rejected. Values are defined by\n                RFC 7518 https://tools.ietf.org/html/rfc7518#section-3.1. (default [RS256])\n      --oidc-username-claim string\n                The OpenID claim to use as the user name. Note that claims other than the default ('sub') is not guaranteed to be unique and immutable. This flag is\n                experimental, please see the authentication documentation for further details. (default \"sub\")\n      --oidc-username-prefix string\n                If provided, all usernames will be prefixed with this value. If not provided, username claims other than 'email' are prefixed by the issuer URL to avoid\n                clashes. To skip any prefixing, provide the value '-'.\n      --requestheader-allowed-names strings\n                List of client certificate common names to allow to provide usernames in headers specified by --requestheader-username-headers. If empty, any client\n                certificate validated by the authorities in --requestheader-client-ca-file is allowed.\n      --requestheader-client-ca-file string\n                Root certificate bundle to use to verify client certificates on incoming requests before trusting usernames in headers specified by\n                --requestheader-username-headers. WARNING: generally do not depend on authorization being already done for incoming requests.\n      --requestheader-extra-headers-prefix strings\n                List of request header prefixes to inspect. X-Remote-Extra- is suggested.\n      --requestheader-group-headers strings\n                List of request headers to inspect for groups. X-Remote-Group is suggested.\n      --requestheader-username-headers strings\n                List of request headers to inspect for usernames. X-Remote-User is common.\n      --service-account-issuer string\n                Identifier of the service account token issuer. The issuer will assert this identifier in \"iss\" claim of issued tokens. This value is a string or URI.\n      --service-account-key-file stringArray\n                File containing PEM-encoded x509 RSA or ECDSA private or public keys, used to verify ServiceAccount tokens. The specified file can contain multiple keys,\n                and the flag can be specified multiple times with different files. If unspecified, --tls-private-key-file is used. Must be specified when\n                --service-account-signing-key is provided\n      --service-account-lookup\n                If true, validate ServiceAccount tokens exist in etcd as part of authentication. (default true)\n      --service-account-max-token-expiration duration\n                The maximum validity duration of a token created by the service account token issuer. If an otherwise valid TokenRequest with a validity duration larger\n                than this value is requested, a token will be issued with a validity duration of this value.\n      --token-auth-file string\n                If set, the file that will be used to secure the secure port of the API server via token authentication.\n```\n\n<br>\n\n#### 2.5 Authorization授权配置\n\n认证和授权的区别在于：  张三发来了一个删除pod的请求。 认证：证明张三是张三， 授权：张三是master，有权限删除这个pod。\n\n在Kubernetes系统组件或客户端请求通过认证阶段之后，会来到授权阶段。kube-apiserver同样支持多种授权机制，并支持同时开启多个授权功能，客户端发起一个请求，在经过授权阶段时，只要有一个授权器通过则授权成功。kube-apiserver目前提供了6种授权机制，分别是AlwaysAllow、AlwaysDeny、Webhook、\n\nNode、ABAC、RBAC。每一种授权机制被实例化后会成为授权器（Authorizer），每一个授权器都被封装在http.Handler请求处理函数中，它们接收组件或客户\n\n端的请求并授权请求。kube-apiserver通过BuildAuthorizer函数实例化授权器，代码示例如下：\n\n```\n// 5. auth授权\ngenericConfig.Authorization.Authorizer, genericConfig.RuleResolver, err = BuildAuthorizer(s, versionedInformers)\n```\n\nBuildAuthorizer函数会生成授权器。在该函数中，首先生成授权器的配置文件，然后调用authorizationConfig.New函数实例化授权器。授权实例化流程如下图所示。\n\n![image-20210225160544222](../images/apiserver-auth-2.png)\n\nauthorizationConfig.New函数在实例化授权器的过程中，会根据--authorization-mode参数的配置信息（由flags命令行参数传入）决定是否启用授权方法，并对\n\n启用的授权方法生成对应的HTTP Handler函数，最后通过union函数将已启用的授权器合并到authorizers数组对象中，代码示例如下：\n\n```\n// BuildAuthorizer constructs the authorizer\nfunc BuildAuthorizer(s *options.ServerRunOptions, versionedInformers clientgoinformers.SharedInformerFactory) (authorizer.Authorizer, authorizer.RuleResolver, error) {\n\tauthorizationConfig := s.Authorization.ToAuthorizationConfig(versionedInformers)\n\treturn authorizationConfig.New()\n}\n\nauthorizationConfig.New()\nreturn union.New(authorizers...), union.NewRuleResolvers(ruleResolvers...), nil\n```\n\nauthorizers中存放的是已启用的授权器列表，ruleResolvers中存放的是已启用的授权器规则解析器，实际上分别将它们存放在union结构的\n\n[]authorizer.Authorizer和[]authorizer.RuleResolver对象中。当客户端请求到达kube-apiserver时，kube-apiserver会遍历授权器列表，并按照顺序执行授权\n\n器，排在前面的授权器具有更高的优先级（允许或拒绝请求）。客户端发起一个请求，在经过授权阶段时，只要有一个授权器通过，则授权成功。\n\n<br>\n\n**Kube-apiserver**通过authorization-mode指定支持哪几种授权模式。\n\n**注意，这个是访问安全端口时候用到的，如果访问的是非安全端口，是不同通过授权验证的！！！**\n\n```\n--authorization-mode strings\n                Ordered list of plug-ins to do authorization on secure port. Comma-delimited list of: AlwaysAllow,AlwaysDeny,ABAC,Webhook,RBAC,Node. (default [AlwaysAllow])\n```\n\n<br>\n\n#### 2.6 Admission准入控制器配置\n\nKubernetes系统组件或客户端请求通过授权阶段之后，会来到准入控制器阶段，它会在认证和授权请求之后，对象被持久化之前，拦截kube-apiserver的请求，\n\n拦截后的请求进入准入控制器中处理，对请求的资源对象进行自定义（校验、修改或拒绝）等操作。kube-apiserver支持多种准入控制器机制，并支持同时开启多\n\n个准入控制器功能，如果开启了多个准入控制器，则按照顺序执行准入控制器。\n\nkube-apiserver目前提供了31种准入控制器，分别是AlwaysAdmit、AlwaysDeny、AlwaysPullImages、DefaultStorageClass、DefaultTolerationSeconds、\n\nDenyEscalatingExec、DenyExecOnPrivileged、EventRateLimit、ExtendedResourceToleration、ImagePolicyWebhook、\n\nLimitPodHardAntiAffinityTopology、LimitRanger、MutatingAdmissionWebhook、NamespaceAutoProvision、NamespaceExists、NamespaceLifecycle、\n\nNodeRestriction、OwnerReferencesPermissionEnforcement、PersistentVolumeClaimResize、Persistent VolumeLabel、PodNodeSelector、PodPreset、\n\nPodSecurityPolicy、PodTolerationRestriction、Priority、ResourceQuota、SecurityContextDeny、ServiceAccount、StorageObjectInUse Protection、\n\nTaintNodesByCondition、ValidatingAdmissionWebhook。\n\n这个启动的代码在这：\n\n```\n    pluginInitializers, admissionPostStartHook, err = BuildAdmissionPluginInitializers(\n\t\ts,\n\t\tclient,\n\t\tsharedInformers,\n\t\tserviceResolver,\n\t\twebhookAuthResolverWrapper,\n\t)\n\tif err != nil {\n\t\tlastErr = fmt.Errorf(\"failed to create admission plugin initializer: %v\", err)\n\t\treturn\n\t}\n\n\terr = s.Admission.ApplyTo(\n\t\tgenericConfig,\n\t\tversionedInformers,\n\t\tkubeClientConfig,\n\t\tlegacyscheme.Scheme,\n\t\tpluginInitializers...)\n```\n\n<br>\n\nkube-apiserver在启动时注册所有准入控制器，准入控制器通过Plugins数据结构统一注册、存放、管理所有的准入控制器。Plugins数据结构如下：\n\n```\n// Factory is a function that returns an Interface for admission decisions.\n// The config parameter provides an io.Reader handler to the factory in\n// order to load specific configurations. If no configuration is provided\n// the parameter is nil.\ntype Factory func(config io.Reader) (Interface, error)\n\ntype Plugins struct {\n\tlock     sync.Mutex\n\tregistry map[string]Factory\n}\n```\n\nPlugins数据结构字段说明如下。\n\n● registry：以键值对形式存放插件，key为准入控制器的名称，例如AlwaysPullImages、LimitRanger等；value为对应的准入控制器名称的代码实现。\n\n● lock：用于保护registry字段的并发一致性。\n\n其中Factory为准入控制器实现的接口定义，它接收准入控制器的config配置信息，通过--admission-control-config-\n\nfile参数指定准入控制器的配置文件，返回准入控制器的插件实现。Plugins数据结构提供了Register方法，为外部提供了准入控制器的注册方法。\n\nkube-apiserver提供了31种准入控制器，kube-apiserver组件在启动时分别在两个位置注册它们，代码示例如下：\n\n```\n// RegisterAllAdmissionPlugins registers all admission plugins\nfunc RegisterAllAdmissionPlugins(plugins *admission.Plugins) {\n\tlifecycle.Register(plugins)\n\tinitialization.Register(plugins)\n\tvalidatingwebhook.Register(plugins)\n\tmutatingwebhook.Register(plugins)\n}\n```\n\n每个准入控制器都实现了Register方法，通过Register方法可以在Plugins数据结构中注册当前准入控制器。以AlwaysPullImages准入控制器为例，注册方法代码示例如下：\n\n```\n// PluginName indicates name of admission plugin.\nconst PluginName = \"ImagePolicyWebhook\"\n\n// AuditKeyPrefix is used as the prefix for all audit keys handled by this\n// pluggin. Some well known suffixes are listed below.\nvar AuditKeyPrefix = strings.ToLower(PluginName) + \".image-policy.k8s.io/\"\n\nconst (\n\t// ImagePolicyFailedOpenKeySuffix in an annotation indicates the image\n\t// review failed open when the image policy webhook backend connection\n\t// failed.\n\tImagePolicyFailedOpenKeySuffix string = \"failed-open\"\n\n\t// ImagePolicyAuditRequiredKeySuffix in an annotation indicates the pod\n\t// should be audited.\n\tImagePolicyAuditRequiredKeySuffix string = \"audit-required\"\n)\n\nvar (\n\tgroupVersions = []schema.GroupVersion{v1alpha1.SchemeGroupVersion}\n)\n\n// Register registers a plugin\nfunc Register(plugins *admission.Plugins) {\n\tplugins.Register(PluginName, func(config io.Reader) (admission.Interface, error) {\n\t\tnewImagePolicyWebhook, err := NewImagePolicyWebhook(config)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\treturn newImagePolicyWebhook, nil\n\t})\n}\n```\n\n<br>\n\n这个是通过--admission-control指定，例如：\n\n```\n--admission-control=NamespaceLifecycle,NamespaceExists,LimitRanger,ServiceAccount,ResourceQuota,DefaultStorageClass,Priority,MutatingAdmissionWebhook,ValidatingAdmissionWebhook\n```\n\n<br>\n\n**参考文档：**  Kubernetes源码剖析，郑东旭\n"
  },
  {
    "path": "k8s/kube-apiserver/8-kube-apiserver创建APIExtensionsServer.md",
    "content": "* [Table of Contents](#table-of-contents)\n    * [1\\. 背景回顾](#1-背景回顾)\n    * [2\\. 生成apiExtensionsConfig](#2-生成apiextensionsconfig)\n    * [3\\. createAPIExtensionsServer](#3-createapiextensionsserver)\n      * [3\\.1  创建GenericAPIServer](#31--创建genericapiserver)\n      * [3\\.2  实例化CustomResourceDefinitions](#32--实例化customresourcedefinitions)\n      * [3\\.3 实例化APIGroupInfo](#33-实例化apigroupinfo)\n      * [3\\.4 InstallAPIGroup注册APIGroup](#34-installapigroup注册apigroup)\n      * [3\\.5 启动crdController](#35-启动crdcontroller)\n    * [4总结](#4总结)\n    * [5\\. 参考链接](#5-参考链接)\n\n**本章重点：**分析第四个流程，创建APIExtensionsServer\n\n kube-apiserver整体启动流程如下：\n\n（1）资源注册。\n\n（2）Cobra命令行参数解析\n\n（3）创建APIServer通用配置\n\n（4）创建APIExtensionsServer\n\n（5）创建KubeAPIServer\n\n（6）创建AggregatorServer\n\n（7）启动HTTP服务。\n\n（8）启动HTTPS服务\n\n<br>\n\n### 1. 背景回顾\n\n再次回到 CreateServerChain。在生成配置参数后，CreateServerChain第一个做的就是创建APIExtensionsServer。核心包含2步骤：\n\n（1）生成apiExtensionsConfig\n\n（2）new一个APIExtensionsServer\n\n```go\n// CreateServerChain creates the apiservers connected via delegation.\nfunc CreateServerChain(completedOptions completedServerRunOptions, stopCh <-chan struct{}) (*genericapiserver.GenericAPIServer, error) {\n    \n    // 1.创建到节点拨号连接,目的为了和节点交互。在云平台中，则需要安装本机的SSH Key到Kubernetes集群中所有节点上，可通过用户名和私钥，SSH到node节点\n\tnodeTunneler, proxyTransport, err := CreateNodeDialer(completedOptions)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n    // 2. 配置API Server的Config。\n\tkubeAPIServerConfig, insecureServingInfo, serviceResolver, pluginInitializer, admissionPostStartHook, err := CreateKubeAPIServerConfig(completedOptions, nodeTunneler, proxyTransport)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n    // 3.这里同时还配置了Extension API Server的Config，用于配置用户自己编写的API Server。\n\t// If additional API servers are added, they should be gated.  从这里深入挖下去\n\tapiExtensionsConfig, err := createAPIExtensionsConfig(*kubeAPIServerConfig.GenericConfig, kubeAPIServerConfig.ExtraConfig.VersionedInformers, pluginInitializer, completedOptions.ServerRunOptions, completedOptions.MasterCount)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t// 4.创建APIExtensionsServer\n\tapiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegate())\n\tif err != nil {\n\t\treturn nil, err\n\t}\n    \n    // 5.创建kubeapiserver，这里就是定义了 /apis/groups等这些api。\n\tkubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer, admissionPostStartHook)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// otherwise go down the normal path of standing the aggregator up in front of the API server\n\t// this wires up openapi\n\t\n\t// 6. kubeAPIServer prepareRun\n\tkubeAPIServer.GenericAPIServer.PrepareRun()\n\n    // 7. apiExtensionsServer prepareRun\n\t// This will wire up openapi for extension api server\n\tapiExtensionsServer.GenericAPIServer.PrepareRun()\n\n    // 8. 配置AA config，然后创建AA server。\n\t// aggregator comes last in the chain\n\taggregatorConfig, err := createAggregatorConfig(*kubeAPIServerConfig.GenericConfig, completedOptions.ServerRunOptions, kubeAPIServerConfig.ExtraConfig.VersionedInformers, serviceResolver, proxyTransport, pluginInitializer)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t\n\t// 9.创建AA server.这里传入了参数 kube-apiserver, apiExtensionServer。\n\taggregatorServer, err := createAggregatorServer(aggregatorConfig, kubeAPIServer.GenericAPIServer, apiExtensionsServer.Informers)\n\tif err != nil {\n\t\t// we don't need special handling for innerStopCh because the aggregator server doesn't create any go routines\n\t\treturn nil, err\n\t}\n  \n  // 10. 启动http服务\n\tif insecureServingInfo != nil {\n\t\tinsecureHandlerChain := kubeserver.BuildInsecureHandlerChain(aggregatorServer.GenericAPIServer.UnprotectedHandler(), kubeAPIServerConfig.GenericConfig)\n\t\tif err := insecureServingInfo.Serve(insecureHandlerChain, kubeAPIServerConfig.GenericConfig.RequestTimeout, stopCh); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\n\treturn aggregatorServer.GenericAPIServer, nil\n}\n```\n\n<br>\n\n### 2. 生成apiExtensionsConfig\n\n可以看到, 顺序为：生成kube-apisever config -> 再生成apiExtensionsConfig -> new apiExtensionsServer -> new kube-apisever。\n\n```\nkubeAPIServerConfig, insecureServingInfo, serviceResolver, pluginInitializer, err := CreateKubeAPIServerConfig(completedOptions, nodeTunneler, proxyTransport)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// If additional API servers are added, they should be gated.\n\tapiExtensionsConfig, err := createAPIExtensionsConfig(*kubeAPIServerConfig.GenericConfig, kubeAPIServerConfig.ExtraConfig.VersionedInformers, pluginInitializer, completedOptions.ServerRunOptions, completedOptions.MasterCount,\n\t\tserviceResolver, webhook.NewDefaultAuthenticationInfoResolverWrapper(proxyTransport, kubeAPIServerConfig.GenericConfig.LoopbackClientConfig))\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tapiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegate())\n\tif err != nil {\n\t\treturn nil, err\n\t}\n```\n\n原因其实也很好理解，因为kube-apisever config是通用的配置，apiExtensionsConfig 再此基础上多了ExtraConfig配置，核心就是如何和etcd打交道\n\n```\napiextensionsConfig := &apiextensionsapiserver.Config{\n\t\tGenericConfig: &genericapiserver.RecommendedConfig{\n\t\t\tConfig:                genericConfig,\n\t\t\tSharedInformerFactory: externalInformers,\n\t\t},\n\t\tExtraConfig: apiextensionsapiserver.ExtraConfig{\n\t\t\tCRDRESTOptionsGetter: apiextensionsoptions.NewCRDRESTOptionsGetter(etcdOptions),\n\t\t\tMasterCount:          masterCount,\n\t\t\tAuthResolverWrapper:  authResolverWrapper,\n\t\t\tServiceResolver:      serviceResolver,\n\t\t},\n\t}\n\t\n// NewCRDRESTOptionsGetter create a RESTOptionsGetter for CustomResources.\nfunc NewCRDRESTOptionsGetter(etcdOptions genericoptions.EtcdOptions) genericregistry.RESTOptionsGetter {\n\tret := apiserver.CRDRESTOptionsGetter{\n\t\tStorageConfig:           etcdOptions.StorageConfig,\n\t\tStoragePrefix:           etcdOptions.StorageConfig.Prefix,\n\t\tEnableWatchCache:        etcdOptions.EnableWatchCache,\n\t\tDefaultWatchCacheSize:   etcdOptions.DefaultWatchCacheSize,\n\t\tEnableGarbageCollection: etcdOptions.EnableGarbageCollection,\n\t\tDeleteCollectionWorkers: etcdOptions.DeleteCollectionWorkers,\n\t\tCountMetricPollPeriod:   etcdOptions.StorageConfig.CountMetricPollPeriod,\n\t}\n\tret.StorageConfig.Codec = unstructured.UnstructuredJSONScheme\n\n\treturn ret\n}\n```\n\n### 3. createAPIExtensionsServer\n\ncreateAPIExtensionsServer核心就是返回一个genericServer。delegateAPIServer是个空的，一开始啥也不干，直到有CRD注册进来才开始工作。它后续将被注册为kube-apiserver的一部分。\n\n核心步骤如下：\n\n（1）创建GenericAPIServer.  APIExtensionsServer的运行依赖于GenericAPIServer，通过c.GenericConfig.New函数创建名为apiextensions-apiserver的服务\n\n（2）实例化CustomResourceDefinitions. APIExtensionsServer（API扩展服务）通过CustomResourceDefinitions对象进行管理，实例化该对象后才能注册APIExtensionsServer下的资源。\n\n（3）实例化APIGroupInfo.\n\n（4）InstallAPIGroup注册APIGroup\n\n（5）启动crdController，处理的CRD的创建/修改/删除\n\n```go\nfunc createAPIExtensionsServer(apiextensionsConfig *apiextensionsapiserver.Config, delegateAPIServer genericapiserver.DelegationTarget) (*apiextensionsapiserver.CustomResourceDefinitions, error) {\n   return apiextensionsConfig.Complete().New(delegateAPIServer)\n}\nComplete()就是补全了上面的APIExtensionsServer config配置，这里主要关心New函数\n\n// New returns a new instance of CustomResourceDefinitions from the given config.\nfunc (c completedConfig) New(delegationTarget genericapiserver.DelegationTarget) (*CustomResourceDefinitions, error) {\n  // 1. 创建GenericAPIServer.  APIExtensionsServer的运行依赖于GenericAPIServer，通过c.GenericConfig.New函数创建名为apiextensions-apiserver的服务。\n\tgenericServer, err := c.GenericConfig.New(\"apiextensions-apiserver\", delegationTarget)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n  \n  \n  // 2.实例化CustomResourceDefinitions. APIExtensionsServer（API扩展服务）通过CustomResourceDefinitions对象进行管理，实例化该对象后才能注册APIExtensionsServer下的资源。\n\ts := &CustomResourceDefinitions{\n\t\tGenericAPIServer: genericServer,\n\t}\n\n\tapiResourceConfig := c.GenericConfig.MergedResourceConfig\n\t\n\t// 3. 实例化APIGroupInfo.  \n\tapiGroupInfo := genericapiserver.NewDefaultAPIGroupInfo(apiextensions.GroupName, Scheme, metav1.ParameterCodec, Codecs)\n\tif apiResourceConfig.VersionEnabled(v1beta1.SchemeGroupVersion) {\n\t\tstorage := map[string]rest.Storage{}\n\t\t// customresourcedefinitions\n\t\tcustomResourceDefintionStorage := customresourcedefinition.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter)\n\t\tstorage[\"customresourcedefinitions\"] = customResourceDefintionStorage\n\t\tstorage[\"customresourcedefinitions/status\"] = customresourcedefinition.NewStatusREST(Scheme, customResourceDefintionStorage)\n\n\t\tapiGroupInfo.VersionedResourcesStorageMap[v1beta1.SchemeGroupVersion.Version] = storage\n\t}\n\tif apiResourceConfig.VersionEnabled(v1.SchemeGroupVersion) {\n\t\tstorage := map[string]rest.Storage{}\n\t\t// customresourcedefinitions\n\t\tcustomResourceDefintionStorage := customresourcedefinition.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter)\n\t\tstorage[\"customresourcedefinitions\"] = customResourceDefintionStorage\n\t\tstorage[\"customresourcedefinitions/status\"] = customresourcedefinition.NewStatusREST(Scheme, customResourceDefintionStorage)\n\n\t\tapiGroupInfo.VersionedResourcesStorageMap[v1.SchemeGroupVersion.Version] = storage\n\t}\n  \n  \n  // 4. InstallAPIGroup注册APIGroup\n\tif err := s.GenericAPIServer.InstallAPIGroup(&apiGroupInfo); err != nil {\n\t\treturn nil, err\n\t}\n\n\tcrdClient, err := internalclientset.NewForConfig(s.GenericAPIServer.LoopbackClientConfig)\n\tif err != nil {\n\t\t// it's really bad that this is leaking here, but until we can fix the test (which I'm pretty sure isn't even testing what it wants to test),\n\t\t// we need to be able to move forward\n\t\treturn nil, fmt.Errorf(\"failed to create clientset: %v\", err)\n\t}\n\ts.Informers = internalinformers.NewSharedInformerFactory(crdClient, 5*time.Minute)\n\n\tdelegateHandler := delegationTarget.UnprotectedHandler()\n\tif delegateHandler == nil {\n\t\tdelegateHandler = http.NotFoundHandler()\n\t}\n\n\tversionDiscoveryHandler := &versionDiscoveryHandler{\n\t\tdiscovery: map[schema.GroupVersion]*discovery.APIVersionHandler{},\n\t\tdelegate:  delegateHandler,\n\t}\n\tgroupDiscoveryHandler := &groupDiscoveryHandler{\n\t\tdiscovery: map[string]*discovery.APIGroupHandler{},\n\t\tdelegate:  delegateHandler,\n\t}\n\testablishingController := establish.NewEstablishingController(s.Informers.Apiextensions().InternalVersion().CustomResourceDefinitions(), crdClient.Apiextensions())\n\tcrdHandler, err := NewCustomResourceDefinitionHandler(\n\t\tversionDiscoveryHandler,\n\t\tgroupDiscoveryHandler,\n\t\ts.Informers.Apiextensions().InternalVersion().CustomResourceDefinitions(),\n\t\tdelegateHandler,\n\t\tc.ExtraConfig.CRDRESTOptionsGetter,\n\t\tc.GenericConfig.AdmissionControl,\n\t\testablishingController,\n\t\tc.ExtraConfig.ServiceResolver,\n\t\tc.ExtraConfig.AuthResolverWrapper,\n\t\tc.ExtraConfig.MasterCount,\n\t\ts.GenericAPIServer.Authorizer,\n\t\tc.GenericConfig.RequestTimeout,\n\t\ttime.Duration(c.GenericConfig.MinRequestTimeout)*time.Second,\n\t\tapiGroupInfo.StaticOpenAPISpec,\n\t\tc.GenericConfig.MaxRequestBodyBytes,\n\t)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\ts.GenericAPIServer.Handler.NonGoRestfulMux.Handle(\"/apis\", crdHandler)\n\ts.GenericAPIServer.Handler.NonGoRestfulMux.HandlePrefix(\"/apis/\", crdHandler)\n  \n  \n  // 5.启动crdController\n\tcrdController := NewDiscoveryController(s.Informers.Apiextensions().InternalVersion().CustomResourceDefinitions(), versionDiscoveryHandler, groupDiscoveryHandler)\n\tnamingController := status.NewNamingConditionController(s.Informers.Apiextensions().InternalVersion().CustomResourceDefinitions(), crdClient.Apiextensions())\n\tnonStructuralSchemaController := nonstructuralschema.NewConditionController(s.Informers.Apiextensions().InternalVersion().CustomResourceDefinitions(), crdClient.Apiextensions())\n\tapiApprovalController := apiapproval.NewKubernetesAPIApprovalPolicyConformantConditionController(s.Informers.Apiextensions().InternalVersion().CustomResourceDefinitions(), crdClient.Apiextensions())\n\tfinalizingController := finalizer.NewCRDFinalizer(\n\t\ts.Informers.Apiextensions().InternalVersion().CustomResourceDefinitions(),\n\t\tcrdClient.Apiextensions(),\n\t\tcrdHandler,\n\t)\n\tvar openapiController *openapicontroller.Controller\n\tif utilfeature.DefaultFeatureGate.Enabled(apiextensionsfeatures.CustomResourcePublishOpenAPI) {\n\t\topenapiController = openapicontroller.NewController(s.Informers.Apiextensions().InternalVersion().CustomResourceDefinitions())\n\t}\n\n\ts.GenericAPIServer.AddPostStartHookOrDie(\"start-apiextensions-informers\", func(context genericapiserver.PostStartHookContext) error {\n\t\ts.Informers.Start(context.StopCh)\n\t\treturn nil\n\t})\n\ts.GenericAPIServer.AddPostStartHookOrDie(\"start-apiextensions-controllers\", func(context genericapiserver.PostStartHookContext) error {\n\t\t// OpenAPIVersionedService and StaticOpenAPISpec are populated in generic apiserver PrepareRun().\n\t\t// Together they serve the /openapi/v2 endpoint on a generic apiserver. A generic apiserver may\n\t\t// choose to not enable OpenAPI by having null openAPIConfig, and thus OpenAPIVersionedService\n\t\t// and StaticOpenAPISpec are both null. In that case we don't run the CRD OpenAPI controller.\n\t\tif utilfeature.DefaultFeatureGate.Enabled(apiextensionsfeatures.CustomResourcePublishOpenAPI) && s.GenericAPIServer.OpenAPIVersionedService != nil && s.GenericAPIServer.StaticOpenAPISpec != nil {\n\t\t\tgo openapiController.Run(s.GenericAPIServer.StaticOpenAPISpec, s.GenericAPIServer.OpenAPIVersionedService, context.StopCh)\n\t\t}\n\n\t\tgo crdController.Run(context.StopCh)\n\t\tgo namingController.Run(context.StopCh)\n\t\tgo establishingController.Run(context.StopCh)\n\t\tgo nonStructuralSchemaController.Run(5, context.StopCh)\n\t\tgo apiApprovalController.Run(5, context.StopCh)\n\t\tgo finalizingController.Run(5, context.StopCh)\n\t\treturn nil\n\t})\n\t// we don't want to report healthy until we can handle all CRDs that have already been registered.  Waiting for the informer\n\t// to sync makes sure that the lister will be valid before we begin.  There may still be races for CRDs added after startup,\n\t// but we won't go healthy until we can handle the ones already present.\n\ts.GenericAPIServer.AddPostStartHookOrDie(\"crd-informer-synced\", func(context genericapiserver.PostStartHookContext) error {\n\t\treturn wait.PollImmediateUntil(100*time.Millisecond, func() (bool, error) {\n\t\t\treturn s.Informers.Apiextensions().InternalVersion().CustomResourceDefinitions().Informer().HasSynced(), nil\n\t\t}, context.StopCh)\n\t})\n\n\treturn s, nil\n}\n```\n\n<br>\n\n#### 3.1  创建GenericAPIServer\n\n```\n  // 1. 创建GenericAPIServer.  APIExtensionsServer的运行依赖于GenericAPIServer，通过c.GenericConfig.New函数创建名为apiextensions-apiserver的服务。\n\tgenericServer, err := c.GenericConfig.New(\"apiextensions-apiserver\", delegationTarget)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n```\n\n在创建另外两个apiserver的时候，都用到了这个，最后再统一分析。\n\n<br>\n\n#### 3.2  实例化CustomResourceDefinitions\n\n```\ns := &CustomResourceDefinitions{\n\t\tGenericAPIServer: genericServer,\n\t}\n\t\n\n// CustomResourceDefinitions进行了另外的封装\ntype CustomResourceDefinitions struct {\n\tGenericAPIServer *genericapiserver.GenericAPIServer\n\n\t// provided for easier embedding\n\tInformers internalinformers.SharedInformerFactory\n}\n```\n\nAPIExtensionsServer（API扩展服务）通过CustomResourceDefinitions对象进行管理，实例化该对象后才能注册APIExtensionsServer下的资源。\n\n<br>\n\n#### 3.3 实例化APIGroupInfo\n\n```go\n// 3. 实例化APIGroupInfo.  \n\tapiGroupInfo := genericapiserver.NewDefaultAPIGroupInfo(apiextensions.GroupName, Scheme, metav1.ParameterCodec, Codecs)\n\tif apiResourceConfig.VersionEnabled(v1beta1.SchemeGroupVersion) {\n\t\tstorage := map[string]rest.Storage{}\n\t\t// customresourcedefinitions\n\t\tcustomResourceDefintionStorage := customresourcedefinition.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter)\n\t\tstorage[\"customresourcedefinitions\"] = customResourceDefintionStorage\n\t\tstorage[\"customresourcedefinitions/status\"] = customresourcedefinition.NewStatusREST(Scheme, customResourceDefintionStorage)\n\n\t\tapiGroupInfo.VersionedResourcesStorageMap[v1beta1.SchemeGroupVersion.Version] = storage\n\t}\n\tif apiResourceConfig.VersionEnabled(v1.SchemeGroupVersion) {\n\t\tstorage := map[string]rest.Storage{}\n\t\t// customresourcedefinitions\n\t\tcustomResourceDefintionStorage := customresourcedefinition.NewREST(Scheme, c.GenericConfig.RESTOptionsGetter)\n\t\tstorage[\"customresourcedefinitions\"] = customResourceDefintionStorage\n\t\tstorage[\"customresourcedefinitions/status\"] = customresourcedefinition.NewStatusREST(Scheme, customResourceDefintionStorage)\n\n\t\tapiGroupInfo.VersionedResourcesStorageMap[v1.SchemeGroupVersion.Version] = storage\n\t}\n\t\n\t\n// APIGroupInfo结构体如下所示：\t\n// Info about an API group.\ntype APIGroupInfo struct {\n\tPrioritizedVersions []schema.GroupVersion\n\t// Info about the resources in this group. It's a map from version to resource to the storage.\n  // 存储资源与资源存储对象的对应关系，用于installRest的时候用\n\tVersionedResourcesStorageMap map[string]map[string]rest.Storage    \n\t// OptionsExternalVersion controls the APIVersion used for common objects in the\n\t// schema like api.Status, api.DeleteOptions, and metav1.ListOptions. Other implementors may\n\t// define a version \"v1beta1\" but want to use the Kubernetes \"v1\" internal objects.\n\t// If nil, defaults to groupMeta.GroupVersion.\n\t// TODO: Remove this when https://github.com/kubernetes/kubernetes/issues/19018 is fixed.\n\tOptionsExternalVersion *schema.GroupVersion\n\t// MetaGroupVersion defaults to \"meta.k8s.io/v1\" and is the scheme group version used to decode\n\t// common API implementations like ListOptions. Future changes will allow this to vary by group\n\t// version (for when the inevitable meta/v2 group emerges).\n\tMetaGroupVersion *schema.GroupVersion\n\n\t// Scheme includes all of the types used by this group and how to convert between them (or\n\t// to convert objects from outside of this group that are accepted in this API).\n\t// TODO: replace with interfaces\n\tScheme *runtime.Scheme\n\t// NegotiatedSerializer controls how this group encodes and decodes data\n\tNegotiatedSerializer runtime.NegotiatedSerializer\n\t// ParameterCodec performs conversions for query parameters passed to API calls\n\tParameterCodec runtime.ParameterCodec\n\n\t// StaticOpenAPISpec is the spec derived from the definitions of all resources installed together.\n\t// It is set during InstallAPIGroups, InstallAPIGroup, and InstallLegacyAPIGroup.\n\tStaticOpenAPISpec *spec.Swagger\n}\n```\n\nAPIGroupInfo对象用于描述资源组信息，其中该对象的VersionedResourcesStorageMap字段用于存储资源与资源存储对象的对应关系，其表现形式为map[string]map[string]rest.Storage（即<资源版本>/<资源>/<资源存储对象>）;\n\n例如CustomResourceDefinitions资源与资源存储对象的映射关系是v1beta1/customresourcedefinitions/customResourceDefintionStorage。\n\n<br>\n\n在实例化APIGroupInfo对象后，完成其资源与资源存储对象的映射，APIExtensionsServer会先判断apiextensions.k8s.io/v1beta1资源\n\n组/资源版本是否已启用，如果其已启用，则将该资源组、资源版本下的资源与资源存储对象进行映射并存储至APIGroupInfo对象的\n\nVersionedResourcesStorageMap字段中。每个资源（包括子资源）都通过类似于NewREST的函数创建资源存储对象（即\n\nRESTStorage）。kube-apiserver将RESTStorage封装成HTTP Handler函数，资源存储对象以RESTful的方式运行，一个RESTStorage对\n\n象负责一个资源的增、删、改、查操作。当操作CustomResourceDefinitions资源数据时，通过对应的RESTStorage资源存储对象与\n\ngenericregistry.Store进行交互。\n\n<br>\n\n**提示：** 一个资源组对应一个APIGroupInfo对象，每个资源（包括子资源）对应一个资源存储对象。\n\nadmissionregistration.k8s.io就是一个group。\n\n```\n[root@k8s-master ~]# kubectl api-versions    表现形式为<group>/<version>.\nadmissionregistration.k8s.io/v1beta1\napiextensions.k8s.io/v1beta1\napiregistration.k8s.io/v1\napiregistration.k8s.io/v1beta1\napps/v1\napps/v1beta1\napps/v1beta2\nauthentication.k8s.io/v1\nauthentication.k8s.io/v1beta1\nauthorization.k8s.io/v1\nauthorization.k8s.io/v1beta1\nautoscaling/v1\nautoscaling/v2beta1\nautoscaling/v2beta2\nbatch/v1\nbatch/v1beta1\ncertificates.k8s.io/v1beta1\ncoordination.k8s.io/v1beta1\nevents.k8s.io/v1beta1\nextensions/v1beta1\nnetworking.k8s.io/v1\npolicy/v1beta1\nrbac.authorization.k8s.io/v1\nrbac.authorization.k8s.io/v1beta1\nscheduling.k8s.io/v1beta1\nstorage.k8s.io/v1\nstorage.k8s.io/v1beta1\nv1\n```\n\n<br>\n\n#### 3.4 InstallAPIGroup注册APIGroup\n\n```\n  // 4. InstallAPIGroup注册APIGroup\n\tif err := s.GenericAPIServer.InstallAPIGroup(&apiGroupInfo); err != nil {\n\t\treturn nil, err\n\t}\n```\n\nInstallAPIGroup注册APIGroupInfo的过程非常重要，将APIGroupInfo对象中的<资源组>/<资源版本>/<资源>/<子资源>（包括资源存储\n\n对象）注册到APIExtensionsServerHandler函数。其过程是遍历APIGroupInfo，将<资源组>/<资源版本>/<资源名称>映射到HTTP PATH\n\n请求路径，通过InstallREST函数将资源存储对象作为资源的Handlers方法，最后使用go-restful的ws.Route将定义好的请求路径和\n\nHandlers方法添加路由到go-restful中。整个过程为InstallAPIGroup→s.installAPIResources→InstallREST，代码示例如下：\n\n```\n\n// InstallREST registers the REST handlers (storage, watch, proxy and redirect) into a restful Container.\n// It is expected that the provided path root prefix will serve all operations. Root MUST NOT end\n// in a slash.\nfunc (g *APIGroupVersion) InstallREST(container *restful.Container) error {\n\tprefix := path.Join(g.Root, g.GroupVersion.Group, g.GroupVersion.Version)\n\tinstaller := &APIInstaller{\n\t\tgroup:             g,\n\t\tprefix:            prefix,\n\t\tminRequestTimeout: g.MinRequestTimeout,\n\t}\n\n\tapiResources, ws, registrationErrors := installer.Install()\n\tversionDiscoveryHandler := discovery.NewAPIVersionHandler(g.Serializer, g.GroupVersion, staticLister{apiResources})\n\tversionDiscoveryHandler.AddToWebService(ws)\n\tcontainer.Add(ws)\n\treturn utilerrors.NewAggregate(registrationErrors)\n}\n```\n\n<br>\n\nInstallREST函数接收restful.Container指针对象。安装过程分为4步，分别介绍如下。\n\n（1）prefix定义了HTTP PATH请求路径，其表现形式为<apiPrefix>/<group>/<version>（即/apis/apiextensions.k8s.io/v1beta1）。\n\n（2）实例化APIInstaller安装器。\n\n（3）在installer.Install安装器内部创建一个go-restful WebService，然后通过a.registerResourceHandlers函数，为资源注册对应的\n\nHandlers方法（即资源存储对象Resource Storage），完成资源与资源Handlers方法的绑定并为go-restfulWebService添加该路由。\n\n（4）最后通过container.Add函数将WebService添加到go-restful Container中。APIExtensionsServer负责管理apiextensions.k8s.io资\n\n源组下的所有资源，该资源有v1beta1版本。通过访问http://127.0.0.1:8080/apis/apiextensions.k8s.io/v1获得该资源/子资源的详细信\n\n息，命令示例如下：\n\n```\n# curl http://127.0.0.1:8080/apis/apiextensions.k8s.io/v1  ##需要替换成实际的ip,port\n{\n  \"kind\": \"APIResourceList\",\n  \"apiVersion\": \"v1\",\n  \"groupVersion\": \"apiextensions.k8s.io/v1\",\n  \"resources\": [\n    {\n      \"name\": \"customresourcedefinitions\",\n      \"singularName\": \"\",\n      \"namespaced\": false,\n      \"kind\": \"CustomResourceDefinition\",\n      \"verbs\": [\n        \"create\",\n        \"delete\",\n        \"deletecollection\",\n        \"get\",\n        \"list\",\n        \"patch\",\n        \"update\",\n        \"watch\"\n      ],\n      \"shortNames\": [\n        \"crd\",\n        \"crds\"\n      ],\n      \"storageVersionHash\": \"jfWCUB31mvA=\"\n    },\n    {\n      \"name\": \"customresourcedefinitions/status\",\n      \"singularName\": \"\",\n      \"namespaced\": false,\n      \"kind\": \"CustomResourceDefinition\",\n      \"verbs\": [\n        \"get\",\n        \"patch\",\n        \"update\"\n      ]\n    }\n  ]\n}\n```\n\n<br>\n\n```\n查看具体某个crd的信息。 khchecks.comcast.gihub.io就是一个crd\n#curl http://127.0.0.1:8080/apis/apiextensions.k8s.io/v1/customresourcedefinitions/khchecks.comcast.gihub.io\n{\n  \"kind\": \"CustomResourceDefinition\",\n  \"apiVersion\": \"apiextensions.k8s.io/v1\",\n  \"metadata\": {\n    \"name\": \"khchecks.comcast.github.io\",\n    \"selfLink\": \"/apis/apiextensions.k8s.io/v1/customresourcedefinitions/khchecks.comcast.github.io\",\n    \"uid\": \"e115a743-5c79-4aa9-8b16-016b852e9c46\",\n    \"resourceVersion\": \"197092612\",\n    \"generation\": 1,\n    \"creationTimestamp\": \"2021-04-13T10:00:14Z\"\n  },\n  \"spec\": {\n    \"group\": \"comcast.github.io\",\n    \"names\": {\n      \"plural\": \"khchecks\",\n      \"singular\": \"khcheck\",\n      \"shortNames\": [\n        \"khc\"\n      ],\n      \"kind\": \"KuberhealthyCheck\",\n      \"listKind\": \"KuberhealthyCheckList\"\n    },\n    \"scope\": \"Namespaced\",\n    \"versions\": [\n      {\n        \"name\": \"v1\",\n        \"served\": true,\n        \"storage\": true\n      }\n    ],\n    \"conversion\": {\n      \"strategy\": \"None\"\n    },\n    \"preserveUnknownFields\": true\n  },\n  \"status\": {\n    \"conditions\": [\n      {\n        \"type\": \"NamesAccepted\",\n        \"status\": \"True\",\n        \"lastTransitionTime\": \"2021-04-13T10:00:14Z\",\n        \"reason\": \"NoConflicts\",\n        \"message\": \"no conflicts found\"\n      },\n      {\n        \"type\": \"Established\",\n        \"status\": \"True\",\n        \"lastTransitionTime\": \"2021-04-13T10:00:19Z\",\n        \"reason\": \"InitialNamesAccepted\",\n        \"message\": \"the initial names have been accepted\"\n      }\n    ],\n    \"acceptedNames\": {\n      \"plural\": \"khchecks\",\n      \"singular\": \"khcheck\",\n      \"shortNames\": [\n        \"khc\"\n      ],\n      \"kind\": \"KuberhealthyCheck\",\n      \"listKind\": \"KuberhealthyCheckList\"\n    },\n    \"storedVersions\": [\n      \"v1\"\n    ]\n  }\n}\n```\n\n<br>\n\n#### 3.5 启动crdController\n\n```\ngo crdController.Run(context.StopCh)\n```\n\n这里和kcm中的大部分控制器一样了，runWorker -> processNextWorkItem ->  sync\n\n这里就是当k8s运行起来后，对一个个新增的crd进行处理。\n\n```\nfunc (c *DiscoveryController) sync(version schema.GroupVersion) error {\n\n\tapiVersionsForDiscovery := []metav1.GroupVersionForDiscovery{}\n\tapiResourcesForDiscovery := []metav1.APIResource{}\n\tversionsForDiscoveryMap := map[metav1.GroupVersion]bool{}\n\n\tcrds, err := c.crdLister.List(labels.Everything())\n\tif err != nil {\n\t\treturn err\n\t}\n\tfoundVersion := false\n\tfoundGroup := false\n\tfor _, crd := range crds {\n\t\tif !apiextensions.IsCRDConditionTrue(crd, apiextensions.Established) {\n\t\t\tcontinue\n\t\t}\n\n\t\tif crd.Spec.Group != version.Group {\n\t\t\tcontinue\n\t\t}\n\n\t\tfoundThisVersion := false\n\t\tvar storageVersionHash string\n\t\tfor _, v := range crd.Spec.Versions {\n\t\t\tif !v.Served {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\t// If there is any Served version, that means the group should show up in discovery\n\t\t\tfoundGroup = true\n\n\t\t\tgv := metav1.GroupVersion{Group: crd.Spec.Group, Version: v.Name}\n\t\t\tif !versionsForDiscoveryMap[gv] {\n\t\t\t\tversionsForDiscoveryMap[gv] = true\n\t\t\t\tapiVersionsForDiscovery = append(apiVersionsForDiscovery, metav1.GroupVersionForDiscovery{\n\t\t\t\t\tGroupVersion: crd.Spec.Group + \"/\" + v.Name,\n\t\t\t\t\tVersion:      v.Name,\n\t\t\t\t})\n\t\t\t}\n\t\t\tif v.Name == version.Version {\n\t\t\t\tfoundThisVersion = true\n\t\t\t}\n\t\t\tif v.Storage {\n\t\t\t\tstorageVersionHash = discovery.StorageVersionHash(gv.Group, gv.Version, crd.Spec.Names.Kind)\n\t\t\t}\n\t\t}\n\n\t\tif !foundThisVersion {\n\t\t\tcontinue\n\t\t}\n\t\tfoundVersion = true\n\n\t\tverbs := metav1.Verbs([]string{\"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"create\", \"update\", \"watch\"})\n\t\t// if we're terminating we don't allow some verbs\n\t\tif apiextensions.IsCRDConditionTrue(crd, apiextensions.Terminating) {\n\t\t\tverbs = metav1.Verbs([]string{\"delete\", \"deletecollection\", \"get\", \"list\", \"watch\"})\n\t\t}\n\n\t\tapiResourcesForDiscovery = append(apiResourcesForDiscovery, metav1.APIResource{\n\t\t\tName:               crd.Status.AcceptedNames.Plural,\n\t\t\tSingularName:       crd.Status.AcceptedNames.Singular,\n\t\t\tNamespaced:         crd.Spec.Scope == apiextensions.NamespaceScoped,\n\t\t\tKind:               crd.Status.AcceptedNames.Kind,\n\t\t\tVerbs:              verbs,\n\t\t\tShortNames:         crd.Status.AcceptedNames.ShortNames,\n\t\t\tCategories:         crd.Status.AcceptedNames.Categories,\n\t\t\tStorageVersionHash: storageVersionHash,\n\t\t})\n\n\t\tsubresources, err := apiextensions.GetSubresourcesForVersion(crd, version.Version)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tif subresources != nil && subresources.Status != nil {\n\t\t\tapiResourcesForDiscovery = append(apiResourcesForDiscovery, metav1.APIResource{\n\t\t\t\tName:       crd.Status.AcceptedNames.Plural + \"/status\",\n\t\t\t\tNamespaced: crd.Spec.Scope == apiextensions.NamespaceScoped,\n\t\t\t\tKind:       crd.Status.AcceptedNames.Kind,\n\t\t\t\tVerbs:      metav1.Verbs([]string{\"get\", \"patch\", \"update\"}),\n\t\t\t})\n\t\t}\n\n\t\tif subresources != nil && subresources.Scale != nil {\n\t\t\tapiResourcesForDiscovery = append(apiResourcesForDiscovery, metav1.APIResource{\n\t\t\t\tGroup:      autoscaling.GroupName,\n\t\t\t\tVersion:    \"v1\",\n\t\t\t\tKind:       \"Scale\",\n\t\t\t\tName:       crd.Status.AcceptedNames.Plural + \"/scale\",\n\t\t\t\tNamespaced: crd.Spec.Scope == apiextensions.NamespaceScoped,\n\t\t\t\tVerbs:      metav1.Verbs([]string{\"get\", \"patch\", \"update\"}),\n\t\t\t})\n\t\t}\n\t}\n\n\tif !foundGroup {\n\t\tc.groupHandler.unsetDiscovery(version.Group)\n\t\tc.versionHandler.unsetDiscovery(version)\n\t\treturn nil\n\t}\n\n\tsortGroupDiscoveryByKubeAwareVersion(apiVersionsForDiscovery)\n\n\tapiGroup := metav1.APIGroup{\n\t\tName:     version.Group,\n\t\tVersions: apiVersionsForDiscovery,\n\t\t// the preferred versions for a group is the first item in\n\t\t// apiVersionsForDiscovery after it put in the right ordered\n\t\tPreferredVersion: apiVersionsForDiscovery[0],\n\t}\n\tc.groupHandler.setDiscovery(version.Group, discovery.NewAPIGroupHandler(Codecs, apiGroup))\n\n\tif !foundVersion {\n\t\tc.versionHandler.unsetDiscovery(version)\n\t\treturn nil\n\t}\n\tc.versionHandler.setDiscovery(version, discovery.NewAPIVersionHandler(Codecs, version, discovery.APIResourceListerFunc(func() []metav1.APIResource {\n\t\treturn apiResourcesForDiscovery\n\t})))\n\n\treturn nil\n}\n```\n\n<br>\n\n创建流程如下：\n\n（1）创建GenericAPIServer\n\n（2）实例化CustomResourceDefinitions\n\n（3）实例化APIGroupInfo，将资源版本、资源、资源存储对象进行相互映射\n\n（4）InstallAPIGroup注册APIGroup（apiextensions.k8s.io）\n\n（5）启动crdController\n\n### 4总结\n\n（1）可以看出来APIExtensionsServer就是负载CRD资源请求的处理\n\n（2）具体做法是，实例化APIGroupInfo，将资源版本、资源、资源存储对象进行相互映射。这里实际就是将 url 和handler函数关联起来，这个后面kube-apiserver实例化的时候再分析\n\n<br>\n\n### 5. 参考链接\n\nkubernetes源码解剖： https://weread.qq.com/web/reader/f1e3207071eeeefaf1e138akb5332110237b53b3a3d68d2\n\n博客： https://juejin.cn/post/6844903801934069774"
  },
  {
    "path": "k8s/kube-apiserver/9-kube-apiserver 创建KubeAPIServer.md",
    "content": "* [Table of Contents](#table-of-contents)\n    * [1\\. 背景回顾](#1-背景回顾)\n    * [2\\. 创建KubeAPIServer](#2-创建kubeapiserver)\n      * [2\\.1 创建GenericAPIServer\\-进行底层router实现](#21-创建genericapiserver-进行底层router实现)\n      * [2\\.2 实例化Master\\-资源注册](#22-实例化master-资源注册)\n      * [2\\.3 InstallLegacyAPI注册/api资源](#23-installlegacyapi注册api资源)\n        * [2\\.3\\.1 new bootstrap\\-controller](#231-new-bootstrap-controller)\n        * [2\\.3\\.2 注册core v1 路由](#232-注册core-v1-路由)\n      * [2\\.4 InstallAPIs注册/apis资源](#24-installapis注册apis资源)\n      * [2\\.5 路由注册总结](#25-路由注册总结)\n    * [3\\. 总结](#3-总结)\n    * [4\\. 参考链接](#4-参考链接)\n\n**本章重点：**分析第五个流程，创建kubeapiserver\n\n kube-apiserver整体启动流程如下：\n\n（1）资源注册。\n\n（2）Cobra命令行参数解析\n\n（3）创建APIServer通用配置\n\n（4）创建APIExtensionsServer\n\n（5）创建KubeAPIServer\n\n（6）创建AggregatorServer\n\n（7）启动HTTP服务。\n\n（8）启动HTTPS服务\n\n<br>\n\n### 1. 背景回顾\n\n再次回到 CreateServerChain。在生成配置参数后，创建APIExtensionsServer之后，就是创建kubeapiserver。这里核心就是**CreateKubeAPIServer函数**\n\n```go\n// CreateServerChain creates the apiservers connected via delegation.\nfunc CreateServerChain(completedOptions completedServerRunOptions, stopCh <-chan struct{}) (*genericapiserver.GenericAPIServer, error) {\n    \n    // 1.创建到节点拨号连接,目的为了和节点交互。在云平台中，则需要安装本机的SSH Key到Kubernetes集群中所有节点上，可通过用户名和私钥，SSH到node节点\n\tnodeTunneler, proxyTransport, err := CreateNodeDialer(completedOptions)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n    // 2. 配置API Server的Config。\n\tkubeAPIServerConfig, insecureServingInfo, serviceResolver, pluginInitializer, admissionPostStartHook, err := CreateKubeAPIServerConfig(completedOptions, nodeTunneler, proxyTransport)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n    // 3.这里同时还配置了Extension API Server的Config，用于配置用户自己编写的API Server。\n\t// If additional API servers are added, they should be gated.  从这里深入挖下去\n\tapiExtensionsConfig, err := createAPIExtensionsConfig(*kubeAPIServerConfig.GenericConfig, kubeAPIServerConfig.ExtraConfig.VersionedInformers, pluginInitializer, completedOptions.ServerRunOptions, completedOptions.MasterCount)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t// 4.创建APIExtensionsServer\n\tapiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegate())\n\tif err != nil {\n\t\treturn nil, err\n\t}\n    \n    // 5.创建kubeapiserver，这里就是定义了 /apis/groups等这些api。\n\tkubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer, admissionPostStartHook)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// otherwise go down the normal path of standing the aggregator up in front of the API server\n\t// this wires up openapi\n\t\n\t// 6. kubeAPIServer prepareRun\n\tkubeAPIServer.GenericAPIServer.PrepareRun()\n\n    // 7. apiExtensionsServer prepareRun\n\t// This will wire up openapi for extension api server\n\tapiExtensionsServer.GenericAPIServer.PrepareRun()\n\n    // 8. 配置AA config，然后创建AA server。\n\t// aggregator comes last in the chain\n\taggregatorConfig, err := createAggregatorConfig(*kubeAPIServerConfig.GenericConfig, completedOptions.ServerRunOptions, kubeAPIServerConfig.ExtraConfig.VersionedInformers, serviceResolver, proxyTransport, pluginInitializer)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t\n\t// 9.创建AA server.这里传入了参数 kube-apiserver, apiExtensionServer。\n\taggregatorServer, err := createAggregatorServer(aggregatorConfig, kubeAPIServer.GenericAPIServer, apiExtensionsServer.Informers)\n\tif err != nil {\n\t\t// we don't need special handling for innerStopCh because the aggregator server doesn't create any go routines\n\t\treturn nil, err\n\t}\n  \n  // 10. 启动http服务\n\tif insecureServingInfo != nil {\n\t\tinsecureHandlerChain := kubeserver.BuildInsecureHandlerChain(aggregatorServer.GenericAPIServer.UnprotectedHandler(), kubeAPIServerConfig.GenericConfig)\n\t\tif err := insecureServingInfo.Serve(insecureHandlerChain, kubeAPIServerConfig.GenericConfig.RequestTimeout, stopCh); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\n\treturn aggregatorServer.GenericAPIServer, nil\n}\n```\n\n<br>\n\n### 2. 创建KubeAPIServer\n\n创建KubeAPIServer的流程与创建APIExtensionsServer的流程类似，其原理都是将<资源组>/<资源版本>/<资源>与资源存储对象进行映\n\n射并将其存储至APIGroupInfo对象的VersionedResourcesStorageMap字段中。通过installer.Install安装器为资源注册对应的Handlers\n\n方法（即资源存储对象ResourceStorage），完成资源与资源Handlers方法的绑定并为go-restful WebService添加该路由。最后将\n\nWebService添加到go-restful Container中。创建KubeAPIServer的流程如下所示：\n\n（1）创建GenericAPIServer\n\n（2）实例化Master\n\n（3）InstallLegacyAPI注册/api资源\n\n（4）InstallAPIs注册/apis资源。\n\n<br>\n\n```\n\tkubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\t\n\t\n\t// CreateKubeAPIServer creates and wires a workable kube-apiserver\nfunc CreateKubeAPIServer(kubeAPIServerConfig *master.Config, delegateAPIServer genericapiserver.DelegationTarget) (*master.Master, error) {\n\tkubeAPIServer, err := kubeAPIServerConfig.Complete().New(delegateAPIServer)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\treturn kubeAPIServer, nil\n}\n\n\n\n// New returns a new instance of Master from the given config.\n// Certain config fields will be set to a default value if unset.\n// Certain config fields must be specified, including:\n//   KubeletClientConfig\nfunc (c completedConfig) New(delegationTarget genericapiserver.DelegationTarget) (*Master, error) {\n\tif reflect.DeepEqual(c.ExtraConfig.KubeletClientConfig, kubeletclient.KubeletClientConfig{}) {\n\t\treturn nil, fmt.Errorf(\"Master.New() called with empty config.KubeletClientConfig\")\n\t}\n\n  // 1 创建GenericAPIServer\n\ts, err := c.GenericConfig.New(\"kube-apiserver\", delegationTarget)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tif c.ExtraConfig.EnableLogsSupport {\n\t\troutes.Logs{}.Install(s.Handler.GoRestfulContainer)\n\t}\n\n  // 2.实例化Master。KubeAPIServer（API核心服务）通过Master对象进行管理，实例化该对象后才能注册KubeAPIServer下的资源。\n\tm := &Master{\n\t\tGenericAPIServer:          s,\n\t\tClusterAuthenticationInfo: c.ExtraConfig.ClusterAuthenticationInfo,\n\t}\n\n  // 3.InstallLegacyAPI注册/api资源\n\t// install legacy rest storage\n\tif c.ExtraConfig.APIResourceConfigSource.VersionEnabled(apiv1.SchemeGroupVersion) {\n\t\tlegacyRESTStorageProvider := corerest.LegacyRESTStorageProvider{\n\t\t\tStorageFactory:              c.ExtraConfig.StorageFactory,\n\t\t\tProxyTransport:              c.ExtraConfig.ProxyTransport,\n\t\t\tKubeletClientConfig:         c.ExtraConfig.KubeletClientConfig,\n\t\t\tEventTTL:                    c.ExtraConfig.EventTTL,\n\t\t\tServiceIPRange:              c.ExtraConfig.ServiceIPRange,\n\t\t\tSecondaryServiceIPRange:     c.ExtraConfig.SecondaryServiceIPRange,\n\t\t\tServiceNodePortRange:        c.ExtraConfig.ServiceNodePortRange,\n\t\t\tLoopbackClientConfig:        c.GenericConfig.LoopbackClientConfig,\n\t\t\tServiceAccountIssuer:        c.ExtraConfig.ServiceAccountIssuer,\n\t\t\tServiceAccountMaxExpiration: c.ExtraConfig.ServiceAccountMaxExpiration,\n\t\t\tAPIAudiences:                c.GenericConfig.Authentication.APIAudiences,\n\t\t}\n\t\tif err := m.InstallLegacyAPI(&c, c.GenericConfig.RESTOptionsGetter, legacyRESTStorageProvider); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\n  // 4.InstallAPIs注册/apis资源\n\t// The order here is preserved in discovery.\n\t// If resources with identical names exist in more than one of these groups (e.g. \"deployments.apps\"\" and \"deployments.extensions\"),\n\t// the order of this list determines which group an unqualified resource name (e.g. \"deployments\") should prefer.\n\t// This priority order is used for local discovery, but it ends up aggregated in `k8s.io/kubernetes/cmd/kube-apiserver/app/aggregator.go\n\t// with specific priorities.\n\t// TODO: describe the priority all the way down in the RESTStorageProviders and plumb it back through the various discovery\n\t// handlers that we have.\n\trestStorageProviders := []RESTStorageProvider{\n\t\tauditregistrationrest.RESTStorageProvider{},\n\t\tauthenticationrest.RESTStorageProvider{Authenticator: c.GenericConfig.Authentication.Authenticator, APIAudiences: c.GenericConfig.Authentication.APIAudiences},\n\t\tauthorizationrest.RESTStorageProvider{Authorizer: c.GenericConfig.Authorization.Authorizer, RuleResolver: c.GenericConfig.RuleResolver},\n\t\tautoscalingrest.RESTStorageProvider{},\n\t\tbatchrest.RESTStorageProvider{},\n\t\tcertificatesrest.RESTStorageProvider{},\n\t\tcoordinationrest.RESTStorageProvider{},\n\t\tdiscoveryrest.StorageProvider{},\n\t\textensionsrest.RESTStorageProvider{},\n\t\tnetworkingrest.RESTStorageProvider{},\n\t\tnoderest.RESTStorageProvider{},\n\t\tpolicyrest.RESTStorageProvider{},\n\t\trbacrest.RESTStorageProvider{Authorizer: c.GenericConfig.Authorization.Authorizer},\n\t\tschedulingrest.RESTStorageProvider{},\n\t\tsettingsrest.RESTStorageProvider{},\n\t\tstoragerest.RESTStorageProvider{},\n\t\tflowcontrolrest.RESTStorageProvider{},\n\t\t// keep apps after extensions so legacy clients resolve the extensions versions of shared resource names.\n\t\t// See https://github.com/kubernetes/kubernetes/issues/42392\n\t\tappsrest.RESTStorageProvider{},\n\t\tadmissionregistrationrest.RESTStorageProvider{},\n\t\teventsrest.RESTStorageProvider{TTL: c.ExtraConfig.EventTTL},\n\t}\n\tif err := m.InstallAPIs(c.ExtraConfig.APIResourceConfigSource, c.GenericConfig.RESTOptionsGetter, restStorageProviders...); err != nil {\n\t\treturn nil, err\n\t}\n\n\tif c.ExtraConfig.Tunneler != nil {\n\t\tm.installTunneler(c.ExtraConfig.Tunneler, corev1client.NewForConfigOrDie(c.GenericConfig.LoopbackClientConfig).Nodes())\n\t}\n\n\tm.GenericAPIServer.AddPostStartHookOrDie(\"start-cluster-authentication-info-controller\", func(hookContext genericapiserver.PostStartHookContext) error {\n\t\tkubeClient, err := kubernetes.NewForConfig(hookContext.LoopbackClientConfig)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tcontroller := clusterauthenticationtrust.NewClusterAuthenticationTrustController(m.ClusterAuthenticationInfo, kubeClient)\n\n\t\t// prime values and start listeners\n\t\tif m.ClusterAuthenticationInfo.ClientCA != nil {\n\t\t\tif notifier, ok := m.ClusterAuthenticationInfo.ClientCA.(dynamiccertificates.Notifier); ok {\n\t\t\t\tnotifier.AddListener(controller)\n\t\t\t}\n\t\t\tif controller, ok := m.ClusterAuthenticationInfo.ClientCA.(dynamiccertificates.ControllerRunner); ok {\n\t\t\t\t// runonce to be sure that we have a value.\n\t\t\t\tif err := controller.RunOnce(); err != nil {\n\t\t\t\t\truntime.HandleError(err)\n\t\t\t\t}\n\t\t\t\tgo controller.Run(1, hookContext.StopCh)\n\t\t\t}\n\t\t}\n\t\tif m.ClusterAuthenticationInfo.RequestHeaderCA != nil {\n\t\t\tif notifier, ok := m.ClusterAuthenticationInfo.RequestHeaderCA.(dynamiccertificates.Notifier); ok {\n\t\t\t\tnotifier.AddListener(controller)\n\t\t\t}\n\t\t\tif controller, ok := m.ClusterAuthenticationInfo.RequestHeaderCA.(dynamiccertificates.ControllerRunner); ok {\n\t\t\t\t// runonce to be sure that we have a value.\n\t\t\t\tif err := controller.RunOnce(); err != nil {\n\t\t\t\t\truntime.HandleError(err)\n\t\t\t\t}\n\t\t\t\tgo controller.Run(1, hookContext.StopCh)\n\t\t\t}\n\t\t}\n\n\t\tgo controller.Run(1, hookContext.StopCh)\n\t\treturn nil\n\t})\n\n\treturn m, nil\n}\n```\n\n<br>\n\n#### 2.1 创建GenericAPIServer-进行底层router实现\n\n无论创建APIExtensionsServer、KubeAPIServer，还是AggregatorServer，它们在底层都依赖于GenericAPIServer。通过GenericAPIServer将Kubernetes资源与REST API进行映射。\n\n例如：\n\n```\n  // 1. 创建GenericAPIServer.  APIExtensionsServer的运行依赖于GenericAPIServer，通过c.GenericConfig.New函数创建名为apiextensions-apiserver的服务。\n\tgenericServer, err := c.GenericConfig.New(\"apiextensions-apiserver\", delegationTarget)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n```\n\n通过c.GenericConfig.New函数创建GenericAPIServer。在NewAPIServerHandler函数的内部，通过restful.NewContainer创建restful \n\nContainer实例，并设置Router路由。代码示例如下：\n\n```\n\tapiServerHandler := NewAPIServerHandler(name, c.Serializer, handlerChainBuilder, delegationTarget.UnprotectedHandler())\n\n\nfunc NewAPIServerHandler(name string, s runtime.NegotiatedSerializer, handlerChainBuilder HandlerChainBuilderFn, notFoundHandler http.Handler) *APIServerHandler {\n\tnonGoRestfulMux := mux.NewPathRecorderMux(name)\n\tif notFoundHandler != nil {\n\t\tnonGoRestfulMux.NotFoundHandler(notFoundHandler)\n\t}\n\n\tgorestfulContainer := restful.NewContainer()\n\tgorestfulContainer.ServeMux = http.NewServeMux()\n\tgorestfulContainer.Router(restful.CurlyRouter{}) // e.g. for proxy/{kind}/{name}/{*}\n\tgorestfulContainer.RecoverHandler(func(panicReason interface{}, httpWriter http.ResponseWriter) {\n\t\tlogStackOnRecover(s, panicReason, httpWriter)\n\t})\n\tgorestfulContainer.ServiceErrorHandler(func(serviceErr restful.ServiceError, request *restful.Request, response *restful.Response) {\n\t\tserviceErrorHandler(s, serviceErr, request, response)\n\t})\n\n\tdirector := director{\n\t\tname:               name,\n\t\tgoRestfulContainer: gorestfulContainer,\n\t\tnonGoRestfulMux:    nonGoRestfulMux,\n\t}\n\n\treturn &APIServerHandler{\n\t\tFullHandlerChain:   handlerChainBuilder(director),\n\t\tGoRestfulContainer: gorestfulContainer,\n\t\tNonGoRestfulMux:    nonGoRestfulMux,\n\t\tDirector:           director,\n\t}\n}\n```\n\ninstallAPI通过routes注册GenericAPIServer的相关API。\n● routes.Index：用于获取index索引页面。\n● routes.Profiling：用于分析性能的可视化页面。\n● routes.MetricsWithReset：用于获取metrics指标信息，一般用于Prometheus指标采集。\n● routes.Version：用于获取Kubernetes系统版本信息。\n\n<br>\n\n#### 2.2 实例化Master-资源注册\n\n```\n\tm := &Master{\n\t\tGenericAPIServer:          s,\n\t\tClusterAuthenticationInfo: c.ExtraConfig.ClusterAuthenticationInfo,\n\t}\n\t\n\t// Master contains state for a Kubernetes cluster master/api server.\ntype Master struct {\n\tGenericAPIServer *genericapiserver.GenericAPIServer\n\n\tClusterAuthenticationInfo clusterauthenticationtrust.ClusterAuthenticationInfo\n}\n```\n\nKubeAPIServer（API核心服务）通过Master对象进行管理，实例化该对象后才能注册KubeAPIServer下的资源。\n\n后面的InstallLegacyAPI，InstallAPIs都需要实例化之后才能调用。(可以对应资源注册那一节)\n\n<br>\n\n在当前的Kubernetes系统中，支持两类资源组，分别是拥有组名的资源组和没有组名的资源组。KubeAPIServer通过InstallLegacyAPI函\n\n数将没有组名的资源组注册到/api前缀的路径下，其表现形式为/api/<version>/<resource>，例如 http://localhost:8080/api/v1/pods。\n\nKubeAPIServer通过InstallAPIs函数将拥有组名的资源组注册到/apis前缀的路径下，其表现形式为/apis/<group>/<version>/<resource>，例如http://localhost:8080/apis/apps/v1/deployments。\n\n#### 2.3 InstallLegacyAPI注册/api资源\n\nInstallLegacyAPI 核心干了2件事：\n\n（1）将boostrap-controller的启停添加到apiserver的postStartHook和preShutDownhook中\n\n（2）调用InstallLegacyAPIGroup 注册核心Core v1下面的restful api\n\n代码路径：pkg/master/master.go\n\n```\n// install legacy rest storage\n\tif c.ExtraConfig.APIResourceConfigSource.VersionEnabled(apiv1.SchemeGroupVersion) {\n\t\tlegacyRESTStorageProvider := corerest.LegacyRESTStorageProvider{\n\t\t\tStorageFactory:              c.ExtraConfig.StorageFactory,\n\t\t\tProxyTransport:              c.ExtraConfig.ProxyTransport,\n\t\t\tKubeletClientConfig:         c.ExtraConfig.KubeletClientConfig,\n\t\t\tEventTTL:                    c.ExtraConfig.EventTTL,\n\t\t\tServiceIPRange:              c.ExtraConfig.ServiceIPRange,\n\t\t\tSecondaryServiceIPRange:     c.ExtraConfig.SecondaryServiceIPRange,\n\t\t\tServiceNodePortRange:        c.ExtraConfig.ServiceNodePortRange,\n\t\t\tLoopbackClientConfig:        c.GenericConfig.LoopbackClientConfig,\n\t\t\tServiceAccountIssuer:        c.ExtraConfig.ServiceAccountIssuer,\n\t\t\tServiceAccountMaxExpiration: c.ExtraConfig.ServiceAccountMaxExpiration,\n\t\t\tAPIAudiences:                c.GenericConfig.Authentication.APIAudiences,\n\t\t}\n\t\tif err := m.InstallLegacyAPI(&c, c.GenericConfig.RESTOptionsGetter, legacyRESTStorageProvider); err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\t\n\n// InstallLegacyAPI will install the legacy APIs for the restStorageProviders if they are enabled.\nfunc (m *Master) InstallLegacyAPI(c *completedConfig, restOptionsGetter generic.RESTOptionsGetter, legacyRESTStorageProvider corerest.LegacyRESTStorageProvider) error {\n\tlegacyRESTStorage, apiGroupInfo, err := legacyRESTStorageProvider.NewLegacyRESTStorage(restOptionsGetter)\n\tif err != nil {\n\t\treturn fmt.Errorf(\"Error building core storage: %v\", err)\n\t}\n  \n  // 1.将boostrap-controller的启停添加到apiserver的postStartHook和preShutDownhook中\n\tcontrollerName := \"bootstrap-controller\"\n\tcoreClient := corev1client.NewForConfigOrDie(c.GenericConfig.LoopbackClientConfig)\n\tbootstrapController := c.NewBootstrapController(legacyRESTStorage, coreClient, coreClient, coreClient, coreClient.RESTClient())\n\tm.GenericAPIServer.AddPostStartHookOrDie(controllerName, bootstrapController.PostStartHook)\n\tm.GenericAPIServer.AddPreShutdownHookOrDie(controllerName, bootstrapController.PreShutdownHook)\n  \n  // 2. 调用InstallLegacyAPIGroup 注册核心Core v1下面的restful api\n\tif err := m.GenericAPIServer.InstallLegacyAPIGroup(genericapiserver.DefaultLegacyAPIPrefix, &apiGroupInfo); err != nil {\n\t\treturn fmt.Errorf(\"Error in registering group versions: %v\", err)\n\t}\n\treturn nil\n}\n\n```\n\n##### 2.3.1 new bootstrap-controller\n\n这里参考kube-apiserver概述章节。可以知道bootstrap-controller有什么作用，是怎样启动的。\n\n##### 2.3.2 注册core v1 路由\n\nKubeAPIServer会先判断Core Groups/v1（即核心资源组/资源版本）是否已启用，如果其已启用，则通过m.InstallLegacyAPI函数将\n\nCore Groups/v1注册到KubeAPIServer的/api/v1下。\n\n可以通过访问http://127.0.0.18080/api/v1获得Core Groups/v1下的资源与子资源信息。\n\n<br>\n\nInstallLegacyAPI函数的执行过程分为两步，分别介绍如下。\n\n第1步，通过legacyRESTStorageProvider.NewLegacyRESTStorage函数实例化APIGroupInfo，APIGroupInfo对象用于描述资源组信\n\n息，该对象的VersionedResourcesStorageMap字段用于存储资源与资源存储对象的映射关系，其表现形式为map[string]map[string]rest.Storage （即<资源版本>/<资源>/<资源存储对象>），\n\n例如Pod资源与资源存储对象的映射关系是v1/pods/PodStorage。使Core Groups/v1下的资源与资源存储对象相互映射，代码示例如\n\n下：代码路径：pkg/registry/core/rest/storage_core.go\n\n这里就是将核心的资源，pod, svc等，设置好url和处理函数。\n\n```go\nfunc (c LegacyRESTStorageProvider) NewLegacyRESTStorage(restOptionsGetter generic.RESTOptionsGetter) (LegacyRESTStorage, genericapiserver.APIGroupInfo, error) {\n\tapiGroupInfo := genericapiserver.APIGroupInfo{\n\t\tPrioritizedVersions:          legacyscheme.Scheme.PrioritizedVersionsForGroup(\"\"),\n\t\tVersionedResourcesStorageMap: map[string]map[string]rest.Storage{},\n\t\tScheme:                       legacyscheme.Scheme,\n\t\tParameterCodec:               legacyscheme.ParameterCodec,\n\t\tNegotiatedSerializer:         legacyscheme.Codecs,\n\t}\n\n\tvar podDisruptionClient policyclient.PodDisruptionBudgetsGetter\n\tif policyGroupVersion := (schema.GroupVersion{Group: \"policy\", Version: \"v1beta1\"}); legacyscheme.Scheme.IsVersionRegistered(policyGroupVersion) {\n\t\tvar err error\n\t\tpodDisruptionClient, err = policyclient.NewForConfig(c.LoopbackClientConfig)\n\t\tif err != nil {\n\t\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t\t}\n\t}\n\trestStorage := LegacyRESTStorage{}\n\n\tpodTemplateStorage, err := podtemplatestore.NewREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\teventStorage, err := eventstore.NewREST(restOptionsGetter, uint64(c.EventTTL.Seconds()))\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\tlimitRangeStorage, err := limitrangestore.NewREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tresourceQuotaStorage, resourceQuotaStatusStorage, err := resourcequotastore.NewREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\tsecretStorage, err := secretstore.NewREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\tpersistentVolumeStorage, persistentVolumeStatusStorage, err := pvstore.NewREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\tpersistentVolumeClaimStorage, persistentVolumeClaimStatusStorage, err := pvcstore.NewREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\tconfigMapStorage, err := configmapstore.NewREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tnamespaceStorage, namespaceStatusStorage, namespaceFinalizeStorage, err := namespacestore.NewREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tendpointsStorage, err := endpointsstore.NewREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tnodeStorage, err := nodestore.NewStorage(restOptionsGetter, c.KubeletClientConfig, c.ProxyTransport)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tpodStorage, err := podstore.NewStorage(\n\t\trestOptionsGetter,\n\t\tnodeStorage.KubeletConnectionInfo,\n\t\tc.ProxyTransport,\n\t\tpodDisruptionClient,\n\t)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tvar serviceAccountStorage *serviceaccountstore.REST\n\tif c.ServiceAccountIssuer != nil && utilfeature.DefaultFeatureGate.Enabled(features.TokenRequest) {\n\t\tserviceAccountStorage, err = serviceaccountstore.NewREST(restOptionsGetter, c.ServiceAccountIssuer, c.APIAudiences, c.ServiceAccountMaxExpiration, podStorage.Pod.Store, secretStorage.Store)\n\t} else {\n\t\tserviceAccountStorage, err = serviceaccountstore.NewREST(restOptionsGetter, nil, nil, 0, nil, nil)\n\t}\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tserviceRESTStorage, serviceStatusStorage, err := servicestore.NewGenericREST(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tvar serviceClusterIPRegistry rangeallocation.RangeRegistry\n\tserviceClusterIPRange := c.ServiceIPRange\n\tif serviceClusterIPRange.IP == nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, fmt.Errorf(\"service clusterIPRange is missing\")\n\t}\n\n\tserviceStorageConfig, err := c.StorageFactory.NewConfig(api.Resource(\"services\"))\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tserviceClusterIPAllocator, err := ipallocator.NewAllocatorCIDRRange(&serviceClusterIPRange, func(max int, rangeSpec string) (allocator.Interface, error) {\n\t\tmem := allocator.NewAllocationMap(max, rangeSpec)\n\t\t// TODO etcdallocator package to return a storage interface via the storageFactory\n\t\tetcd, err := serviceallocator.NewEtcd(mem, \"/ranges/serviceips\", api.Resource(\"serviceipallocations\"), serviceStorageConfig)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tserviceClusterIPRegistry = etcd\n\t\treturn etcd, nil\n\t})\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, fmt.Errorf(\"cannot create cluster IP allocator: %v\", err)\n\t}\n\trestStorage.ServiceClusterIPAllocator = serviceClusterIPRegistry\n\n\t// allocator for secondary service ip range\n\tvar secondaryServiceClusterIPAllocator ipallocator.Interface\n\tif utilfeature.DefaultFeatureGate.Enabled(features.IPv6DualStack) && c.SecondaryServiceIPRange.IP != nil {\n\t\tvar secondaryServiceClusterIPRegistry rangeallocation.RangeRegistry\n\t\tsecondaryServiceClusterIPAllocator, err = ipallocator.NewAllocatorCIDRRange(&c.SecondaryServiceIPRange, func(max int, rangeSpec string) (allocator.Interface, error) {\n\t\t\tmem := allocator.NewAllocationMap(max, rangeSpec)\n\t\t\t// TODO etcdallocator package to return a storage interface via the storageFactory\n\t\t\tetcd, err := serviceallocator.NewEtcd(mem, \"/ranges/secondaryserviceips\", api.Resource(\"serviceipallocations\"), serviceStorageConfig)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t\tsecondaryServiceClusterIPRegistry = etcd\n\t\t\treturn etcd, nil\n\t\t})\n\t\tif err != nil {\n\t\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, fmt.Errorf(\"cannot create cluster secondary IP allocator: %v\", err)\n\t\t}\n\t\trestStorage.SecondaryServiceClusterIPAllocator = secondaryServiceClusterIPRegistry\n\t}\n\n\tvar serviceNodePortRegistry rangeallocation.RangeRegistry\n\tserviceNodePortAllocator, err := portallocator.NewPortAllocatorCustom(c.ServiceNodePortRange, func(max int, rangeSpec string) (allocator.Interface, error) {\n\t\tmem := allocator.NewAllocationMap(max, rangeSpec)\n\t\t// TODO etcdallocator package to return a storage interface via the storageFactory\n\t\tetcd, err := serviceallocator.NewEtcd(mem, \"/ranges/servicenodeports\", api.Resource(\"servicenodeportallocations\"), serviceStorageConfig)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tserviceNodePortRegistry = etcd\n\t\treturn etcd, nil\n\t})\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, fmt.Errorf(\"cannot create cluster port allocator: %v\", err)\n\t}\n\trestStorage.ServiceNodePortAllocator = serviceNodePortRegistry\n\n\tcontrollerStorage, err := controllerstore.NewStorage(restOptionsGetter)\n\tif err != nil {\n\t\treturn LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err\n\t}\n\n\tserviceRest, serviceRestProxy := servicestore.NewREST(serviceRESTStorage,\n\t\tendpointsStorage,\n\t\tpodStorage.Pod,\n\t\tserviceClusterIPAllocator,\n\t\tsecondaryServiceClusterIPAllocator,\n\t\tserviceNodePortAllocator,\n\t\tc.ProxyTransport)\n \n  // storage就是将ulr和处理函数进行了绑定\n\trestStorageMap := map[string]rest.Storage{\n\t\t\"pods\":             podStorage.Pod,\n\t\t\"pods/attach\":      podStorage.Attach,\n\t\t\"pods/status\":      podStorage.Status,\n\t\t\"pods/log\":         podStorage.Log,\n\t\t\"pods/exec\":        podStorage.Exec,\n\t\t\"pods/portforward\": podStorage.PortForward,\n\t\t\"pods/proxy\":       podStorage.Proxy,\n\t\t\"pods/binding\":     podStorage.Binding,\n\t\t\"bindings\":         podStorage.LegacyBinding,\n\n\t\t\"podTemplates\": podTemplateStorage,\n\n\t\t\"replicationControllers\":        controllerStorage.Controller,\n\t\t\"replicationControllers/status\": controllerStorage.Status,\n\n\t\t\"services\":        serviceRest,\n\t\t\"services/proxy\":  serviceRestProxy,\n\t\t\"services/status\": serviceStatusStorage,\n\n\t\t\"endpoints\": endpointsStorage,\n\n\t\t\"nodes\":        nodeStorage.Node,\n\t\t\"nodes/status\": nodeStorage.Status,\n\t\t\"nodes/proxy\":  nodeStorage.Proxy,\n\n\t\t\"events\": eventStorage,\n\n\t\t\"limitRanges\":                   limitRangeStorage,\n\t\t\"resourceQuotas\":                resourceQuotaStorage,\n\t\t\"resourceQuotas/status\":         resourceQuotaStatusStorage,\n\t\t\"namespaces\":                    namespaceStorage,\n\t\t\"namespaces/status\":             namespaceStatusStorage,\n\t\t\"namespaces/finalize\":           namespaceFinalizeStorage,\n\t\t\"secrets\":                       secretStorage,\n\t\t\"serviceAccounts\":               serviceAccountStorage,\n\t\t\"persistentVolumes\":             persistentVolumeStorage,\n\t\t\"persistentVolumes/status\":      persistentVolumeStatusStorage,\n\t\t\"persistentVolumeClaims\":        persistentVolumeClaimStorage,\n\t\t\"persistentVolumeClaims/status\": persistentVolumeClaimStatusStorage,\n\t\t\"configMaps\":                    configMapStorage,\n\n\t\t\"componentStatuses\": componentstatus.NewStorage(componentStatusStorage{c.StorageFactory}.serversToValidate),\n\t}\n\tif legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: \"autoscaling\", Version: \"v1\"}) {\n\t\trestStorageMap[\"replicationControllers/scale\"] = controllerStorage.Scale\n\t}\n\tif legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: \"policy\", Version: \"v1beta1\"}) {\n\t\trestStorageMap[\"pods/eviction\"] = podStorage.Eviction\n\t}\n\tif serviceAccountStorage.Token != nil {\n\t\trestStorageMap[\"serviceaccounts/token\"] = serviceAccountStorage.Token\n\t}\n\tif utilfeature.DefaultFeatureGate.Enabled(features.EphemeralContainers) {\n\t\trestStorageMap[\"pods/ephemeralcontainers\"] = podStorage.EphemeralContainers\n\t}\n\tapiGroupInfo.VersionedResourcesStorageMap[\"v1\"] = restStorageMap\n\n\treturn restStorage, apiGroupInfo, nil\n}\n```\n\n每个资源（包括子资源）都通过类似于NewREST的函数创建资源存储对象（即RESTStorage）。kube-apiserver将RESTStorage封装成\n\nHTTP Handler函数，资源存储对象以RESTful的方式运行，一个RESTStorage对象负责一个资源的增、删、改、查操作。当操作\n\nCustomResourceDefinitions资源数据时，通过对应的RESTStorage资源存储对象与genericregistry.Store进行交互。\n\n<br>\n\n第2步，通过m.GenericAPIServer.InstallLegacyAPIGroup函数将APIGroupInfo对象中的<资源组>/<资源版本>/<资源>/<子资源>（包括\n\n资源存储对象）注册到KubeAPIServer Handlers方法。其过程是遍历APIGroupInfo，将<资源组>/<资源版本>/<资源名称>映射到HTTP \n\nPATH请求路径，通过InstallREST函数将资源存储对象作为资源的Handlers方法。最后使用go-restful的ws.Route将定义好的请求路径和\n\nHandlers方法添加路由到go-restful中。整个过程为InstallLegacyAPIGroup→s.installAPIResources→InstallREST，该过程与\n\nAPIExtensionsServer注册APIGroupInfo的过程类似，故不再赘述。\n\n<br>\n\n#### 2.4 InstallAPIs注册/apis资源\n\n```\n\t// The order here is preserved in discovery.\n\t// If resources with identical names exist in more than one of these groups (e.g. \"deployments.apps\"\" and \"deployments.extensions\"),\n\t// the order of this list determines which group an unqualified resource name (e.g. \"deployments\") should prefer.\n\t// This priority order is used for local discovery, but it ends up aggregated in `k8s.io/kubernetes/cmd/kube-apiserver/app/aggregator.go\n\t// with specific priorities.\n\t// TODO: describe the priority all the way down in the RESTStorageProviders and plumb it back through the various discovery\n\t// handlers that we have.\n\trestStorageProviders := []RESTStorageProvider{\n\t  ...\n\t}\n\tif err := m.InstallAPIs(c.ExtraConfig.APIResourceConfigSource, c.GenericConfig.RESTOptionsGetter, restStorageProviders...); err != nil {\n\t\treturn nil, err\n\t}\n```\n\n<br>\n\n 通过m.InstallLegacyAPI函数将拥有组名的资源组注册到KubeAPIServer的/apis下。可以通过访问http://localhost:8080/apis/apps/v1/deployments获得其下的资源与子资源信息。\n\n  InstallLegacyAPI函数的执行过程分为两步，分别介绍如下。\n\n第1步，实例化所有已启用的资源组的APIGroupInfo，APIGroupInfo对象用于描述资源组信息，该对象的VersionedResourcesStorageMap 字段用于存储资源与资源存储对象的映射关系，其表现形式为\n\nmap[string]map[string]rest.Storage（即<资源版本>/<资源>/<资源存储对象>），例如Deployment资源与资源存储对象的映射关系是\n\nv1/deployments/deploymentStorage。通过restStorageBuilder.NewRESTStorage→v1Storage函数可实现apps资源组下的资源与资源存储对象的映射。代码不展示了。\n\n每个资源（包括子资源）都通过类似于NewStorage的函数创建资源存储对象（即RESTStorage）。kube-apiserver将RESTStorage封装\n\n成HTTP Handler函数，资源存储对象以RESTful的方式运行，一个RESTStorage对象负责一个资源的增、删、改、查操作。当操作\n\nCustomResourceDefinitions资源数据时，通过对应的RESTStorage资源存储对象与genericregistry.Store进行交互。\n\n<br>\n\n第2步，通过 m.GenericAPIServer.InstallLegacyAPIGroup 函数将APIGroupInfo对象中的<资源组>/<资源版本>/<资源>/<子资源>（包括\n\n资源存储对象）注册到KubeAPIServer Handlers方法。其过程是遍历APIGroupInfo，将<资源组>/<资源版本>/<资源名称>映射到HTTP \n\nPATH请求路径，通过InstallREST函数将资源存储对象作为资源的Handlers方法。最后使用go-restful的ws.Route将定义好的请求路径和\n\nHandlers方法添加路由到go-restful中。整个过程为InstallAPIGroups→s.installAPIResources→InstallREST，该过程与\n\nAPIExtensionsServer注册APIGroupInfo的过程类似，故不再赘述。KubeAPIServer负责管理众多资源组，以apps资源组为例，通过访问\n\nhttp://127.0.0.1:8080/apis/apps/v1可以获得该资源/子资源的详细信息。\n\n<br>\n\n#### 2.5 路由注册总结\n\napi开头的路由通过`InstallLegacyAPI`方法添加。进入`InstallLegacyAPI`方法，通过`NewLegacyRESTStorage`方法创建各个资源的**RESTStorage**。RESTStorage是一个结构体，具体的定义在`vendor/k8s.io/apiserver/pkg/registry/generic/registry/store.go`下，结构体内主要包含`NewFunc`返回特定资源信息、`NewListFunc`返回特定资源列表、`CreateStrategy`特定资源创建时的策略、`UpdateStrategy`更新时的策略以及`DeleteStrategy`删除时的策略等重要方法。\n\n```\n// TODO: make the default exposed methods exactly match a generic RESTStorage\ntype Store struct {\n\t// NewFunc returns a new instance of the type this registry returns for a\n\t// GET of a single object, e.g.:\n\t//\n\t// curl GET /apis/group/version/namespaces/my-ns/myresource/name-of-object\n\tNewFunc func() runtime.Object\n\n\t// NewListFunc returns a new list of the type this registry; it is the\n\t// type returned when the resource is listed, e.g.:\n\t//\n\t// curl GET /apis/group/version/namespaces/my-ns/myresource\n\tNewListFunc func() runtime.Object\n\n\t// DefaultQualifiedResource is the pluralized name of the resource.\n\t// This field is used if there is no request info present in the context.\n\t// See qualifiedResourceFromContext for details.\n\tDefaultQualifiedResource schema.GroupResource\n\n\t// KeyRootFunc returns the root etcd key for this resource; should not\n\t// include trailing \"/\".  This is used for operations that work on the\n\t// entire collection (listing and watching).\n\t//\n\t// KeyRootFunc and KeyFunc must be supplied together or not at all.\n\tKeyRootFunc func(ctx context.Context) string\n\n\t// KeyFunc returns the key for a specific object in the collection.\n\t// KeyFunc is called for Create/Update/Get/Delete. Note that 'namespace'\n\t// can be gotten from ctx.\n\t//\n\t// KeyFunc and KeyRootFunc must be supplied together or not at all.\n\tKeyFunc func(ctx context.Context, name string) (string, error)\n\n\t// ObjectNameFunc returns the name of an object or an error.\n\tObjectNameFunc func(obj runtime.Object) (string, error)\n\n\t// TTLFunc returns the TTL (time to live) that objects should be persisted\n\t// with. The existing parameter is the current TTL or the default for this\n\t// operation. The update parameter indicates whether this is an operation\n\t// against an existing object.\n\t//\n\t// Objects that are persisted with a TTL are evicted once the TTL expires.\n\tTTLFunc func(obj runtime.Object, existing uint64, update bool) (uint64, error)\n\n\t// PredicateFunc returns a matcher corresponding to the provided labels\n\t// and fields. The SelectionPredicate returned should return true if the\n\t// object matches the given field and label selectors.\n\tPredicateFunc func(label labels.Selector, field fields.Selector) storage.SelectionPredicate\n\n\t// EnableGarbageCollection affects the handling of Update and Delete\n\t// requests. Enabling garbage collection allows finalizers to do work to\n\t// finalize this object before the store deletes it.\n\t//\n\t// If any store has garbage collection enabled, it must also be enabled in\n\t// the kube-controller-manager.\n\tEnableGarbageCollection bool\n\n\t// DeleteCollectionWorkers is the maximum number of workers in a single\n\t// DeleteCollection call. Delete requests for the items in a collection\n\t// are issued in parallel.\n\tDeleteCollectionWorkers int\n\n\t// Decorator is an optional exit hook on an object returned from the\n\t// underlying storage. The returned object could be an individual object\n\t// (e.g. Pod) or a list type (e.g. PodList). Decorator is intended for\n\t// integrations that are above storage and should only be used for\n\t// specific cases where storage of the value is not appropriate, since\n\t// they cannot be watched.\n\tDecorator ObjectFunc\n\t// CreateStrategy implements resource-specific behavior during creation.\n\tCreateStrategy rest.RESTCreateStrategy\n\t// AfterCreate implements a further operation to run after a resource is\n\t// created and before it is decorated, optional.\n\tAfterCreate ObjectFunc\n\n\t// UpdateStrategy implements resource-specific behavior during updates.\n\tUpdateStrategy rest.RESTUpdateStrategy\n\t// AfterUpdate implements a further operation to run after a resource is\n\t// updated and before it is decorated, optional.\n\tAfterUpdate ObjectFunc\n\n\t// DeleteStrategy implements resource-specific behavior during deletion.\n\tDeleteStrategy rest.RESTDeleteStrategy\n\t// AfterDelete implements a further operation to run after a resource is\n\t// deleted and before it is decorated, optional.\n\tAfterDelete ObjectFunc\n\t// ReturnDeletedObject determines whether the Store returns the object\n\t// that was deleted. Otherwise, return a generic success status response.\n\tReturnDeletedObject bool\n\t// ShouldDeleteDuringUpdate is an optional function to determine whether\n\t// an update from existing to obj should result in a delete.\n\t// If specified, this is checked in addition to standard finalizer,\n\t// deletionTimestamp, and deletionGracePeriodSeconds checks.\n\tShouldDeleteDuringUpdate func(ctx context.Context, key string, obj, existing runtime.Object) bool\n\t// ExportStrategy implements resource-specific behavior during export,\n\t// optional. Exported objects are not decorated.\n\tExportStrategy rest.RESTExportStrategy\n\t// TableConvertor is an optional interface for transforming items or lists\n\t// of items into tabular output. If unset, the default will be used.\n\tTableConvertor rest.TableConvertor\n\n\t// Storage is the interface for the underlying storage for the\n\t// resource. It is wrapped into a \"DryRunnableStorage\" that will\n\t// either pass-through or simply dry-run.\n\tStorage DryRunnableStorage\n\t// StorageVersioner outputs the <group/version/kind> an object will be\n\t// converted to before persisted in etcd, given a list of possible\n\t// kinds of the object.\n\t// If the StorageVersioner is nil, apiserver will leave the\n\t// storageVersionHash as empty in the discovery document.\n\tStorageVersioner runtime.GroupVersioner\n\t// Called to cleanup clients used by the underlying Storage; optional.\n\tDestroyFunc func()\n}\n```\n\n 在`NewLegacyRESTStorage`内部，可以看到创建了多种资源的RESTStorage。常见的像event、secret、namespace、endpoints等，统一调用`NewREST`方法构造相应的资源。待所有资源的store创建完成之后，使用`restStorageMap`的Map类型将每个资源的路由和对应的store对应起来，方便后续去做路由的统一规划，代码如下：\n\n```\n  // storage就是将ulr和处理函数进行了绑定\n\trestStorageMap := map[string]rest.Storage{\n\t\t\"pods\":             podStorage.Pod,\n\t\t\"pods/attach\":      podStorage.Attach,\n\t\t....\n\t}\n```\n\n最终完成以api开头的所有资源的RESTStorage操作。\n 创建完之后，则开始进行路由的安装，执行`InstallLegacyAPIGroup`方法，主要调用链为`InstallLegacyAPIGroup-->installAPIResources-->InstallREST-->Install-->registerResourceHandlers`，最终核心的路由构造在`registerResourceHandlers`方法内。\n\n```\nvendor/k8s.io/apiserver/pkg/endpoints/installer.go\n\nfunc (a *APIInstaller) registerResourceHandlers(path string, storage rest.Storage, ws *restful.WebService) (*metav1.APIResource, error) {\n    \n　　　　...\n\n　　　　creater, isCreater := storage.(rest.Creater)\n　　　　namedCreater, isNamedCreater := storage.(rest.NamedCreater)\n　　　　lister, isLister := storage.(rest.Lister)\n　　　　getter, isGetter := storage.(rest.Getter)\n　　　　getterWithOptions, isGetterWithOptions := storage.(rest.GetterWithOptions)\n　　　　gracefulDeleter, isGracefulDeleter := storage.(rest.GracefulDeleter)\n　　　　collectionDeleter, isCollectionDeleter := storage.(rest.CollectionDeleter)\n　　　　updater, isUpdater := storage.(rest.Updater)\n　　　　patcher, isPatcher := storage.(rest.Patcher)\n　　　　watcher, isWatcher := storage.(rest.Watcher)\n　　　　connecter, isConnecter := storage.(rest.Connecter)\n　　　　storageMeta, isMetadata := storage.(rest.StorageMetadata)\n\n　　　　if !isMetadata {\n   　　　　　　storageMeta = defaultStorageMetadata{}\n　　　　}\n　　　　...\n\n\n　　　　// Handler for standard REST verbs (GET, PUT, POST and DELETE).\n　　　　// Add actions at the resource path: /api/apiVersion/resource\n　　　　actions = appendIf(actions, action{\"LIST\", resourcePath, resourceParams, namer, false}, isLister)\n　　　　actions = appendIf(actions, action{\"POST\", resourcePath, resourceParams, namer, false}, isCreater)\n　　　　actions = appendIf(actions, action{\"DELETECOLLECTION\", resourcePath, resourceParams, namer, false}, isCollectionDeleter)\n　　　　\n　　　　// Add actions at the item path: /api/apiVersion/resource/{name}\n　　　　actions = appendIf(actions, action{\"GET\", itemPath, nameParams, namer, false}, isGetter)\n　　　　if getSubpath {\n   　　　　　　actions = appendIf(actions, action{\"GET\", itemPath + \"/{path:*}\", proxyParams, namer, false}, isGetter)\n　　　　}\n　　　　actions = appendIf(actions, action{\"PUT\", itemPath, nameParams, namer, false}, isUpdater)\n　　　　actions = appendIf(actions, action{\"PATCH\", itemPath, nameParams, namer, false}, isPatcher)\n　　　　actions = appendIf(actions, action{\"DELETE\", itemPath, nameParams, namer, false}, isGracefulDeleter)\n\n　　　　actions = appendIf(actions, action{\"CONNECT\", itemPath, nameParams, namer, false}, isConnecter)\n　　　　actions = appendIf(actions, action{\"CONNECT\", itemPath + \"/{path:*}\", proxyParams, namer, false}, isConnecter && connectSubpath)\n\n　　　　...\n　　　　\n　　　　routes := []*restful.RouteBuilder{}\n\n　　　　case \"GET\": // Get a resource.\n　　　　　　　　var handler restful.RouteFunction\n　　　　　　　　if isGetterWithOptions {\n   　　　　　　　　　　handler = restfulGetResourceWithOptions(getterWithOptions, reqScope, isSubresource)\n　　　　　　　　} else {\n   　　　　　　　　　　handler = restfulGetResource(getter, exporter, reqScope)\n　　　　　　　　}\n\n　　　　　　　　if needOverride {\n   　　　　　　　　　　// need change the reported verb\n　　　　　　　　　　　　handler = metrics.InstrumentRouteFunc(verbOverrider.OverrideMetricsVerb(action.Verb), group, version, resource, subresource, requestScope, metrics.APIServerComponent, handler)\n　　　　　　　　} else {\n　　　　　　　　　　　　handler = metrics.InstrumentRouteFunc(action.Verb, group, version, resource, subresource, requestScope, metrics.APIServerComponent, handler)\n　　　　　　　　}\n\n　　　　　　　　if a.enableAPIResponseCompression {\n   　　　　　　　　　　handler = genericfilters.RestfulWithCompression(handler)\n　　　　　　　　}\n　　　　　　　　doc := \"read the specified \" + kind\n　　　　　　　　if isSubresource {\n   　　　　　　　　　　doc = \"read \" + subresource + \" of the specified \" + kind\n　　　　　　　　}\n　　　　　　　　route := ws.GET(action.Path).To(handler).Doc(doc).\n　　　　　　　　Param(ws.QueryParameter(\"pretty\", \"If 'true', then the output is pretty printed.\")).\n　　　　　　　　Operation(\"read\"+namespaced+kind+strings.Title(subresource)+operationSuffix).\n　　　　　　　　Produces(append(storageMeta.ProducesMIMETypes(action.Verb), mediaTypes...)...).\n　　　　　　　　Returns(http.StatusOK, \"OK\", producedObject).Writes(producedObject)\n　　　　　　　　...\n　　　　　　　　routes = append(routes, route)\n\n　　　　...\n\n　　　　\n　　　　for _, route := range routes {\n   　　　　　　route.Metadata(ROUTE_META_GVK, metav1.GroupVersionKind{\n      　　　　　　　　Group:   reqScope.Kind.Group,\n      　　　　　　　　Version: reqScope.Kind.Version,\n      　　　　　　　　Kind:    reqScope.Kind.Kind,\n   　　　　　　})\n   　　　　　　route.Metadata(ROUTE_META_ACTION, strings.ToLower(action.Verb))\n   　　　　　　ws.Route(route)\n　　　　}\n　　　　...\n　　　　return &apiResource, nil\n}\n```\n\n**registerResourceHandlers**处理逻辑如下：\n\n1.判断storage是否支持create、list、get等方法，并对所有支持的方法进行进一步的处理，如if !isMetadata这一块一样，内容过多不一一贴出；\n\n2.将所有支持的方法存入actions数组中；\n\n3.遍历actions数组，在一个switch语句中，为所有元素定义路由。如贴出的case \"GET\"这一块，首先创建并包装一个handler对象，然后调用WebService的一系列方法，创建一个route对象，将handler绑定到这个route上。后面还有case \"PUT\"、case \"DELETE\"等一系列case，不一一贴出。最后，将route加入routes数组中。\n\n4.遍历routes数组，将route加入WebService中。\n\n5.最后，返回一个APIResource结构体。\n\n这样，Install方法就通过调用registerResourceHandlers方法，完成了WebService与APIResource的绑定。\n\n至此，InstallLegacyAPI方法的逻辑就分析完了。总的来说，这个方法遵循了go-restful的设计模式，在/api路径下注册了WebService，并将WebService加入Container中。\n\n<br>\n\n这是一个非常复杂的方法，整个方法的代码在700行左右。方法的主要功能是通过上一步骤构造的RESTStorage判断该资源可以执行哪些操作（如create、update等），将其对应的操作存入到action，每一个action对应一个标准的rest操作，如create对应的action操作为POST、update对应的action操作为PUT。最终根据actions数组依次遍历，对每一个操作添加一个handler方法，注册到route中去，route注册到webservice中去，完美匹配go-restful的设计模式。\n\n<br>\n\napi开头的路由主要是对基础资源的路由实现，而对于其他附加的资源，如认证相关、网络相关等各种扩展的api资源，统一以apis开头命名，实现入口为`InstallAPIs`。\n `InstallAPIs`与`InstallLegacyAPIGroup`主要的区别是获取RESTStorage的方式。对于api开头的路由来说，都是/api/v1这种统一的格式；而对于apis开头路由则不一样，它包含了多种不同的格式（Kubernetes代码内叫groupName），如/apis/apps、/apis/certificates.k8s.io等各种无规律的groupName。为此，kubernetes提供了一种`RESTStorageProvider`的工厂模式的接口\n\n```\n// RESTStorageProvider is a factory type for REST storage.\ntype RESTStorageProvider interface {\n\tGroupName() string\n\tNewRESTStorage(apiResourceConfigSource serverstorage.APIResourceConfigSource, restOptionsGetter generic.RESTOptionsGetter) (genericapiserver.APIGroupInfo, bool)\n}\n```\n\n\n 所有以apis开头的路由的资源都需要实现该接口。GroupName()方法获取到的就是类似于/apis/apps、/apis/certificates.k8s.io这样的groupName，NewRESTStorage方法获取到的是相对应的RESTStorage封装后的信息。然后的步骤和api是一样的。\n\n<br>\n\n**简要总结：**\n\n（1）所有资源都必须有 restStorage。这个包含`NewFunc`返回特定资源信息、`NewListFunc`返回特定资源列表、`CreateStrategy`特定资源创建时的策略、`UpdateStrategy`更新时的策略以及`DeleteStrategy`删除时的策略等重要方法。\n\n（2）有了restStorage，调用 `installAPIResources-->InstallREST-->Install-->registerResourceHandlers` 这一条链路，然后注册路由。\n\n### 3. 总结\n\nKube-apiserver启动其实就是启动一个服务。这里的关键点就是在于如何添加路由。\n\n创建完之后，则开始进行路由的安装，执行`InstallLegacyAPIGroup`方法，主要调用链为`InstallLegacyAPIGroup-->installAPIResources-->InstallREST-->Install-->registerResourceHandlers`，最终核心的路由构造在`registerResourceHandlers`方法内。这是一个非常复杂的方法，整个方法的代码在700行左右。方法的主要功能是通过上一步骤构造的RESTStorage判断该资源可以执行哪些操作（如create、update等），将其对应的操作存入到action，每一个action对应一个标准的rest操作，如create对应的action操作为POST、update对应的action操作为PUT。最终根据actions数组依次遍历，对每一个操作添加一个handler方法，注册到route中去，route注册到webservice中去，完美匹配go-restful的设计模式。\n\n### 4. 参考链接\n\nkubernetes源码解剖： https://weread.qq.com/web/reader/f1e3207071eeeefaf1e138akb5332110237b53b3a3d68d2\n\n博客： https://juejin.cn/post/6844903801934069774"
  },
  {
    "path": "k8s/kube-scheduler/1. kube-scheduler简介.md",
    "content": "Table of Contents\n=================\n\n  * [1. kube-scheduler功能简介](#1-kube-scheduler功能简介)\n  * [2. kube-scheduler调度框架预览](#2-kube-scheduler调度框架预览)\n  * [3. 优选函数，预选函数和plugin的区别](#3-优选函数预选函数和plugin的区别)\n\n### 1. kube-scheduler功能简介\n\nkube-scheduler是kubernetes中的重要的一环，总的来说，它的功能就是：将一个未调度的pod，调度到合适的node节点上。\n\n下面以创建一个Pod为例，简要介绍kube-scheduler在整个过程发挥的作用：\n\n1. 用户通过命令行创建Pod\n2. kube-apiserver经过对象校验、admission、quota等准入操作，写入etcd， 此时Pod的 nodeip字段是空的\n3. kube-apiserver将结果返回给用户\n4. kube-scheduler一直监听节点、Pod（监听nodeip为空的pod），然后进行调度\n5. kubelet监听分配给自己的Pod，调用CRI接口进行Pod创建（该部分内容后续出系列，进行介绍）\n6. kubelet创建Pod后，更新Pod状态等信息，并向kube-apiserver上报\n7. kube-apiserver写入数据\n\n\n\n<br>\n\n### 2. kube-scheduler调度框架预览\n\nkube-scheduler总体框架如下，分为： 选择待调度pod ->  预选  -> 优选  -> 绑定。\n\n在整个的过程中，为了自定义开发，kube-scheduler还在每个部分定义了一些列插件。\n\n预选部分其实就是图中 Filter 插件负责的功能。\n\n优选部分就是 图中 score插件负责的功能。\n\n![scheduler-struct](../images/scheduler-struct.png)\n\n具体每个插件的作用如下：\n\n**preFilter:**  前置过滤插件用于预处理 Pod 的相关信息，或者检查集群或 Pod 必须满足的某些条件。 如果 PreFilter 插件返回错误，则调度周期将终止。\n\n**Filter:** 过滤插件用于过滤出不能运行该 Pod 的节点。对于每个节点， 调度器将按照其配置顺序调用这些过滤插件。如果任何过滤插件将节点标记为不可行， 则不会为该节点调用剩下的过滤插件。节点可以被同时进行评估。\n\n**PostFilter:** 这些插件在筛选阶段后调用，但仅在该 Pod 没有可行的节点时调用。 插件按其配置的顺序调用。如果任何后过滤器插件标记节点为“可调度”， 则其余的插件不会调用。典型的后筛选实现是抢占，试图通过抢占其他 Pod 的资源使该 Pod 可以调度。\n\n**preScore:** 前置评分插件用于执行 “前置评分” 工作，即生成一个可共享状态供评分插件使用。 如果 PreScore 插件返回错误，则调度周期将终止。\n\n**Score:** 评分插件用于对通过过滤阶段的节点进行排名。调度器将为每个节点调用每个评分插件。 将有一个定义明确的整数范围，代表最小和最大分数。 在[标准化评分](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#normalize-scoring)阶段之后，调度器将根据配置的插件权重 合并所有插件的节点分数。\n\n**Normalize socre:** 标准化评分插件用于在调度器计算节点的排名之前修改分数。 在此扩展点注册的插件将使用同一插件的[评分](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#scoring) 结果被调用。 每个插件在每个调度周期调用一次。\n\n**Reserve:** Reserve 是一个信息性的扩展点。 管理运行时状态的插件（也成为“有状态插件”）应该使用此扩展点，以便 调度器在节点给指定 Pod 预留了资源时能够通知该插件。 这是在调度器真正将 Pod 绑定到节点之前发生的，并且它存在是为了防止 在调度器等待绑定成功时发生竞争情况。\n\n**Permit**\n\n*Permit* 插件在每个 Pod 调度周期的最后调用，用于防止或延迟 Pod 的绑定。 一个允许插件可以做以下三件事之一：\n\n1. **批准**\n   一旦所有 Permit 插件批准 Pod 后，该 Pod 将被发送以进行绑定。\n\n2. **拒绝**\n   如果任何 Permit 插件拒绝 Pod，则该 Pod 将被返回到调度队列。 这将触发[Unreserve](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#unreserve) 插件。\n\n3. **等待**（带有超时）\n   如果一个 Permit 插件返回 “等待” 结果，则 Pod 将保持在一个内部的 “等待中” 的 Pod 列表，同时该 Pod 的绑定周期启动时即直接阻塞直到得到 [批准](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#frameworkhandle)。如果超时发生，**等待** 变成 **拒绝**，并且 Pod 将返回调度队列，从而触发 [Unreserve](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#unreserve) 插件。\n\n**PreBind**：预绑定插件用于执行 Pod 绑定前所需的任何工作。 例如，一个预绑定插件可能需要提供网络卷并且在允许 Pod 运行在该节点之前 将其挂载到目标节点上。\n\n**Bind：**Bind 插件用于将 Pod 绑定到节点上。直到所有的 PreBind 插件都完成，Bind 插件才会被调用。 各绑定插件按照配置顺序被调用。绑定插件可以选择是否处理指定的 Pod。 如果绑定插件选择处理 Pod，**剩余的绑定插件将被跳过**。\n\n**postBind:** 这是个信息性的扩展点。 绑定后插件在 Pod 成功绑定后被调用。这是绑定周期的结尾，可用于清理相关的资源。\n\n**Unreserve:** 这是个信息性的扩展点。 如果 Pod 被保留，然后在后面的阶段中被拒绝，则 Unreserve 插件将被通知。 Unreserve 插件应该清楚保留 Pod 的相关状态。\n\n使用此扩展点的插件通常也使用[Reserve](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#reserve)。\n\n<br>\n\n### 3. 优选函数，预选函数和plugin的区别\n\n经常听到预选函数，优选函数，这些和上面的plugin有什么区别。\n\n个人认为：调度过程是按照上述plugins的顺序执行的。其实可以优选，预选只是plugins中的2个。\n\n但是为什么会额外有优选，预选函数？\n\n是因为这2个是核心操作。所有额外定义了一些函数，Filter Score Plugin时间上只是调用了优选和预选函数。"
  },
  {
    "path": "k8s/kube-scheduler/2-kube-scheduler源码分析.md",
    "content": "Table of Contents\n=================\n\n   * [1. 源码阅读背景](#1-源码阅读背景)\n   * [2. Kube-scheduler启动过程-源码分析](#2-kube-scheduler启动过程-源码分析)\n      * [2.1 main](#21-main)\n      * [2.2 runCommand](#22-runcommand)\n         * [2.2.1 ApplyFeatureGates](#221-applyfeaturegates)\n         * [2.2 run](#22-run)\n            * [2.2.1 scheduler.New](#221-schedulernew)\n               * [2.2.2.1 默认配置](#2221-默认配置)\n               * [2.2.2.2 schedulerCache](#2222-schedulercache)\n               * [2.2.2.3  注册插件，这是NewDefaultRegistry就注册了所有的插件](#2223--注册插件这是newdefaultregistry就注册了所有的插件)\n               * [2.2.2.4 实例化Scheduler对象](#2224-实例化scheduler对象)\n   * [3 kube-scheduler 调度过程源码分析](#3-kube-scheduler-调度过程源码分析)\n      * [3.1 补充知识](#31-补充知识)\n      * [3.2 scheduler.Run-开始调度](#32-schedulerrun-开始调度)\n      * [3.3 scheduleOne](#33-scheduleone)\n      * [3.4 sched.Algorithm.Schedule](#34-schedalgorithmschedule)\n      * [3.5 podFitsOnNode](#35-podfitsonnode)\n         * [2.5.1 为什么执行2次for循环](#251-为什么执行2次for循环)\n         * [2.5.2 预选函数优先级的定义](#252-预选函数优先级的定义)\n   * [4.有意思的知识点](#4有意思的知识点)\n      * [4.1 这个就是判断是否有实现了接口](#41-这个就是判断是否有实现了接口)\n\n\n\n## 1. 源码阅读背景\n\n因工作需要，需要增加额外的scheduler插件，所以正好趁这个机会，对kube-scheduler组件源码进行分析\n\n## 2. Kube-scheduler启动过程-源码分析\n\n### 2.1 main\n\nMain -> runCommand -> Run\n\n```\nfunc main() {\n\trand.Seed(time.Now().UnixNano())\n\n\tcommand := app.NewSchedulerCommand()     //没有带任何参数\n\n\t// TODO: once we switch everything over to Cobra commands, we can go back to calling\n\t// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the\n\t// normalize func and add the go flag set by hand.\n\tpflag.CommandLine.SetNormalizeFunc(cliflag.WordSepNormalizeFunc)\n\t// utilflag.InitFlags()\n\tlogs.InitLogs()\n\tdefer logs.FlushLogs()\n\n\tif err := command.Execute(); err != nil {\n\t\tos.Exit(1)\n\t}\n}\n```\n\n### 2.2 runCommand\n\nrunCommand逻辑如下：\n\n（1）补全config\n\n（2）ApplyFeatureGates根据输入参数，设置Predicates(预选), Priorities（优选）参数\n\n（3）往下执行run函数\n\n```\n// runCommand runs the scheduler.\nfunc runCommand(cmd *cobra.Command, args []string, opts *options.Options, registryOptions ...Option) error {\n\t\n\t// 1.补全config\n\t// Get the completed config\n\tcc := c.Complete()\n \n  // 2. ApplyFeatureGates根据输入参数，设置Predicates(预选), Priorities（优选）参数\n\t// Apply algorithms based on feature gates.\n\t// TODO: make configurable?\n\talgorithmprovider.ApplyFeatureGates()\n\n\t// Configz registration.\n\tif cz, err := configz.New(\"componentconfig\"); err == nil {\n\t\tcz.Set(cc.ComponentConfig)\n\t} else {\n\t\treturn fmt.Errorf(\"unable to register configz: %s\", err)\n\t}\n\n\tctx, cancel := context.WithCancel(context.Background())\n\tdefer cancel()\n  \n  // 3. RUN\n\treturn Run(ctx, cc, registryOptions...)\n}\n```\n\n#### 2.2.1 ApplyFeatureGates\n\nApplyFeatureGates函数的逻辑就是:\n\n（1）如果开启了 EvenPodsSpread这个FeatureGate (拓扑调度相关), 就会多注册（所有的provider都会注册）EvenPodsSpreadPred这个，和 EvenPodsSpreadPriority这个优选函数\n（2）如果开启了ResourceLimitsPriorityFunction这个EvenPodsSpread这个FeatureGate, 就会多注册ResourceLimitsPriority这个优选函数\n\n```\npkg/scheduler/algorithmprovider/plugins.go\n// ApplyFeatureGates applies algorithm by feature gates.\nfunc ApplyFeatureGates() func() {\n\treturn defaults.ApplyFeatureGates()\n}\n\n\n// ApplyFeatureGates函数的逻辑就是\n// 如果开启了 EvenPodsSpread这个FeatureGate (拓扑调度相关), 就会多注册EvenPodsSpreadPred这个，和 EvenPodsSpreadPriority这个优选函数\n// 如果开启了ResourceLimitsPriorityFunction这个EvenPodsSpread这个FeatureGate, 就会多注册ResourceLimitsPriority这个优选函数\npkg/scheduler/algorithmprovider/defaults/defaults.go\nfunc ApplyFeatureGates() (restore func()) {\n\tsnapshot := scheduler.RegisteredPredicatesAndPrioritiesSnapshot()\n\t\n\t// Only register EvenPodsSpread predicate & priority if the feature is enabled\n\tif utilfeature.DefaultFeatureGate.Enabled(features.EvenPodsSpread) {\n\t\tklog.Infof(\"Registering EvenPodsSpread predicate and priority function\")\n\t\t// register predicate\n\t\tscheduler.InsertPredicateKeyToAlgorithmProviderMap(predicates.EvenPodsSpreadPred)\n\t\tscheduler.RegisterFitPredicate(predicates.EvenPodsSpreadPred, predicates.EvenPodsSpreadPredicate)\n\t\t// register priority\n\t\tscheduler.InsertPriorityKeyToAlgorithmProviderMap(priorities.EvenPodsSpreadPriority)\n\t\tscheduler.RegisterPriorityMapReduceFunction(\n\t\t\tpriorities.EvenPodsSpreadPriority,\n\t\t\tpriorities.CalculateEvenPodsSpreadPriorityMap,\n\t\t\tpriorities.CalculateEvenPodsSpreadPriorityReduce,\n\t\t\t1,\n\t\t)\n\t}\n\n\t// Prioritizes nodes that satisfy pod's resource limits\n\tif utilfeature.DefaultFeatureGate.Enabled(features.ResourceLimitsPriorityFunction) {\n\t\tklog.Infof(\"Registering resourcelimits priority function\")\n\t\tscheduler.RegisterPriorityMapReduceFunction(priorities.ResourceLimitsPriority, priorities.ResourceLimitsPriorityMap, nil, 1)\n\t\t// Register the priority function to specific provider too.\n\t\tscheduler.InsertPriorityKeyToAlgorithmProviderMap(scheduler.RegisterPriorityMapReduceFunction(priorities.ResourceLimitsPriority, priorities.ResourceLimitsPriorityMap, nil, 1))\n\t}\n\n\trestore = func() {\n\t\tscheduler.ApplyPredicatesAndPriorities(snapshot)\n\t}\n\treturn\n}\n\n\n// 每个Provider都注册了\n// InsertPredicateKeyToAlgorithmProviderMap insert a fit predicate key to all algorithmProviders which in algorithmProviderMap.\nfunc InsertPredicateKeyToAlgorithmProviderMap(key string) {\n\tschedulerFactoryMutex.Lock()\n\tdefer schedulerFactoryMutex.Unlock()\n\n\tfor _, provider := range algorithmProviderMap {\n\t\tprovider.FitPredicateKeys.Insert(key)\n\t}\n}\n```\n\n<br>\n\nApplyFeatureGates所在的包有一个init函数, 这个注册了默认的优选，预选函数。\n\n会往DefaultProvider, ClusterAutoscalerProvider注册\n\n```\npkg/scheduler/algorithmprovider/defaults/defaults.go\n\nfunc init() {\n\tregisterAlgorithmProvider(defaultPredicates(), defaultPriorities())\n}\n\nconst (\n\t// DefaultProvider defines the default algorithm provider name.\n\tDefaultProvider = \"DefaultProvider\"\n\t\n\t// ClusterAutoscalerProvider defines the default autoscaler provider\n\tClusterAutoscalerProvider = \"ClusterAutoscalerProvider\"\n)\n\n\n\n注册就是往默认的DefaultProvider，ClusterAutoscalerProvider注册这些函数\nfunc registerAlgorithmProvider(predSet, priSet sets.String) {\n\t// Registers algorithm providers. By default we use 'DefaultProvider', but user can specify one to be used\n\t// by specifying flag.\n\tscheduler.RegisterAlgorithmProvider(scheduler.DefaultProvider, predSet, priSet)\n\t// Cluster autoscaler friendly scheduling algorithm.\n\tscheduler.RegisterAlgorithmProvider(ClusterAutoscalerProvider, predSet,\n\t\tcopyAndReplace(priSet, priorities.LeastRequestedPriority, priorities.MostRequestedPriority))\n}\n\n\n// RegisterFitPredicateFactory registers a fit predicate factory with the\n// algorithm registry. Returns the name with which the predicate was registered.\nfunc RegisterFitPredicateFactory(name string, predicateFactory FitPredicateFactory) string {\n\tschedulerFactoryMutex.Lock()\n\tdefer schedulerFactoryMutex.Unlock()\n\tvalidateAlgorithmNameOrDie(name)\n\tfitPredicateMap[name] = predicateFactory\n\treturn name\n}\n\n\n\n// 默认的11个预选函数\nfunc defaultPredicates() sets.String {\n\treturn sets.NewString(\n\t\tpredicates.NoVolumeZoneConflictPred,\n\t\tpredicates.MaxEBSVolumeCountPred,\n\t\tpredicates.MaxGCEPDVolumeCountPred,\n\t\tpredicates.MaxAzureDiskVolumeCountPred,\n\t\tpredicates.MaxCSIVolumeCountPred,\n\t\tpredicates.MatchInterPodAffinityPred,\n\t\tpredicates.NoDiskConflictPred,\n\t\tpredicates.GeneralPred,\n\t\tpredicates.PodToleratesNodeTaintsPred,\n\t\tpredicates.CheckVolumeBindingPred,\n\t\tpredicates.CheckNodeUnschedulablePred,\n\t)\n}\n\n// 默认的8个优选函数\nfunc defaultPriorities() sets.String {\n\treturn sets.NewString(\n\t\tpriorities.SelectorSpreadPriority,\n\t\tpriorities.InterPodAffinityPriority,\n\t\tpriorities.LeastRequestedPriority,\n\t\tpriorities.BalancedResourceAllocation,\n\t\tpriorities.NodePreferAvoidPodsPriority,\n\t\tpriorities.NodeAffinityPriority,\n\t\tpriorities.TaintTolerationPriority,\n\t\tpriorities.ImageLocalityPriority,\n\t)\n}\n```\n\n#### 2.2 run\n\n（1）准备event-client，用于上报event\n\n（2）初始化scheduler. WithName,  With开头的函数就是一个Option,结合WithName来看，这里实际就是将默认的 schedulerName赋值给 cc\n\n（3）start 所有informers\n\n（4）运行所有的sched.Run函数\n\n这里核心分析 初始化scheduler做了什么\n\n```\nfunc Run(ctx context.Context, cc schedulerserverconfig.CompletedConfig, outOfTreeRegistryOptions ...Option) error {\n\t// To help debugging, immediately log version\n\tklog.V(1).Infof(\"Starting Kubernetes Scheduler version %+v\", version.Get())\n  \n  // outOfTreeRegistryOptions参数是看起来是NewSchedulerCommand带过来的，为空的，这个先不管\n  // outOfTreeRegistry也是make一个空的map\n\toutOfTreeRegistry := make(framework.Registry)\n\tfor _, option := range outOfTreeRegistryOptions {\n\t\tif err := option(outOfTreeRegistry); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n  \n  // 1. 准备event-client，用于上报event\n  // Prepare event clients.\n\tif _, err := cc.Client.Discovery().ServerResourcesForGroupVersion(eventsv1beta1.SchemeGroupVersion.String()); err == nil {\n\t\tcc.Broadcaster = events.NewBroadcaster(&events.EventSinkImpl{Interface: cc.EventClient.Events(\"\")})\n\t\tcc.Recorder = cc.Broadcaster.NewRecorder(scheme.Scheme, cc.ComponentConfig.SchedulerName)\n\t} else {\n\t\trecorder := cc.CoreBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: cc.ComponentConfig.SchedulerName})\n\t\tcc.Recorder = record.NewEventRecorderAdapter(recorder)\n\t}\n\t\n\t\n  // 2.初始化scheduler.\n  // WithName,With...等等 其实就是一个Option,结合new来看，这里实际就是将默认的 schedulerName赋值给 cc\n  \n\t// Create the scheduler.\n\tsched, err := scheduler.New(cc.Client,\n\t\tcc.InformerFactory,\n\t\tcc.PodInformer,\n\t\tcc.Recorder,\n\t\tctx.Done(),\n\t\tscheduler.WithName(cc.ComponentConfig.SchedulerName),\n\t\tscheduler.WithAlgorithmSource(cc.ComponentConfig.AlgorithmSource),\n\t\tscheduler.WithHardPodAffinitySymmetricWeight(cc.ComponentConfig.HardPodAffinitySymmetricWeight),\n\t\tscheduler.WithPreemptionDisabled(cc.ComponentConfig.DisablePreemption),\n\t\tscheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore),\n\t\tscheduler.WithBindTimeoutSeconds(cc.ComponentConfig.BindTimeoutSeconds),\n\t\tscheduler.WithFrameworkOutOfTreeRegistry(outOfTreeRegistry),\n\t\tscheduler.WithFrameworkPlugins(cc.ComponentConfig.Plugins),\n\t\tscheduler.WithFrameworkPluginConfig(cc.ComponentConfig.PluginConfig),\n\t\tscheduler.WithPodMaxBackoffSeconds(cc.ComponentConfig.PodMaxBackoffSeconds),\n\t\tscheduler.WithPodInitialBackoffSeconds(cc.ComponentConfig.PodInitialBackoffSeconds),\n\t)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t// Prepare the event broadcaster.\n\tif cc.Broadcaster != nil && cc.EventClient != nil {\n\t\tcc.Broadcaster.StartRecordingToSink(ctx.Done())\n\t}\n\tif cc.CoreBroadcaster != nil && cc.CoreEventClient != nil {\n\t\tcc.CoreBroadcaster.StartRecordingToSink(&corev1.EventSinkImpl{Interface: cc.CoreEventClient.Events(\"\")})\n\t}\n\t// Setup healthz checks.\n\tvar checks []healthz.HealthChecker\n\tif cc.ComponentConfig.LeaderElection.LeaderElect {\n\t\tchecks = append(checks, cc.LeaderElection.WatchDog)\n\t}\n\n\t// Start up the healthz server.\n\tif cc.InsecureServing != nil {\n\t\tseparateMetrics := cc.InsecureMetricsServing != nil\n\t\thandler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, separateMetrics, checks...), nil, nil)\n\t\tif err := cc.InsecureServing.Serve(handler, 0, ctx.Done()); err != nil {\n\t\t\treturn fmt.Errorf(\"failed to start healthz server: %v\", err)\n\t\t}\n\t}\n\tif cc.InsecureMetricsServing != nil {\n\t\thandler := buildHandlerChain(newMetricsHandler(&cc.ComponentConfig), nil, nil)\n\t\tif err := cc.InsecureMetricsServing.Serve(handler, 0, ctx.Done()); err != nil {\n\t\t\treturn fmt.Errorf(\"failed to start metrics server: %v\", err)\n\t\t}\n\t}\n\tif cc.SecureServing != nil {\n\t\thandler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, false, checks...), cc.Authentication.Authenticator, cc.Authorization.Authorizer)\n\t\t// TODO: handle stoppedCh returned by c.SecureServing.Serve\n\t\tif _, err := cc.SecureServing.Serve(handler, 0, ctx.Done()); err != nil {\n\t\t\t// fail early for secure handlers, removing the old error loop from above\n\t\t\treturn fmt.Errorf(\"failed to start secure server: %v\", err)\n\t\t}\n\t}\n  \n  // 3. start 所有informers\n\t// Start all informers.\n\tgo cc.PodInformer.Informer().Run(ctx.Done())\n\tcc.InformerFactory.Start(ctx.Done())\n\n\t// Wait for all caches to sync before scheduling.\n\tcc.InformerFactory.WaitForCacheSync(ctx.Done())\n  \n  \n  // 4. 运行所有的sched.Run函数\n\t// If leader election is enabled, runCommand via LeaderElector until done and exit.\n\tif cc.LeaderElection != nil {\n\t\tcc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{\n\t\t\tOnStartedLeading: sched.Run,\n\t\t\tOnStoppedLeading: func() {\n\t\t\t\tklog.Fatalf(\"leaderelection lost\")\n\t\t\t},\n\t\t}\n\t\tleaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)\n\t\tif err != nil {\n\t\t\treturn fmt.Errorf(\"couldn't create leader elector: %v\", err)\n\t\t}\n\n\t\tleaderElector.Run(ctx)\n\n\t\treturn fmt.Errorf(\"lost lease\")\n\t}\n\n\t// Leader election is disabled, so runCommand inline until done.\n\tsched.Run(ctx)\n\treturn fmt.Errorf(\"finished without leader elect\")\n}\n```\n\n##### 2.2.1 scheduler.New\n\n（1）创建默认配置，是否开启驱逐等等配置\n\n（2）schedulerCache.new，主要功能就是清理过期的cache\n\n（3）注册插件，这是NewDefaultRegistry就注册了所有的插件\n\n（4）实例化Scheduler对象\n\n```\nWithName 其实就是一个Option,结合new来看，这里实际就是将默认的 schedulerName赋值给 cc\n// WithName sets schedulerName for Scheduler, the default schedulerName is default-scheduler\nfunc WithName(schedulerName string) Option {\n\treturn func(o *schedulerOptions) {\n\t\to.schedulerName = schedulerName\n\t}\n}\n\n// New returns a Scheduler\nfunc New(client clientset.Interface,\n\tinformerFactory informers.SharedInformerFactory,\n\tpodInformer coreinformers.PodInformer,\n\trecorder events.EventRecorder,\n\tstopCh <-chan struct{},\n\topts ...Option) (*Scheduler, error) {\n\n\tstopEverything := stopCh\n\tif stopEverything == nil {\n\t\tstopEverything = wait.NeverStop\n\t}\n  \n  // 2.3.1 创建默认配置\n\toptions := defaultSchedulerOptions\n\tfor _, opt := range opts {\n\t\topt(&options)\n\t}\n  \n  // 2.3.2. schedulerCache，主要功能就是清理过期的pod\n\tschedulerCache := internalcache.New(30*time.Second, stopEverything)\n\tvolumeBinder := volumebinder.NewVolumeBinder(\n\t\tclient,\n\t\tinformerFactory.Core().V1().Nodes(),\n\t\tinformerFactory.Storage().V1().CSINodes(),\n\t\tinformerFactory.Core().V1().PersistentVolumeClaims(),\n\t\tinformerFactory.Core().V1().PersistentVolumes(),\n\t\tinformerFactory.Storage().V1().StorageClasses(),\n\t\ttime.Duration(options.bindTimeoutSeconds)*time.Second,\n\t)\n\n  // 2.3.3 注册插件，这是NewDefaultRegistry就注册了所有的插件\n  // 默认的scheduler插件是空的，所以会调用NewDefaultRegistry函数\n\tregistry := options.frameworkDefaultRegistry\n\tif registry == nil {\n\t\tregistry = frameworkplugins.NewDefaultRegistry(&frameworkplugins.RegistryArgs{\n\t\t\tVolumeBinder: volumeBinder,\n\t\t})\n\t}\n\t// frameworkOutOfTreeRegistry指定的时候也是空的\n\tregistry.Merge(options.frameworkOutOfTreeRegistry)\n\n\n\tsnapshot := nodeinfosnapshot.NewEmptySnapshot()\n\n\tconfigurator := &Configurator{\n\t\tclient:                         client,\n\t\tinformerFactory:                informerFactory,\n\t\tpodInformer:                    podInformer,\n\t\tvolumeBinder:                   volumeBinder,\n\t\tschedulerCache:                 schedulerCache,\n\t\tStopEverything:                 stopEverything,\n\t\thardPodAffinitySymmetricWeight: options.hardPodAffinitySymmetricWeight,\n\t\tdisablePreemption:              options.disablePreemption,\n\t\tpercentageOfNodesToScore:       options.percentageOfNodesToScore,\n\t\tbindTimeoutSeconds:             options.bindTimeoutSeconds,\n\t\tpodInitialBackoffSeconds:       options.podInitialBackoffSeconds,\n\t\tpodMaxBackoffSeconds:           options.podMaxBackoffSeconds,\n\t\tenableNonPreempting:            utilfeature.DefaultFeatureGate.Enabled(kubefeatures.NonPreemptingPriority),\n\t\tregistry:                       registry,\n\t\tplugins:                        options.frameworkPlugins,\n\t\tpluginConfig:                   options.frameworkPluginConfig,\n\t\tpluginConfigProducerRegistry:   options.frameworkConfigProducerRegistry,\n\t\tnodeInfoSnapshot:               snapshot,\n\t\talgorithmFactoryArgs: AlgorithmFactoryArgs{\n\t\t\tSharedLister:                   snapshot,\n\t\t\tInformerFactory:                informerFactory,\n\t\t\tVolumeBinder:                   volumeBinder,\n\t\t\tHardPodAffinitySymmetricWeight: options.hardPodAffinitySymmetricWeight,\n\t\t},\n\t\tconfigProducerArgs: &frameworkplugins.ConfigProducerArgs{},\n\t}\n\n\tmetrics.Register()\n  \n  // 2.3.4 实例化Scheduler对象\n\tvar sched *Scheduler\n\tsource := options.schedulerAlgorithmSource\n\tswitch {\n\tcase source.Provider != nil:\n\t\t// Create the config from a named algorithm provider.\n\t\tsc, err := configurator.CreateFromProvider(*source.Provider)\n\t\tif err != nil {\n\t\t\treturn nil, fmt.Errorf(\"couldn't create scheduler using provider %q: %v\", *source.Provider, err)\n\t\t}\n\t\tsched = sc\n\tcase source.Policy != nil:\n\t\t// Create the config from a user specified policy source.\n\t\tpolicy := &schedulerapi.Policy{}\n\t\tswitch {\n\t\tcase source.Policy.File != nil:\n\t\t\tif err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\tcase source.Policy.ConfigMap != nil:\n\t\t\tif err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil {\n\t\t\t\treturn nil, err\n\t\t\t}\n\t\t}\n\t\tsc, err := configurator.CreateFromConfig(*policy)\n\t\tif err != nil {\n\t\t\treturn nil, fmt.Errorf(\"couldn't create scheduler from policy: %v\", err)\n\t\t}\n\t\tsched = sc\n\tdefault:\n\t\treturn nil, fmt.Errorf(\"unsupported algorithm source: %v\", source)\n\t}\n\t// Additional tweaks to the config produced by the configurator.\n\tsched.Recorder = recorder\n\tsched.DisablePreemption = options.disablePreemption\n\tsched.StopEverything = stopEverything\n\tsched.podConditionUpdater = &podConditionUpdaterImpl{client}\n\tsched.podPreemptor = &podPreemptorImpl{client}\n\tsched.scheduledPodsHasSynced = podInformer.Informer().HasSynced\n\n\tAddAllEventHandlers(sched, options.schedulerName, informerFactory, podInformer)\n\treturn sched, nil\n}\n```\n\n###### 2.2.2.1 默认配置\n\n（1） 默认sheduler名字为： default-scheduler\n\n（2）默认关闭抢占\n\n（3）默认的超时等等\n\n（4）这里scheduler没有frameworkPlugins，因为后面predicate/priority-mapped plugins会注册\n\n```\nvar defaultSchedulerOptions = schedulerOptions{\n   schedulerName: v1.DefaultSchedulerName,\n   schedulerAlgorithmSource: schedulerapi.SchedulerAlgorithmSource{\n      Provider: defaultAlgorithmSourceProviderName(),\n   },\n   hardPodAffinitySymmetricWeight:  v1.DefaultHardPodAffinitySymmetricWeight,\n   disablePreemption:               false,\n   percentageOfNodesToScore:        schedulerapi.DefaultPercentageOfNodesToScore,\n   bindTimeoutSeconds:              BindTimeoutSeconds,\n   podInitialBackoffSeconds:        int64(internalqueue.DefaultPodInitialBackoffDuration.Seconds()),\n   podMaxBackoffSeconds:            int64(internalqueue.DefaultPodMaxBackoffDuration.Seconds()),\n   frameworkConfigProducerRegistry: frameworkplugins.NewDefaultConfigProducerRegistry(),\n   // The plugins and pluginConfig options are currently nil because we currently don't have\n   // \"default\" plugins. All plugins that we run through the framework currently come from two\n   // sources: 1) specified in component config, in which case those two options should be\n   // set using their corresponding With* functions, 2) predicate/priority-mapped plugins, which\n   // pluginConfigProducerRegistry contains a mapping for and produces their configurations.\n   // TODO(ahg-g) Once predicates and priorities are migrated to natively run as plugins, the\n   // below two parameters will be populated accordingly.\n   frameworkPlugins:      nil,\n   frameworkPluginConfig: nil,\n}\n```\n\n###### 2.2.2.2 schedulerCache\n\n主要功能就是清理过期的cache\n\n```\nschedulerCache := internalcache.New(30*time.Second, stopEverything)\n\n// New returns a Cache implementation.\n// It automatically starts a go routine that manages expiration of assumed pods.\n// \"ttl\" is how long the assumed pod will get expired.\n// \"stop\" is the channel that would close the background goroutine.\nfunc New(ttl time.Duration, stop <-chan struct{}) Cache {\n\tcache := newSchedulerCache(ttl, cleanAssumedPeriod, stop)\n\tcache.run()\n\treturn cache\n}\n\n\nfunc (cache *schedulerCache) run() {\n\tgo wait.Until(cache.cleanupExpiredAssumedPods, cache.period, cache.stop)\n}\n\nfunc (cache *schedulerCache) cleanupExpiredAssumedPods() {\n\tcache.cleanupAssumedPods(time.Now())\n}\n\n// cleanupAssumedPods exists for making test deterministic by taking time as input argument.\n// It also reports metrics on the cache size for nodes, pods, and assumed pods.\nfunc (cache *schedulerCache) cleanupAssumedPods(now time.Time) {\n\tcache.mu.Lock()\n\tdefer cache.mu.Unlock()\n\tdefer cache.updateMetrics()\n\n\t// The size of assumedPods should be small\n\tfor key := range cache.assumedPods {\n\t\tps, ok := cache.podStates[key]\n\t\tif !ok {\n\t\t\tpanic(\"Key found in assumed set but not in podStates. Potentially a logical error.\")\n\t\t}\n\t\tif !ps.bindingFinished {\n\t\t\tklog.V(3).Infof(\"Couldn't expire cache for pod %v/%v. Binding is still in progress.\",\n\t\t\t\tps.pod.Namespace, ps.pod.Name)\n\t\t\tcontinue\n\t\t}\n\t\tif now.After(*ps.deadline) {\n\t\t\tklog.Warningf(\"Pod %s/%s expired\", ps.pod.Namespace, ps.pod.Name)\n\t\t\tif err := cache.expirePod(key, ps); err != nil {\n\t\t\t\tklog.Errorf(\"ExpirePod failed for %s: %v\", key, err)\n\t\t\t}\n\t\t}\n\t}\n}\n```\n\n###### 2.2.2.3  注册插件，这是NewDefaultRegistry就注册了所有的插件\n\n```\n// NewDefaultRegistry builds the default registry with all the in-tree plugins.\n// This is the registry that Kubernetes default scheduler uses. A scheduler that runs out of tree\n// plugins can register additional plugins through the WithFrameworkOutOfTreeRegistry option.\nfunc NewDefaultRegistry(args *RegistryArgs) framework.Registry {\n\treturn framework.Registry{\n\t\tdefaultpodtopologyspread.Name:        defaultpodtopologyspread.New,\n\t\timagelocality.Name:                   imagelocality.New,\n\t\ttainttoleration.Name:                 tainttoleration.New,\n\t\tnodename.Name:                        nodename.New,\n\t\tnodeports.Name:                       nodeports.New,\n\t\tnodepreferavoidpods.Name:             nodepreferavoidpods.New,\n\t\tnodeaffinity.Name:                    nodeaffinity.New,\n\t\tpodtopologyspread.Name:               podtopologyspread.New,\n\t\tnodeunschedulable.Name:               nodeunschedulable.New,\n\t\tnoderesources.FitName:                noderesources.NewFit,\n\t\tnoderesources.BalancedAllocationName: noderesources.NewBalancedAllocation,\n\t\tnoderesources.MostAllocatedName:      noderesources.NewMostAllocated,\n\t\tnoderesources.LeastAllocatedName:     noderesources.NewLeastAllocated,\n\t\tvolumebinding.Name: func(_ *runtime.Unknown, _ framework.FrameworkHandle) (framework.Plugin, error) {\n\t\t\treturn volumebinding.NewFromVolumeBinder(args.VolumeBinder), nil\n\t\t},\n\t\tvolumerestrictions.Name:        volumerestrictions.New,\n\t\tvolumezone.Name:                volumezone.New,\n\t\tnodevolumelimits.CSIName:       nodevolumelimits.NewCSI,\n\t\tnodevolumelimits.EBSName:       nodevolumelimits.NewEBS,\n\t\tnodevolumelimits.GCEPDName:     nodevolumelimits.NewGCEPD,\n\t\tnodevolumelimits.AzureDiskName: nodevolumelimits.NewAzureDisk,\n\t\tnodevolumelimits.CinderName:    nodevolumelimits.NewCinder,\n\t\tinterpodaffinity.Name:          interpodaffinity.New,\n\t\tnodelabel.Name:                 nodelabel.New,\n\t\trequestedtocapacityratio.Name:  requestedtocapacityratio.New,\n\t\tserviceaffinity.Name:           serviceaffinity.New,\n\t}\n}\n```\n\n随便找一个插件，比如node_label\n\npkg/scheduler/framework/plugins/nodelabel/node_label.go\n\n```\n\nvar _ framework.FilterPlugin = &NodeLabel{}\nvar _ framework.ScorePlugin = &NodeLabel{}\n\n// Name returns name of the plugin. It is used in logs, etc.\nfunc (pl *NodeLabel) Name() string {\n\treturn Name\n}\n\n// Filter invoked at the filter extension point.\nfunc (pl *NodeLabel) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *nodeinfo.NodeInfo) *framework.Status {\n\t// Note that NodeLabelPredicate doesn't use predicate metadata, hence passing nil here.\n\t_, reasons, err := pl.predicate(pod, nil, nodeInfo)\n\treturn migration.PredicateResultToFrameworkStatus(reasons, err)\n}\n\n// Score invoked at the score extension point.\nfunc (pl *NodeLabel) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {\n\tnodeInfo, err := pl.handle.SnapshotSharedLister().NodeInfos().Get(nodeName)\n\tif err != nil {\n\t\treturn 0, framework.NewStatus(framework.Error, fmt.Sprintf(\"getting node %q from Snapshot: %v\", nodeName, err))\n\t}\n\t// Note that node label priority function doesn't use metadata, hence passing nil here.\n\ts, err := pl.prioritize(pod, nil, nodeInfo)\n\treturn s.Score, migration.ErrorToFrameworkStatus(err)\n}\n\n// ScoreExtensions of the Score plugin.\nfunc (pl *NodeLabel) ScoreExtensions() framework.ScoreExtensions {\n\treturn nil\n}\n\n```\n\n**补充说明**：ScoreExtensions函数的作用，就是对score进行归一化，或者再处理。\n\n可以认为：ScoreExtensions是对 score的再次处理，参考这个issue:\n\nhttps://blog.csdn.net/weixin_42663840/article/details/114791229\n\n###### 2.2.2.4 实例化Scheduler对象\n\n这里根据不同的配置生成对应的`sched`，总共有两种方式初始化，第一种是默认的`DefaultProvider`，第二种`Policy`，policy有两种形式加载，包括从文件和ConfigMap。这里先分析默认的方式。\n\n```\n// CreateFromProvider creates a scheduler from the name of a registered algorithm provider.\nfunc (c *Configurator) CreateFromProvider(providerName string) (*Scheduler, error) {\n\tklog.V(2).Infof(\"Creating scheduler from algorithm provider '%v'\", providerName)\n\tprovider, err := GetAlgorithmProvider(providerName)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn c.CreateFromKeys(provider.FitPredicateKeys, provider.PriorityFunctionKeys, []algorithm.SchedulerExtender{})\n}\n\n\n// CreateFromKeys creates a scheduler from a set of registered fit predicate keys and priority keys.\nfunc (c *Configurator) CreateFromKeys(predicateKeys, priorityKeys sets.String, extenders []algorithm.SchedulerExtender) (*Scheduler, error) {\n\tklog.V(2).Infof(\"Creating scheduler with fit predicates '%v' and priority functions '%v'\", predicateKeys, priorityKeys)\n\n\tif c.GetHardPodAffinitySymmetricWeight() < 1 || c.GetHardPodAffinitySymmetricWeight() > 100 {\n\t\treturn nil, fmt.Errorf(\"invalid hardPodAffinitySymmetricWeight: %d, must be in the range 1-100\", c.GetHardPodAffinitySymmetricWeight())\n\t}\n\n\tpredicateFuncs, pluginsForPredicates, pluginConfigForPredicates, err := c.getPredicateConfigs(predicateKeys)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tpriorityConfigs, pluginsForPriorities, pluginConfigForPriorities, err := c.getPriorityConfigs(priorityKeys)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tpriorityMetaProducer, err := getPriorityMetadataProducer(c.algorithmFactoryArgs)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tpredicateMetaProducer, err := getPredicateMetadataProducer(c.algorithmFactoryArgs)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// Combine all framework configurations. If this results in any duplication, framework\n\t// instantiation should fail.\n\tvar plugins schedulerapi.Plugins\n\tplugins.Append(pluginsForPredicates)\n\tplugins.Append(pluginsForPriorities)\n\tplugins.Append(c.plugins)\n\tvar pluginConfig []schedulerapi.PluginConfig\n\tpluginConfig = append(pluginConfig, pluginConfigForPredicates...)\n\tpluginConfig = append(pluginConfig, pluginConfigForPriorities...)\n\tpluginConfig = append(pluginConfig, c.pluginConfig...)\n\n\tframework, err := framework.NewFramework(\n\t\tc.registry,\n\t\t&plugins,\n\t\tpluginConfig,\n\t\tframework.WithClientSet(c.client),\n\t\tframework.WithInformerFactory(c.informerFactory),\n\t\tframework.WithSnapshotSharedLister(c.nodeInfoSnapshot),\n\t)\n\tif err != nil {\n\t\tklog.Fatalf(\"error initializing the scheduling framework: %v\", err)\n\t}\n\n\tpodQueue := internalqueue.NewSchedulingQueue(\n\t\tc.StopEverything,\n\t\tframework,\n\t\tinternalqueue.WithPodInitialBackoffDuration(time.Duration(c.podInitialBackoffSeconds)*time.Second),\n\t\tinternalqueue.WithPodMaxBackoffDuration(time.Duration(c.podMaxBackoffSeconds)*time.Second),\n\t)\n\n\t// Setup cache debugger.\n\tdebugger := cachedebugger.New(\n\t\tc.informerFactory.Core().V1().Nodes().Lister(),\n\t\tc.podInformer.Lister(),\n\t\tc.schedulerCache,\n\t\tpodQueue,\n\t)\n\tdebugger.ListenForSignal(c.StopEverything)\n\n\tgo func() {\n\t\t<-c.StopEverything\n\t\tpodQueue.Close()\n\t}()\n\n\talgo := core.NewGenericScheduler(\n\t\tc.schedulerCache,\n\t\tpodQueue,\n\t\tpredicateFuncs,\n\t\tpredicateMetaProducer,\n\t\tpriorityConfigs,\n\t\tpriorityMetaProducer,\n\t\tc.nodeInfoSnapshot,\n\t\tframework,\n\t\textenders,\n\t\tc.volumeBinder,\n\t\tc.informerFactory.Core().V1().PersistentVolumeClaims().Lister(),\n\t\tGetPodDisruptionBudgetLister(c.informerFactory),\n\t\tc.alwaysCheckAllPredicates,\n\t\tc.disablePreemption,\n\t\tc.percentageOfNodesToScore,\n\t\tc.enableNonPreempting,\n\t)\n\n\treturn &Scheduler{\n\t\tSchedulerCache:  c.schedulerCache,\n\t\tAlgorithm:       algo,\n\t\tGetBinder:       getBinderFunc(c.client, extenders),\n\t\tFramework:       framework,\n\t\tNextPod:         internalqueue.MakeNextPodFunc(podQueue),\n\t\tError:           MakeDefaultErrorFunc(c.client, podQueue, c.schedulerCache),\n\t\tStopEverything:  c.StopEverything,\n\t\tVolumeBinder:    c.volumeBinder,\n\t\tSchedulingQueue: podQueue,\n\t\tPlugins:         plugins,\n\t\tPluginConfig:    pluginConfig,\n\t}, nil\n}\n```\n\n<br>\n\n## 3 kube-scheduler 调度过程源码分析\n\n### 3.1 补充知识\n\n参考：https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/\n\n一个pod的调度流程如下所示：\n\n调度过程（优选 -> 预选） ->  绑定过程\n\n**preFilter:**  前置过滤插件用于预处理 Pod 的相关信息，或者检查集群或 Pod 必须满足的某些条件。 如果 PreFilter 插件返回错误，则调度周期将终止。\n\n**Filter:** 过滤插件用于过滤出不能运行该 Pod 的节点。对于每个节点， 调度器将按照其配置顺序调用这些过滤插件。如果任何过滤插件将节点标记为不可行， 则不会为该节点调用剩下的过滤插件。节点可以被同时进行评估。\n\n**PostFilter:** 这些插件在筛选阶段后调用，但仅在该 Pod 没有可行的节点时调用。 插件按其配置的顺序调用。如果任何后过滤器插件标记节点为“可调度”， 则其余的插件不会调用。典型的后筛选实现是抢占，试图通过抢占其他 Pod 的资源使该 Pod 可以调度。\n\n**preScore:** 前置评分插件用于执行 “前置评分” 工作，即生成一个可共享状态供评分插件使用。 如果 PreScore 插件返回错误，则调度周期将终止。\n\n**Score:** 评分插件用于对通过过滤阶段的节点进行排名。调度器将为每个节点调用每个评分插件。 将有一个定义明确的整数范围，代表最小和最大分数。 在[标准化评分](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#normalize-scoring)阶段之后，调度器将根据配置的插件权重 合并所有插件的节点分数。\n\n**Normalize socre:** 标准化评分插件用于在调度器计算节点的排名之前修改分数。 在此扩展点注册的插件将使用同一插件的[评分](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#scoring) 结果被调用。 每个插件在每个调度周期调用一次。\n\n**Reserve:** Reserve 是一个信息性的扩展点。 管理运行时状态的插件（也成为“有状态插件”）应该使用此扩展点，以便 调度器在节点给指定 Pod 预留了资源时能够通知该插件。 这是在调度器真正将 Pod 绑定到节点之前发生的，并且它存在是为了防止 在调度器等待绑定成功时发生竞争情况。\n\n**Permit**\n\n*Permit* 插件在每个 Pod 调度周期的最后调用，用于防止或延迟 Pod 的绑定。 一个允许插件可以做以下三件事之一：\n\n1. **批准**\n   一旦所有 Permit 插件批准 Pod 后，该 Pod 将被发送以进行绑定。\n\n1. **拒绝**\n   如果任何 Permit 插件拒绝 Pod，则该 Pod 将被返回到调度队列。 这将触发[Unreserve](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#unreserve) 插件。\n\n1. **等待**（带有超时）\n   如果一个 Permit 插件返回 “等待” 结果，则 Pod 将保持在一个内部的 “等待中” 的 Pod 列表，同时该 Pod 的绑定周期启动时即直接阻塞直到得到 [批准](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#frameworkhandle)。如果超时发生，**等待** 变成 **拒绝**，并且 Pod 将返回调度队列，从而触发 [Unreserve](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#unreserve) 插件。\n\n**PreBind**：预绑定插件用于执行 Pod 绑定前所需的任何工作。 例如，一个预绑定插件可能需要提供网络卷并且在允许 Pod 运行在该节点之前 将其挂载到目标节点上。\n\n**Bind：**Bind 插件用于将 Pod 绑定到节点上。直到所有的 PreBind 插件都完成，Bind 插件才会被调用。 各绑定插件按照配置顺序被调用。绑定插件可以选择是否处理指定的 Pod。 如果绑定插件选择处理 Pod，**剩余的绑定插件将被跳过**。\n\n**postBind:** 这是个信息性的扩展点。 绑定后插件在 Pod 成功绑定后被调用。这是绑定周期的结尾，可用于清理相关的资源。\n\n**Unreserve:** 这是个信息性的扩展点。 如果 Pod 被保留，然后在后面的阶段中被拒绝，则 Unreserve 插件将被通知。 Unreserve 插件应该清楚保留 Pod 的相关状态。\n\n使用此扩展点的插件通常也使用[Reserve](https://kubernetes.io/zh/docs/concepts/scheduling-eviction/scheduling-framework/#reserve)。\n\n![scheduler-struct](../images/scheduler-struct.png)\n\n### 3.2 scheduler.Run-开始调度\n\n就是一直调用scheduleOne函数\n\n```\n// Run begins watching and scheduling. It waits for cache to be synced, then starts scheduling and blocked until the context is done.\nfunc (sched *Scheduler) Run(ctx context.Context) {\n\tif !cache.WaitForCacheSync(ctx.Done(), sched.scheduledPodsHasSynced) {\n\t\treturn\n\t}\n\n\twait.UntilWithContext(ctx, sched.scheduleOne, 0)\n}\n```\n\n<br>\n\n### 3.3 scheduleOne\n\n（1）取出1个pod, 如果pod.DeletionTimestamp!=nil，直接跳过\n\n（2）调用sched.Algorithm.Schedule算法，开始调度这个pod\n\n（3）为pod绑定Volumes\n\n（4）进入reserve步骤\n\n（5）开启协程进行绑定，开启协程进行绑定的原因在于，调度和绑定的分离。这样调度完了后，该函数可以返回，协程继续bind，提高速度。\n\n```\n// scheduleOne does the entire scheduling workflow for a single pod.  It is serialized on the scheduling algorithm's host fitting.\nfunc (sched *Scheduler) scheduleOne(ctx context.Context) {\n\tfwk := sched.Framework\n  \n  // 1.取出1个pod, 过滤掉DeletionTimestamp不为空的pod\n\tpodInfo := sched.NextPod()\n\t// pod could be nil when schedulerQueue is closed\n\tif podInfo == nil || podInfo.Pod == nil {\n\t\treturn\n\t}\n\tpod := podInfo.Pod\n\tif pod.DeletionTimestamp != nil {\n\t\tsched.Recorder.Eventf(pod, nil, v1.EventTypeWarning, \"FailedScheduling\", \"Scheduling\", \"skip schedule deleting pod: %v/%v\", pod.Namespace, pod.Name)\n\t\tklog.V(3).Infof(\"Skip schedule deleting pod: %v/%v\", pod.Namespace, pod.Name)\n\t\treturn\n\t}\n\n\tklog.V(3).Infof(\"Attempting to schedule pod: %v/%v\", pod.Namespace, pod.Name)\n\n\t// Synchronously attempt to find a fit for the pod.\n\tstart := time.Now()\n\tstate := framework.NewCycleState()\n\tstate.SetRecordFrameworkMetrics(rand.Intn(100) < frameworkMetricsSamplePercent)\n\tschedulingCycleCtx, cancel := context.WithCancel(ctx)\n\tdefer cancel()\n\t\n\t// 2.开始调度这个pod，sched.Algorithm.Schedule\n\tscheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, state, pod)\n\tif err != nil {\n\t\tsched.recordSchedulingFailure(podInfo.DeepCopy(), err, v1.PodReasonUnschedulable, err.Error())\n\t\t// Schedule() may have failed because the pod would not fit on any host, so we try to\n\t\t// preempt, with the expectation that the next time the pod is tried for scheduling it\n\t\t// will fit due to the preemption. It is also possible that a different pod will schedule\n\t\t// into the resources that were preempted, but this is harmless.\n\t\t// 2.1 如果调度失败，并且开启了抢占，就抢占\n\t\tif fitError, ok := err.(*core.FitError); ok {\n\t\t\tif sched.DisablePreemption {\n\t\t\t\tklog.V(3).Infof(\"Pod priority feature is not enabled or preemption is disabled by scheduler configuration.\" +\n\t\t\t\t\t\" No preemption is performed.\")\n\t\t\t} else {\n\t\t\t\tpreemptionStartTime := time.Now()\n\t\t\t\tsched.preempt(schedulingCycleCtx, state, fwk, pod, fitError)\n\t\t\t\tmetrics.PreemptionAttempts.Inc()\n\t\t\t\tmetrics.SchedulingAlgorithmPreemptionEvaluationDuration.Observe(metrics.SinceInSeconds(preemptionStartTime))\n\t\t\t\tmetrics.DeprecatedSchedulingAlgorithmPreemptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))\n\t\t\t\tmetrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))\n\t\t\t\tmetrics.DeprecatedSchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))\n\t\t\t}\n\t\t\t// Pod did not fit anywhere, so it is counted as a failure. If preemption\n\t\t\t// succeeds, the pod should get counted as a success the next time we try to\n\t\t\t// schedule it. (hopefully)\n\t\t\tmetrics.PodScheduleFailures.Inc()\n\t\t} else {\n\t\t\tklog.Errorf(\"error selecting node for pod: %v\", err)\n\t\t\tmetrics.PodScheduleErrors.Inc()\n\t\t}\n\t\treturn\n\t}\n\tmetrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInSeconds(start))\n\tmetrics.DeprecatedSchedulingAlgorithmLatency.Observe(metrics.SinceInMicroseconds(start))\n\t// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.\n\t// This allows us to keep scheduling without waiting on binding to occur.\n\tassumedPodInfo := podInfo.DeepCopy()\n\tassumedPod := assumedPodInfo.Pod\n\n\t// Assume volumes first before assuming the pod.\n\t//\n\t// If all volumes are completely bound, then allBound is true and binding will be skipped.\n\t//\n\t// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.\n\t//\n\t\n\t// 3.为pod绑定Volumes\n\t// This function modifies 'assumedPod' if volume binding is required.\n\tallBound, err := sched.VolumeBinder.Binder.AssumePodVolumes(assumedPod, scheduleResult.SuggestedHost)\n\tif err != nil {\n\t\tsched.recordSchedulingFailure(assumedPodInfo, err, SchedulerError,\n\t\t\tfmt.Sprintf(\"AssumePodVolumes failed: %v\", err))\n\t\tmetrics.PodScheduleErrors.Inc()\n\t\treturn\n\t}\n   \n  // 4.进入reserve步骤\n\t// Run \"reserve\" plugins.\n\tif sts := fwk.RunReservePlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() {\n\t\tsched.recordSchedulingFailure(assumedPodInfo, sts.AsError(), SchedulerError, sts.Message())\n\t\tmetrics.PodScheduleErrors.Inc()\n\t\treturn\n\t}\n\n\t// assume modifies `assumedPod` by setting NodeName=scheduleResult.SuggestedHost\n\terr = sched.assume(assumedPod, scheduleResult.SuggestedHost)\n\tif err != nil {\n\t\t// This is most probably result of a BUG in retrying logic.\n\t\t// We report an error here so that pod scheduling can be retried.\n\t\t// This relies on the fact that Error will check if the pod has been bound\n\t\t// to a node and if so will not add it back to the unscheduled pods queue\n\t\t// (otherwise this would cause an infinite loop).\n\t\tsched.recordSchedulingFailure(assumedPodInfo, err, SchedulerError, fmt.Sprintf(\"AssumePod failed: %v\", err))\n\t\tmetrics.PodScheduleErrors.Inc()\n\t\t// trigger un-reserve plugins to clean up state associated with the reserved Pod\n\t\tfwk.RunUnreservePlugins(schedulingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)\n\t\treturn\n\t}\n\t// 5. 开启携程进行绑定\n\t// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).\n\tgo func() {\n\t\tbindingCycleCtx, cancel := context.WithCancel(ctx)\n\t\tdefer cancel()\n\t\tmetrics.SchedulerGoroutines.WithLabelValues(\"binding\").Inc()\n\t\tdefer metrics.SchedulerGoroutines.WithLabelValues(\"binding\").Dec()\n\n\t\t// Run \"permit\" plugins.\n\t\tpermitStatus := fwk.RunPermitPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)\n\t\tif !permitStatus.IsSuccess() {\n\t\t\tvar reason string\n\t\t\tif permitStatus.IsUnschedulable() {\n\t\t\t\tmetrics.PodScheduleFailures.Inc()\n\t\t\t\treason = v1.PodReasonUnschedulable\n\t\t\t} else {\n\t\t\t\tmetrics.PodScheduleErrors.Inc()\n\t\t\t\treason = SchedulerError\n\t\t\t}\n\t\t\tif forgetErr := sched.Cache().ForgetPod(assumedPod); forgetErr != nil {\n\t\t\t\tklog.Errorf(\"scheduler cache ForgetPod failed: %v\", forgetErr)\n\t\t\t}\n\t\t\t// trigger un-reserve plugins to clean up state associated with the reserved Pod\n\t\t\tfwk.RunUnreservePlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)\n\t\t\tsched.recordSchedulingFailure(assumedPodInfo, permitStatus.AsError(), reason, permitStatus.Message())\n\t\t\treturn\n\t\t}\n\n\t\t// Bind volumes first before Pod\n\t\tif !allBound {\n\t\t\terr := sched.bindVolumes(assumedPod)\n\t\t\tif err != nil {\n\t\t\t\tsched.recordSchedulingFailure(assumedPodInfo, err, \"VolumeBindingFailed\", err.Error())\n\t\t\t\tmetrics.PodScheduleErrors.Inc()\n\t\t\t\t// trigger un-reserve plugins to clean up state associated with the reserved Pod\n\t\t\t\tfwk.RunUnreservePlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\n\t\t// Run \"prebind\" plugins.\n\t\tpreBindStatus := fwk.RunPreBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)\n\t\tif !preBindStatus.IsSuccess() {\n\t\t\tvar reason string\n\t\t\tmetrics.PodScheduleErrors.Inc()\n\t\t\treason = SchedulerError\n\t\t\tif forgetErr := sched.Cache().ForgetPod(assumedPod); forgetErr != nil {\n\t\t\t\tklog.Errorf(\"scheduler cache ForgetPod failed: %v\", forgetErr)\n\t\t\t}\n\t\t\t// trigger un-reserve plugins to clean up state associated with the reserved Pod\n\t\t\tfwk.RunUnreservePlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)\n\t\t\tsched.recordSchedulingFailure(assumedPodInfo, preBindStatus.AsError(), reason, preBindStatus.Message())\n\t\t\treturn\n\t\t}\n\n\t\terr := sched.bind(bindingCycleCtx, assumedPod, scheduleResult.SuggestedHost, state)\n\t\tmetrics.E2eSchedulingLatency.Observe(metrics.SinceInSeconds(start))\n\t\tmetrics.DeprecatedE2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))\n\t\tif err != nil {\n\t\t\tmetrics.PodScheduleErrors.Inc()\n\t\t\t// trigger un-reserve plugins to clean up state associated with the reserved Pod\n\t\t\tfwk.RunUnreservePlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)\n\t\t\tsched.recordSchedulingFailure(assumedPodInfo, err, SchedulerError, fmt.Sprintf(\"Binding rejected: %v\", err))\n\t\t} else {\n\t\t\t// Calculating nodeResourceString can be heavy. Avoid it if klog verbosity is below 2.\n\t\t\tif klog.V(2) {\n\t\t\t\tklog.Infof(\"pod %v/%v is bound successfully on node %q, %d nodes evaluated, %d nodes were found feasible.\", assumedPod.Namespace, assumedPod.Name, scheduleResult.SuggestedHost, scheduleResult.EvaluatedNodes, scheduleResult.FeasibleNodes)\n\t\t\t}\n\n\t\t\tmetrics.PodScheduleSuccesses.Inc()\n\t\t\tmetrics.PodSchedulingAttempts.Observe(float64(podInfo.Attempts))\n\t\t\tmetrics.PodSchedulingDuration.Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))\n\n\t\t\t// Run \"postbind\" plugins.\n\t\t\tfwk.RunPostBindPlugins(bindingCycleCtx, state, assumedPod, scheduleResult.SuggestedHost)\n\t\t}\n\t}()\n}\n```\n\n### 3.4 sched.Algorithm.Schedule\n\n逻辑如下：\n\n（1）执行podPassesBasicChecks，如果pod使用的pvc不存在，或者pvc正在删除中，返回错误\n\n（2）执行 prefilter plugins\n\n（3）执行 filter plugins, 这里就是findNodesThatFit函数\n\n（4）执行优选\n\n（5）返回结果\n\n```\n// ScheduleResult represents the result of one pod scheduled. It will contain\n// the final selected Node, along with the selected intermediate information.\ntype ScheduleResult struct {\n\t// Name of the scheduler suggest host\n\tSuggestedHost string                //最终优选出来的节点\n\t// Number of nodes scheduler evaluated on one pod scheduled\n\tEvaluatedNodes int                 //共有多少个节点参与了评选\n\t// Number of feasible nodes on one pod scheduled\n\tFeasibleNodes int                  //共有多少个节点符合pod调度要求\n}\n\n\n// Schedule tries to schedule the given pod to one of the nodes in the node list.\n// If it succeeds, it will return the name of the node.\n// If it fails, it will return a FitError error with reasons.\nfunc (g *genericScheduler) Schedule(ctx context.Context, state *framework.CycleState, pod *v1.Pod) (result ScheduleResult, err error) {\n\ttrace := utiltrace.New(\"Scheduling\", utiltrace.Field{Key: \"namespace\", Value: pod.Namespace}, utiltrace.Field{Key: \"name\", Value: pod.Name})\n\tdefer trace.LogIfLong(100 * time.Millisecond)\n   \n  // 1.执行podPassesBasicChecks，如果pod使用的pvc不存在，或者pvc正在删除中，返回错误\n\tif err := podPassesBasicChecks(pod, g.pvcLister); err != nil {\n\t\treturn result, err\n\t}\n\ttrace.Step(\"Basic checks done\")\n\n\tif err := g.snapshot(); err != nil {\n\t\treturn result, err\n\t}\n\ttrace.Step(\"Snapshoting scheduler cache and node infos done\")\n\n\tif len(g.nodeInfoSnapshot.NodeInfoList) == 0 {\n\t\treturn result, ErrNoNodesAvailable\n\t}\n  \n  // 2.执行 prefilter plugins\n\t// Run \"prefilter\" plugins.\n\tpreFilterStatus := g.framework.RunPreFilterPlugins(ctx, state, pod)\n\tif !preFilterStatus.IsSuccess() {\n\t\treturn result, preFilterStatus.AsError()\n\t}\n\ttrace.Step(\"Running prefilter plugins done\")\n   \n   \n  // 3.执行 filter plugins, 这里就是findNodesThatFit函数\n\tstartPredicateEvalTime := time.Now()\n\tfilteredNodes, failedPredicateMap, filteredNodesStatuses, err := g.findNodesThatFit(ctx, state, pod)\n\tif err != nil {\n\t\treturn result, err\n\t}\n\ttrace.Step(\"Computing predicates done\")\n  \n  // 3.执行 postfilter plugins\n\t// Run \"postfilter\" plugins.\n\tpostfilterStatus := g.framework.RunPostFilterPlugins(ctx, state, pod, filteredNodes, filteredNodesStatuses)\n\tif !postfilterStatus.IsSuccess() {\n\t\treturn result, postfilterStatus.AsError()\n\t}\n\n\tif len(filteredNodes) == 0 {\n\t\treturn result, &FitError{\n\t\t\tPod:                   pod,\n\t\t\tNumAllNodes:           len(g.nodeInfoSnapshot.NodeInfoList),\n\t\t\tFailedPredicates:      failedPredicateMap,\n\t\t\tFilteredNodesStatuses: filteredNodesStatuses,\n\t\t}\n\t}\n\ttrace.Step(\"Running postfilter plugins done\")\n\tmetrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInSeconds(startPredicateEvalTime))\n\tmetrics.DeprecatedSchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))\n\tmetrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))\n\tmetrics.DeprecatedSchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))\n\n\tstartPriorityEvalTime := time.Now()\n\t// When only one node after predicate, just use it.\n\tif len(filteredNodes) == 1 {\n\t\tmetrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInSeconds(startPriorityEvalTime))\n\t\tmetrics.DeprecatedSchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))\n\t\treturn ScheduleResult{\n\t\t\tSuggestedHost:  filteredNodes[0].Name,\n\t\t\tEvaluatedNodes: 1 + len(failedPredicateMap) + len(filteredNodesStatuses),\n\t\t\tFeasibleNodes:  1,\n\t\t}, nil\n\t}\n  \n  \n  // 4.执行优选\n\tmetaPrioritiesInterface := g.priorityMetaProducer(pod, filteredNodes, g.nodeInfoSnapshot)\n\tpriorityList, err := g.prioritizeNodes(ctx, state, pod, metaPrioritiesInterface, filteredNodes)\n\tif err != nil {\n\t\treturn result, err\n\t}\n \n\tmetrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInSeconds(startPriorityEvalTime))\n\tmetrics.DeprecatedSchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))\n\tmetrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))\n\tmetrics.DeprecatedSchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))\n\n\thost, err := g.selectHost(priorityList)\n\ttrace.Step(\"Prioritizing done\")\n  \n  // 5.返回结果\n\treturn ScheduleResult{\n\t\tSuggestedHost:  host,\n\t\tEvaluatedNodes: len(filteredNodes) + len(failedPredicateMap) + len(filteredNodesStatuses),\n\t\tFeasibleNodes:  len(filteredNodes),\n\t}, err\n}\n```\n\n### 3.5 podFitsOnNode\n\n优选函数的核心就是：podFitsOnNode，podFitsOnNode核心就是执行2次for循环，for循环的核心：\n\n（1）遍历所有的 predicates函数，然后执行predicates函数（之前定义的预选函数）【predicates函数是有顺序的】\n\n（2）执行framework的RunFilterPlugins函数\n\n```\n// podFitsOnNode checks whether a node given by NodeInfo satisfies the given predicate functions.\n// For given pod, podFitsOnNode will check if any equivalent pod exists and try to reuse its cached\n// predicate results as possible.\n// This function is called from two different places: Schedule and Preempt.\n// When it is called from Schedule, we want to test whether the pod is schedulable\n// on the node with all the existing pods on the node plus higher and equal priority\n// pods nominated to run on the node.\n// When it is called from Preempt, we should remove the victims of preemption and\n// add the nominated pods. Removal of the victims is done by SelectVictimsOnNode().\n// It removes victims from meta and NodeInfo before calling this function.\nfunc (g *genericScheduler) podFitsOnNode(\n\tctx context.Context,\n\tstate *framework.CycleState,\n\tpod *v1.Pod,\n\tmeta predicates.Metadata,\n\tinfo *schedulernodeinfo.NodeInfo,\n\talwaysCheckAllPredicates bool,\n) (bool, []predicates.PredicateFailureReason, *framework.Status, error) {\n\tvar failedPredicates []predicates.PredicateFailureReason\n\tvar status *framework.Status\n\n\tpodsAdded := false\n\t// We run predicates twice in some cases. If the node has greater or equal priority\n\t// nominated pods, we run them when those pods are added to meta and nodeInfo.\n\t// If all predicates succeed in this pass, we run them again when these\n\t// nominated pods are not added. This second pass is necessary because some\n\t// predicates such as inter-pod affinity may not pass without the nominated pods.\n\t// If there are no nominated pods for the node or if the first run of the\n\t// predicates fail, we don't run the second pass.\n\t// We consider only equal or higher priority pods in the first pass, because\n\t// those are the current \"pod\" must yield to them and not take a space opened\n\t// for running them. It is ok if the current \"pod\" take resources freed for\n\t// lower priority pods.\n\t// Requiring that the new pod is schedulable in both circumstances ensures that\n\t// we are making a conservative decision: predicates like resources and inter-pod\n\t// anti-affinity are more likely to fail when the nominated pods are treated\n\t// as running, while predicates like pod affinity are more likely to fail when\n\t// the nominated pods are treated as not running. We can't just assume the\n\t// nominated pods are running because they are not running right now and in fact,\n\t// they may end up getting scheduled to a different node.\n\tfor i := 0; i < 2; i++ {\n\t\tmetaToUse := meta\n\t\tstateToUse := state\n\t\tnodeInfoToUse := info\n\t\tif i == 0 {\n\t\t\tvar err error\n\t\t\tpodsAdded, metaToUse, stateToUse, nodeInfoToUse, err = g.addNominatedPods(ctx, pod, meta, state, info)\n\t\t\tif err != nil {\n\t\t\t\treturn false, []predicates.PredicateFailureReason{}, nil, err\n\t\t\t}\n\t\t} else if !podsAdded || len(failedPredicates) != 0 || !status.IsSuccess() {\n\t\t\tbreak\n\t\t}\n    \n    // 1.遍历所有的 predicates函数，然后执行predicates函数（之前定义的预选函数）\n\t\tfor _, predicateKey := range predicates.Ordering() {\n\t\t\tvar (\n\t\t\t\tfit     bool\n\t\t\t\treasons []predicates.PredicateFailureReason\n\t\t\t\terr     error\n\t\t\t)\n     \n\t\t\tif predicate, exist := g.predicates[predicateKey]; exist {\n\t\t\t\tfit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn false, []predicates.PredicateFailureReason{}, nil, err\n\t\t\t\t}\n\n\t\t\t\tif !fit {\n\t\t\t\t\t// eCache is available and valid, and predicates result is unfit, record the fail reasons\n\t\t\t\t\tfailedPredicates = append(failedPredicates, reasons...)\n\t\t\t\t\t// if alwaysCheckAllPredicates is false, short circuit all predicates when one predicate fails.\n\t\t\t\t\tif !alwaysCheckAllPredicates {\n\t\t\t\t\t\tklog.V(5).Infoln(\"since alwaysCheckAllPredicates has not been set, the predicate \" +\n\t\t\t\t\t\t\t\"evaluation is short circuited and there are chances \" +\n\t\t\t\t\t\t\t\"of other predicates failing as well.\")\n\t\t\t\t\t\tbreak\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n    \n    // 2.执行framework的RunFilterPlugins函数\n\t\tstatus = g.framework.RunFilterPlugins(ctx, stateToUse, pod, nodeInfoToUse)\n\t\tif !status.IsSuccess() && !status.IsUnschedulable() {\n\t\t\treturn false, failedPredicates, status, status.AsError()\n\t\t}\n\t}\n\n\treturn len(failedPredicates) == 0 && status.IsSuccess(), failedPredicates, status, nil\n}\n```\n\n\n\n#### 2.5.1 为什么执行2次for循环\n\n如上，在i为0时调用到了addNominatedPods函数，这个函数把更高或者相同优先级的pod（待运行到本node上的）信息增加到meta和nodeInfo中，也就是对应考虑这些nominated都Running的场景；后面i为1对应的就是不考虑这些pod的场景。对于这个2遍过程，注释里是这样解释的：\n\n如果这个node上“指定”了更高或者相等优先级的pods（也就是优先级不低于本pod的一群pods将要跑在这个node上），我们运行predicates过程当这些pods信息全被加到meta和nodeInfo中的情况。如果所有的predicates过程成功了，我们再次运行这些predicates过程在这些pods信息没有被加到meta和nodeInfo的情况。这样第二次过程可能会因为一些pod间的亲和性策略过不了（因为这些计划要跑的pods没有跑，所以可能亲和性被破坏）。这里其实基于2点考虑：\n\n1、有亲和性要求的pod如果认为这些nominated pods在，则在这些nominated pods不在的情况下会异常；\n\n2、有反亲和性要求的pod如果认为这些nominated pods不在，则在这些nominated pods在的情况下会异常。\n\n#### 2.5.2 预选函数优先级的定义\n\n预选函数的定义在：pkg/scheduler/algorithm/predicates/predicates.go\n\n预选函数的注册在：pkg/scheduler/factory.go\n\n这里先不分析了\n\n<br>\n\n## 4.有意思的知识点\n\n### 4.1 这个就是判断是否有实现了接口\n\nhttp://soiiy.com/index.php/go/13207.html\n\n```\n这个语句的作用就是判断  StatelessPreBindExample是否实现了 framework.PreBindPlugin接口\n如果实现了就会正常编译，没实现就会编译报错\nvar _ framework.PreBindPlugin = StatelessPreBindExample{}\n```\n\n\n"
  },
  {
    "path": "k8s/kube-scheduler/3-如何编写一个scheduler plugin.md",
    "content": "Table of Contents\n=================\n\n  * [0. 背景](#0-背景)\n  * [1. 实现testPlugin](#1-实现testplugin)\n  * [2. 注册testPlugin](#2-注册testplugin)\n  * [3 结果验证](#3-结果验证)\n\n### 0. 背景\n\n这里就是构造了一个简单的案例。如果pod包含了test-plugin这个annotations, 就不让pod调度，一直处于pending状态。\n\n**具体步骤如下：**\n\n### 1. 实现testPlugin\n\n<br>\n\n只需要增加对应文件即可：\n\npkg/scheduler/framework/plugins/testplugin/test_plugin.go\n\n```\n/*\nCopyright 2019 The Kubernetes Authors.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n*/\n\npackage testplugin\n\nimport (\n\t\"context\"\n\t\"k8s.io/klog\"\n\n\tv1 \"k8s.io/api/core/v1\"\n\t\"k8s.io/apimachinery/pkg/runtime\"\n\t\"k8s.io/kubernetes/pkg/scheduler/framework/plugins/migration\"\n\tframework \"k8s.io/kubernetes/pkg/scheduler/framework/v1alpha1\"\n\t\"k8s.io/kubernetes/pkg/scheduler/nodeinfo\"\n)\n\n// NodeName is a plugin that checks if a pod spec node name matches the current node.\ntype TestPlugin struct{}\n\nvar _ framework.FilterPlugin = &TestPlugin{}\n\n// Name is the name of the plugin used in the plugin registry and configurations.\nconst Name = \"TestPlugin\"\n\n// Name returns name of the plugin. It is used in logs, etc.\nfunc (pl *TestPlugin) Name() string {\n\treturn Name\n}\n\n// Filter invoked at the filter extension point.\nfunc (pl *TestPlugin) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *nodeinfo.NodeInfo) *framework.Status {\n\t_, exist := pod.GetAnnotations()[\"test-plugin\"]\n\tklog.Infof(\"[zoux testPlugin for pod\", pod.Name)\n\tif !exist {\n\t\treturn migration.PredicateResultToFrameworkStatus(nil, nil)\n\t}\n\n\treturn framework.NewStatus(framework.Unschedulable, \"testPlugin\")\n\n}\n\n// New initializes a new plugin and returns it.\nfunc New(_ *runtime.Unknown, _ framework.FrameworkHandle) (framework.Plugin, error) {\n\treturn &TestPlugin{}, nil\n}\n\n```\n\n### 2. 注册testPlugin\n\npkg/scheduler/framework/plugins/default_registry.go\n\n(1) 在NewDefaultRegistry进行赋值\n\n```\nfunc NewDefaultRegistry(args *RegistryArgs) framework.Registry {\n\treturn framework.Registry{\n\t\tdefaultpodtopologyspread.Name:        defaultpodtopologyspread.New,\n\t\timagelocality.Name:                   imagelocality.New,\n\t\ttainttoleration.Name:                 tainttoleration.New,\n\t\tnodename.Name:                        nodename.New,\n\t\ttestplugin.Name:                      testplugin.New,\n\t\tnodeports.Name:                       nodeports.New,\n\t\tnodepreferavoidpods.Name:             nodepreferavoidpods.New,\n\t\tnodeaffinity.Name:                    nodeaffinity.New,\n\t\tpodtopologyspread.Name:               podtopologyspread.New,\n\t\t。。。\n}\n```\n\n（2）在NewDefaultConfigProducerRegistry进行注册\n\n```\n// NewDefaultConfigProducerRegistry creates a new producer registry.\nfunc NewDefaultConfigProducerRegistry() *ConfigProducerRegistry {\n\tregistry := &ConfigProducerRegistry{\n\t\tPredicateToConfigProducer: make(map[string]ConfigProducer),\n\t\tPriorityToConfigProducer:  make(map[string]ConfigProducer),\n\t}\n\t// Register Predicates.\n\tregistry.RegisterPredicate(predicates.GeneralPred,\n\t\tfunc(_ ConfigProducerArgs) (plugins config.Plugins, pluginConfig []config.PluginConfig) {\n\t\t\t// GeneralPredicate is a combination of predicates.\n\t\t\tplugins.Filter = appendToPluginSet(plugins.Filter, noderesources.FitName, nil)\n\t\t\tplugins.Filter = appendToPluginSet(plugins.Filter, nodename.Name, nil)\n\t\t\tplugins.Filter = appendToPluginSet(plugins.Filter, nodeports.Name, nil)\n\t\t\tplugins.Filter = appendToPluginSet(plugins.Filter, nodeaffinity.Name, nil)\n\t\t\tplugins.Filter = appendToPluginSet(plugins.Filter, testplugin.Name, nil)\n\t\t\treturn\n\t\t})\n\n```\n\n### 3 结果验证\n\n编译后，验证即可。日志忘保存，不贴出来了。。"
  },
  {
    "path": "k8s/kubectl/0-ReadMe.md",
    "content": "本章主要通过 kubectl create -f pod.yaml 为主线, 研究kubectl 的实现机制，特别是对Factory, Builder, visitor有所了解。"
  },
  {
    "path": "k8s/kubectl/1-kubectl 整体流程分析.md",
    "content": "Table of Contents\n=================\n\n  * [1. cmd/kubectl/kubectl.go](#1-cmdkubectlkubectlgo)\n  * [2. NewDefaultKubectlCommand](#2-newdefaultkubectlcommand)\n     * [2.1 如何设置kubectl自动补全](#21-如何设置kubectl自动补全)\n     * [2.2 kubectl config](#22-kubectl-config)\n     * [2.3 kubectl  api-versions](#23-kubectl--api-versions)\n     * [2.4 kubectl api-resources](#24-kubectl-api-resources)\n     * [2.5 kubectl options](#25-kubectl-options)\n  * [3. 总结](#3-总结)\n\nk8s源码一般的目录就是  cm/kubectl  下是启动的主函数\n\n然后进入 pkg/kubectl  运行着真正的逻辑\n\n### 1. cmd/kubectl/kubectl.go\n\n```\ncmd/kubectl/kubectl.go\nfunc main() {\n\trand.Seed(time.Now().UnixNano())\n    \n    \n    // 1. 主要的函数就是 NewDefaultKubectlCommand\n\tcommand := cmd.NewDefaultKubectlCommand()\n\n\t// TODO: once we switch everything over to Cobra commands, we can go back to calling\n\t// cliflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the\n\t// normalize func and add the go flag set by hand.\n\t// 1.设置统一化函数\n\tpflag.CommandLine.SetNormalizeFunc(cliflag.WordSepNormalizeFunc)\n\t// 2.同时使用pflag和flag\n\tpflag.CommandLine.AddGoFlagSet(goflag.CommandLine)\n\t// cliflag.InitFlags()\n\tlogs.InitLogs()\n\tdefer logs.FlushLogs()\n\n\tif err := command.Execute(); err != nil {\n\t\tos.Exit(1)\n\t}\n}\n```\n\nSetNormalizeFunc的作用是：\n\n如果我们创建了名称为 --des-detail 的参数，但是用户却在传参时写成了 --des_detail 或 --des.detail 会怎么样？默认情况下程序会报错退出，但是我们可以通过 pflag 提供的 SetNormalizeFunc 功能轻松的解决这个问题。\n\n<br>\n\n### 2. NewDefaultKubectlCommand\n\npkg/kubectl/cmd/cmd.go \n\n这里调用链路如下：\n\nNewDefaultKubectlCommand -> NewDefaultKubectlCommandWithArgs -> NewKubectlCommand \n\nNewKubectlCommand 只是注册一些命令。具体步骤如下：\n\n1. 定义kubectl 命令。从这里可以看出来，针对kubectl而言不会有任何参数，只会输出kubectl的使用帮助\n2. 设置kubeconfigflags，用于连接apiserver\n3. 利用kubeconfigflags生成了, 一个Factory f, 这个f包含了与apiserver操作的client，每个子命令都利用这个f进行后续的操作。(接下来单独分析这一步)\n4. kubectl 子命令分为了以下这几类: Basic Commands (Beginner), Basic Commands (Intermediate), Deploy Commands, Cluster Management Commands, Troubleshooting and Debugging Commands, Advanced Commands , Settings Commands\n5. 注册上述的command\n6. 注册没有分类的一些命令，例如，kubectl pulgin ， kubectl version等等。\n\n到这里，kubectl的大体逻辑就是定义好一堆的子命令。接下来具体看一个子命令，kubectl create -f 命令。\n\n```\n// NewKubectlCommand creates the `kubectl` command and its nested children.\nfunc NewKubectlCommand(in io.Reader, out, err io.Writer) *cobra.Command {\n\n     // 1.定义kubectl 命令。从这里可以看出来，针对kubectl而言不会有任何参数，只会输出kubectl的使用帮助\n\t// Parent command to which all subcommands are added.\n\tcmds := &cobra.Command{\n\t\tUse:   \"kubectl\",\n\t\tShort: i18n.T(\"kubectl controls the Kubernetes cluster manager\"),\n\t\tLong: templates.LongDesc(`\n      kubectl controls the Kubernetes cluster manager.\n\n      Find more information at:\n            https://kubernetes.io/docs/reference/kubectl/overview/`),\n\t\tRun: runHelp,\n\t\t// Hook before and after Run initialize and write profiles to disk,\n\t\t// respectively.\n\t\tPersistentPreRunE: func(*cobra.Command, []string) error {\n\t\t\treturn initProfiling()\n\t\t},\n\t\tPersistentPostRunE: func(*cobra.Command, []string) error {\n\t\t\treturn flushProfiling()\n\t\t},\n\t\tBashCompletionFunction: bashCompletionFunc,\n\t}\n \n\n  flags := cmds.PersistentFlags()\n\tflags.SetNormalizeFunc(cliflag.WarnWordSepNormalizeFunc) // Warn for \"_\" flags\n  // Normalize all flags that are coming from other packages or pre-configurations\n\t// a.k.a. change all \"_\" to \"-\". e.g. glog package\n\tflags.SetNormalizeFunc(cliflag.WordSepNormalizeFunc)\n\taddProfilingFlags(flags)\n\n  // 2.设置kubeconfigflags，用于连接apiserver\n\tkubeConfigFlags := genericclioptions.NewConfigFlags(true).WithDeprecatedPasswordFlag()\n\tkubeConfigFlags.AddFlags(flags)\n\n  // 3.利用kubeconfigflags生成了, 一个Factory f, 这个f包含了与apiserver操作的client，每个子命令都利用这个f进行后续的操作。\n\tmatchVersionKubeConfigFlags := cmdutil.NewMatchVersionFlags(kubeConfigFlags)\n\tmatchVersionKubeConfigFlags.AddFlags(cmds.PersistentFlags())\n\tcmds.PersistentFlags().AddGoFlagSet(flag.CommandLine)\n\n\tf := cmdutil.NewFactory(matchVersionKubeConfigFlags)\n     \n     // 4. kubectl子命令分为了以下这几类\n\t \tgroups := templates.CommandGroups{\n\t\t{\n\t\t\tMessage: \"Basic Commands (Beginner):\",\n\t\t\tCommands: []*cobra.Command{\n\t\t\t\tcreate.NewCmdCreate(f, ioStreams),\n\t\t\t\texpose.NewCmdExposeService(f, ioStreams),\n\t\t\t\trun.NewCmdRun(f, ioStreams),\n\t\t\t\tset.NewCmdSet(f, ioStreams),\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tMessage: \"Basic Commands (Intermediate):\",\n\t\t\tCommands: []*cobra.Command{\n\t\t\t\texplain.NewCmdExplain(\"kubectl\", f, ioStreams),\n\t\t\t\tget.NewCmdGet(\"kubectl\", f, ioStreams),\n\t\t\t\tedit.NewCmdEdit(f, ioStreams),\n\t\t\t\tdelete.NewCmdDelete(f, ioStreams),\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tMessage: \"Deploy Commands:\",\n\t\t\tCommands: []*cobra.Command{\n\t\t\t\trollout.NewCmdRollout(f, ioStreams),\n\t\t\t\trollingupdate.NewCmdRollingUpdate(f, ioStreams),\n\t\t\t\tscale.NewCmdScale(f, ioStreams),\n\t\t\t\tautoscale.NewCmdAutoscale(f, ioStreams),\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tMessage: \"Cluster Management Commands:\",\n\t\t\tCommands: []*cobra.Command{\n\t\t\t\tcertificates.NewCmdCertificate(f, ioStreams),\n\t\t\t\tclusterinfo.NewCmdClusterInfo(f, ioStreams),\n\t\t\t\ttop.NewCmdTop(f, ioStreams),\n\t\t\t\tdrain.NewCmdCordon(f, ioStreams),\n\t\t\t\tdrain.NewCmdUncordon(f, ioStreams),\n\t\t\t\tdrain.NewCmdDrain(f, ioStreams),\n\t\t\t\ttaint.NewCmdTaint(f, ioStreams),\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tMessage: \"Troubleshooting and Debugging Commands:\",\n\t\t\tCommands: []*cobra.Command{\n\t\t\t\tdescribe.NewCmdDescribe(\"kubectl\", f, ioStreams),\n\t\t\t\tlogs.NewCmdLogs(f, ioStreams),\n\t\t\t\tattach.NewCmdAttach(f, ioStreams),\n\t\t\t\tcmdexec.NewCmdExec(f, ioStreams),\n\t\t\t\tportforward.NewCmdPortForward(f, ioStreams),\n\t\t\t\tproxy.NewCmdProxy(f, ioStreams),\n\t\t\t\tcp.NewCmdCp(f, ioStreams),\n\t\t\t\tauth.NewCmdAuth(f, ioStreams),\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tMessage: \"Advanced Commands:\",\n\t\t\tCommands: []*cobra.Command{\n\t\t\t\tdiff.NewCmdDiff(f, ioStreams),\n\t\t\t\tapply.NewCmdApply(\"kubectl\", f, ioStreams),\n\t\t\t\tpatch.NewCmdPatch(f, ioStreams),\n\t\t\t\treplace.NewCmdReplace(f, ioStreams),\n\t\t\t\twait.NewCmdWait(f, ioStreams),\n\t\t\t\tconvert.NewCmdConvert(f, ioStreams),\n\t\t\t\tkustomize.NewCmdKustomize(ioStreams),\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tMessage: \"Settings Commands:\",\n\t\t\tCommands: []*cobra.Command{\n\t\t\t\tlabel.NewCmdLabel(f, ioStreams),\n\t\t\t\tannotate.NewCmdAnnotate(\"kubectl\", f, ioStreams),\n\t\t\t\tcompletion.NewCmdCompletion(ioStreams.Out, \"\"),\n\t\t\t},\n\t\t},\n\t}\n\t\n    // 4.1.注册上面的子命令\n\tgroups.Add(cmds)\n\n\tfilters := []string{\"options\"}\n    \n    // 4.2 直接使用kubectl alpha 命令可以查看有哪些命令是alpha 阶段的。\n\t// Hide the \"alpha\" subcommand if there are no alpha commands in this build.\n\talpha := cmdpkg.NewCmdAlpha(f, ioStreams)\n\tif !alpha.HasSubCommands() {\n\t\tfilters = append(filters, alpha.Name())\n\t}\n\n\ttemplates.ActsAsRootCommand(cmds, filters, groups...)\n    \n    // 4.3 kubectl代码自动补全\n\tfor name, completion := range bashCompletionFlags {\n\t\tif cmds.Flag(name) != nil {\n\t\t\tif cmds.Flag(name).Annotations == nil {\n\t\t\t\tcmds.Flag(name).Annotations = map[string][]string{}\n\t\t\t}\n\t\t\tcmds.Flag(name).Annotations[cobra.BashCompCustom] = append(\n\t\t\t\tcmds.Flag(name).Annotations[cobra.BashCompCustom],\n\t\t\t\tcompletion,\n\t\t\t)\n\t\t}\n\t}\n    \n    // 4.4 添加一些不在默认分组的子命令\n    // 1.alpha命令\n\tcmds.AddCommand(alpha)\n\t// kubectl config命令，见2.2\n\tcmds.AddCommand(cmdconfig.NewCmdConfig(f, clientcmd.NewDefaultPathOptions(), ioStreams))\n\t// 配置kubectl 插件。 例如kubectl debug就是一个插件。  详见：https://kubernetes.io/zh/docs/tasks/extend-kubectl/kubectl-plugins/\n\tcmds.AddCommand(plugin.NewCmdPlugin(f, ioStreams))\n\t\n    // kubectl version输出版本信息\n\tcmds.AddCommand(version.NewCmdVersion(f, ioStreams))\n\t\n\t// 实现 kubectl api-versions, 详见2.3\n\tcmds.AddCommand(apiresources.NewCmdAPIVersions(f, ioStreams))\n\t\n\t// 实现 kubectl api-resources, 详见2.3\n\tcmds.AddCommand(apiresources.NewCmdAPIResources(f, ioStreams))\n\t\n\t// 实现kubectl options， 可以查看子命令可以带哪些options，例如所有命令都会带-v 查看完整日志\n\tcmds.AddCommand(options.NewCmdOptions(ioStreams.Out))\n\n\treturn cmds\n}\n\nfunc runHelp(cmd *cobra.Command, args []string) {\n\tcmd.Help()\n}\n```\n\n#### 2.1 如何设置kubectl自动补全\n\n```\n# linux\nsource <(kubectl completion bash)\necho \"source <(kubectl completion bash)\" >> ~/.bashrc\n#或者\necho \"source <(kubectl completion bash)\" >> /etc/profile\n\n# mac \nsource <(kubectl completion bash)\n# 测试\n[root@node02 ~]# kubectl tab键\nannotate       apply          autoscale      completion     cordon         delete         drain          explain        kustomize      options        port-forward   rollout        set            uncordon\napi-resources  attach         certificate    config         cp             describe       edit           expose         label          patch          proxy          run            taint          version\napi-versions   auth           cluster-info   convert        create         diff           exec           get            logs           plugin         replace        scale          top            wait\n```\n\n#### 2.2 kubectl config\n\n从kubectl config命令可以看出来，config优先级为： \n\n（1）--kubeconfig 指定的config\n\n（2）KUBECONFIG环境变量\n\n（3）默认的 /HOME/.root/config文件\n\n```\nroot@k8s-master:~# kubectl config -h\nModify kubeconfig files using subcommands like \"kubectl config set current-context my-context\"\n\n The loading order follows these rules:\n\n  1.  If the --kubeconfig flag is set, then only that file is loaded. The flag may only be set once and no merging takes\nplace.\n  2.  If $KUBECONFIG environment variable is set, then it is used as a list of paths (normal path delimiting rules for\nyour system). These paths are merged. When a value is modified, it is modified in the file that defines the stanza. When\na value is created, it is created in the first file that exists. If no files in the chain exist, then it creates the\nlast file in the list.\n  3.  Otherwise, ${HOME}/.kube/config is used and no merging takes place.\n\nAvailable Commands:\n  current-context Displays the current-context\n  delete-cluster  Delete the specified cluster from the kubeconfig\n  delete-context  Delete the specified context from the kubeconfig\n  get-clusters    Display clusters defined in the kubeconfig\n  get-contexts    Describe one or many contexts\n  rename-context  Renames a context from the kubeconfig file.\n  set             Sets an individual value in a kubeconfig file\n  set-cluster     Sets a cluster entry in kubeconfig\n  set-context     Sets a context entry in kubeconfig\n  set-credentials Sets a user entry in kubeconfig\n  unset           Unsets an individual value in a kubeconfig file\n  use-context     Sets the current-context in a kubeconfig file\n  view            Display merged kubeconfig settings or a specified kubeconfig file\n\nUsage:\n  kubectl config SUBCOMMAND [options]\n\nUse \"kubectl <command> --help\" for more information about a given command.\nUse \"kubectl options\" for a list of global command-line options (applies to all commands).\n```\n\n#### 2.3 kubectl  api-versions\n\n```\nroot@k8s-master:~# ^C\nroot@k8s-master:~# kubectl api-versions  -h\nPrint the supported API versions on the server, in the form of \"group/version\"\n\nExamples:\n  # Print the supported API versions\n  kubectl api-versions\n\nUsage:\n  kubectl api-versions [flags] [options]\n\nUse \"kubectl options\" for a list of global command-line options (applies to all commands).\n```\n\n#### 2.4 kubectl api-resources\n\n```\nroot@k8s-master:~# kubectl api-resources  -h\nPrint the supported API resources on the server\n\nExamples:\n  # Print the supported API Resources\n  kubectl api-resources\n  \n  # Print the supported API Resources with more information\n  kubectl api-resources -o wide\n  \n  # Print the supported API Resources sorted by a column\n  kubectl api-resources --sort-by=name\n  \n  # Print the supported namespaced resources\n  kubectl api-resources --namespaced=true\n  \n  # Print the supported non-namespaced resources\n  kubectl api-resources --namespaced=false\n  \n  # Print the supported API Resources with specific APIGroup\n  kubectl api-resources --api-group=extensions\n\nOptions:\n      --api-group='': Limit to resources in the specified API group.\n      --cached=false: Use the cached list of resources if available.\n      --namespaced=true: If false, non-namespaced resources will be returned, otherwise returning namespaced resources\nby default.\n      --no-headers=false: When using the default or custom-column output format, don't print headers (default print\nheaders).\n  -o, --output='': Output format. One of: wide|name.\n      --sort-by='': If non-empty, sort nodes list using specified field. The field can be either 'name' or 'kind'.\n      --verbs=[]: Limit to resources that support the specified verbs.\n\nUsage:\n  kubectl api-resources [flags] [options]\n\nUse \"kubectl options\" for a list of global command-line options (applies to all commands).\n```\n\n#### 2.5 kubectl options\n\n```\nroot@k8s-master:~# kubectl options\nThe following options can be passed to any command:\n\n      --add-dir-header=false: If true, adds the file directory to the header\n      --alsologtostderr=false: log to standard error as well as files\n      --as='': Username to impersonate for the operation\n      --as-group=[]: Group to impersonate for the operation, this flag can be repeated to specify multiple groups.\n      --cache-dir='/root/.kube/http-cache': Default HTTP cache directory\n      --certificate-authority='': Path to a cert file for the certificate authority\n      --client-certificate='': Path to a client certificate file for TLS\n      --client-key='': Path to a client key file for TLS\n      --cluster='': The name of the kubeconfig cluster to use\n      --context='': The name of the kubeconfig context to use\n      --insecure-skip-tls-verify=false: If true, the server's certificate will not be checked for validity. This will\nmake your HTTPS connections insecure\n      --kubeconfig='': Path to the kubeconfig file to use for CLI requests.\n      --log-backtrace-at=:0: when logging hits line file:N, emit a stack trace\n      --log-dir='': If non-empty, write log files in this directory\n      --log-file='': If non-empty, use this log file\n      --log-file-max-size=1800: Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0,\nthe maximum file size is unlimited.\n      --log-flush-frequency=5s: Maximum number of seconds between log flushes\n      --logtostderr=true: log to standard error instead of files\n      --match-server-version=false: Require server version to match client version\n  -n, --namespace='': If present, the namespace scope for this CLI request\n      --password='': Password for basic authentication to the API server\n      --profile='none': Name of profile to capture. One of (none|cpu|heap|goroutine|threadcreate|block|mutex)\n      --profile-output='profile.pprof': Name of the file to write the profile to\n      --request-timeout='0': The length of time to wait before giving up on a single server request. Non-zero values\nshould contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests.\n  -s, --server='': The address and port of the Kubernetes API server\n      --skip-headers=false: If true, avoid header prefixes in the log messages\n      --skip-log-headers=false: If true, avoid headers when opening log files\n      --stderrthreshold=2: logs at or above this threshold go to stderr\n      --token='': Bearer token for authentication to the API server\n      --user='': The name of the kubeconfig user to use\n      --username='': Username for basic authentication to the API server\n  -v, --v=0: number for the log level verbosity\n      --vmodule=: comma-separated list of pattern=N settings for file-filtered logging\n```\n\n<br>\n\n### 3. 总结\n\nkubectl 代码结构非常清楚。主要就是：\n\n（1）定义kubectl 命令\n\n（2）配置kubeconfig, 并且生成一个Factory f。这一步很重要，接下来单独分析这个过程\n\n（3）定义kubectl 各种子命令"
  },
  {
    "path": "k8s/kubectl/2-client-go中连接apiserver的4种client介绍.md",
    "content": "Table of Contents\r\n=================\r\n\r\n  * [1. client-go 中4种连接apiserver的客户端](#1-client-go-中4种连接apiserver的客户端)\r\n     * [1.1 restClient客户端](#11-restclient客户端)\r\n     * [1.2 clientSet客户端](#12-clientset客户端)\r\n     * [1.3 DynamicClient客户端](#13-dynamicclient客户端)\r\n     * [1.4 DiscoveryClient客户端](#14-discoveryclient客户端)\r\n\r\n### 1. client-go 中4种连接apiserver的客户端\r\n\r\nclient-go的客户端对象有4个，作用各有不同：\r\n\r\n- RESTClient： 是对HTTP Request进行了封装，实现了RESTful风格的API。其他客户端都是在RESTClient基础上的实现。可与用于k8s内置资源和CRD资源\r\n- ClientSet:是对k8s内置资源对象的客户端的集合，默认情况下，不能操作CRD资源，但是通过client-gen代码生成的话，也是可以操作CRD资源的。\r\n- DynamicClient:不仅能对K8S内置资源进行处理，还可以对CRD资源进行处理，不需要client-gen生成代码即可实现。\r\n- DiscoveryClient：用于发现kube-apiserver所支持的资源组、资源版本、资源信息（即Group、Version、Resources）。\r\n\r\n![client](../images/client.png)\r\n\r\n\r\n\r\nRESTClient是最基础的客户端。RESTClient对HTTP Request进行了封装，实现了RESTful风格的API。ClientSet、DynamicClient及DiscoveryClient客户端都是基于RESTClient实现的。\r\n\r\n\r\n\r\nClientSet在RESTClient的基础上封装了对Resource和Version的管理方法。每一个Resource可以理解为一个客户端，而ClientSet则是多个客户端的集合，每一个Resource和Version都以函数的方式暴露给开发者。ClientSet只能够处理Kubernetes内置资源，它是通过client-gen代码生成器自动生成的。\r\n\r\n\r\n\r\nDynamicClient与ClientSet最大的不同之处是，ClientSet仅能访问Kubernetes自带的资源（即Client集合内的资源），不能直接访问CRD自定义资源。DynamicClient能够处理Kubernetes中的所有资源对象，包括Kubernetes内置资源与CRD自定义资源。\r\n\r\nDiscoveryClient发现客户端，用于发现kube-apiserver所支持的资源组、资源版本、资源信息（即Group、Versions、Resources）。以上4种客户端：RESTClient、ClientSet、DynamicClient、DiscoveryClient都可以通过kubeconfig配置信息连接到指定的KubernetesAPI Server。\r\n\r\n**总结下**：RESTCLient、ClientSet和DynamicClient都可以对K8S内置资源和CRD资源进行操作。只是clientSet需要生成代码才能操作CRD资源。\r\n\r\n而clientSet 和dynamicClient不同在于，dynamicClient可以操作任意的对象，clientset初始化是只能指定一种对象操作。\r\n\r\n<br>\r\n\r\n<br>\r\n\r\n\r\n\r\n#### 1.1 restClient客户端\r\n\r\nrest.RESTClientFor函数通过kubeconfig配置信息实例化RESTClient对象，RESTClient对象构建HTTP请求参数，例如Get函数设置请求方法为get操作，它还支持Post、Put、Delete、Patch等请求方法。\r\n\r\n如下的例子可见，restful需要自己确定url，访问资源。并且 restclient核心是通过RESTClientFor函数实例化的。\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n    \"fmt\"\r\n\r\n    corev1 \"k8s.io/api/core/v1\"\r\n    metav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\r\n    \"k8s.io/client-go/kubernetes/scheme\"\r\n    \"k8s.io/client-go/rest\"\r\n    \"k8s.io/client-go/tools/clientcmd\"\r\n)\r\n\r\nfunc main() {\r\n    // 加载kubeconfig文件，生成config对象\r\n    config, err := clientcmd.BuildConfigFromFlags(\"\", \"D:\\\\coding\\\\config\")\r\n\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n    // 配置API路径和请求的资源组/资源版本信息\r\n    config.APIPath = \"api\"\r\n    config.GroupVersion = &corev1.SchemeGroupVersion\r\n    config.NegotiatedSerializer = scheme.Codecs\r\n\r\n    // 通过rest.RESTClientFor()生成RESTClient对象\r\n    restClient, err := rest.RESTClientFor(config)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // 通过RESTClient构建请求参数，查询default空间下所有pod资源\r\n    result := &corev1.PodList{}\r\n    err = restClient.Get().\r\n        Namespace(\"default\").\r\n        Resource(\"pods\").\r\n        VersionedParams(&metav1.ListOptions{Limit: 500}, scheme.ParameterCodec).\r\n        Do().\r\n        Into(result)\r\n\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    for _, d := range result.Items {\r\n        fmt.Printf(\"NAMESPACE:%v \\t NAME: %v \\t STATUS: %v\\n\", d.Namespace, d.Name, d.Status.Phase)\r\n    }\r\n}\r\n\r\n// 测试\r\ngo run .\\restClient-example.go\r\nNAMESPACE:default        NAME: nginx-deployment-6b474476c4-lpld7         STATUS: Running\r\nNAMESPACE:default        NAME: nginx-deployment-6b474476c4-t6xl4         STATUS: Running\r\n```\r\n\r\n<br>\r\n\r\n#### 1.2 clientSet客户端\r\n\r\nRESTClient是一种最基础的客户端，使用时需要指定Resource和Version等信息，编写代码时需要提前知道Resource所在的Group和对应的Version信息。相比RESTClient，ClientSet使用起来更加便捷，一般情况下，开发者对Kubernetes进行二次开发时通常使用ClientSet。\r\n\r\n如下的例子可见，clientSet通过 NewForConfig 实现一个客户端。用起来也方便很多。\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n    \"fmt\"\r\n\r\n    apiv1 \"k8s.io/api/core/v1\"\r\n    metav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\r\n    \"k8s.io/client-go/kubernetes\"\r\n    \"k8s.io/client-go/tools/clientcmd\"\r\n)\r\n\r\nfunc main() {\r\n    // 加载kubeconfig文件，生成config对象\r\n    config, err := clientcmd.BuildConfigFromFlags(\"\", \"D:\\\\coding\\\\config\")\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // kubernetes.NewForConfig通过config实例化ClientSet对象\r\n    clientset, err := kubernetes.NewForConfig(config)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    //请求core核心资源组v1资源版本下的Pods资源对象\r\n    podClient := clientset.CoreV1().Pods(apiv1.NamespaceDefault)\r\n    // 设置选项\r\n    list, err := podClient.List(metav1.ListOptions{Limit: 500})\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    for _, d := range list.Items {\r\n        fmt.Printf(\"NAMESPACE: %v \\t NAME:%v \\t STATUS: %+v\\n\", d.Namespace, d.Name, d.Status.Phase)\r\n    }\r\n}\r\n\r\n// 测试\r\ngo run .\\clientSet-example.go\r\n\r\nNAMESPACE: default       NAME:nginx-deployment-6b474476c4-lpld7          STATUS: Running\r\nNAMESPACE: default       NAME:nginx-deployment-6b474476c4-t6xl4          STATUS: Running\r\n```\r\n\r\n<br>\r\n\r\n#### 1.3 DynamicClient客户端\r\n\r\nDynamicClient是一种动态客户端，它可以对任意Kubernetes资源进行RESTful操作，包括CRD自定义资源。DynamicClient与ClientSet操作类似，同样封装了RESTClient，同样提供了Create、Update、Delete、Get、List、Watch、Patch等方法。DynamicClient与ClientSet最大的不同之处是，ClientSet仅能访问Kubernetes自带的资源（即客户端集合内的资源），不能直接访问CRD自定义资源。ClientSet需要预先实现每种Resource和Version的操作，其内部的数据都是结构化数据（即已知数据结构）。而DynamicClient内部实现了Unstructured，用于处理非结构化数据结构（即无法提前预知数据结构），这也是DynamicClient能够处理CRD自定义资源的关键。\r\n\r\n**注意：**\r\n\r\n* DynamicClient获得的数据都是一个object类型。存的时候是 unstructured\r\n* DynamicClient不是类型安全的，因此在访问CRD自定义资源时需要特别注意。例如，在操作指针不当的情况下可能会导致程序崩溃。\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n    \"fmt\"\r\n\r\n    apiv1 \"k8s.io/api/core/v1\"\r\n    corev1 \"k8s.io/api/core/v1\"\r\n    metav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\r\n\r\n    \"k8s.io/apimachinery/pkg/runtime\"\r\n    \"k8s.io/apimachinery/pkg/runtime/schema\"\r\n    \"k8s.io/client-go/dynamic\"\r\n    \"k8s.io/client-go/tools/clientcmd\"\r\n)\r\n\r\nfunc main() {\r\n    // 加载kubeconfig文件，生成config对象\r\n    config, err := clientcmd.BuildConfigFromFlags(\"\", \"D:\\\\coding\\\\config\")\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // dynamic.NewForConfig函数通过config实例化dynamicClient对象\r\n    dynamicClient, err := dynamic.NewForConfig(config)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // 通过schema.GroupVersionResource设置请求的资源版本和资源组，设置命名空间和请求参数,得到unstructured.UnstructuredList指针类型的PodList\r\n    gvr := schema.GroupVersionResource{Version: \"v1\", Resource: \"pods\"}\r\n    unstructObj, err := dynamicClient.Resource(gvr).Namespace(apiv1.NamespaceDefault).List(metav1.ListOptions{Limit: 500})\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // 通过runtime.DefaultUnstructuredConverter函数将unstructured.UnstructuredList转为PodList类型\r\n    podList := &corev1.PodList{}\r\n    err = runtime.DefaultUnstructuredConverter.FromUnstructured(unstructObj.UnstructuredContent(), podList)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    for _, d := range podList.Items {\r\n        fmt.Printf(\"NAMESPACE: %v NAME:%v \\t STATUS: %+v\\n\", d.Namespace, d.Name, d.Status.Phase)\r\n    }\r\n}\r\n\r\n// 测试\r\ngo run .\\dynamicClient-example.go\r\nNAMESPACE: default NAME:nginx-deployment-6b474476c4-lpld7        STATUS: Running\r\nNAMESPACE: default NAME:nginx-deployment-6b474476c4-t6xl4        STATUS: Running\r\n```\r\n\r\n<br>\r\n\r\n#### 1.4 DiscoveryClient客户端\r\n\r\nDiscoveryClient是发现客户端，它主要用于发现Kubernetes API Server所支持的资源组、资源版本、资源信息。Kubernetes API Server支持很多资源组、资源版本、资源信息，开发者在开发过程中很难记住所有信息，此时可以通过DiscoveryClient查看所支持的资源组、资源版本、资源信息。kubectl的api-versions和api-resources命令输出也是通过DiscoveryClient实现的。另外，DiscoveryClient同样在RESTClient的基础上进行了封装。DiscoveryClient除了可以发现Kubernetes API Server所支持的资源组、资源版本、资源信息，还可以将这些信息存储到本地，用于本地缓存（Cache），以减轻对Kubernetes API Server访问的压力。在运行Kubernetes组件的机器上，缓存信息默认存储于～/.kube/cache和～/.kube/http-cache下。\r\n\r\n```\r\npackage main\r\n\r\nimport (\r\n    \"fmt\"\r\n\r\n    \"k8s.io/apimachinery/pkg/runtime/schema\"\r\n    \"k8s.io/client-go/discovery\"\r\n    \"k8s.io/client-go/tools/clientcmd\"\r\n)\r\n\r\nfunc main() {\r\n    // 加载kubeconfig文件，生成config对象\r\n    config, err := clientcmd.BuildConfigFromFlags(\"\", \"D:\\\\coding\\\\config\")\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // discovery.NewDiscoveryClientForConfigg函数通过config实例化discoveryClient对象\r\n    discoveryClient, err := discovery.NewDiscoveryClientForConfig(config)\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // discoveryClient.ServerGroupsAndResources 返回API Server所支持的资源组、资源版本、资源信息\r\n    _, APIResourceList, err := discoveryClient.ServerGroupsAndResources()\r\n    if err != nil {\r\n        panic(err)\r\n    }\r\n\r\n    // 输出所有资源信息\r\n    for _, list := range APIResourceList {\r\n        gv, err := schema.ParseGroupVersion(list.GroupVersion)\r\n        if err != nil {\r\n            panic(err)\r\n        }\r\n\r\n        for _, resource := range list.APIResources {\r\n            fmt.Printf(\"NAME: %v, GROUP: %v, VERSION: %v \\n\", resource.Name, gv.Group, gv.Version)\r\n        }\r\n    }\r\n}\r\n\r\n\r\n// 测试\r\n go run .\\discoveryClient-example.go\r\nNAME: bindings, GROUP: , VERSION: v1 \r\nNAME: componentstatuses, GROUP: , VERSION: v1 \r\nNAME: configmaps, GROUP: , VERSION: v1\r\nNAME: endpoints, GROUP: , VERSION: v1\r\nNAME: events, GROUP: , VERSION: v1\r\nNAME: limitranges, GROUP: , VERSION: v1\r\nNAME: namespaces, GROUP: , VERSION: v1\r\nNAME: namespaces/finalize, GROUP: , VERSION: v1\r\nNAME: namespaces/status, GROUP: , VERSION: v1\r\nNAME: nodes, GROUP: , VERSION: v1\r\nNAME: nodes/proxy, GROUP: , VERSION: v1\r\nNAME: nodes/status, GROUP: , VERSION: v1\r\nNAME: persistentvolumeclaims, GROUP: , VERSION: v1\r\nNAME: persistentvolumeclaims/status, GROUP: , VERSION: v1\r\nNAME: persistentvolumes, GROUP: , VERSION: v1\r\nNAME: persistentvolumes/status, GROUP: , VERSION: v1\r\nNAME: pods, GROUP: , VERSION: v1\r\nNAME: pods/attach, GROUP: , VERSION: v1\r\nNAME: pods/binding, GROUP: , VERSION: v1\r\nNAME: pods/eviction, GROUP: , VERSION: v1\r\nNAME: pods/exec, GROUP: , VERSION: v1\r\nNAME: pods/log, GROUP: , VERSION: v1\r\nNAME: pods/portforward, GROUP: , VERSION: v1\r\nNAME: pods/proxy, GROUP: , VERSION: v1\r\nNAME: pods/status, GROUP: , VERSION: v1\r\nNAME: podtemplates, GROUP: , VERSION: v1\r\nNAME: replicationcontrollers, GROUP: , VERSION: v1\r\nNAME: replicationcontrollers/scale, GROUP: , VERSION: v1\r\nNAME: replicationcontrollers/status, GROUP: , VERSION: v1\r\nNAME: resourcequotas, GROUP: , VERSION: v1\r\nNAME: resourcequotas/status, GROUP: , VERSION: v1\r\nNAME: secrets, GROUP: , VERSION: v1\r\nNAME: serviceaccounts, GROUP: , VERSION: v1\r\nNAME: services, GROUP: , VERSION: v1\r\nNAME: services/proxy, GROUP: , VERSION: v1\r\nNAME: services/status, GROUP: , VERSION: v1\r\nNAME: apiservices, GROUP: apiregistration.k8s.io, VERSION: v1\r\nNAME: apiservices/status, GROUP: apiregistration.k8s.io, VERSION: v1\r\nNAME: apiservices, GROUP: apiregistration.k8s.io, VERSION: v1beta1 \r\nNAME: apiservices/status, GROUP: apiregistration.k8s.io, VERSION: v1beta1\r\nNAME: ingresses, GROUP: extensions, VERSION: v1beta1\r\nNAME: ingresses/status, GROUP: extensions, VERSION: v1beta1\r\nNAME: controllerrevisions, GROUP: apps, VERSION: v1\r\nNAME: daemonsets, GROUP: apps, VERSION: v1\r\nNAME: daemonsets/status, GROUP: apps, VERSION: v1\r\nNAME: deployments, GROUP: apps, VERSION: v1\r\nNAME: deployments/scale, GROUP: apps, VERSION: v1\r\nNAME: deployments/status, GROUP: apps, VERSION: v1\r\nNAME: replicasets, GROUP: apps, VERSION: v1\r\nNAME: replicasets/scale, GROUP: apps, VERSION: v1\r\nNAME: replicasets/status, GROUP: apps, VERSION: v1\r\nNAME: statefulsets, GROUP: apps, VERSION: v1\r\nNAME: statefulsets/scale, GROUP: apps, VERSION: v1\r\nNAME: statefulsets/status, GROUP: apps, VERSION: v1\r\nNAME: events, GROUP: events.k8s.io, VERSION: v1beta1\r\nNAME: tokenreviews, GROUP: authentication.k8s.io, VERSION: v1\r\nNAME: tokenreviews, GROUP: authentication.k8s.io, VERSION: v1beta1\r\nNAME: localsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1\r\nNAME: selfsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1\r\nNAME: selfsubjectrulesreviews, GROUP: authorization.k8s.io, VERSION: v1\r\nNAME: subjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1\r\nNAME: localsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1beta1\r\nNAME: selfsubjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1beta1\r\nNAME: selfsubjectrulesreviews, GROUP: authorization.k8s.io, VERSION: v1beta1\r\nNAME: subjectacce***eviews, GROUP: authorization.k8s.io, VERSION: v1beta1\r\nNAME: horizontalpodautoscalers, GROUP: autoscaling, VERSION: v1\r\nNAME: horizontalpodautoscalers/status, GROUP: autoscaling, VERSION: v1\r\nNAME: horizontalpodautoscalers, GROUP: autoscaling, VERSION: v2beta1\r\nNAME: horizontalpodautoscalers/status, GROUP: autoscaling, VERSION: v2beta1\r\nNAME: horizontalpodautoscalers, GROUP: autoscaling, VERSION: v2beta2\r\nNAME: horizontalpodautoscalers/status, GROUP: autoscaling, VERSION: v2beta2\r\nNAME: jobs, GROUP: batch, VERSION: v1\r\nNAME: jobs/status, GROUP: batch, VERSION: v1\r\nNAME: cronjobs, GROUP: batch, VERSION: v1beta1\r\nNAME: cronjobs/status, GROUP: batch, VERSION: v1beta1\r\nNAME: certificatesigningrequests, GROUP: certificates.k8s.io, VERSION: v1beta1\r\nNAME: certificatesigningrequests/approval, GROUP: certificates.k8s.io, VERSION: v1beta1\r\nNAME: certificatesigningrequests/status, GROUP: certificates.k8s.io, VERSION: v1beta1\r\nNAME: networkpolicies, GROUP: networking.k8s.io, VERSION: v1\r\nNAME: ingressclasses, GROUP: networking.k8s.io, VERSION: v1beta1\r\nNAME: ingresses, GROUP: networking.k8s.io, VERSION: v1beta1\r\nNAME: ingresses/status, GROUP: networking.k8s.io, VERSION: v1beta1\r\nNAME: poddisruptionbudgets, GROUP: policy, VERSION: v1beta1\r\nNAME: poddisruptionbudgets/status, GROUP: policy, VERSION: v1beta1\r\nNAME: podsecuritypolicies, GROUP: policy, VERSION: v1beta1\r\nNAME: clusterrolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1\r\nNAME: clusterroles, GROUP: rbac.authorization.k8s.io, VERSION: v1\r\nNAME: rolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1\r\nNAME: roles, GROUP: rbac.authorization.k8s.io, VERSION: v1\r\nNAME: clusterrolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1\r\nNAME: clusterroles, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1\r\nNAME: rolebindings, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1\r\nNAME: roles, GROUP: rbac.authorization.k8s.io, VERSION: v1beta1\r\nNAME: csidrivers, GROUP: storage.k8s.io, VERSION: v1\r\nNAME: csinodes, GROUP: storage.k8s.io, VERSION: v1\r\nNAME: storageclasses, GROUP: storage.k8s.io, VERSION: v1\r\nNAME: volumeattachments, GROUP: storage.k8s.io, VERSION: v1\r\nNAME: volumeattachments/status, GROUP: storage.k8s.io, VERSION: v1 \r\nNAME: csidrivers, GROUP: storage.k8s.io, VERSION: v1beta1\r\nNAME: csinodes, GROUP: storage.k8s.io, VERSION: v1beta1\r\nNAME: storageclasses, GROUP: storage.k8s.io, VERSION: v1beta1\r\nNAME: volumeattachments, GROUP: storage.k8s.io, VERSION: v1beta1\r\nNAME: mutatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1\r\nNAME: validatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1\r\nNAME: mutatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1beta1\r\nNAME: validatingwebhookconfigurations, GROUP: admissionregistration.k8s.io, VERSION: v1beta1\r\nNAME: customresourcedefinitions, GROUP: apiextensions.k8s.io, VERSION: v1\r\nNAME: customresourcedefinitions/status, GROUP: apiextensions.k8s.io, VERSION: v1\r\nNAME: customresourcedefinitions, GROUP: apiextensions.k8s.io, VERSION: v1beta1\r\nNAME: customresourcedefinitions/status, GROUP: apiextensions.k8s.io, VERSION: v1beta1\r\nNAME: priorityclasses, GROUP: scheduling.k8s.io, VERSION: v1\r\nNAME: priorityclasses, GROUP: scheduling.k8s.io, VERSION: v1beta1\r\nNAME: leases, GROUP: coordination.k8s.io, VERSION: v1\r\nNAME: leases, GROUP: coordination.k8s.io, VERSION: v1beta1\r\nNAME: runtimeclasses, GROUP: node.k8s.io, VERSION: v1beta1\r\nNAME: endpointslices, GROUP: discovery.k8s.io, VERSION: v1beta1\r\n```\r\n\r\n"
  },
  {
    "path": "k8s/kubectl/3-kubectl Factory机制-上.md",
    "content": "Table of Contents\n=================\n\n  * [1.背景](#1背景)\n  * [2 kubectl的kubeconfig](#2-kubectl的kubeconfig)\n     * [2.1 kubeConfigFlags](#21-kubeconfigflags)\n        * [2.1.1 ToRESTConfig](#211-torestconfig)\n        * [2.1.2 ToDiscoveryClient](#212-todiscoveryclient)\n        * [2.1.3 ToRESTMapper](#213-torestmapper)\n        * [2.1.4 举例说明discorvey和restMapper在创建删除资源时起到的作用](#214-举例说明discorvey和restmapper在创建删除资源时起到的作用)\n  * [3.总结](#3总结)\n\n### 1.背景\n\n在第一篇 kubectl整体流程的分析中。kubectl在定义子命令之前，做了两件事情，就是网上经常说的Factory机制。\n\n（1）设置kubeconfigflags，用于连接apiserver\n\n（2）利用kubeconfigflags生成了, 一个Factory f\n\n然后上篇文件补充了一些基础知识。介绍了client-go中连接apiserver的四种client。这篇笔记就通过源码介绍一下Factory机制。\n\n```\n  // 2.设置kubeconfigflags，用于连接apiserver\n\tkubeConfigFlags := genericclioptions.NewConfigFlags(true).WithDeprecatedPasswordFlag()\n\tkubeConfigFlags.AddFlags(flags)\n\n  // 3.利用kubeconfigflags生成了, 一个Factory f, 这个f包含了与apiserver操作的client，每个子命令都利用这个f进行后续的操作。\n\tmatchVersionKubeConfigFlags := cmdutil.NewMatchVersionFlags(kubeConfigFlags)\n\tmatchVersionKubeConfigFlags.AddFlags(cmds.PersistentFlags())\n\tcmds.PersistentFlags().AddGoFlagSet(flag.CommandLine)\n\n\tf := cmdutil.NewFactory(matchVersionKubeConfigFlags)\n```\n\n### 2 kubectl的kubeconfig\n\n#### 2.1 kubeConfigFlags\n\nConfigFlags 是生成Factory的关键，先了解一下ConfigFlags\n\n**（1）数据结构**\n\n```\n// ConfigFlags composes the set of values necessary\n// for obtaining a REST client config\ntype ConfigFlags struct {\n\tCacheDir   *string\n\tKubeConfig *string\n\n\t// config flags\n\tClusterName      *string\n\tAuthInfoName     *string\n\tContext          *string\n\tNamespace        *string\n\tAPIServer        *string\n\tInsecure         *bool\n\tCertFile         *string\n\tKeyFile          *string\n\tCAFile           *string\n\tBearerToken      *string\n\tImpersonate      *string\n\tImpersonateGroup *[]string\n\tUsername         *string\n\tPassword         *string\n\tTimeout          *string\n\n\tclientConfig clientcmd.ClientConfig\n\tlock         sync.Mutex\n\t// If set to true, will use persistent client config and\n\t// propagate the config to the places that need it, rather than\n\t// loading the config multiple times\n\tusePersistentConfig bool\n}\n```\n\n<br>\n\n通过增加打印日志，发现使用kubectl 时默认都是空的。\n\n```\n  // 设置kubeconfigflags，用于连接apiserver\n\tkubeConfigFlags := genericclioptions.NewConfigFlags(true).WithDeprecatedPasswordFlag()\n\tkubeConfigFlags.AddFlags(flags)\n\t\n\t// 增加打印日志\n\tklog.Errorf(\"zoux Namespace is: %v,APIServer is %v,AuthInfoName is %v,BearerToken is %v,CacheDir is %v,\", *kubeConfigFlags.Namespace, *kubeConfigFlags.APIServer,*kubeConfigFlags.AuthInfoName, *kubeConfigFlags.BearerToken,*kubeConfigFlags.CacheDir)\n\tklog.Errorf(\"zoux CAFile is: %v,CertFile is %v,ClusterName is %v,Context is %v,Impersonate is %v,\", \t*kubeConfigFlags.CAFile,*kubeConfigFlags.CertFile,*kubeConfigFlags.ClusterName,*kubeConfigFlags.Context,*kubeConfigFlags.Impersonate)\n\tklog.Errorf(\"zoux Insecure is: %v,KeyFile is %v,KubeConfig is %v,Password is %v,Timeout is %v,Username is %v\", \t*kubeConfigFlags.Insecure,*kubeConfigFlags.KeyFile,*kubeConfigFlags.KubeConfig,*kubeConfigFlags.Password,*kubeConfigFlags.Timeout,*kubeConfigFlags.Username)\n\n\t\n## kubectl create 的默认输出\nE1105 16:47:01.248983   13836 cmd.go:470] zoux Namespace is: ,APIServer is ,AuthInfoName is ,BearerToken is ,CacheDir is /root/.kube/http-cache,\nE1105 16:47:01.249066   13836 cmd.go:471] zoux CAFile is: ,CertFile is ,ClusterName is ,Context is ,Impersonate is ,\nE1105 16:47:01.249070   13836 cmd.go:472] zoux Insecure is: false,KeyFile is ,KubeConfig is ,Password is ,Timeout is 0,Username is\n```\n\n**（2）继承了RESTClientGetter接口**\n\n```\ntype RESTClientGetter interface {\n\t// ToRESTConfig returns restconfig\n\tToRESTConfig() (*rest.Config, error)        \n\t// ToDiscoveryClient returns discovery client\n\tToDiscoveryClient() (discovery.CachedDiscoveryInterface, error)\n\t// ToRESTMapper returns a restmapper\n\tToRESTMapper() (meta.RESTMapper, error)\n\t// ToRawKubeConfigLoader return kubeconfig loader as-is\n\t// load kubeconfig的中间函数，ToRESTConfig回调用该函数，后面不再单独分析这个函数\n\tToRawKubeConfigLoader() clientcmd.ClientConfig    \n}\n```\n\n<br>\n\n##### 2.1.1 ToRESTConfig \n\n返回一个restconfig, 从这里可以看出来：\n\n（1）如果使用kubectl指定了--kubeconfig, 优先使用这个\n\n（2）否则看环境变量的KUBECONFIG \n\n（3）否则使用默认的config /root/.kube/config\n\n（4）填充其他的配置\n\n**解析文件后，它会确定当前要使用的上下文、当前指向的集群以及当前与用户关联的所有身份验证信息。**如果用户提供了额外的参数（例如 `--username`），则这些值优先，并将覆盖 kubeconfig 中指定的值。\n\n一旦有了上述信息， Kubectl 就会填充客户端的配置，以便它能够适当地修饰 HTTP 请求：\n\n- x509 证书使用 `tls.TLSConfig` 发送（包括 CA 证书）；\n- bearer tokens 在 HTTP 请求头 Authorization 中发送；\n- 用户名和密码通过 HTTP 基础认证发送；\n- OpenID 认证过程是由用户事先手动处理的，产生一个像 bearer token 一样被发送的 token。\n\n```\n// ToRESTConfig implements RESTClientGetter.\n// Returns a REST client configuration based on a provided path\n// to a .kubeconfig file, loading rules, and config flag overrides.\n// Expects the AddFlags method to have been called.\nfunc (f *ConfigFlags) ToRESTConfig() (*rest.Config, error) {\n\treturn f.ToRawKubeConfigLoader().ClientConfig()\n}\n\n// ToRawKubeConfigLoader binds config flag values to config overrides\n// Returns an interactive clientConfig if the password flag is enabled,\n// or a non-interactive clientConfig otherwise.\nfunc (f *ConfigFlags) ToRawKubeConfigLoader() clientcmd.ClientConfig {\n\tif f.usePersistentConfig {\n\t\treturn f.toRawKubePersistentConfigLoader()\n\t}\n\treturn f.toRawKubeConfigLoader()\n}\n\nfunc (f *ConfigFlags) toRawKubeConfigLoader() clientcmd.ClientConfig {\n  // 1.默认的加载kubeconfig规则\n\tloadingRules := clientcmd.NewDefaultClientConfigLoadingRules()\n\t// use the standard defaults for this client command\n\t// DEPRECATED: remove and replace with something more accurate\n\tloadingRules.DefaultClientConfig = &clientcmd.DefaultClientConfig\n\n\tif f.KubeConfig != nil {\n\t\tloadingRules.ExplicitPath = *f.KubeConfig\n\t}\n\n  // 2.使用命令行 --kubeconfig 指定的配置覆盖，可以看出来 --kubeconfig优先级大于默认规则\n\toverrides := &clientcmd.ConfigOverrides{ClusterDefaults: clientcmd.ClusterDefaults}\n\n\t...\n\n    // 3.填充其他的配置\n  \t// bind auth info flag values to overrides\n\tif f.CertFile != nil {\n\t\toverrides.AuthInfo.ClientCertificate = *f.CertFile\n\t}\n\tif f.KeyFile != nil {\n\t\toverrides.AuthInfo.ClientKey = *f.KeyFile\n\t}\n\n\treturn clientConfig\n}\n\n\n// 默认的规则是，环境变量配置的KUBECONFIG  优先级大于  默认的config文件 /root/.kube/config\n// NewDefaultClientConfigLoadingRules returns a ClientConfigLoadingRules object with default fields filled in.  You are not required to\n// use this constructor\nfunc NewDefaultClientConfigLoadingRules() *ClientConfigLoadingRules {\n\tchain := []string{}\n\twarnIfAllMissing := false\n\n\tenvVarFiles := os.Getenv(RecommendedConfigPathEnvVar)\n\tif len(envVarFiles) != 0 {\n\t\tfileList := filepath.SplitList(envVarFiles)\n\t\t// prevent the same path load multiple times\n\t\tchain = append(chain, deduplicate(fileList)...)\n\t\twarnIfAllMissing = true\n\n\t} else {\n\t\tchain = append(chain, RecommendedHomeFile)\n\t}\n\n\treturn &ClientConfigLoadingRules{\n\t\tPrecedence:       chain,\n\t\tMigrationRules:   currentMigrationRules(),\n\t\tWarnIfAllMissing: warnIfAllMissing,\n\t}\n}\n // 默认是 /HOME/.kube/kubeconfig\n\toldRecommendedHomeFile := path.Join(os.Getenv(\"HOME\"), \"/.kube/.kubeconfig\")\n\toldRecommendedWindowsHomeFile := path.Join(os.Getenv(\"HOME\"), RecommendedHomeDir, RecommendedFileName)\n```\n\n##### 2.1.2 ToDiscoveryClient\n\n这个就是基于上面的kubeconfig, 返回一个DiscoveryClient。\n\nDiscoveryClient是发现客户端，它主要用于发现Kubernetes API Server所支持的资源组、资源版本、资源信息。Kubernetes API Server支持很多资源组、资源版本、资源信息，开发者在开发过程中很难记住所有信息，此时可以通过DiscoveryClient查看所支持的资源组、资源版本、资源信息。kubectl的api-versions和api-resources命令输出也是通过DiscoveryClient实现的。另外，DiscoveryClient同样在RESTClient的基础上进行了封装。DiscoveryClient除了可以发现Kubernetes API Server所支持的资源组、资源版本、资源信息，还可以将这些信息存储到本地，用于本地缓存（Cache），以减轻对Kubernetes API Server访问的压力。在运行Kubernetes组件的机器上，缓存信息默认存储于～/.kube/cache和～/.kube/http-cache下。\n\nDiscoveryClient的作用和用法详见clientv-go章节。\n\n```\n// ToDiscoveryClient implements RESTClientGetter.\n// Expects the AddFlags method to have been called.\n// Returns a CachedDiscoveryInterface using a computed RESTConfig.\nfunc (f *ConfigFlags) ToDiscoveryClient() (discovery.CachedDiscoveryInterface, error) {\n\tconfig, err := f.ToRESTConfig()\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// The more groups you have, the more discovery requests you need to make.\n\t// given 25 groups (our groups + a few custom resources) with one-ish version each, discovery needs to make 50 requests\n\t// double it just so we don't end up here again for a while.  This config is only used for discovery.\n\tconfig.Burst = 100\n\n\t// retrieve a user-provided value for the \"cache-dir\"\n\t// defaulting to ~/.kube/http-cache if no user-value is given.\n\thttpCacheDir := defaultCacheDir\n\tif f.CacheDir != nil {\n\t\thttpCacheDir = *f.CacheDir\n\t}\n\n\tdiscoveryCacheDir := computeDiscoverCacheDir(filepath.Join(homedir.HomeDir(), \".kube\", \"cache\", \"discovery\"), config.Host)\n\treturn diskcached.NewCachedDiscoveryClientForConfig(config, discoveryCacheDir, httpCacheDir, time.Duration(10*time.Minute))\n}\n```\n\n**kubectl的两种缓存**\n\nK8s 用 API group 来管理 resource API。 这是一种不同于 monolithic API（所有 API 扁平化）的 API 管理方式。\n\n具体来说，同一资源的不同版本的 API，会放到一个 group 里面。 例如 Deployment 资源的 API group 名为 apps，最新的版本是 v1。这也是为什么 我们在创建 Deployment 时，需要在 yaml 中指定 apiVersion: apps/v1 的原因。\n\n出于性能考虑，kubectl 会 缓存这份 OpenAPI schema， 路径是 ~/.kube/cache/discovery。想查看这个 API discovery 过程，可以删除这个文件， 然后随便执行一条 kubectl 命令，并指定足够大的日志级别（例如 kubectl get ds -v 10）,这个时候kubectl就会缓存这个。\n\n```\nroot@k8s-master:~/.kube/cache/discovery/localhost_8080# ls\nadmissionregistration.k8s.io  batch\t\t   node.k8s.io\napiextensions.k8s.io\t      certificates.k8s.io  policy\napiregistration.k8s.io\t      coordination.k8s.io  rbac.authorization.k8s.io\napps\t\t\t      discovery.k8s.io\t   scheduling.k8s.io\nauthentication.k8s.io\t      events.k8s.io\t   servergroups.json\nauthorization.k8s.io\t      extensions\t   storage.k8s.io\nautoscaling\t\t      networking.k8s.io    v1\nroot@k8s-master:~/.kube/cache/discovery/localhost_8080#\nroot@k8s-master:~/.kube/cache/discovery/localhost_8080# pwd\n/root/.kube/cache/discovery/localhost_8080\n\n\n// 缓存了resource\nroot@k8s-master:~/.kube/cache/discovery/localhost_8080# cd v1/\nroot@k8s-master:~/.kube/cache/discovery/localhost_8080/v1# cat serverresources.json\n{\n\t\"kind\": \"APIResourceList\",\n\t\"apiVersion\": \"v1\",\n\t\"groupVersion\": \"v1\",\n\t\"resources\": [{\n\t\t\"name\": \"bindings\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Binding\",\n\t\t\"verbs\": [\"create\"]\n\t}, {\n\t\t\"name\": \"componentstatuses\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": false,\n\t\t\"kind\": \"ComponentStatus\",\n\t\t\"verbs\": [\"get\", \"list\"],\n\t\t\"shortNames\": [\"cs\"]\n\t}, {\n\t\t\"name\": \"configmaps\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"ConfigMap\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"cm\"],\n\t\t\"storageVersionHash\": \"qFsyl6wFWjQ=\"\n\t}, {\n\t\t\"name\": \"endpoints\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Endpoints\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"ep\"],\n\t\t\"storageVersionHash\": \"fWeeMqaN/OA=\"\n\t}, {\n\t\t\"name\": \"events\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Event\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"ev\"],\n\t\t\"storageVersionHash\": \"r2yiGXH7wu8=\"\n\t}, {\n\t\t\"name\": \"limitranges\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"LimitRange\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"limits\"],\n\t\t\"storageVersionHash\": \"EBKMFVe6cwo=\"\n\t}, {\n\t\t\"name\": \"namespaces\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": false,\n\t\t\"kind\": \"Namespace\",\n\t\t\"verbs\": [\"create\", \"delete\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"ns\"],\n\t\t\"storageVersionHash\": \"Q3oi5N2YM8M=\"\n\t}, {\n\t\t\"name\": \"namespaces/finalize\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": false,\n\t\t\"kind\": \"Namespace\",\n\t\t\"verbs\": [\"update\"]\n\t}, {\n\t\t\"name\": \"namespaces/status\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": false,\n\t\t\"kind\": \"Namespace\",\n\t\t\"verbs\": [\"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"nodes\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": false,\n\t\t\"kind\": \"Node\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"no\"],\n\t\t\"storageVersionHash\": \"XwShjMxG9Fs=\"\n\t}, {\n\t\t\"name\": \"nodes/proxy\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": false,\n\t\t\"kind\": \"NodeProxyOptions\",\n\t\t\"verbs\": [\"create\", \"delete\", \"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"nodes/status\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": false,\n\t\t\"kind\": \"Node\",\n\t\t\"verbs\": [\"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"persistentvolumeclaims\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"PersistentVolumeClaim\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"pvc\"],\n\t\t\"storageVersionHash\": \"QWTyNDq0dC4=\"\n\t}, {\n\t\t\"name\": \"persistentvolumeclaims/status\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"PersistentVolumeClaim\",\n\t\t\"verbs\": [\"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"persistentvolumes\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": false,\n\t\t\"kind\": \"PersistentVolume\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"pv\"],\n\t\t\"storageVersionHash\": \"HN/zwEC+JgM=\"\n\t}, {\n\t\t\"name\": \"persistentvolumes/status\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": false,\n\t\t\"kind\": \"PersistentVolume\",\n\t\t\"verbs\": [\"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"pods\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Pod\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"po\"],\n\t\t\"categories\": [\"all\"],\n\t\t\"storageVersionHash\": \"xPOwRZ+Yhw8=\"\n\t}, {\n\t\t\"name\": \"pods/attach\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"PodAttachOptions\",\n\t\t\"verbs\": [\"create\", \"get\"]\n\t}, {\n\t\t\"name\": \"pods/binding\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Binding\",\n\t\t\"verbs\": [\"create\"]\n\t}, {\n\t\t\"name\": \"pods/eviction\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"group\": \"policy\",\n\t\t\"version\": \"v1beta1\",\n\t\t\"kind\": \"Eviction\",\n\t\t\"verbs\": [\"create\"]\n\t}, {\n\t\t\"name\": \"pods/exec\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"PodExecOptions\",\n\t\t\"verbs\": [\"create\", \"get\"]\n\t}, {\n\t\t\"name\": \"pods/log\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Pod\",\n\t\t\"verbs\": [\"get\"]\n\t}, {\n\t\t\"name\": \"pods/portforward\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"PodPortForwardOptions\",\n\t\t\"verbs\": [\"create\", \"get\"]\n\t}, {\n\t\t\"name\": \"pods/proxy\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"PodProxyOptions\",\n\t\t\"verbs\": [\"create\", \"delete\", \"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"pods/status\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Pod\",\n\t\t\"verbs\": [\"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"podtemplates\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"PodTemplate\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"storageVersionHash\": \"LIXB2x4IFpk=\"\n\t}, {\n\t\t\"name\": \"replicationcontrollers\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"ReplicationController\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"rc\"],\n\t\t\"categories\": [\"all\"],\n\t\t\"storageVersionHash\": \"Jond2If31h0=\"\n\t}, {\n\t\t\"name\": \"replicationcontrollers/scale\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"group\": \"autoscaling\",\n\t\t\"version\": \"v1\",\n\t\t\"kind\": \"Scale\",\n\t\t\"verbs\": [\"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"replicationcontrollers/status\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"ReplicationController\",\n\t\t\"verbs\": [\"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"resourcequotas\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"ResourceQuota\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"quota\"],\n\t\t\"storageVersionHash\": \"8uhSgffRX6w=\"\n\t}, {\n\t\t\"name\": \"resourcequotas/status\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"ResourceQuota\",\n\t\t\"verbs\": [\"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"secrets\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Secret\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"storageVersionHash\": \"S6u1pOWzb84=\"\n\t}, {\n\t\t\"name\": \"serviceaccounts\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"ServiceAccount\",\n\t\t\"verbs\": [\"create\", \"delete\", \"deletecollection\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"sa\"],\n\t\t\"storageVersionHash\": \"pbx9ZvyFpBE=\"\n\t}, {\n\t\t\"name\": \"services\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Service\",\n\t\t\"verbs\": [\"create\", \"delete\", \"get\", \"list\", \"patch\", \"update\", \"watch\"],\n\t\t\"shortNames\": [\"svc\"],\n\t\t\"categories\": [\"all\"],\n\t\t\"storageVersionHash\": \"0/CO1lhkEBI=\"\n\t}, {\n\t\t\"name\": \"services/proxy\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"ServiceProxyOptions\",\n\t\t\"verbs\": [\"create\", \"delete\", \"get\", \"patch\", \"update\"]\n\t}, {\n\t\t\"name\": \"services/status\",\n\t\t\"singularName\": \"\",\n\t\t\"namespaced\": true,\n\t\t\"kind\": \"Service\",\n\t\t\"verbs\": [\"get\", \"patch\", \"update\"]\n\t}]\n}\n```\n\n##### 2.1.3 ToRESTMapper\n\nRESTMapper用于管理所有对象的信息。外部要获取的话，直接通过version，group获取到RESTMapper，然后通过kind类型可以获取到相对应的信息。\n\n```\n// ToRESTMapper returns a mapper.\nfunc (f *ConfigFlags) ToRESTMapper() (meta.RESTMapper, error) {\n   discoveryClient, err := f.ToDiscoveryClient()\n   if err != nil {\n      return nil, err\n   }\n\n   mapper := restmapper.NewDeferredDiscoveryRESTMapper(discoveryClient)\n   expander := restmapper.NewShortcutExpander(mapper, discoveryClient)\n   return expander, nil\n}\n\nfunc NewShortcutExpander\ntype ShortcutExpander struct是可以用于Kubernetes资源的RESTMapper。 把userResources、mapper、discoveryClient封装成一个ShortcutExpander结构，可以理解为就是一个简单的封装\n\nfunc NewShortcutExpander(delegate meta.RESTMapper, client discovery.DiscoveryInterface) ShortcutExpander {\n\treturn ShortcutExpander{All: userResources, RESTMapper: delegate, discoveryClient: client}\n}\n```\n\n<br>\n\n**RESTMapper**其实主要就是kindfor, resourceFor, 实现了GVK到GVR的转化。这样的好处就是，通过yaml中的apiVersion和kind就知道要创建哪种资源\n\n这个是资源注册到apiserver时，就知道了每种资源的gvk。\n\n```\n// RESTMapper allows clients to map resources to kind, and map kind and version\n// to interfaces for manipulating those objects. It is primarily intended for\n// consumers of Kubernetes compatible REST APIs as defined in docs/devel/api-conventions.md.\n//\n// The Kubernetes API provides versioned resources and object kinds which are scoped\n// to API groups. In other words, kinds and resources should not be assumed to be\n// unique across groups.\n//\n// TODO: split into sub-interfaces\ntype RESTMapper interface {\n\t// KindFor takes a partial resource and returns the single match.  Returns an error if there are multiple matches\n\tKindFor(resource schema.GroupVersionResource) (schema.GroupVersionKind, error)\n\n\t// KindsFor takes a partial resource and returns the list of potential kinds in priority order\n\tKindsFor(resource schema.GroupVersionResource) ([]schema.GroupVersionKind, error)\n\n\t// ResourceFor takes a partial resource and returns the single match.  Returns an error if there are multiple matches\n\tResourceFor(input schema.GroupVersionResource) (schema.GroupVersionResource, error)\n\n\t// ResourcesFor takes a partial resource and returns the list of potential resource in priority order\n\tResourcesFor(input schema.GroupVersionResource) ([]schema.GroupVersionResource, error)\n\n\t// RESTMapping identifies a preferred resource mapping for the provided group kind.\n\tRESTMapping(gk schema.GroupKind, versions ...string) (*RESTMapping, error)\n\t// RESTMappings returns all resource mappings for the provided group kind if no\n\t// version search is provided. Otherwise identifies a preferred resource mapping for\n\t// the provided version(s).\n\tRESTMappings(gk schema.GroupKind, versions ...string) ([]*RESTMapping, error)\n\n\tResourceSingularizer(resource string) (singular string, err error)\n}\n\n// RESTMapping contains the information needed to deal with objects of a specific\n// resource and kind in a RESTful manner.\ntype RESTMapping struct {\n\t// Resource is the GroupVersionResource (location) for this endpoint\n\tResource schema.GroupVersionResource\n\n\t// GroupVersionKind is the GroupVersionKind (data format) to submit to this endpoint\n\tGroupVersionKind schema.GroupVersionKind\n\n\t// Scope contains the information needed to deal with REST Resources that are in a resource hierarchy\n\tScope RESTScope\n}\n```\n\n##### 2.1.4 举例说明discorvey和restMapper在创建删除资源时起到的作用\n\n**通过Go代码操作K8S 资源**\n\n下面函数实现的功能就是：\n\n通过operating指定操作，来操作data对应的对象。\n\n其中data可以认为是yaml序列号后的数据。\n\n所以这里的核心就是：\n\n（1）生成DiscoveryClient，这样才能获取group, version, kind, resource等信息\n\n（2）根据DiscoveryClient生成RESTMapper\n\n（3）data中有gvk的信息，有了gvk，再根据RESTMapper就能生成一个mappering对象，就能找到gvr\n\n（4）有了gvr就能有了restful api路径，再结合data->unstructuredObj, 就可以直接发送create/delete请求了\n\n```\npackage kube\n\nimport (\n\t\"context\"\n\t\"k8s.io/apimachinery/pkg/api/meta\"\n\tmetav1 \"k8s.io/apimachinery/pkg/apis/meta/v1\"\n\t\"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured\"\n\t\"k8s.io/apimachinery/pkg/runtime/serializer/yaml\"\n\t\"k8s.io/client-go/discovery\"\n\t\"k8s.io/client-go/discovery/cached/memory\"\n\t\"k8s.io/client-go/dynamic\"\n\t\"k8s.io/client-go/rest\"\n\t\"k8s.io/client-go/restmapper\"\n)\n\nfunc DynamicK8s(operating string, data []byte) error {\n\t// creates the in-cluster config\n\tconfig, err := rest.InClusterConfig()\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t// 1. Prepare a RESTMapper to find GVR\n\tdc, err := discovery.NewDiscoveryClientForConfig(config)\n\tif err != nil {\n\t\treturn err\n\t}\n\tmapper := restmapper.NewDeferredDiscoveryRESTMapper(memory.NewMemCacheClient(dc))\n\t// 2. Prepare the dynamic client\n\tdynamicClient, err := dynamic.NewForConfig(config)\n\tif err != nil {\n\t\treturn err\n\t}\n\t// 3. Decode YAML manifest into unstructured.Unstructured\n\truntimeObject, gvk, err :=\n\t\tyaml.\n\t\t\tNewDecodingSerializer(unstructured.UnstructuredJSONScheme).\n\t\t\tDecode(data, nil, nil)\n\tif err != nil {\n\t\treturn err\n\t}\n\tunstructuredObj := runtimeObject.(*unstructured.Unstructured)\n\t// 4. Find GVR\n\tmapping, err := mapper.RESTMapping(gvk.GroupKind(), gvk.Version)\n\tif err != nil {\n\t\treturn err\n\t}\n\t// 5. Obtain REST interface for the GVR\n\tvar resourceREST dynamic.ResourceInterface\n\tif mapping.Scope.Name() == meta.RESTScopeNameNamespace {\n\t\t// namespaced resources should specify the namespace\n\t\tresourceREST = dynamicClient.Resource(mapping.Resource).Namespace(unstructuredObj.GetNamespace())\n\t} else {\n\t\t// for cluster-wide resources\n\t\tresourceREST = dynamicClient.Resource(mapping.Resource)\n\t}\n\tswitch operating {\n\tcase \"create\":\n\t\t_, err = resourceREST.Create(context.TODO(), unstructuredObj, metav1.CreateOptions{})\n\tcase \"delete\":\n\t\tdeletePolicy := metav1.DeletePropagationForeground\n\t\tdeleteOptions := metav1.DeleteOptions{\n\t\t\tPropagationPolicy: &deletePolicy,\n\t\t}\n\t\terr = resourceREST.Delete(context.TODO(), unstructuredObj.GetName(), deleteOptions)\n\t}\n\n\treturn err\n}\n```\n\n### 3.总结\n\n（1）了解到kubectl 加载kubectl的配置的优先级\n\n（2）了解了kubectl操作 yaml中对象的大致原理\n\n"
  },
  {
    "path": "k8s/kubectl/4-kubectl Factor机制-下.md",
    "content": "Table of Contents\r\n=================\r\n\r\n  * [1. 背景](#1-背景)\r\n     * [1.1 MatchVersionFlags](#11-matchversionflags)\r\n  * [2. NewFactory](#2-newfactory)\r\n     * [2.1 Factory接口](#21-factory接口)\r\n     * [2.2 factoryImpl实现的函数](#22-factoryimpl实现的函数)\r\n     * [2.3 NewBuilder](#23-newbuilder)\r\n        * [2.3.1 builder的常见用法](#231-builder的常见用法)\r\n        * [2.3.2 builder 功能说明](#232-builder-功能说明)\r\n        * [2.3.3 visitorResult](#233-visitorresult)\r\n  * [3 总结](#3-总结)\r\n\r\n### 1. 背景\r\n\r\n在上文了解到了kubeconfigFlags(ConfigFlags)的作用，接下来往下继续分析。\r\n\r\n```\r\n\t// 1. configFlags\r\n\tkubeConfigFlags := genericclioptions.NewConfigFlags(true).WithDeprecatedPasswordFlag()\r\n\t\r\n\t// 2.生成matchVersionKubeConfigFlags对象。\r\n\t// --match-server-version=false: Require server version to match client version\r\n\tmatchVersionKubeConfigFlags := cmdutil.NewMatchVersionFlags(kubeConfigFlags)\r\n\tmatchVersionKubeConfigFlags.AddFlags(cmds.PersistentFlags())\r\n    \r\n    // 3.persistent意思是说这个flag能任何命令下均可使用，适合全局flag：\r\n\tcmds.PersistentFlags().AddGoFlagSet(flag.CommandLine)\r\n    \r\n    // 4.生成Factory\r\n\tf := cmdutil.NewFactory(matchVersionKubeConfigFlags)\r\n```\r\n\r\n#### 1.1 MatchVersionFlags\r\n\r\n可以看出来MatchVersionFlags和configflags的区别就是多了一个checkMatchingServerVersion函数。\r\n\r\nToDiscoveryClient还是复用configflags的。\r\n\r\n在kubectl中有一个option就是 --match-server-version=false: Require server version to match client version\r\n\r\n可以要求kubectl 和 apiserver的版本一致。\r\n\r\n```\r\nfunc (f *MatchVersionFlags) checkMatchingServerVersion() error {\r\n\tf.checkServerVersion.Do(func() {\r\n\t\tif !f.RequireMatchedServerVersion {\r\n\t\t\treturn\r\n\t\t}\r\n\t\tdiscoveryClient, err := f.Delegate.ToDiscoveryClient()\r\n\t\tif err != nil {\r\n\t\t\tf.matchesServerVersionErr = err\r\n\t\t\treturn\r\n\t\t}\r\n\t\tf.matchesServerVersionErr = discovery.MatchesServerVersion(version.Get(), discoveryClient)\r\n\t})\r\n\r\n\treturn f.matchesServerVersionErr\r\n}\r\n\r\n// ToRESTConfig implements RESTClientGetter.\r\n// Returns a REST client configuration based on a provided path\r\n// to a .kubeconfig file, loading rules, and config flag overrides.\r\n// Expects the AddFlags method to have been called.\r\nfunc (f *MatchVersionFlags) ToRESTConfig() (*rest.Config, error) {\r\n\tif err := f.checkMatchingServerVersion(); err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\tclientConfig, err := f.Delegate.ToRESTConfig()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\t// TODO we should not have to do this.  It smacks of something going wrong.\r\n\tsetKubernetesDefaults(clientConfig)\r\n\treturn clientConfig, nil\r\n}\r\n\r\nfunc (f *MatchVersionFlags) ToRawKubeConfigLoader() clientcmd.ClientConfig {\r\n\treturn f.Delegate.ToRawKubeConfigLoader()\r\n}\r\n\r\nfunc (f *MatchVersionFlags) ToDiscoveryClient() (discovery.CachedDiscoveryInterface, error) {\r\n\tif err := f.checkMatchingServerVersion(); err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\treturn f.Delegate.ToDiscoveryClient()\r\n}\r\n\r\n// ToRESTMapper returns a mapper.\r\nfunc (f *MatchVersionFlags) ToRESTMapper() (meta.RESTMapper, error) {\r\n\tif err := f.checkMatchingServerVersion(); err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\treturn f.Delegate.ToRESTMapper()\r\n}\r\n```\r\n\r\n### 2. NewFactory\r\n\r\nNewFactory其实就是上面configflags的子类。factoryImpl实现了Factory的接口，所以是Factory类型。\r\n\r\n```\r\nf := cmdutil.NewFactory(matchVersionKubeConfigFlags)\r\n\r\n\r\nfunc NewFactory(clientGetter genericclioptions.RESTClientGetter) Factory {\r\n\tif clientGetter == nil {\r\n\t\tpanic(\"attempt to instantiate client_access_factory with nil clientGetter\")\r\n\t}\r\n\r\n\tf := &factoryImpl{\r\n\t\tclientGetter: clientGetter,\r\n\t}\r\n\r\n\treturn f\r\n}\r\n```\r\n\r\n#### 2.1 Factory接口\r\n\r\n```\r\ntype Factory interface {\r\n\tgenericclioptions.RESTClientGetter   \r\n\r\n\t// DynamicClient returns a dynamic client ready for use\r\n\tDynamicClient() (dynamic.Interface, error)\r\n\r\n\t// KubernetesClientSet gives you back an external clientset\r\n\tKubernetesClientSet() (*kubernetes.Clientset, error)\r\n\r\n\t// Returns a RESTClient for accessing Kubernetes resources or an error.\r\n\tRESTClient() (*restclient.RESTClient, error)\r\n\r\n\t// NewBuilder returns an object that assists in loading objects from both disk and the server\r\n\t// and which implements the common patterns for CLI interactions with generic resources.\r\n\tNewBuilder() *resource.Builder\r\n\r\n\t// Returns a RESTClient for working with the specified RESTMapping or an error. This is intended\r\n\t// for working with arbitrary resources and is not guaranteed to point to a Kubernetes APIServer.\r\n\tClientForMapping(mapping *meta.RESTMapping) (resource.RESTClient, error)\r\n\t// Returns a RESTClient for working with Unstructured objects.\r\n\tUnstructuredClientForMapping(mapping *meta.RESTMapping) (resource.RESTClient, error)\r\n\r\n\t// Returns a schema that can validate objects stored on disk.\r\n\tValidator(validate bool) (validation.Schema, error)\r\n\t// OpenAPISchema returns the schema openapi schema definition\r\n\tOpenAPISchema() (openapi.Resources, error)\r\n}\r\n```\r\n\r\n<br>\r\n\r\n#### 2.2 factoryImpl实现的函数\r\n\r\n之前介绍了, client-go种有四种类型的客户端\r\n\r\n- RESTClient： 是对HTTP Request进行了封装，实现了RESTful风格的API。其他客户端都是在RESTClient基础上的实现。可与用于k8s内置资源和CRD资源\r\n- ClientSet:是对k8s内置资源对象的客户端的集合，默认情况下，不能操作CRD资源，但是通过client-gen代码生成的话，也是可以操作CRD资源的。\r\n- DynamicClient:不仅能对K8S内置资源进行处理，还可以对CRD资源进行处理，不需要client-gen生成代码即可实现。\r\n- DiscoveryClient：用于发现kube-apiserver所支持的资源组、资源版本、资源信息（即Group、Version、Resources）。DynamicClient内部实现了Unstructured，用于处理非结构化数据结构（即无法提前预知数据结构），这也是DynamicClient能够处理CRD自定义资源的关键。\r\n\r\n<br>\r\n\r\n可以看出来 factoryImpl 除了继续了configflags的函数：ToRESTConfig， ToRESTMapper， ToDiscoveryClient，ToRawKubeConfigLoader外。还有如下的函数：\r\n\r\n KubernetesClientSet() ：生成了KubernetesClientSet，其实也是一个 Clientset客户端。第二种类型\r\n\r\nDynamicClient()：生成了DynamicClient，第三种类型\r\n\r\nRESTClient()：生成了restclient。 第一种类型\r\n\r\nClientForMapping():  针对结构化对象，根据mapping的gvk，生成 rest路径\r\n\r\nUnstructuredClientForMapping():  根据mapping的gvk，生成 rest路径, 和上面不同的就是，上面的是结构体话的，这个是非机构体。对应DynamicClient\r\n\r\nOpenAPISchema():  通过discoveryClient获取k8s对象信息\r\n\r\nValidator():  如果指定validate, 根据OpenAPISchema获得的信息，生成一个validation.schema用于进行验证\r\n\r\nNewBuilder(): 生成一个Builder, 这里对builder还有点陌生，下一节看看Builder到底是什么。\r\n\r\n```\r\n// 1.复用了之前configflags的函数\r\nfunc (f *factoryImpl) ToRESTConfig() (*restclient.Config, error) {\r\n\treturn f.clientGetter.ToRESTConfig()\r\n}\r\n\r\nfunc (f *factoryImpl) ToRESTMapper() (meta.RESTMapper, error) {\r\n\treturn f.clientGetter.ToRESTMapper()\r\n}\r\n\r\nfunc (f *factoryImpl) ToDiscoveryClient() (discovery.CachedDiscoveryInterface, error) {\r\n\treturn f.clientGetter.ToDiscoveryClient()\r\n}\r\n\r\nfunc (f *factoryImpl) ToRawKubeConfigLoader() clientcmd.ClientConfig {\r\n\treturn f.clientGetter.ToRawKubeConfigLoader()\r\n}\r\n\r\n// 2.生成了KubernetesClientSet，其实也是一个 Clientset客户端。第二种类型\r\nfunc (f *factoryImpl) KubernetesClientSet() (*kubernetes.Clientset, error) {\r\n\tclientConfig, err := f.ToRESTConfig()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\treturn kubernetes.NewForConfig(clientConfig)\r\n}\r\n\r\n//  3.生成了DynamicClient，第三种类型\r\nfunc (f *factoryImpl) DynamicClient() (dynamic.Interface, error) {\r\n\tclientConfig, err := f.ToRESTConfig()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\treturn dynamic.NewForConfig(clientConfig)\r\n}\r\n\r\n// 4.生成了一个builder\r\n// NewBuilder returns a new resource builder for structured api objects.\r\nfunc (f *factoryImpl) NewBuilder() *resource.Builder {\r\n\treturn resource.NewBuilder(f.clientGetter)\r\n}\r\n\r\n// 5.生成了restclient。 第一种类型\r\nfunc (f *factoryImpl) RESTClient() (*restclient.RESTClient, error) {\r\n\tclientConfig, err := f.ToRESTConfig()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\tsetKubernetesDefaults(clientConfig)\r\n\treturn restclient.RESTClientFor(clientConfig)\r\n}\r\n\r\n// 6.根据mapping的gvk，生成 rest路径\r\nfunc (f *factoryImpl) ClientForMapping(mapping *meta.RESTMapping) (resource.RESTClient, error) {\r\n\tcfg, err := f.clientGetter.ToRESTConfig()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\tif err := setKubernetesDefaults(cfg); err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\tgvk := mapping.GroupVersionKind\r\n\tswitch gvk.Group {\r\n\tcase corev1.GroupName:\r\n\t\tcfg.APIPath = \"/api\"\r\n\tdefault:\r\n\t\tcfg.APIPath = \"/apis\"\r\n\t}\r\n\tgv := gvk.GroupVersion()\r\n\tcfg.GroupVersion = &gv\r\n\treturn restclient.RESTClientFor(cfg)\r\n}\r\n\r\n// 7. 根据mapping的gvk，生成 rest路径, 和上面不同的就是，上面的是结构体话的，这个是非机构体。对应DynamicClient\r\nfunc (f *factoryImpl) UnstructuredClientForMapping(mapping *meta.RESTMapping) (resource.RESTClient, error) {\r\n\tcfg, err := f.clientGetter.ToRESTConfig()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\tif err := restclient.SetKubernetesDefaults(cfg); err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\tcfg.APIPath = \"/apis\"\r\n\tif mapping.GroupVersionKind.Group == corev1.GroupName {\r\n\t\tcfg.APIPath = \"/api\"\r\n\t}\r\n\tgv := mapping.GroupVersionKind.GroupVersion()\r\n\tcfg.ContentConfig = resource.UnstructuredPlusDefaultContentConfig()\r\n\tcfg.GroupVersion = &gv\r\n\treturn restclient.RESTClientFor(cfg)\r\n}\r\n\r\n// 8.如果指定validate, 根据下面函数获得的信息，生成一个validation.schema进行验证\r\nfunc (f *factoryImpl) Validator(validate bool) (validation.Schema, error) {\r\n\tif !validate {\r\n\t\treturn validation.NullSchema{}, nil\r\n\t}\r\n\r\n\tresources, err := f.OpenAPISchema()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\r\n\treturn validation.ConjunctiveSchema{\r\n\t\topenapivalidation.NewSchemaValidation(resources),\r\n\t\tvalidation.NoDoubleKeySchema{},\r\n\t}, nil\r\n}\r\n\r\n// 9. 通过discoveryClient获取k8s对象信息\r\n// OpenAPISchema returns metadata and structural information about Kubernetes object definitions.\r\nfunc (f *factoryImpl) OpenAPISchema() (openapi.Resources, error) {\r\n\tdiscovery, err := f.clientGetter.ToDiscoveryClient()\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\r\n\t// Lazily initialize the OpenAPIGetter once\r\n\tf.openAPIGetter.once.Do(func() {\r\n\t\t// Create the caching OpenAPIGetter\r\n\t\tf.openAPIGetter.getter = openapi.NewOpenAPIGetter(discovery)\r\n\t})\r\n\r\n\t// Delegate to the OpenAPIGetter\r\n\treturn f.openAPIGetter.getter.Get()\r\n}\r\n```\r\n\r\n#### 2.3 NewBuilder\r\n\r\n```\r\n// NewBuilder returns a new resource builder for structured api objects.\r\nfunc (f *factoryImpl) NewBuilder() *resource.Builder {\r\n\treturn resource.NewBuilder(f.clientGetter)\r\n}\r\n```\r\n\r\n##### 2.3.1 builder的常见用法\r\n\r\n以kubectl create命令为例， Builder大多方法支持链式调用，最后的Do()返回一个type Result struct。这里一些列链式调用大部分都在根据传入的Cmd来设置新建Builder的属性值。\r\n\r\n这个和rest http请求也是一样 各种函数后面.(点) 函数，最后调用一个DO发送请求。\r\n\r\n```\r\nr := f.NewBuilder().\r\n\t\tUnstructured().\r\n\t\tSchema(schema).\r\n\t\tContinueOnError().\r\n\t\tNamespaceParam(cmdNamespace).DefaultNamespace().\r\n\t\tFilenameParam(enforceNamespace, &o.FilenameOptions).\r\n\t\tLabelSelectorParam(o.Selector).\r\n\t\tFlatten().\r\n\t\tDo()\r\n\t\t\r\n\r\nVisitor 的构建。\r\nr 的构建跟builder属性相关，依次过一下处理Builder实例的函数流\r\n\r\nUnstructured() : 对b.mapper 赋值，b.mapper = unstructured\r\nSchema()： 对b.schema 赋值，b.schema = schema\r\nContinueOnError(): b.continueOnError 置为true。 意为遇到有错误的资源也不马上返回，跳过错误，继续处理下一个资源\r\nNamespaceParam(cmdNamespace)：b.namespace = cmdNamespace\r\nDefaultNamespace(): b.defaultNamespace 置为true\r\nFilenameParam(enforceNamespace, &options.FilenameOptions) : 后边详细说\r\nLabelSelectorParam(options.Selector)：对 b.labelSelector进行赋值\r\nFlatten(): b.flatten 置为true,\r\nDo()：后边详细说\r\n```\r\n\r\n##### 2.3.2 builder 功能说明\r\n\r\nBuilder是Kubectl命令行信息的内部载体，可以通过Builder生成Result对象。Builder 结构体保存了从命令行获取的各种参数，以及它实现了各种函数用于处理这些参数 并将其转换为一系列的resources，最终用Visitor 方法迭代处理resource。可以看出来builder的成员变量非常多。\r\n\r\n**成员变量**\r\n\r\n```\r\n// Builder provides convenience functions for taking arguments and parameters\r\n// from the command line and converting them to a list of resources to iterate\r\n// over using the Visitor interface.\r\ntype Builder struct {\r\n\tcategoryExpanderFn CategoryExpanderFunc\r\n\r\n\t// mapper is set explicitly by resource builders\r\n\tmapper *mapper\r\n\r\n\t// clientConfigFn is a function to produce a client, *if* you need one\r\n\tclientConfigFn ClientConfigFunc\r\n\r\n\trestMapperFn RESTMapperFunc\r\n\r\n\t// objectTyper is statically determinant per-command invocation based on your internal or unstructured choice\r\n\t// it does not ever need to rely upon discovery.\r\n\tobjectTyper runtime.ObjectTyper\r\n\r\n\t// codecFactory describes which codecs you want to use\r\n\tnegotiatedSerializer runtime.NegotiatedSerializer\r\n\r\n\t// local indicates that we cannot make server calls\r\n\tlocal bool\r\n\r\n\terrs []error\r\n\r\n\tpaths  []Visitor\r\n\tstream bool\r\n\tdir    bool\r\n\r\n\tlabelSelector     *string\r\n\tfieldSelector     *string\r\n\tselectAll         bool\r\n\tlimitChunks       int64\r\n\trequestTransforms []RequestTransform\r\n\r\n\tresources []string\r\n\r\n\tnamespace    string\r\n\tallNamespace bool\r\n\tnames        []string\r\n\r\n\tresourceTuples []resourceTuple\r\n\r\n\tdefaultNamespace bool\r\n\trequireNamespace bool\r\n\r\n\tflatten bool\r\n\tlatest  bool\r\n\r\n\trequireObject bool\r\n\r\n\tsingleResourceType bool\r\n\tcontinueOnError    bool\r\n\r\n\tsingleItemImplied bool\r\n\r\n\texport bool\r\n\r\n\tschema ContentValidator\r\n\r\n\t// fakeClientFn is used for testing\r\n\tfakeClientFn FakeClientFunc\r\n}\r\n```\r\n\r\n**函数说明：**\r\n\r\nbuilder中大部分成员函数都是对builder进行赋值操作。这里介绍一些重要的函数。\r\n\r\n```\r\nk8s.io/cli-runtime/pkg/resource/builder.go\r\n\r\n// NamespaceParam accepts the namespace that these resources should be\r\n// considered under from - used by DefaultNamespace() and RequireNamespace()\r\n/*\r\n\tfunc (b *Builder) NamespaceParam 设置b *Builder的namespace属性，\r\n\t会被 DefaultNamespace() and RequireNamespace()使用\r\n*/\r\nfunc (b *Builder) NamespaceParam(namespace string) *Builder {\r\n\tb.namespace = namespace\r\n\treturn b\r\n}\r\n\r\n// DefaultNamespace instructs the builder to set the namespace value for any object found\r\n// to NamespaceParam() if empty.\r\n/*\r\n\t让builder在namespace为空的时候，找到namespace的值\r\n*/\r\nfunc (b *Builder) DefaultNamespace() *Builder {\r\n\tb.defaultNamespace = true\r\n\treturn b\r\n}\r\n\r\n// AllNamespaces instructs the builder to use NamespaceAll as a namespace to request resources\r\n// across all of the namespace. This overrides the namespace set by NamespaceParam().\r\n/*\r\n\tfunc AllNamespaces 让builder使用NamespaceAll作为cmd的namespace，向所有的namespace请求resources。\r\n\t将重写由func (b *Builder) NamespaceParam(namespace string)设置的属性namespace\r\n*/\r\nfunc (b *Builder) AllNamespaces(allNamespace bool) *Builder {\r\n\t/*\r\n\t\t如果入参allNamespace bool＝true,那么重写b *Builder的namespace和allNamespace属性\r\n\t\t\tapi.NamespaceAll定义在 pkg/api/v1/types.go\r\n\t\t\t\t==>NamespaceAll string = \"\"\r\n\t*/\r\n\tif allNamespace {\r\n\t\tb.namespace = api.NamespaceAll\r\n\t}\r\n\tb.allNamespace = allNamespace\r\n\treturn b\r\n}\r\n\r\n\r\n// kubectl create就用到了这个，赋值了mapper和其他的函数，比如decoder\r\n// Unstructured updates the builder so that it will request and send unstructured\r\n// objects. Unstructured objects preserve all fields sent by the server in a map format\r\n// based on the object's JSON structure which means no data is lost when the client\r\n// reads and then writes an object. Use this mode in preference to Internal unless you\r\n// are working with Go types directly.\r\nfunc (b *Builder) Unstructured() *Builder {\r\n\tif b.mapper != nil {\r\n\t\tb.errs = append(b.errs, fmt.Errorf(\"another mapper was already selected, cannot use unstructured types\"))\r\n\t\treturn b\r\n\t}\r\n\tb.objectTyper = unstructuredscheme.NewUnstructuredObjectTyper()\r\n\tb.mapper = &mapper{\r\n\t\tlocalFn:      b.isLocal,\r\n\t\trestMapperFn: b.restMapperFn,\r\n\t\tclientFn:     b.getClient,\r\n\t\tdecoder:      &metadataValidatingDecoder{unstructured.UnstructuredJSONScheme},\r\n\t}\r\n\r\n\treturn b\r\n}\r\n\r\n\r\n// FilenameParam groups input in two categories: URLs and files (files, directories, STDIN)\r\n// If enforceNamespace is false, namespaces in the specs will be allowed to\r\n// override the default namespace. If it is true, namespaces that don't match\r\n// will cause an error.\r\n// If ContinueOnError() is set prior to this method, objects on the path that are not\r\n// recognized will be ignored (but logged at V(2)).\r\n\r\n译：func (b *Builder) FilenameParam以URLs and files (files, directories, STDIN)两种形式来传入参数。\r\n\t\t如果enforceNamespace＝false，specs中声明的namespaces将允许被重写为default namespace。\r\n\t\t如果enforceNamespace＝true，不匹配的namespaces将导致error。\r\n\t\t如果在此方法之前设置了ContinueOnError()，则路径上无法识别的objects将被忽略（记录在 V(2)级别的log中）。\r\n\r\nfunc (b *Builder) FilenameParam(enforceNamespace bool, filenameOptions *FilenameOptions) *Builder {\r\n\tif errs := filenameOptions.validate(); len(errs) > 0 {\r\n\t\tb.errs = append(b.errs, errs...)\r\n\t\treturn b\r\n\t}\r\n\trecursive := filenameOptions.Recursive\r\n\tpaths := filenameOptions.Filenames\r\n\tfor _, s := range paths {\r\n\t\tswitch {\r\n\t\tcase s == \"-\":\r\n\t\t\tb.Stdin()\r\n\t\tcase strings.Index(s, \"http://\") == 0 || strings.Index(s, \"https://\") == 0:\r\n\t\t\turl, err := url.Parse(s)\r\n\t\t\tif err != nil {\r\n\t\t\t\tb.errs = append(b.errs, fmt.Errorf(\"the URL passed to filename %q is not valid: %v\", s, err))\r\n\t\t\t\tcontinue\r\n\t\t\t}\r\n\t\t\tb.URL(defaultHttpGetAttempts, url)\r\n\t\tdefault:\r\n\t\t\tif !recursive {\r\n\t\t\t\tb.singleItemImplied = true\r\n\t\t\t}\r\n\t\t\tb.Path(recursive, s)\r\n\t\t}\r\n\t}\r\n\tif filenameOptions.Kustomize != \"\" {\r\n\t\tb.paths = append(b.paths, &KustomizeVisitor{filenameOptions.Kustomize,\r\n\t\t\tNewStreamVisitor(nil, b.mapper, filenameOptions.Kustomize, b.schema)})\r\n\t}\r\n\r\n\tif enforceNamespace {\r\n\t\tb.RequireNamespace()\r\n\t}\r\n\r\n\treturn b\r\n}\r\n\r\n\r\nlabelselector 和 Fieldselector\r\n// LabelSelectorParam defines a selector that should be applied to the object types to load.\r\n// This will not affect files loaded from disk or URL. If the parameter is empty it is\r\n// a no-op - to select all resources invoke `b.LabelSelector(labels.Everything.String)`.\r\nfunc (b *Builder) LabelSelectorParam(s string) *Builder {\r\n\tselector := strings.TrimSpace(s)\r\n\tif len(selector) == 0 {\r\n\t\treturn b\r\n\t}\r\n\tif b.selectAll {\r\n\t\tb.errs = append(b.errs, fmt.Errorf(\"found non-empty label selector %q with previously set 'all' parameter. \", s))\r\n\t\treturn b\r\n\t}\r\n\treturn b.LabelSelector(selector)\r\n}\r\n\r\n// LabelSelector accepts a selector directly and will filter the resulting list by that object.\r\n// Use LabelSelectorParam instead for user input.\r\nfunc (b *Builder) LabelSelector(selector string) *Builder {\r\n\tif len(selector) == 0 {\r\n\t\treturn b\r\n\t}\r\n\r\n\tb.labelSelector = &selector\r\n\treturn b\r\n}\r\n\r\n// FieldSelectorParam defines a selector that should be applied to the object types to load.\r\n// This will not affect files loaded from disk or URL. If the parameter is empty it is\r\n// a no-op - to select all resources.\r\nfunc (b *Builder) FieldSelectorParam(s string) *Builder {\r\n\ts = strings.TrimSpace(s)\r\n\tif len(s) == 0 {\r\n\t\treturn b\r\n\t}\r\n\tif b.selectAll {\r\n\t\tb.errs = append(b.errs, fmt.Errorf(\"found non-empty field selector %q with previously set 'all' parameter. \", s))\r\n\t\treturn b\r\n\t}\r\n\tb.fieldSelector = &s\r\n\treturn b\r\n}\r\n\r\n\r\n// ContinueOnError will attempt to load and visit as many objects as possible, even if some visits\r\n// return errors or some objects cannot be loaded. The default behavior is to terminate after\r\n// the first error is returned from a VisitorFunc.\r\n/*\r\n\tContinueOnError将尝试加载并访问尽可能多的对象，即使某些访问返回错误或某些对象无法加载。\r\n\t默认行为是在 ‘VisitorFunc返回第一个错误之后’ 终止。\r\n*/\r\nfunc (b *Builder) ContinueOnError() *Builder {\r\n\tb.continueOnError = true\r\n\treturn b\r\n}\r\n\r\n// Latest will fetch the latest copy of any objects loaded from URLs or files from the server.\r\n/*\r\n\t译：func (b *Builder) Latest() 将从server端获取该URL或文件加载objects的最新副本。\r\n*/\r\nfunc (b *Builder) Latest() *Builder {\r\n\tb.latest = true\r\n\treturn b\r\n}\r\n\r\n// Flatten will convert any objects with a field named \"Items\" that is an array of runtime.Object\r\n// compatible types into individual entries and give them their own items. The original object\r\n// is not passed to any visitors.\r\n/*\r\n\t译：Flatten将使用一个名为“Items”的字段将任何对象转换为一个runtime.Object兼容类型的数组，并将它们分配给各自的items。\r\n\t\t 原始对象不会传递给任何访问者。\r\n*/\r\nfunc (b *Builder) Flatten() *Builder {\r\n\tb.flatten = true\r\n\treturn b\r\n}\r\n```\r\n\r\n<br>\r\n\r\ndo基本都是builder这一套中最后使用的一个函数。 \r\n\r\n```\r\nDo() 返回一个type Result struct，该type Result struct中含有一个visitor Visitor，visitor 能访问在Builder中定义的resources。The visitor将遵守由ContinueOnError指定的错误行为。\r\n\r\n// Do returns a Result object with a Visitor for the resources identified by the Builder.\r\n// The visitor will respect the error behavior specified by ContinueOnError. Note that stream\r\n// inputs are consumed by the first execution - use Infos() or Object() on the Result to capture a list\r\n// for further iteration.\r\nfunc (b *Builder) Do() *Result {\r\n    r := b.visitorResult()  // 第一次生成 result 实例，初始化r.visitor的值\r\n    r.mapper = b.Mapper()\r\n    if r.err != nil {\r\n        return r    // 第一处return，出错时，从此处返回\r\n    }\r\n    if b.flatten {  // 默认为true, 在b.Flatten() 中赋值\r\n        r.visitor = NewFlattenListVisitor(r.visitor, b.mapper) //第一次修改r.visitor \r\n    }\r\n    helpers := []VisitorFunc{}\r\n    if b.defaultNamespace {// 默认为true, 在b.DefaultNamespace() 赋值\r\n        helpers = append(helpers, SetNamespace(b.namespace))\r\n    }\r\n    if b.requireNamespace { // 在FileFilenameParam() 赋值为true\r\n        /* b.namespace 即RunCreate() 中的cmdNamespace，cmdNamespace 的值为 kubectl 命令中-- \r\n          namespace 参数的值*/\r\n        helpers = append(helpers, RequireNamespace(b.namespace))\r\n    }\r\n    helpers = append(helpers, FilterNamespace)\r\n    if b.requireObject {  // 默认为true，在builder 结构体初始化中赋值\r\n        helpers = append(helpers, RetrieveLazy)\r\n    }\r\n    r.visitor = NewDecoratedVisitor(r.visitor, helpers...)  //  第二次修改r.visitor\r\n    if b.continueOnError {\r\n        r.visitor = ContinueOnErrorVisitor{r.visitor}\r\n    }\r\n    return r  //第二处return，一般从此处返回。\r\n}\r\n```\r\n\r\n<br>\r\n\r\nResult结构如下：\r\n\r\n```\r\n// Result contains helper methods for dealing with the outcome of a Builder.\r\ntype Result struct {\r\n\terr     error\r\n\tvisitor Visitor\r\n\r\n\tsources            []Visitor\r\n\tsingleItemImplied  bool\r\n\ttargetsSingleItems bool\r\n\r\n\tmapper       *mapper\r\n\tignoreErrors []utilerrors.Matcher\r\n\r\n\t// populated by a call to Infos\r\n\tinfo []*Info\r\n} \r\n```\r\n\r\nInfo结构如下：\r\n\r\n```\r\n// Info contains temporary info to execute a REST call, or show the results\r\n// of an already completed REST call.\r\ntype Info struct {\r\n\t// Client will only be present if this builder was not local\r\n\tClient RESTClient\r\n\t// Mapping will only be present if this builder was not local\r\n\tMapping *meta.RESTMapping\r\n\r\n\t// Namespace will be set if the object is namespaced and has a specified value.\r\n\tNamespace string\r\n\tName      string\r\n\r\n\t// Optional, Source is the filename or URL to template file (.json or .yaml),\r\n\t// or stdin to use to handle the resource\r\n\tSource string\r\n\t\r\n\t\r\n\t// 这个就是server端返回的对象\r\n\t// Optional, this is the most recent value returned by the server if available. It will\r\n\t// typically be in unstructured or internal forms, depending on how the Builder was\r\n\t// defined. If retrieved from the server, the Builder expects the mapping client to\r\n\t// decide the final form. Use the AsVersioned, AsUnstructured, and AsInternal helpers\r\n\t// to alter the object versions.\r\n\tObject runtime.Object\r\n\t\r\n\t// 译：可选，这是server端知道的此类resource的最新resource version。\r\n\t\t\t它可能与该object的resource version 不匹配，\r\n\t\t\t但如果设置它应该等于或新于对象的资源版本（但服务器定义资源版本）。\r\n\r\n\t\t简单来说，ResourceVersion的值是etcd中全局最新的Index\r\n\t*/\r\n\t// Optional, this is the most recent resource version the server knows about for\r\n\t// this type of resource. It may not match the resource version of the object,\r\n\t// but if set it should be equal to or newer than the resource version of the\r\n\t// object (however the server defines resource version).\r\n\tResourceVersion string\r\n\t// Optional, should this resource be exported, stripped of cluster-specific and instance specific fields\r\n\tExport bool\r\n}\r\n```\r\n\r\n##### 2.3.3 visitorResult\r\n\r\n再看看Do调用的visitorResult函数，可以看出其返回值是一个type Result struct指针。 根据前面设置的参数值（或者说是命令行cmd的参数）来选择相应的Visitor\r\n\r\n```\r\nfunc (b *Builder) visitorResult() *Result {\r\n    // 返回一，错误返回，b.errs 为一列表结构，列表长度大于0，说明之前有错误发生，在此返回。\r\n\tif len(b.errs) > 0 {\r\n\t\treturn &Result{err: utilerrors.NewAggregate(b.errs)}\r\n\t}\r\n    \r\n\tif b.selectAll {\r\n\t\tselector := labels.Everything().String()\r\n\t\tb.labelSelector = &selector\r\n\t}\r\n    \r\n    // create 命令进入此分支，例如 kubectl create -f xxx.yaml(或者URL/xxx.yaml)*/\r\n\t// visit items specified by paths\r\n\tif len(b.paths) != 0 {\r\n\t\treturn b.visitByPaths()\r\n\t}\r\n    \r\n    // \r\n\t// visit selectors\r\n\tif b.labelSelector != nil || b.fieldSelector != nil {\r\n\t\treturn b.visitBySelector()\r\n\t}\r\n    \r\n    // get 某一个指定资源对象时，进入此分支\r\n\t// visit items specified by resource and name\r\n\tif len(b.resourceTuples) != 0 {\r\n\t\treturn b.visitByResource()\r\n\t}\r\n\r\n\t// visit items specified by name\r\n\tif len(b.names) != 0 {\r\n\t\treturn b.visitByName()\r\n\t}\r\n\r\n\tif len(b.resources) != 0 {\r\n\t\tfor _, r := range b.resources {\r\n\t\t\t_, err := b.mappingFor(r)\r\n\t\t\tif err != nil {\r\n\t\t\t\treturn &Result{err: err}\r\n\t\t\t}\r\n\t\t}\r\n\t\treturn &Result{err: fmt.Errorf(\"resource(s) were provided, but no name, label selector, or --all flag specified\")}\r\n\t}\r\n\treturn &Result{err: missingResourceError}\r\n}\r\n```\r\n\r\n综上，可以看出Do()函数主要的根据Builder设置的属性值来获取一个type Result struct。 type Result struct中最重要的数据结构是visitor Visitor和info []*Info。 然后create函数中可以看到调用createAndRefresh。\r\n\r\n最终info就是从apiserver接受到的对象。\r\n\r\n```\r\n// createAndRefresh creates an object from input info and refreshes info with that object\r\nfunc createAndRefresh(info *resource.Info) error {\r\n\tobj, err := resource.NewHelper(info.Client, info.Mapping).Create(info.Namespace, true, info.Object, nil)\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\tinfo.Refresh(obj, true)\r\n\treturn nil\r\n}\r\n\r\nfunc (m *Helper) Create(namespace string, modify bool, obj runtime.Object, options *metav1.CreateOptions) (runtime.Object, error) {\r\n\tif options == nil {\r\n\t\toptions = &metav1.CreateOptions{}\r\n\t}\r\n\tif modify {\r\n\t\t// Attempt to version the object based on client logic.\r\n\t\tversion, err := metadataAccessor.ResourceVersion(obj)\r\n\t\tif err != nil {\r\n\t\t\t// We don't know how to clear the version on this object, so send it to the server as is\r\n\t\t\treturn m.createResource(m.RESTClient, m.Resource, namespace, obj, options)\r\n\t\t}\r\n\t\tif version != \"\" {\r\n\t\t\tif err := metadataAccessor.SetResourceVersion(obj, \"\"); err != nil {\r\n\t\t\t\treturn nil, err\r\n\t\t\t}\r\n\t\t}\r\n\t}\r\n\r\n\treturn m.createResource(m.RESTClient, m.Resource, namespace, obj, options)\r\n}\r\n\r\nfunc (m *Helper) createResource(c RESTClient, resource, namespace string, obj runtime.Object, options *metav1.CreateOptions) (runtime.Object, error) {\r\n\treturn c.Post().\r\n\t\tNamespaceIfScoped(namespace, m.NamespaceScoped).\r\n\t\tResource(resource).\r\n\t\tVersionedParams(options, metav1.ParameterCodec).\r\n\t\tBody(obj).\r\n\t\tDo().\r\n\t\tGet()\r\n}\r\n```\r\n\r\n### 3 总结\r\n\r\n（1）factory其实就是在之前的configflags的基础上做了一些扩展\r\n\r\n（2）factory最重要的两个成员对象就是buider 和 vistor (builder包含了vistor)\r\n\r\n（3）其实configflags就应该可以实现了往server端发送请求的功能了，但是kubectl功能强大，所有利用了builder+visotor机制做了封装优化。\r\n\r\n（4）builder的具体功能就是包含了cmd和默认的所有配置。在使用的时候就是   f.NewBuilder().xx.xx.xx.Do()。在xx.xx的过程中不仅根据配置实例化了一个builder。和生成了一个与之对应的vistor\r\n\r\n（5）vistor就是一种设计模式。kubectl中的vistor有两种，一种是产生info的vistor，另一种是处理Info的visotor。产生info的visotor在发送请求到apiserver之前对info增加某些关键字段（name, kind, spec的内容等等）。\r\n\r\n 处理info的vistor在apiserver返回后，再对info进行处理。目前对vistor细节理解还不够，接下来进一步分析vistor。\r\n\r\n"
  },
  {
    "path": "k8s/kubectl/5 visitor机制.md",
    "content": "Table of Contents\r\n=================\r\n\r\n  * [1. 背景](#1-背景)\r\n  * [2. visitor机制](#2-visitor机制)\r\n     * [2.1 举例说明](#21-举例说明)\r\n  * [3 总结](#3-总结)\r\n\r\n### 1. 背景\r\n\r\n之前分析到kubectl 的Factory机制时，牵扯到了 visitor机制，这里参考了皓叔的 GO 编程模式：K8S VISITOR 模式[https://coolshell.cn/articles/21263.html], 做的相关笔记，以加深对visitor机制的了解，方便后面的源码分析。\r\n\r\n<br>\r\n\r\n### 2. visitor机制\r\n\r\nvisitor机制看起来是更高级的装饰器模式。它的核心就是这个模式是一种将算法与操作对象的结构分离的一种方法。这种分离的实际结果是能够在不修改结构的情况下向现有对象结构添加新操作，是遵循开放/封闭原则的一种方法。\r\n\r\n#### 2.1 举例说明\r\n\r\n```\r\npackage main\r\n\r\nimport \"fmt\"\r\n\r\ntype VisitorFunc func(*Info, error) error\r\n\r\ntype Visitor interface {\r\n\tVisit(VisitorFunc) error\r\n}\r\n\r\ntype Info struct {\r\n\tNamespace   string\r\n\tName        string\r\n\tOtherThings string\r\n}\r\nfunc (info *Info) Visit(fn VisitorFunc) error {\r\n\treturn fn(info, nil)\r\n}\r\n\r\ntype NameVisitor struct {\r\n\tvisitor Visitor\r\n}\r\n\r\nfunc (v NameVisitor) Visit(fn VisitorFunc) error {\r\n\treturn v.visitor.Visit(func(info *Info, err error) error {\r\n\t\tfmt.Println(\"NameVisitor() before call function\")\r\n\t\terr = fn(info, err)\r\n\t\tif err == nil {\r\n\t\t\tfmt.Printf(\"==> Name=%s, NameSpace=%s\\n\", info.Name, info.Namespace)\r\n\t\t}\r\n\t\tfmt.Println(\"NameVisitor() after call function\")\r\n\t\treturn err\r\n\t})\r\n}\r\n\r\n\r\ntype OtherThingsVisitor struct {\r\n\tvisitor Visitor\r\n}\r\n\r\nfunc (v OtherThingsVisitor) Visit(fn VisitorFunc) error {\r\n\treturn v.visitor.Visit(func(info *Info, err error) error {\r\n\t\tfmt.Println(\"OtherThingsVisitor() before call function\")\r\n\t\terr = fn(info, err)\r\n\t\tif err == nil {\r\n\t\t\tfmt.Printf(\"==> OtherThings=%s\\n\", info.OtherThings)\r\n\t\t}\r\n\t\tfmt.Println(\"OtherThingsVisitor() after call function\")\r\n\t\treturn err\r\n\t})\r\n}\r\n\r\ntype LogVisitor struct {\r\n\tvisitor Visitor\r\n}\r\n\r\nfunc (v LogVisitor) Visit(fn VisitorFunc) error {\r\n\treturn v.visitor.Visit(func(info *Info, err error) error {\r\n\t\tfmt.Println(\"LogVisitor() before call function\")\r\n\t\terr = fn(info, err)\r\n\t\tfmt.Println(\"LogVisitor() after call function\")\r\n\t\treturn err\r\n\t})\r\n}\r\n\r\nfunc main() {\r\n\tinfo := Info{}\r\n\tvar v Visitor = &info\r\n\tfmt.Printf(\"v is  %+v\\n\", v)\r\n\r\n\tv = LogVisitor{v}\r\n\r\n\tfmt.Printf(\"logvistor is  %+v\\n\", v)\r\n\r\n\tv = NameVisitor{v}\r\n\r\n\tfmt.Printf(\"namevistor is  %+v\\n\", v)\r\n\tv = OtherThingsVisitor{v}\r\n\r\n\tfmt.Printf(\"oth is  %+v\\n\", v)\r\n\r\n\tloadFile := func(info *Info, err error) error {\r\n\t\tinfo.Name = \"Hao Chen\"\r\n\t\tinfo.Namespace = \"MegaEase\"\r\n\t\tinfo.OtherThings = \"We are running as remote team.\"\r\n\t\treturn nil\r\n\t}\r\n\tv.Visit(loadFile)\r\n}\r\n```\r\n\r\n**上述代码的输出为：**\r\n\r\n```\r\nv is  &{Namespace: Name: OtherThings:}\r\nlogvistor is  {visitor:0xc000070480}\r\nnamevistor is  {visitor:{visitor:0xc000070480}}\r\noth is  {visitor:{visitor:{visitor:0xc000070480}}}\r\nLogVisitor() before call function\r\nNameVisitor() before call function\r\nOtherThingsVisitor() before call function\r\n==> OtherThings=We are running as remote team.\r\nOtherThingsVisitor() after call function\r\n==> Name=Hao Chen, NameSpace=MegaEase\r\nNameVisitor() after call function\r\nLogVisitor() after call function\r\n\r\nProcess finished with the exit code 0\r\n\r\n```\r\n\r\n<br>\r\n\r\n从上述代码可以看出来，visitor机制的原理在于：\r\n\r\n* 定义一个基础的visitor接口，并且规定，visitor接口对应的函数必须是对info对象操作的函数fn(info, err)。\r\n* 定义一个数据结构info，并实现visitor接口\r\n* 其他的想要对info进行处理的。实现visitor接口，在对应实现的函数里，实现自己对info的处理。同时为了能链起来。所以还要调用 fn(info, err)函数\r\n* 根据main函数那样定义的串起来。这样v.Visit(loadFile)就可以 递归似的执行visitor函数\r\n\r\n### 3 总结\r\n\r\n（1）kubectl 根据各种的options对资源对象会做各种处理，利用visitor机制就可以一个一个的处理\r\n\r\n（2）但是有一个疑问就是，为啥不用简单的装饰器模式更方便理解。类似这种，for循环一个个处理？\r\n\r\nA: 这样其实就是for循环，起不到 v1-v2-v3-v2-v1这样递归的效果\r\n\r\n```\r\n// Visit implements Visitor\r\nfunc (v DecoratedVisitor) Visit(fn VisitorFunc) error {\r\n  return v.visitor.Visit(func(info *Info, err error) error {\r\n    if err != nil {\r\n      return err\r\n    }\r\n    if err := fn(info, nil); err != nil {\r\n      return err\r\n    }\r\n    for i := range v.decorators {\r\n      if err := v.decorators[i](info, nil); err != nil {\r\n        return err\r\n      }\r\n    }\r\n    return nil\r\n  })\r\n}\r\n```\r\n\r\n"
  },
  {
    "path": "k8s/kubectl/6-kubectl中的所有visitor.md",
    "content": "Table of Contents\n=================\n\n  * [1. Visitor 接口](#1-visitor-接口)\n  * [2. visitor种类](#2-visitor种类)\n     * [2.1 StreamVisitor](#21-streamvisitor)\n     * [2.2 FileVisitor](#22-filevisitor)\n     * [2.3 URLvisitor](#23-urlvisitor)\n     * [2.4 KustomizeVisitor](#24-kustomizevisitor)\n     * [2.5 Selector](#25-selector)\n     * [2.5 InfoListVisitor](#25-infolistvisitor)\n     * [2.6 FilteredVisitor](#26-filteredvisitor)\n     * [2.7 DecoratedVisitor](#27-decoratedvisitor)\n     * [2.8 ContinueOnErrorVisitor](#28-continueonerrorvisitor)\n     * [2.9 FlattenListVisitor](#29-flattenlistvisitor)\n     * [2.10 EagerVisitorList](#210-eagervisitorlist)\n     * [2.11 VisitorList](#211-visitorlist)\n  * [3 总结](#3-总结)\n\n### 1. Visitor 接口\n\nvisitor接口和上文描述的一致。一个visit函数，函数参数为VisitorFunc\n\n```\n// Visitor lets clients walk a list of resources.\ntype Visitor interface {\n\tVisit(VisitorFunc) error\n}\n\n// VisitorFunc implements the Visitor interface for a matching function.\n// If there was a problem walking a list of resources, the incoming error\n// will describe the problem and the function can decide how to handle that error.\n// A nil returned indicates to accept an error to continue loops even when errors happen.\n// This is useful for ignoring certain kinds of errors or aggregating errors in some way.\ntype VisitorFunc func(*Info, error) error\n```\n\n### 2. visitor种类\n\n#### 2.1 StreamVisitor\n\nStreamVisitor就是根据json, 或者yaml中的内容，生成Info信息。\n\n```\n// StreamVisitor reads objects from an io.Reader and walks them. A stream visitor can only be\n// visited once.\n// TODO: depends on objects being in JSON format before being passed to decode - need to implement\n// a stream decoder method on runtime.Codec to properly handle this.\ntype StreamVisitor struct {\n\tio.Reader\n\t*mapper\n\n\tSource string   //这个source是yaml ,json这种对象来源的含义\n\tSchema ContentValidator    \n}\n\n// NewStreamVisitor is a helper function that is useful when we want to change the fields of the struct but keep calls the same.\nfunc NewStreamVisitor(r io.Reader, mapper *mapper, source string, schema ContentValidator) *StreamVisitor {\n\treturn &StreamVisitor{\n\t\tReader: r,\n\t\tmapper: mapper,\n\t\tSource: source,\n\t\tSchema: schema,\n\t}\n}\n\n\n// Visit implements Visitor over a stream. StreamVisitor is able to distinct multiple resources in one stream.\nfunc (v *StreamVisitor) Visit(fn VisitorFunc) error {\n  // 1.从这里也能看出来，只支持yaml,json两种文件格式\n\td := yaml.NewYAMLOrJSONDecoder(v.Reader, 4096)\n\tfor {\n\t\text := runtime.RawExtension{}\n\t\tif err := d.Decode(&ext); err != nil {\n\t\t\tif err == io.EOF {\n\t\t\t\treturn nil\n\t\t\t}\n\t\t\treturn fmt.Errorf(\"error parsing %s: %v\", v.Source, err)\n\t\t}\n\t\t// TODO: This needs to be able to handle object in other encodings and schemas.\n\t\text.Raw = bytes.TrimSpace(ext.Raw)\n\t\tif len(ext.Raw) == 0 || bytes.Equal(ext.Raw, []byte(\"null\")) {\n\t\t\tcontinue\n\t\t}\n\t\t// 2.利用Factory的validator进行验证。staging/src/k8s.io/kubectl/pkg/validation/schema.go\n\t\tif err := ValidateSchema(ext.Raw, v.Schema); err != nil {\n\t\t\treturn fmt.Errorf(\"error validating %q: %v\", v.Source, err)\n\t\t}\n\t\t\n\t\t// 3.InfoForData用传入的数据生成一个Info object。会把json对象转换成对应的struct类型\n\t\tinfo, err := v.infoForData(ext.Raw, v.Source)\n\t\tif err != nil {\n\t\t\tif fnErr := fn(info, err); fnErr != nil {\n\t\t\t\treturn fnErr\n\t\t\t}\n\t\t\tcontinue\n\t\t}\n\t\t// 4.调用fn函数，可以看出来是先执行该selector的处理逻辑，在执行其他visitor的逻辑\n\t\tif err := fn(info, nil); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n}\n\n\n// InfoForData用传入的数据生成一个Info object。会把json对象转换成对应的struct类型\n// InfoForData creates an Info object for the given data. An error is returned\n// if any of the decoding or client lookup steps fail. Name and namespace will be\n// set into Info if the mapping's MetadataAccessor can retrieve them.\nfunc (m *mapper) infoForData(data []byte, source string) (*Info, error) {\n\tobj, gvk, err := m.decoder.Decode(data, nil, nil)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"unable to decode %q: %v\", source, err)\n\t}\n\n\tname, _ := metadataAccessor.Name(obj)\n\tnamespace, _ := metadataAccessor.Namespace(obj)\n\tresourceVersion, _ := metadataAccessor.ResourceVersion(obj)\n\n\tret := &Info{\n\t\tSource:          source,\n\t\tNamespace:       namespace,\n\t\tName:            name,\n\t\tResourceVersion: resourceVersion,\n\n\t\tObject: obj,\n\t}\n\n\tif m.localFn == nil || !m.localFn() {\n\t\trestMapper, err := m.restMapperFn()\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tmapping, err := restMapper.RESTMapping(gvk.GroupKind(), gvk.Version)\n\t\tif err != nil {\n\t\t\treturn nil, fmt.Errorf(\"unable to recognize %q: %v\", source, err)\n\t\t}\n\t\tret.Mapping = mapping\n\n\t\tclient, err := m.clientFn(gvk.GroupVersion())\n\t\tif err != nil {\n\t\t\treturn nil, fmt.Errorf(\"unable to connect to a server to handle %q: %v\", mapping.Resource, err)\n\t\t}\n\t\tret.Client = client\n\t}\n\n\treturn ret, nil\n}\n```\n\n#### 2.2 FileVisitor\n\nFileVisitor封装了一个type StreamVisitor struct，用于处理open/close files\n\n```\n// FileVisitor is wrapping around a StreamVisitor, to handle open/close files\ntype FileVisitor struct {\n\tPath string\n\t*StreamVisitor\n}\n\n// Visit in a FileVisitor is just taking care of opening/closing files\nfunc (v *FileVisitor) Visit(fn VisitorFunc) error {\n\tvar f *os.File\n\tif v.Path == constSTDINstr {\n\t\tf = os.Stdin\n\t} else {\n\t\tvar err error\n\t\tf, err = os.Open(v.Path)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tdefer f.Close()\n\t}\n\n\t// TODO: Consider adding a flag to force to UTF16, apparently some\n\t// Windows tools don't write the BOM\n\tutf16bom := unicode.BOMOverride(unicode.UTF8.NewDecoder())\n\tv.StreamVisitor.Reader = transform.NewReader(f, utf16bom)\n\n\treturn v.StreamVisitor.Visit(fn)\n}\n```\n\n#### 2.3 URLvisitor\n\nURLVisitor下载URL的内容，如果成功，返回一个表示info object代表URL的信息。封装了一个StreamVisitor\n\n```\n// URLVisitor downloads the contents of a URL, and if successful, returns\n// an info object representing the downloaded object.\ntype URLVisitor struct {\n   URL *url.URL\n   *StreamVisitor\n   HttpAttemptCount int\n}\n\nfunc (v *URLVisitor) Visit(fn VisitorFunc) error {\n   body, err := readHttpWithRetries(httpgetImpl, time.Second, v.URL.String(), v.HttpAttemptCount)\n   if err != nil {\n      return err\n   }\n   defer body.Close()\n   v.StreamVisitor.Reader = body\n   return v.StreamVisitor.Visit(fn)\n}\n```\n\n#### 2.4 KustomizeVisitor\n\n这个和file, url一样，也是一种输入的格式。例如：\n\n通过  `kubectl apply -k <kustomization_directory>` 创建应用。\n\n详见：https://kubernetes.io/zh/docs/tasks/manage-kubernetes-objects/kustomization/\n\n这个最终也是调用了StreamVisitor进行了处理。\n\n```\n// KustomizeVisitor is wrapper around a StreamVisitor, to handle Kustomization directories\ntype KustomizeVisitor struct {\n   Path string\n   *StreamVisitor\n}\n\n// Visit in a KustomizeVisitor gets the output of Kustomize build and save it in the Streamvisitor\nfunc (v *KustomizeVisitor) Visit(fn VisitorFunc) error {\n   fSys := fs.MakeRealFS()\n   var out bytes.Buffer\n   err := kustomize.RunKustomizeBuild(&out, fSys, v.Path)\n   if err != nil {\n      return err\n   }\n   v.StreamVisitor.Reader = bytes.NewReader(out.Bytes())\n   return v.StreamVisitor.Visit(fn)\n}\n```\n\n\n\n\n\n\n\n#### 2.5 Selector\n\nSelector是一个resources的Visitor，实现了label selector。\n\n该selector就是填充info的 ListOptions，然后再执行其他的selector\n\n```\n// Selector is a Visitor for resources that match a label selector.\ntype Selector struct {\n\tClient        RESTClient\n\tMapping       *meta.RESTMapping\n\tNamespace     string\n\tLabelSelector string\n\tFieldSelector string\n\tExport        bool\n\tLimitChunks   int64\n}\n\n// NewSelector创建一个资源选择器，它隐藏由标签选择器获取项目的细节。\n// NewSelector creates a resource selector which hides details of getting items by their label selector.\nfunc NewSelector(client RESTClient, mapping *meta.RESTMapping, namespace, labelSelector, fieldSelector string, export bool, limitChunks int64) *Selector {\n\treturn &Selector{\n\t\tClient:        client,\n\t\tMapping:       mapping,\n\t\tNamespace:     namespace,\n\t\tLabelSelector: labelSelector,\n\t\tFieldSelector: fieldSelector,\n\t\tExport:        export,\n\t\tLimitChunks:   limitChunks,\n\t}\n}\n\n// Visit implements Visitor and uses request chunking by default.\nfunc (r *Selector) Visit(fn VisitorFunc) error {\n\tvar continueToken string\n\tfor {\n\t\tlist, err := NewHelper(r.Client, r.Mapping).List(\n\t\t\tr.Namespace,\n\t\t\tr.ResourceMapping().GroupVersionKind.GroupVersion().String(),\n\t\t\tr.Export,\n\t\t\t&metav1.ListOptions{\n\t\t\t\tLabelSelector: r.LabelSelector,\n\t\t\t\tFieldSelector: r.FieldSelector,\n\t\t\t\tLimit:         r.LimitChunks,\n\t\t\t\tContinue:      continueToken,\n\t\t\t},\n\t\t)\n\t\tif err != nil {\n\t\t\tif errors.IsResourceExpired(err) {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif errors.IsBadRequest(err) || errors.IsNotFound(err) {\n\t\t\t\tif se, ok := err.(*errors.StatusError); ok {\n\t\t\t\t\t// modify the message without hiding this is an API error\n\t\t\t\t\tif len(r.LabelSelector) == 0 && len(r.FieldSelector) == 0 {\n\t\t\t\t\t\tse.ErrStatus.Message = fmt.Sprintf(\"Unable to list %q: %v\", r.Mapping.Resource, se.ErrStatus.Message)\n\t\t\t\t\t} else {\n\t\t\t\t\t\tse.ErrStatus.Message = fmt.Sprintf(\"Unable to find %q that match label selector %q, field selector %q: %v\", r.Mapping.Resource, r.LabelSelector, r.FieldSelector, se.ErrStatus.Message)\n\t\t\t\t\t}\n\t\t\t\t\treturn se\n\t\t\t\t}\n\t\t\t\tif len(r.LabelSelector) == 0 && len(r.FieldSelector) == 0 {\n\t\t\t\t\treturn fmt.Errorf(\"Unable to list %q: %v\", r.Mapping.Resource, err)\n\t\t\t\t}\n\t\t\t\treturn fmt.Errorf(\"Unable to find %q that match label selector %q, field selector %q: %v\", r.Mapping.Resource, r.LabelSelector, r.FieldSelector, err)\n\t\t\t}\n\t\t\treturn err\n\t\t}\n\t\tresourceVersion, _ := metadataAccessor.ResourceVersion(list)\n\t\tnextContinueToken, _ := metadataAccessor.Continue(list)\n\t\tinfo := &Info{\n\t\t\tClient:  r.Client,\n\t\t\tMapping: r.Mapping,\n\n\t\t\tNamespace:       r.Namespace,\n\t\t\tResourceVersion: resourceVersion,\n\n\t\t\tObject: list,\n\t\t}\n    \n    // 调用fn函数，可以看出来是先执行该selector的处理逻辑，在执行其他visitor的逻辑\n\t\tif err := fn(info, nil); err != nil {\n\t\t\treturn err\n\t\t}\n\t\tif len(nextContinueToken) == 0 {\n\t\t\treturn nil\n\t\t}\n\t\tcontinueToken = nextContinueToken\n\t}\n}\n```\n\n#### 2.5 InfoListVisitor\n\nInfoListVisitor就是多个info对对象的集合。就是该visitor同时对多个info进行处理。例如kubectl create -f yaml中。yaml定义了多个资源对象的情况\n\n```\ntype InfoListVisitor []*Info\n\nfunc (infos InfoListVisitor) Visit(fn VisitorFunc) error {\n\tvar err error\n\tfor _, i := range infos {\n\t\terr = fn(i, err)\n\t}\n\treturn err\n}\n\n```\n\n<br>\n\n#### 2.6 FilteredVisitor\n\nFilteredVisitor可以检查info是否满足某些条件。如果满足条件，则往下执行，否则返回err。\n\nFilterFunc函数在初始化FilteredVisitor的时候定义好。\n\n```\ntype FilterFunc func(info *Info, err error) (bool, error)\n\ntype FilteredVisitor struct {\n\tvisitor Visitor\n\tfilters []FilterFunc\n}\n\nfunc NewFilteredVisitor(v Visitor, fn ...FilterFunc) Visitor {\n\tif len(fn) == 0 {\n\t\treturn v\n\t}\n\treturn FilteredVisitor{v, fn}\n}\n\nfunc (v FilteredVisitor) Visit(fn VisitorFunc) error {\n\treturn v.visitor.Visit(func(info *Info, err error) error {\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tfor _, filter := range v.filters {\n\t\t\tok, err := filter(info, nil)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif !ok {\n\t\t\t\treturn nil\n\t\t\t}\n\t\t}\n\t\t// 最后在调用fn\n\t\treturn fn(info, nil)\n\t})\n}\n```\n\n#### 2.7 DecoratedVisitor\n\n在调用visitor function之前，DecoratedVisitor将调用decorators。错误将终止visit函数。\n\nNewDecoratedVisitor将在调用用户提供的visitor function之前，创建一个visitor来调用入参visitor functions，让他们有机会改变 Info对象或提前return error。\n\n```\n// DecoratedVisitor will invoke the decorators in order prior to invoking the visitor function\n// passed to Visit. An error will terminate the visit.\ntype DecoratedVisitor struct {\n\tvisitor    Visitor\n\tdecorators []VisitorFunc\n}\n\n// NewDecoratedVisitor will create a visitor that invokes the provided visitor functions before\n// the user supplied visitor function is invoked, giving them the opportunity to mutate the Info\n// object or terminate early with an error.\nfunc NewDecoratedVisitor(v Visitor, fn ...VisitorFunc) Visitor {\n\tif len(fn) == 0 {\n\t\treturn v\n\t}\n\treturn DecoratedVisitor{v, fn}\n}\n\n// Visit implements Visitor\nfunc (v DecoratedVisitor) Visit(fn VisitorFunc) error {\n\treturn v.visitor.Visit(func(info *Info, err error) error {\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tfor i := range v.decorators {\n\t\t\tif err := v.decorators[i](info, nil); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t\t// 也是最后才执行fn\n\t\treturn fn(info, nil)\n\t})\n}\n```\n\n#### 2.8 ContinueOnErrorVisitor\n\nContinueOnErrorVisitor访问每个item，如果任何一个item发生错误，则在访问所有item后返回一个聚合错误。\n\n如果遍历期间没有发生错误，func (v ContinueOnErrorVisitor) Visit返回nil。\n\t\t如果发生错误，或者发生多个错误，则返回聚合错误。\n\t\t如果指定的visitor在任何单独的item上失败，它不会阻止其余的item被访问。\n\t\tvisitor直接返回error，可能会导致一个items没有被访问到。\n\t收集子Visitor产生的错误，并返回。\n\n```\n// ContinueOnErrorVisitor visits each item and, if an error occurs on\n// any individual item, returns an aggregate error after all items\n// are visited.\ntype ContinueOnErrorVisitor struct {\n\tVisitor\n}\n\n// Visit returns nil if no error occurs during traversal, a regular\n// error if one occurs, or if multiple errors occur, an aggregate\n// error.  If the provided visitor fails on any individual item it\n// will not prevent the remaining items from being visited. An error\n// returned by the visitor directly may still result in some items\n// not being visited.\nfunc (v ContinueOnErrorVisitor) Visit(fn VisitorFunc) error {\n\terrs := []error{}\n\t\n\t// 从这里可以看出来，执行其他的visitor就算出错误了，也是返回nil\n\terr := v.Visitor.Visit(func(info *Info, err error) error {\n\t\tif err != nil {\n\t\t\terrs = append(errs, err)\n\t\t\treturn nil\n\t\t}\n\t\t// 先执行的fn函数\n\t\tif err := fn(info, nil); err != nil {\n\t\t\terrs = append(errs, err)\n\t\t}\n\t\treturn nil\n\t})\n\t\n\t\n\tif err != nil {\n\t\terrs = append(errs, err)\n\t}\n\tif len(errs) == 1 {\n\t\treturn errs[0]\n\t}\n\treturn utilerrors.NewAggregate(errs)\n}\n```\n\n#### 2.9 FlattenListVisitor\n\n\n\n```\nFlattenListVisitor将任何runtime.ExtractList转化为一个list－拥有一个公共字段\"Items\"。\n\t\"Items\" 是一个runtime.Objects切片\n\t任何子item的错误（例如，如果列表中包含没有注册的客户端或资源的对象）将终止FlattenListVisitor的visit函数。\n\t\n// FlattenListVisitor flattens any objects that runtime.ExtractList recognizes as a list\n// - has an \"Items\" public field that is a slice of runtime.Objects or objects satisfying\n// that interface - into multiple Infos. Returns nil in the case of no errors.\n// When an error is hit on sub items (for instance, if a List contains an object that does\n// not have a registered client or resource), returns an aggregate error.\ntype FlattenListVisitor struct {\n   visitor Visitor\n   typer   runtime.ObjectTyper\n   mapper  *mapper\n}\n\nNewFlattenListVisitor创建一个visitor，它将list样式的runtime.Objects扩展成单独的items，然后单独访问它们。\n// NewFlattenListVisitor creates a visitor that will expand list style runtime.Objects\n// into individual items and then visit them individually.\nfunc NewFlattenListVisitor(v Visitor, typer runtime.ObjectTyper, mapper *mapper) Visitor {\n   return FlattenListVisitor{v, typer, mapper}\n}\n\nfunc (v FlattenListVisitor) Visit(fn VisitorFunc) error {\n   return v.visitor.Visit(func(info *Info, err error) error {\n      if err != nil {\n         return err\n      }\n      if info.Object == nil {\n         return fn(info, nil)\n      }\n      if !meta.IsListType(info.Object) {\n         return fn(info, nil)\n      }\n\n      items := []runtime.Object{}\n      itemsToProcess := []runtime.Object{info.Object}\n\n      for i := 0; i < len(itemsToProcess); i++ {\n         currObj := itemsToProcess[i]\n         if !meta.IsListType(currObj) {\n            items = append(items, currObj)\n            continue\n         }\n\n         currItems, err := meta.ExtractList(currObj)\n         if err != nil {\n            return err\n         }\n         if errs := runtime.DecodeList(currItems, v.mapper.decoder); len(errs) > 0 {\n            return utilerrors.NewAggregate(errs)\n         }\n         itemsToProcess = append(itemsToProcess, currItems...)\n      }\n\n      // If we have a GroupVersionKind on the list, prioritize that when asking for info on the objects contained in the list\n      var preferredGVKs []schema.GroupVersionKind\n      if info.Mapping != nil && !info.Mapping.GroupVersionKind.Empty() {\n         preferredGVKs = append(preferredGVKs, info.Mapping.GroupVersionKind)\n      }\n      errs := []error{}\n      for i := range items {\n         item, err := v.mapper.infoForObject(items[i], v.typer, preferredGVKs)\n         if err != nil {\n            errs = append(errs, err)\n            continue\n         }\n         if len(info.ResourceVersion) != 0 {\n            item.ResourceVersion = info.ResourceVersion\n         }\n         if err := fn(item, nil); err != nil {\n            errs = append(errs, err)\n         }\n      }\n      return utilerrors.NewAggregate(errs)\n\n   })\n}\n```\n\n#### 2.10 EagerVisitorList\n\nEagerVisitorList 实现其包含的子Visitor的Visit方法。在遍历其子Visitor的过程中，所有的error会被收集起来，在迭代结束后一起return\n\n```\n// EagerVisitorList implements Visit for the sub visitors it contains. All errors\n// will be captured and returned at the end of iteration.\ntype EagerVisitorList []Visitor\n\n// Visit implements Visitor, and gathers errors that occur during processing until\n// all sub visitors have been visited.\nfunc (l EagerVisitorList) Visit(fn VisitorFunc) error {\n\terrs := []error(nil)\n\tfor i := range l {\n\t\tif err := l[i].Visit(func(info *Info, err error) error {\n\t\t\tif err != nil {\n\t\t\t\terrs = append(errs, err)\n\t\t\t\treturn nil\n\t\t\t}\n\t\t\tif err := fn(info, nil); err != nil {\n\t\t\t\terrs = append(errs, err)\n\t\t\t}\n\t\t\treturn nil\n\t\t}); err != nil {\n\t\t\terrs = append(errs, err)\n\t\t}\n\t}\n\treturn utilerrors.NewAggregate(errs)\n}\n```\n\n#### 2.11 VisitorList\n\nVisitorList 实现其包含的子Visitor的Visit方法。在遍历其子Visitor的过程中，只要出现error，VisitorList的Visit立刻return\n\n```\n// VisitorList implements Visit for the sub visitors it contains. The first error\n// returned from a child Visitor will terminate iteration.\ntype VisitorList []Visitor\n\n// Visit implements Visitor\nfunc (l VisitorList) Visit(fn VisitorFunc) error {\n\tfor i := range l {\n\t\tif err := l[i].Visit(fn); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn nil\n}\n```\n\n### 3 总结\n\nBuilder中的Do()函数返回一个Result，而Visitor是Result里面最重要的数据结构。本文梳理了kubectl定义的所有visitor。到这里就已经基本理清楚kubectl factory的主要逻辑：\n\n（1）factory主要包含了builder结构体，该结构包含了所有的配置信息。builder中包含了visitor结构体，然后f.NewBuilder().xx.xx.xx.Do() 定义好了所有的visitor。\n\n（2）最后再通过调用r.visit，使得整个链路的visitor都执行起来。\n\n```\nerr = r.Visit(func(info *resource.Info, err error) error {\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tif err := util.CreateOrUpdateAnnotation(cmdutil.GetFlagBool(cmd, cmdutil.ApplyAnnotationsFlag), info.Object, scheme.DefaultJSONEncoder()); err != nil {\n\t\t\treturn cmdutil.AddSourceToErr(\"creating\", info.Source, err)\n\t\t}\n\n\t\tif err := o.Recorder.Record(info.Object); err != nil {\n\t\t\tklog.V(4).Infof(\"error recording current command: %v\", err)\n\t\t}\n\n\t\tif !o.DryRun {\n\t\t\tif err := createAndRefresh(info); err != nil {\n\t\t\t\treturn cmdutil.AddSourceToErr(\"creating\", info.Source, err)\n\t\t\t}\n\t\t}\n\n\t\tcount++\n\n\t\treturn o.PrintObj(info.Object)\n\t})\n```\n\n"
  },
  {
    "path": "k8s/kubectl/7-kubectl create使用到的visitor.md",
    "content": "Table of Contents\n=================\n\n  * [1. 背景说明](#1-背景说明)\n     * [1.1 get](#11-get)\n     * [1.2 delete](#12-delete)\n     * [1.3 create](#13-create)\n     * [1.4 apply](#14-apply)\n  * [2. kubectl create -f pod.yaml](#2-kubectl-create--f-podyaml)\n     * [2.1 kubectl代码中定义visitor](#21-kubectl代码中定义visitor)\n     * [2.2 再次看kubectl create的输出结果](#22-再次看kubectl-create的输出结果)\n  * [3. 总结](#3-总结)\n\n### 1. 背景说明\n\n为了了解不同的kubectl操作，使用了哪些visitor, 给每个visitor打印了如下的日志。然后使用各种kubectl 命令进行操作。\n\n```\nfunc (v *URLVisitor) Visit(fn VisitorFunc) error {\n\tklog.Errorf(\"in URLVisitor\")\n\tdefer klog.Errorf(\"after URLVisitor\")\n```\n\n#### 1.1 get \n\n可以看出来最简单的get 也用到了selector，应该是默认加了-n default的缘故\n\n```\nroot@k8s-master:~# ./kubectl get pods \nE1108 15:07:03.521469    6873 visitor.go:335] in DecoratedVisitor\nE1108 15:07:03.521496    6873 visitor.go:364] in ContinueOnErrorVisitor\nE1108 15:07:03.521506    6873 visitor.go:404] in FlattenListVisitor\nE1108 15:07:03.521518    6873 visitor.go:216] in EagerVisitorList\nE1108 15:07:03.521530    6873 selector.go:55] in Selector\nE1108 15:07:03.525540    6873 selector.go:107] after Selector\nE1108 15:07:03.525564    6873 visitor.go:233] after EagerVisitorList\nE1108 15:07:03.525573    6873 visitor.go:406] after FlattenListVisitor\nE1108 15:07:03.525582    6873 visitor.go:383] after ContinueOnErrorVisitor\nE1108 15:07:03.525589    6873 visitor.go:337] after DecoratedVisitor\nNAME    READY   STATUS    RESTARTS   AGE\nnginx   1/1     Running   70         2d22h\n\n\nroot@k8s-master:~# ./kubectl get pods --field-selector status.phase=Running\nE1108 15:10:11.485035    8174 visitor.go:335] in DecoratedVisitor\nE1108 15:10:11.485076    8174 visitor.go:364] in ContinueOnErrorVisitor\nE1108 15:10:11.485081    8174 visitor.go:404] in FlattenListVisitor\nE1108 15:10:11.485087    8174 visitor.go:216] in EagerVisitorList\nE1108 15:10:11.485093    8174 selector.go:55] in Selector\nE1108 15:10:11.489887    8174 selector.go:107] after Selector\nE1108 15:10:11.489924    8174 visitor.go:233] after EagerVisitorList\nE1108 15:10:11.489937    8174 visitor.go:406] after FlattenListVisitor\nE1108 15:10:11.489947    8174 visitor.go:383] after ContinueOnErrorVisitor\nE1108 15:10:11.489954    8174 visitor.go:337] after DecoratedVisitor\nNAME    READY   STATUS    RESTARTS   AGE\nnginx   1/1     Running   71         2d23h\n```\n\n<br>\n\n#### 1.2 delete\n\n可以看到delete是先删除完后，再对对象进行处理\n\n```\nroot@k8s-master:~# ./kubectl delete  pods  nginx\nE1108 15:32:47.413503   17665 visitor.go:335] in DecoratedVisitor\nE1108 15:32:47.413531   17665 visitor.go:364] in ContinueOnErrorVisitor\nE1108 15:32:47.413539   17665 visitor.go:404] in FlattenListVisitor\nE1108 15:32:47.413544   17665 visitor.go:199] in VisitorList\nE1108 15:32:47.413554   17665 visitor.go:96] in Info\npod \"nginx\" deleted\nE1108 15:32:47.427400   17665 visitor.go:98] after Info\nE1108 15:32:47.427477   17665 visitor.go:206] after VisitorList\nE1108 15:32:47.427608   17665 visitor.go:406] after FlattenListVisitor\nE1108 15:32:47.427669   17665 visitor.go:383] after ContinueOnErrorVisitor\nE1108 15:32:47.427682   17665 visitor.go:337] after DecoratedVisitor\nE1108 15:32:47.427695   17665 visitor.go:780] in InfoListVisitor\nE1108 15:33:03.414515   17665 visitor.go:786] after InfoListVisitor\n```\n\n#### 1.3 create\n\n```\nroot@k8s-master:~# ./kubectl create -f pod.yaml -v 3\nE1108 15:34:00.125162   18176 visitor.go:335] in DecoratedVisitor\nE1108 15:34:00.125198   18176 visitor.go:364] in ContinueOnErrorVisitor\nE1108 15:34:00.125206   18176 visitor.go:404] in FlattenListVisitor\nE1108 15:34:00.125210   18176 visitor.go:404] in FlattenListVisitor\nE1108 15:34:00.125214   18176 visitor.go:216] in EagerVisitorList\nE1108 15:34:00.125226   18176 visitor.go:526] in FileVisitor\nE1108 15:34:00.125256   18176 visitor.go:592] in StreamVisitor\nI1108 15:34:00.131293   18176 create.go:272] post data to apiserver\npod/nginx created\nE1108 15:34:00.141895   18176 visitor.go:599] after StreamVisitor\nE1108 15:34:00.141923   18176 visitor.go:545] after FileVisitor\nE1108 15:34:00.141934   18176 visitor.go:233] after EagerVisitorList\nE1108 15:34:00.141941   18176 visitor.go:406] after FlattenListVisitor\nE1108 15:34:00.141946   18176 visitor.go:406] after FlattenListVisitor\nE1108 15:34:00.141951   18176 visitor.go:383] after ContinueOnErrorVisitor\nE1108 15:34:00.141960   18176 visitor.go:337] after DecoratedVisitor\n```\n\n#### 1.4 apply\n\n```\nroot@k8s-master:~# ./kubectl apply -f pod.yaml -v 3\nE1108 15:34:33.097456   18406 visitor.go:335] in DecoratedVisitor\nE1108 15:34:33.097490   18406 visitor.go:364] in ContinueOnErrorVisitor\nE1108 15:34:33.097494   18406 visitor.go:404] in FlattenListVisitor\nE1108 15:34:33.097501   18406 visitor.go:404] in FlattenListVisitor\nE1108 15:34:33.097505   18406 visitor.go:216] in EagerVisitorList\nE1108 15:34:33.097516   18406 visitor.go:526] in FileVisitor\nE1108 15:34:33.097551   18406 visitor.go:592] in StreamVisitor\nWarning: kubectl apply should be used on resource created by either kubectl create --save-config or kubectl apply\npod/nginx configured\nE1108 15:34:33.110716   18406 visitor.go:599] after StreamVisitor\nE1108 15:34:33.110748   18406 visitor.go:545] after FileVisitor\nE1108 15:34:33.110756   18406 visitor.go:233] after EagerVisitorList\nE1108 15:34:33.110764   18406 visitor.go:406] after FlattenListVisitor\nE1108 15:34:33.110771   18406 visitor.go:406] after FlattenListVisitor\nE1108 15:34:33.110778   18406 visitor.go:383] after ContinueOnErrorVisitor\nE1108 15:34:33.110786   18406 visitor.go:337] after DecoratedVisitor\n```\n\n<br>\n\n### 2. kubectl create -f pod.yaml\n\n从上诉日志中，可以看出来create 一共用了 DecoratedVisitor，ContinueOnErrorVisitor，FlattenListVisitor， FlattenListVisitor，EagerVisitorList，FileVisitor， StreamVisitor 7个visitor。\n\n接下来从代码角度看看是怎么实现的，为什么这么实现。\n\n#### 2.1 kubectl代码中定义visitor\n\n```\n\tr := f.NewBuilder().       // 和visitor无关，只赋值了一些变量,ToRESTConfig等等\n\t\tUnstructured().          // 因为是create,所以不确定是创建哪种对象，所以要用Unstructured\n\t\tSchema(schema).          // 进行schema赋值，方便校验\n\t\tContinueOnError().       // 设置ContinueOnError=true\n\t\tNamespaceParam(cmdNamespace).DefaultNamespace().   // 设置命名空间\n\t\tFilenameParam(enforceNamespace, &o.FilenameOptions).    // 设置path\n\t\tLabelSelectorParam(o.Selector).                         // 设置label\n\t\tFlatten().                                              // 设置flatten=true\n\t\tDo()                             \n```\n\n<br>\n\n```\nfunc (b *Builder) Do() *Result {\n  // 初始化visitor，kubectl create -f的情况下有：DecoratedVisitor，FlattenListVisitor，EagerVisitorList\n\tr := b.visitorResult()\n\tr.mapper = b.Mapper()\n\tif r.err != nil {\n\t\treturn r\n\t}\n\tif b.flatten {\n\t\tr.visitor = NewFlattenListVisitor(r.visitor, b.objectTyper, b.mapper)\n\t}\n\thelpers := []VisitorFunc{}\n\tif b.defaultNamespace {\n\t\thelpers = append(helpers, SetNamespace(b.namespace))\n\t}\n\tif b.requireNamespace {\n\t\thelpers = append(helpers, RequireNamespace(b.namespace))\n\t}\n\thelpers = append(helpers, FilterNamespace)\n\tif b.requireObject {\n\t\thelpers = append(helpers, RetrieveLazy)\n\t}\n\t// 增加了ContinueOnErrorVisitor\n\tif b.continueOnError {\n\t\tr.visitor = NewDecoratedVisitor(ContinueOnErrorVisitor{r.visitor}, helpers...)\n\t} else {\n\t\tr.visitor = NewDecoratedVisitor(r.visitor, helpers...)\n\t}\n\treturn r\n}\n\n```\n\n\n\n```\nfunc (b *Builder) visitorResult() *Result {\n  。。。\n\t// visit items specified by paths\n\t// create -f pod.yaml走这条路线\n\tif len(b.paths) != 0 {\n\t\treturn b.visitByPaths()\n\t}\n  。。。\n}\n\nunc (b *Builder) visitByPaths() *Result {\n\t\n\n\tvar visitors Visitor\n\t// 1.定义了EagerVisitorList,错误收集后统一返回\n\tif b.continueOnError {\n\t\tvisitors = EagerVisitorList(b.paths)\n\t} else {\n\t\tvisitors = VisitorList(b.paths)\n\t}\n  \n  // 2.定义了FlattenListVisitor，FlattenListVisitor将任何runtime.ExtractList转化为一个list－拥有一个公共字段\"Items\"\n  // 如果create -f xx.yaml中有多个字段，那输出的objectlist就会有 Items这个字段\n\tif b.flatten {\n\t\tvisitors = NewFlattenListVisitor(visitors, b.objectTyper, b.mapper)\n\t}\n   \n  // 3.定义了DecoratedVisitor，这如果有defaultNamespace，就设置默认的ns\n\t// only items from disk can be refetched\n\tif b.latest {\n\t\t// must set namespace prior to fetching\n\t\tif b.defaultNamespace {\n\t\t\tvisitors = NewDecoratedVisitor(visitors, SetNamespace(b.namespace))\n\t\t}\n\t\tvisitors = NewDecoratedVisitor(visitors, RetrieveLatest)\n\t}\n\t\n   // 创建时没有selector,所以没有这个\n\tif b.labelSelector != nil {\n\t\tselector, err := labels.Parse(*b.labelSelector)\n\t\tif err != nil {\n\t\t\treturn result.withError(fmt.Errorf(\"the provided selector %q is not valid: %v\", *b.labelSelector, err))\n\t\t}\n\t\tvisitors = NewFilteredVisitor(visitors, FilterByLabelSelector(selector))\n\t}\n\tresult.visitor = visitors\n\tresult.sources = b.paths\n\treturn result\n}\n```\n\n可以看出来，上面定义了DecoratedVisitor，ContinueOnErrorVisitor，FlattenListVisitor， FlattenListVisitor，EagerVisitorList。但是没有 FileVisitor， StreamVisitor 。\n\nFileVisitor在最开始FilenameParam中定义了。从这里可以看出来StreamVisitor最先开始定义。\n\n```\n// FilenameParam groups input in two categories: URLs and files (files, directories, STDIN)\n// If enforceNamespace is false, namespaces in the specs will be allowed to\n// override the default namespace. If it is true, namespaces that don't match\n// will cause an error.\n// If ContinueOnError() is set prior to this method, objects on the path that are not\n// recognized will be ignored (but logged at V(2)).\nfunc (b *Builder) FilenameParam(enforceNamespace bool, filenameOptions *FilenameOptions) *Builder {\n\n   for _, s := range paths {\n      switch {\n      case s == \"-\":\n         // 文件方式\n         b.Stdin()\n}\n\n\n// Stdin will read objects from the standard input. If ContinueOnError() is set\n// prior to this method being called, objects in the stream that are unrecognized\n// will be ignored (but logged at V(2)).\nfunc (b *Builder) Stdin() *Builder {\n\tb.stream = true\n\tb.paths = append(b.paths, FileVisitorForSTDIN(b.mapper, b.schema))\n\treturn b\n}\n\n// FileVisitorForSTDIN return a special FileVisitor just for STDIN\nfunc FileVisitorForSTDIN(mapper *mapper, schema ContentValidator) Visitor {\n\treturn &FileVisitor{\n\t\tPath:          constSTDINstr,\n\t\tStreamVisitor: NewStreamVisitor(nil, mapper, constSTDINstr, schema),\n\t}\n}\n```\n\n#### 2.2 再次看kubectl create的输出结果\n\n```\nroot@k8s-master:~# ./kubectl create -f pod.yaml -v 3\nE1108 16:23:02.756637    6420 visitor.go:335] in DecoratedVisitor\nE1108 16:23:02.756683    6420 visitor.go:364] in ContinueOnErrorVisitor\nE1108 16:23:02.756688    6420 visitor.go:404] in FlattenListVisitor\nE1108 16:23:02.756695    6420 visitor.go:404] in FlattenListVisitor\nE1108 16:23:02.756698    6420 visitor.go:216] in EagerVisitorList\nE1108 16:23:02.756703    6420 visitor.go:526] in FileVisitor\nE1108 16:23:02.756740    6420 visitor.go:592] in StreamVisitor\nI1108 16:23:02.760038    6420 create.go:260] info is {Client:0xc00132e000 Mapping:0xc001326000 Namespace:default Name:nginx Source:pod.yaml Object:0xc000ea1b68 ResourceVersion: Export:false}:\nI1108 16:23:02.760354    6420 create.go:273] post data to apiserver\nI1108 16:23:02.760375    6420 create.go:274] info is {Client:0xc00132e000 Mapping:0xc001326000 Namespace:default Name:nginx Source:pod.yaml Object:0xc000ea1b68 ResourceVersion: Export:false}:\npod/nginx created\nE1108 16:23:02.772496    6420 visitor.go:599] after StreamVisitor\nE1108 16:23:02.772523    6420 visitor.go:545] after FileVisitor\nE1108 16:23:02.772531    6420 visitor.go:233] after EagerVisitorList\nE1108 16:23:02.772538    6420 visitor.go:406] after FlattenListVisitor\nE1108 16:23:02.772544    6420 visitor.go:406] after FlattenListVisitor\nE1108 16:23:02.772551    6420 visitor.go:383] after ContinueOnErrorVisitor\nE1108 16:23:02.772557    6420 visitor.go:337] after DecoratedVisitor\n```\n\n<br>\n\nDecoratedVisitor -> ContinueOnErrorVisitor -> FlattenListVisitor -> EagerVisitorList -> FileVisitor ->  StreamVisitor -> info(post data to apiserver)\n\n**整体的流程/处理顺序为：**\n\n（1）DecoratedVisitor先执行DecoratedVisitor，这里是给默认的info，增加了默认的命名空间\n\n（2）ContinueOnErrorVisitor是先执行下面的visitor，然后处理返回的错误\n\n（3）FlattenListVisitor这里将info 的 runtime.ExtractList转化为一个list－拥有一个公共字段\"Items\"，先执行自己的\n\n（4）EagerVisitorList 统一收集错误，先执行下面的visitor\n\n（5）FileVisitor先执行自己的，再执行下面的\n\n（6）StreamVisitor先执行自己的，再执行下面的\n\n（7）发送到 apiserver\n\n### 3. 总结\n\n（1）下面这一套是关键，定义好了各种visitor，然后利用r.visit执行。感觉如果要自己定制化一个的话，第一是需要明确各种visitor，第二就是想清楚处理顺序。\n\n```\n\tr := f.NewBuilder().\n\t\tUnstructured().\n\t\tSchema(schema).\n\t\tContinueOnError().\n\t\tNamespaceParam(cmdNamespace).DefaultNamespace().\n\t\tFilenameParam(enforceNamespace, &o.FilenameOptions).\n\t\tLabelSelectorParam(o.Selector).\n\t\tFlatten().\n\t\tDo()\n```\n\n（2）目前看了kubectl create都是在发送数据到apiserver之前对 info进行了处理。还没有利用到返回了数据，再额外处理的情况。"
  },
  {
    "path": "k8s/kubectl/8- kubectl printer分析.md",
    "content": "Table of Contents\r\n=================\r\n\r\n  * [1. kubectl 强大的格式化输出](#1-kubectl-强大的格式化输出)\r\n     * [1.1 常见的用法： kubectl -o/--output  json, yaml， wide](#11-常见的用法-kubectl--o--output--json-yaml-wide)\r\n     * [1.2 custom-columns](#12-custom-columns)\r\n     * [1.3 go-template， go-template-file](#13-go-template-go-template-file)\r\n     * [1.4 jsonpath，jsonpath-file](#14-jsonpathjsonpath-file)\r\n  * [2. kubectl Printer源码分析](#2-kubectl-printer源码分析)\r\n     * [2.1 kubectl create定义printor的过程](#21-kubectl-create定义printor的过程)\r\n     * [2.2 各种printor的实现](#22-各种printor的实现)\r\n        * [2.2.1 JSONYamlPrint](#221-jsonyamlprint)\r\n        * [2.2.2 NamePrinter](#222-nameprinter)\r\n        * [2.2.3 GoTemplatePrinter](#223-gotemplateprinter)\r\n  * [3.总结](#3总结)\r\n\r\n### 1. kubectl 强大的格式化输出\r\n\r\nPrinter是kubectl 命令在输出的时候设置的显示格式。kubectl有着强大的输出格式。本文研究一下kubectl对Printer的实现。\r\n\r\n从kubectl -h可以看出来，kubectl有着以下的定制化输出：\r\n\r\n```\r\nUsage:\r\n  kubectl get\r\n[(-o|--output=)json|yaml|wide|custom-columns=...|custom-columns-file=...|go-template=...|go-template-file=...|jsonpath=...|jsonpath-file=...]\r\n(TYPE[.VERSION][.GROUP] [NAME | -l label] | TYPE[.VERSION][.GROUP]/NAME ...) [flags] [options]\r\n```\r\n\r\n#### 1.1 常见的用法： kubectl -o/--output  json, yaml， wide\r\n\r\n#### 1.2 custom-columns\r\n\r\nkubectl  custom-columns, custom-columns-file (这两个是一样的，只不过第二个把格式定义在了文件中)\r\n\r\n```\r\nkubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[*].restartCount,CONATAINER_NAME:.spec.containers[*].name,READY:.status.containerStatuses[*].ready\r\nNAME     STATUS    RESTARTS   CONATAINER_NAME   READY\r\nnginx    Running   5          nginx             true\r\nnginx1   Running   0,0        nginx,nginx1      true,true\r\n```\r\n\r\n这个很经典，它将所有的字段都可以定制化输出。\r\n\r\n其中类似containerStatuses，containers这种数组情况的，一定要加[]。 `[*]`表达所有的元素。就是上面的那个，没有`*`的表示只输出第一个，如下：\r\n\r\n```\r\nroot@k8s-master:~# kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[].restartCount,CONATAINER_NAME:.spec.containers[].name,READY:.status.containerStatuses[].ready\r\nNAME     STATUS    RESTARTS   CONATAINER_NAME   READY\r\nnginx    Running   5          nginx             true\r\nnginx1   Running   0          nginx             true\r\n```\r\n\r\n#### 1.3 go-template， go-template-file\r\n\r\n go-template， go-template-file效果是一样的\r\n\r\ngo-template的强大还在于可以if else语句\r\n\r\n```\r\nroot@k8s-master:~# kubectl get pods -o go-template --template='{{range .items}}{{printf \"%s %s\\n\" .metadata.name .metadata.creationTimestamp}}{{end}}'\r\nnginx 2021-11-08T08:23:02Z\r\n```\r\n\r\n#### 1.4 jsonpath，jsonpath-file\r\n\r\n这两个效果也是一样的\r\n\r\n```\r\nroot@k8s-master:~# kubectl get pods -o=jsonpath=\"{range .items[*]}{.metadata.name}{'\\t'}{.status.startTime}{'\\n'}{end}\"\r\nnginx   2021-11-08T08:23:02Z\r\n```\r\n\r\n<br>\r\n\r\n### 2. kubectl Printer源码分析\r\n\r\nkubectl creat 和 get 在格式化输出的时候其实是一样的。总体的流程就是：\r\n\r\n（1）通过cmd 获取 printFlags\r\n\r\n（2）根据printFlags选择Printor\r\n\r\n（3）不同的Printor对info对象进行不同的格式化输出\r\n\r\n<br>\r\n\r\n接下来继续从kubectl create命令出发，从源码角度看看是如何实现的。\r\n\r\n#### 2.1 kubectl create定义printor的过程\r\n\r\n（1）绑定PrintFlags参数\r\n\r\n（2）在运行RunCreate之前，通过Complete函数，补全了CreateOptions。\r\n\r\n（3）Complete函数根据参数，实例化printor\r\n\r\n（4）根据不同的output指定的格式，选择不同的printor。可以看出来就是对应上面那几种printor\r\n\r\n<br>\r\n\r\n（1）绑定PrintFlags参数\r\n\r\n```\r\n// CreateOptions is the commandline options for 'create' sub command\r\ntype CreateOptions struct {\r\n\tPrintFlags  *genericclioptions.PrintFlags\r\n}\r\n\r\n// PrintFlags composes common printer flag structs\r\n// used across all commands, and provides a method\r\n// of retrieving a known printer based on flag values provided.\r\ntype PrintFlags struct {\r\n\tJSONYamlPrintFlags   *JSONYamlPrintFlags\r\n\tNamePrintFlags       *NamePrintFlags\r\n\tTemplatePrinterFlags *KubeTemplatePrintFlags\r\n\r\n\tTypeSetterPrinter *printers.TypeSetterPrinter\r\n\r\n\tOutputFormat *string\r\n\r\n\t// OutputFlagSpecified indicates whether the user specifically requested a certain kind of output.\r\n\t// Using this function allows a sophisticated caller to change the flag binding logic if they so desire.\r\n\tOutputFlagSpecified func() bool\r\n}\r\n\r\nNewCmdCreate函数中有这样一句语句\r\no.PrintFlags.AddFlags(cmd)\r\n\r\n// 这里就直接绑定了参数\r\nfunc (f *PrintFlags) AddFlags(cmd *cobra.Command) {\r\n\tf.JSONYamlPrintFlags.AddFlags(cmd)\r\n\tf.NamePrintFlags.AddFlags(cmd)\r\n\tf.TemplatePrinterFlags.AddFlags(cmd)\r\n\r\n\tif f.OutputFormat != nil {\r\n\t\tcmd.Flags().StringVarP(f.OutputFormat, \"output\", \"o\", *f.OutputFormat, fmt.Sprintf(\"Output format. One of: %s.\", strings.Join(f.AllowedFormats(), \"|\")))\r\n\t\tif f.OutputFlagSpecified == nil {\r\n\t\t\tf.OutputFlagSpecified = func() bool {\r\n\t\t\t\treturn cmd.Flag(\"output\").Changed\r\n\t\t\t}\r\n\t\t}\r\n\t}\r\n}\r\n```\r\n\r\n（2）在运行RunCreate之前，通过Complete函数，补全了CreateOptions。\r\n\r\n```\r\n// NewCmdCreate returns new initialized instance of create sub command\r\nfunc NewCmdCreate(f cmdutil.Factory, ioStreams genericclioptions.IOStreams) *cobra.Command {\r\n\to := NewCreateOptions(ioStreams)\r\n\r\n\tcmd := &cobra.Command{\r\n\t\tUse:                   \"create -f FILENAME\",\r\n\t\tDisableFlagsInUseLine: true,\r\n\t\tShort:                 i18n.T(\"Create a resource from a file or from stdin.\"),\r\n\t\tLong:                  createLong,\r\n\t\tExample:               createExample,\r\n\t\tRun: func(cmd *cobra.Command, args []string) {\r\n\t\t\tif cmdutil.IsFilenameSliceEmpty(o.FilenameOptions.Filenames, o.FilenameOptions.Kustomize) {\r\n\t\t\t\tioStreams.ErrOut.Write([]byte(\"Error: must specify one of -f and -k\\n\\n\"))\r\n\t\t\t\tdefaultRunFunc := cmdutil.DefaultSubCommandRun(ioStreams.ErrOut)\r\n\t\t\t\tdefaultRunFunc(cmd, args)\r\n\t\t\t\treturn\r\n\t\t\t}\r\n\t\t\t// 调用了Complete函数，补全了CreateOptions\r\n\t\t\tcmdutil.CheckErr(o.Complete(f, cmd))\r\n\t\t\tcmdutil.CheckErr(o.ValidateArgs(cmd, args))\r\n\t\t\tcmdutil.CheckErr(o.RunCreate(f, cmd))\r\n\t\t},\r\n\t}\r\n```\r\n\r\n<br>\r\n\r\n（3）Complete函数根据参数，实例化printor\r\n\r\n```\r\n// Complete completes all the required options\r\nfunc (o *CreateOptions) Complete(f cmdutil.Factory, cmd *cobra.Command) error {\r\n\tvar err error\r\n\to.RecordFlags.Complete(cmd)\r\n\to.Recorder, err = o.RecordFlags.ToRecorder()\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\r\n\to.DryRun = cmdutil.GetDryRunFlag(cmd)\r\n\r\n\tif o.DryRun {\r\n\t\to.PrintFlags.Complete(\"%s (dry run)\")\r\n\t}\r\n\t// 根据参数，实例化printer\r\n\tprinter, err := o.PrintFlags.ToPrinter()\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\r\n\to.PrintObj = func(obj kruntime.Object) error {\r\n\t\treturn printer.PrintObj(obj, o.Out)\r\n\t}\r\n\r\n\treturn nil\r\n}\r\n```\r\n\r\n(4)  根据不同的output指定的格式，选择不同的printor。可以看出来就是对应上面那几种printor\r\n\r\n```\r\nfunc (f *PrintFlags) ToPrinter() (printers.ResourcePrinter, error) {\r\n\toutputFormat := \"\"\r\n\tif f.OutputFormat != nil {\r\n\t\toutputFormat = *f.OutputFormat\r\n\t}\r\n\t// For backwards compatibility we want to support a --template argument given, even when no --output format is provided.\r\n\t// If no explicit output format has been provided via the --output flag, fallback\r\n\t// to honoring the --template argument.\r\n\ttemplateFlagSpecified := f.TemplatePrinterFlags != nil &&\r\n\t\tf.TemplatePrinterFlags.TemplateArgument != nil &&\r\n\t\tlen(*f.TemplatePrinterFlags.TemplateArgument) > 0\r\n\toutputFlagSpecified := f.OutputFlagSpecified != nil && f.OutputFlagSpecified()\r\n\tif templateFlagSpecified && !outputFlagSpecified {\r\n\t\toutputFormat = \"go-template\"\r\n\t}\r\n\r\n\tif f.JSONYamlPrintFlags != nil {\r\n\t\tif p, err := f.JSONYamlPrintFlags.ToPrinter(outputFormat); !IsNoCompatiblePrinterError(err) {\r\n\t\t\treturn f.TypeSetterPrinter.WrapToPrinter(p, err)\r\n\t\t}\r\n\t}\r\n\r\n\tif f.NamePrintFlags != nil {\r\n\t\tif p, err := f.NamePrintFlags.ToPrinter(outputFormat); !IsNoCompatiblePrinterError(err) {\r\n\t\t\treturn f.TypeSetterPrinter.WrapToPrinter(p, err)\r\n\t\t}\r\n\t}\r\n\r\n\tif f.TemplatePrinterFlags != nil {\r\n\t\tif p, err := f.TemplatePrinterFlags.ToPrinter(outputFormat); !IsNoCompatiblePrinterError(err) {\r\n\t\t\treturn f.TypeSetterPrinter.WrapToPrinter(p, err)\r\n\t\t}\r\n\t}\r\n\r\n\treturn nil, NoCompatiblePrinterError{OutputFormat: f.OutputFormat, AllowedFormats: f.AllowedFormats()}\r\n}\r\n```\r\n\r\n#### 2.2 各种printor的实现\r\n\r\n##### 2.2.1 JSONYamlPrint\r\n\r\n会再根据outputFormat 分为JSONPrinter, YAMLPrinter\r\n\r\n```\r\n// ToPrinter receives an outputFormat and returns a printer capable of\r\n// handling --output=(yaml|json) printing.\r\n// Returns false if the specified outputFormat does not match a supported format.\r\n// Supported Format types can be found in pkg/printers/printers.go\r\nfunc (f *JSONYamlPrintFlags) ToPrinter(outputFormat string) (printers.ResourcePrinter, error) {\r\n\tvar printer printers.ResourcePrinter\r\n\r\n\toutputFormat = strings.ToLower(outputFormat)\r\n\tswitch outputFormat {\r\n\tcase \"json\":\r\n\t\tprinter = &printers.JSONPrinter{}\r\n\tcase \"yaml\":\r\n\t\tprinter = &printers.YAMLPrinter{}\r\n\tdefault:\r\n\t\treturn nil, NoCompatiblePrinterError{OutputFormat: &outputFormat, AllowedFormats: f.AllowedFormats()}\r\n\t}\r\n\r\n\treturn printer, nil\r\n}\r\n```\r\n\r\n<br>\r\n\r\n**jsonPrinter**\r\n\r\n用法：kubectl get pod -o json\r\n\r\n```\r\n// PrintObj is an implementation of ResourcePrinter.PrintObj which simply writes the object to the Writer.\r\nfunc (p *JSONPrinter) PrintObj(obj runtime.Object, w io.Writer) error {\r\n\t// we use reflect.Indirect here in order to obtain the actual value from a pointer.\r\n\t// we need an actual value in order to retrieve the package path for an object.\r\n\t// using reflect.Indirect indiscriminately is valid here, as all runtime.Objects are supposed to be pointers.\r\n\tif InternalObjectPreventer.IsForbidden(reflect.Indirect(reflect.ValueOf(obj)).Type().PkgPath()) {\r\n\t\treturn fmt.Errorf(InternalObjectPrinterErr)\r\n\t}\r\n\r\n\tswitch obj := obj.(type) {\r\n\tcase *metav1.WatchEvent:\r\n\t\tif InternalObjectPreventer.IsForbidden(reflect.Indirect(reflect.ValueOf(obj.Object.Object)).Type().PkgPath()) {\r\n\t\t\treturn fmt.Errorf(InternalObjectPrinterErr)\r\n\t\t}\r\n\t\t// 调用\"encoding/json\" package对对象进行格式化\r\n\t\tdata, err := json.Marshal(obj)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\t_, err = w.Write(data)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\t_, err = w.Write([]byte{'\\n'})\r\n\t\treturn err\r\n\tcase *runtime.Unknown:\r\n\t\tvar buf bytes.Buffer\r\n\t\terr := json.Indent(&buf, obj.Raw, \"\", \"    \")\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\tbuf.WriteRune('\\n')\r\n\t\t_, err = buf.WriteTo(w)\r\n\t\treturn err\r\n\t}\r\n\r\n\tif obj.GetObjectKind().GroupVersionKind().Empty() {\r\n\t\treturn fmt.Errorf(\"missing apiVersion or kind; try GetObjectKind().SetGroupVersionKind() if you know the type\")\r\n\t}\r\n\r\n\tdata, err := json.MarshalIndent(obj, \"\", \"    \")\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\tdata = append(data, '\\n')\r\n\t_, err = w.Write(data)\r\n\treturn err\r\n}\r\n```\r\n\r\n**YAMLPrinter**看起来是先将Obj转成了json，然后再从json转成yaml\r\n\r\n用法：kubectl get pod -o yaml\r\n\r\n```\r\n// PrintObj prints the data as YAML.\r\nfunc (p *YAMLPrinter) PrintObj(obj runtime.Object, w io.Writer) error {\r\n\t// we use reflect.Indirect here in order to obtain the actual value from a pointer.\r\n\t// we need an actual value in order to retrieve the package path for an object.\r\n\t// using reflect.Indirect indiscriminately is valid here, as all runtime.Objects are supposed to be pointers.\r\n\tif InternalObjectPreventer.IsForbidden(reflect.Indirect(reflect.ValueOf(obj)).Type().PkgPath()) {\r\n\t\treturn fmt.Errorf(InternalObjectPrinterErr)\r\n\t}\r\n\r\n\tcount := atomic.AddInt64(&p.printCount, 1)\r\n\tif count > 1 {\r\n\t\tif _, err := w.Write([]byte(\"---\\n\")); err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t}\r\n\r\n\tswitch obj := obj.(type) {\r\n\tcase *metav1.WatchEvent:\r\n\t\tif InternalObjectPreventer.IsForbidden(reflect.Indirect(reflect.ValueOf(obj.Object.Object)).Type().PkgPath()) {\r\n\t\t\treturn fmt.Errorf(InternalObjectPrinterErr)\r\n\t\t}\r\n\t\t// 看起来是先转成了json？\r\n\t\tdata, err := json.Marshal(obj)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\tdata, err = yaml.JSONToYAML(data)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\t_, err = w.Write(data)\r\n\t\treturn err\r\n\tcase *runtime.Unknown:\r\n\t\tdata, err := yaml.JSONToYAML(obj.Raw)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\t_, err = w.Write(data)\r\n\t\treturn err\r\n\t}\r\n\r\n\tif obj.GetObjectKind().GroupVersionKind().Empty() {\r\n\t\treturn fmt.Errorf(\"missing apiVersion or kind; try GetObjectKind().SetGroupVersionKind() if you know the type\")\r\n\t}\r\n\r\n\toutput, err := yaml.Marshal(obj)\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\t_, err = fmt.Fprint(w, string(output))\r\n\treturn err\r\n}\r\n```\r\n\r\n##### 2.2.2 NamePrinter\r\n\r\n就是只打印resource/name\r\n\r\neg:\r\n\r\n```\r\nroot@k8s-master:~# kubectl get pod -o name\r\npod/nginx\r\npod/nginx1\r\n```\r\n\r\n```\r\n// PrintObj is an implementation of ResourcePrinter.PrintObj which decodes the object\r\n// and print \"resource/name\" pair. If the object is a List, print all items in it.\r\nfunc (p *NamePrinter) PrintObj(obj runtime.Object, w io.Writer) error {\r\n\tswitch castObj := obj.(type) {\r\n\tcase *metav1.WatchEvent:\r\n\t\tobj = castObj.Object.Object\r\n\t}\r\n\r\n\t// we use reflect.Indirect here in order to obtain the actual value from a pointer.\r\n\t// using reflect.Indirect indiscriminately is valid here, as all runtime.Objects are supposed to be pointers.\r\n\t// we need an actual value in order to retrieve the package path for an object.\r\n\tif InternalObjectPreventer.IsForbidden(reflect.Indirect(reflect.ValueOf(obj)).Type().PkgPath()) {\r\n\t\treturn fmt.Errorf(InternalObjectPrinterErr)\r\n\t}\r\n\r\n\tif meta.IsListType(obj) {\r\n\t\t// we allow unstructured lists for now because they always contain the GVK information.  We should chase down\r\n\t\t// callers and stop them from passing unflattened lists\r\n\t\t// TODO chase the caller that is setting this and remove it.\r\n\t\tif _, ok := obj.(*unstructured.UnstructuredList); !ok {\r\n\t\t\treturn fmt.Errorf(\"list types are not supported by name printing: %T\", obj)\r\n\t\t}\r\n\r\n\t\titems, err := meta.ExtractList(obj)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\tfor _, obj := range items {\r\n\t\t\tif err := p.PrintObj(obj, w); err != nil {\r\n\t\t\t\treturn err\r\n\t\t\t}\r\n\t\t}\r\n\t\treturn nil\r\n\t}\r\n\r\n\tif obj.GetObjectKind().GroupVersionKind().Empty() {\r\n\t\treturn fmt.Errorf(\"missing apiVersion or kind; try GetObjectKind().SetGroupVersionKind() if you know the type\")\r\n\t}\r\n\r\n\tname := \"<unknown>\"\r\n\tif acc, err := meta.Accessor(obj); err == nil {\r\n\t\tif n := acc.GetName(); len(n) > 0 {\r\n\t\t\tname = n\r\n\t\t}\r\n\t}\r\n\r\n\treturn printObj(w, name, p.Operation, p.ShortOutput, GetObjectGroupKind(obj))\r\n}\r\n```\r\n\r\n\r\n\r\n##### 2.2.3 GoTemplatePrinter\r\n\r\n```\r\nfunc (f *KubeTemplatePrintFlags) ToPrinter(outputFormat string) (printers.ResourcePrinter, error) {\r\n   if f == nil {\r\n      return nil, NoCompatiblePrinterError{}\r\n   }\r\n\r\n   if p, err := f.JSONPathPrintFlags.ToPrinter(outputFormat); !IsNoCompatiblePrinterError(err) {\r\n      return p, err\r\n   }\r\n   return f.GoTemplatePrintFlags.ToPrinter(outputFormat)\r\n}\r\n\r\n// PrintObj formats the obj with the Go Template.\r\nfunc (p *GoTemplatePrinter) PrintObj(obj runtime.Object, w io.Writer) error {\r\n\tif InternalObjectPreventer.IsForbidden(reflect.Indirect(reflect.ValueOf(obj)).Type().PkgPath()) {\r\n\t\treturn fmt.Errorf(InternalObjectPrinterErr)\r\n\t}\r\n\r\n\tvar data []byte\r\n\tvar err error\r\n\tdata, err = json.Marshal(obj)\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\r\n\tout := map[string]interface{}{}\r\n\tif err := json.Unmarshal(data, &out); err != nil {\r\n\t\treturn err\r\n\t}\r\n\tif err = p.safeExecute(w, out); err != nil {\r\n\t\t// It is way easier to debug this stuff when it shows up in\r\n\t\t// stdout instead of just stdin. So in addition to returning\r\n\t\t// a nice error, also print useful stuff with the writer.\r\n\t\tfmt.Fprintf(w, \"Error executing template: %v. Printing more information for debugging the template:\\n\", err)\r\n\t\tfmt.Fprintf(w, \"\\ttemplate was:\\n\\t\\t%v\\n\", p.rawTemplate)\r\n\t\tfmt.Fprintf(w, \"\\traw data was:\\n\\t\\t%v\\n\", string(data))\r\n\t\tfmt.Fprintf(w, \"\\tobject given to template engine was:\\n\\t\\t%+v\\n\\n\", out)\r\n\t\treturn fmt.Errorf(\"error executing template %q: %v\", p.rawTemplate, err)\r\n\t}\r\n\treturn nil\r\n}\r\n```\r\n\r\n### 3.总结\r\n\r\n（1）体验到了kubectl get的强大之处\r\n\r\n（2）可以参考这种思路实现自己的客户端CIL。 通过flags填充option，然后再根据option定义不同的Printer，最后定制化输出"
  },
  {
    "path": "k8s/kubectl/9-kubectl create整体流程分析.md",
    "content": "Table of Contents\n=================\n\n  * [1. kubectl create 命令定义](#1-kubectl-create-命令定义)\n     * [1.1 editBeforeCreate](#11-editbeforecreate)\n     * [1.2 AddApplyAnnotationFlags](#12-addapplyannotationflags)\n  * [2.RunCreate](#2runcreate)\n     * [2.1 CreateOptions](#21-createoptions)\n        * [(1) 直接创建pod](#1-直接创建pod)\n        * [(2) 加入--output=yaml参数](#2-加入--outputyaml参数)\n        * [(3) record 会在创建对象中的annotation中记录](#3-record-会在创建对象中的annotation中记录)\n        * [(4) --raw](#4---raw)\n  * [2.2 RunCreate源码分析](#22-runcreate\n\n从本篇文章尝试以kubectl  create 命令为例，梳理一些之前分析的的所有kubectl思想。\n\n### 1. kubectl create 命令定义\n\ncreate 命令 是定义在 kubectl 的 Basic Commands (Beginner) 中的。\n\n从定义中可以看出来，kubectl create 的用法就是 kubectl create -f filename，创建资源对象。\n\nKubectl create主要逻辑如下：\n\n（1）kubectl必须制定-f 或者 -k\n\n（2）增加了很多flag, 比如dryrun , editbeforCreate等等\n\n（3）定义了一堆subcommands，kubectl create ns/job 等等\n\n（4）kubectl create核心命令而已，还是RunCreate函数\n\n```\n// NewCmdCreate returns new initialized instance of create sub command\nfunc NewCmdCreate(f cmdutil.Factory, ioStreams genericclioptions.IOStreams) *cobra.Command {\n\to := NewCreateOptions(ioStreams)\n\n\tcmd := &cobra.Command{\n\t\tUse:                   \"create -f FILENAME\",\n\t\tDisableFlagsInUseLine: true,\n\t\tShort:                 i18n.T(\"Create a resource from a file or from stdin.\"),\n\t\tLong:                  createLong,\n\t\tExample:               createExample,\n\t\tRun: func(cmd *cobra.Command, args []string) {\n\t\t  // kubectl必须制定-f 或者 -k\n\t\t\tif cmdutil.IsFilenameSliceEmpty(o.FilenameOptions.Filenames, o.FilenameOptions.Kustomize) {\n\t\t\t  // 报错信息\n\t\t\t\tioStreams.ErrOut.Write([]byte(\"Error: must specify one of -f and -k\\n\\n\"))\n\t\t\t\t// 没有制定的话，会执行DefaultSubCommandRun函数，其实就是提示你指定，然后打印help\n\t\t\t\tdefaultRunFunc := cmdutil.DefaultSubCommandRun(ioStreams.ErrOut)\n\t\t\t\tdefaultRunFunc(cmd, args)\n\t\t\t\treturn\n\t\t\t}\n\t\t\t// 参数补全和校验\n\t\t\tcmdutil.CheckErr(o.Complete(f, cmd))\n\t\t\tcmdutil.CheckErr(o.ValidateArgs(cmd, args))\n\t\t\t// RunCreate是真正的逻辑函数\n\t\t\tcmdutil.CheckErr(o.RunCreate(f, cmd))\n\t\t},\n\t}\n\n\t// bind flag structs\n\to.RecordFlags.AddFlags(cmd)\n\n\tusage := \"to use to create the resource\"\n\tcmdutil.AddFilenameOptionFlags(cmd, &o.FilenameOptions, usage)\n\tcmdutil.AddValidateFlags(cmd)\n\t\n\t// 1.实现了editBeforeCreate，详见1.1\n\tcmd.Flags().BoolVar(&o.EditBeforeCreate, \"edit\", o.EditBeforeCreate, \"Edit the API resource before creating\")\n\tcmd.Flags().Bool(\"windows-line-endings\", runtime.GOOS == \"windows\",\n\t\t\"Only relevant if --edit=true. Defaults to the line ending native to your platform.\")\n\t\n\t// 2.添加 AnnotationFlags flags，详见1.2\n\tcmdutil.AddApplyAnnotationFlags(cmd)\n\t// 3. 添加了dryrun, 如果dryrun的话，只会打印输出不会创建\n\tcmdutil.AddDryRunFlag(cmd)\n\tcmd.Flags().StringVarP(&o.Selector, \"selector\", \"l\", o.Selector, \"Selector (label query) to filter on, supports '=', '==', and '!='.(e.g. -l key1=value1,key2=value2)\")\n\tcmd.Flags().StringVar(&o.Raw, \"raw\", o.Raw, \"Raw URI to POST to the server.  Uses the transport specified by the kubeconfig file.\")\n   \n   // 3.1 绑定了printFlags\n\to.PrintFlags.AddFlags(cmd)\n\n\t// create subcommands\n\t// 4. 定义了一堆子命令 kubectl create ns/secrete等等\n\tcmd.AddCommand(NewCmdCreateNamespace(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateQuota(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateSecret(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateConfigMap(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateServiceAccount(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateService(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateDeployment(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateClusterRole(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateClusterRoleBinding(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateRole(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateRoleBinding(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreatePodDisruptionBudget(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreatePriorityClass(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateJob(f, ioStreams))\n\tcmd.AddCommand(NewCmdCreateCronJob(f, ioStreams))\n\treturn cmd\n}\n```\n\n#### 1.1 editBeforeCreate\n\n可以看出来，editBeforeCreate，是先要你edit yaml, 然后才会创建。要是没有做出改动，是不会创建的！\n\n```\nroot@k8s-master:~# kubectl create -f pod.yaml --edit=true\nEdit cancelled, no changes made.\nroot@k8s-master:~#\nroot@k8s-master:~# kubectl get pod\nNo resources found in default namespace.\nroot@k8s-master:~#\nroot@k8s-master:~# kubectl get pod\nNo resources found in default namespace.\nroot@k8s-master:~# kubectl create -f pod.yaml --edit=true\npod/nginx1 created\nroot@k8s-master:~#\nroot@k8s-master:~# kubectl get pod\nNAME     READY   STATUS    RESTARTS   AGE\nnginx1   1/1     Running   0          3s\n```\n\n#### 1.2 AddApplyAnnotationFlags\n\n```\nroot@k8s-master:~# kubectl create -f pod.yaml --save-config=true\npod/nginx created\nroot@k8s-master:~# \nroot@k8s-master:~# kubectl get pod \nNAME    READY   STATUS    RESTARTS   AGE\nnginx   1/1     Running   0          4s\n// annotations里面有各种信息\nroot@k8s-master:~# kubectl get pod nginx -oyaml\napiVersion: v1\nkind: Pod\nmetadata:\n  annotations:\n    kubectl.kubernetes.io/last-applied-configuration: |\n      {\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"metadata\":{\"annotations\":{},\"name\":\"nginx\",\"namespace\":\"default\"},\"spec\":{\"containers\":[{\"command\":[\"sleep\",\"3600\"],\"image\":\"curlimages/curl:7.75.0\",\"name\":\"nginx\"}],\"nodeName\":\"k8s-node\",\"serviceAccountName\":\"sa-example\",\"terminationGracePeriodSeconds\":10}}\n  creationTimestamp: \"2021-11-04T14:32:30Z\"\n  name: nginx\n  namespace: default\n  resourceVersion: \"2416811\"\n  \n创建是不指定\nroot@k8s-master:~# kubectl create -f pod.yaml\npod/nginx created\nroot@k8s-master:~# kubectl get pod nginx -oyaml\napiVersion: v1\nkind: Pod\nmetadata:\n  creationTimestamp: \"2021-11-04T14:34:18Z\"\n  name: nginx\n  namespace: default\n  resourceVersion: \"2417067\"\n  selfLink: /api/v1/namespaces/default/pods/nginx\n  uid: 958b1d53-3035-4342-a048-321119d50a0b\nspec:\n```\n\n### 2.RunCreate\n\n#### 2.1 CreateOptions\n\nk8s中经常有这样的思想。所有的flag最终封装到 Options中。例如CreateOptions，DeleteOptions等等。然后根据Options的参数进行不同的操作。\n\n```\n// CreateOptions is the commandline options for 'create' sub command\ntype CreateOptions struct {\n   PrintFlags  *genericclioptions.PrintFlags      //以什么方式打印创建后的结果，yaml/json等等\n   RecordFlags *genericclioptions.RecordFlags     //指定是否在创建后的对象记录这个create操作，详见（3）\n\n   DryRun bool                   \n\n   FilenameOptions  resource.FilenameOptions    //fileName相关的操作\n   Selector         string\n   EditBeforeCreate bool\n   Raw              string                // apiserver暴露的api\n\n   Recorder genericclioptions.Recorder \n   PrintObj func(obj kruntime.Object) error     //创建后返回的对象，默认的就是打印一句， pod nginx created\n\n   genericclioptions.IOStreams\n}\n```\n\n为例更好的了解, 在RunCreate函数之前打印了一些日志。\n\n```\nk8s.io/kubectl/pkg/cmd/create/create.go\n// RunCreate performs the creation\nfunc (o *CreateOptions) RunCreate(f cmdutil.Factory, cmd *cobra.Command) error {\n\t// raw only makes sense for a single file resource multiple objects aren't likely to do what you want.\n\t// the validator enforces this, so\n\tklog.V(2).Infof(\"zoux CreateOptions is: %v\", o)\n\tklog.V(2).Infof(\"zoux FilenameOptions is: %v,%v,%v\", o.FilenameOptions.Filenames, o.FilenameOptions.Kustomize,o.FilenameOptions.Recursive)\n\tklog.V(2).Infof(\"zoux PrintFlags is: %v, %v,%v, %v, %v, %v, %v\", o.PrintFlags.JSONYamlPrintFlags, o.PrintFlags.OutputFormat, o.PrintFlags.NamePrintFlags, o.PrintFlags.OutputFlagSpecified, o.PrintFlags.OutputFormat, o.PrintFlags.TemplatePrinterFlags, o.PrintFlags.TypeSetterPrinter)\n\tklog.V(2).Infof(\"zoux PrintObj is: %v\", o.PrintObj)\n\tklog.V(2).Infof(\"zoux Recorder is: %v\", o.Recorder)\n\tklog.V(2).Infof(\"zoux RecordFlags is: %v\", o.RecordFlags.Record)\n```\n\n\n\n##### (1) 直接创建pod\n\n```\nroot@k8s-master:~# ./kubectl create -f pod.yaml -v 3\nI1105 11:19:31.373990    5031 create.go:215] zoux CreateOptions is: &{0xc000334ff0 0xc0002fc0c0 false {[pod.yaml]  false}  false  {} 0x14e0d80 {0xc0000b8000 0xc0000b8008 0xc0000b8010}}\nI1105 11:19:31.374118    5031 create.go:216] zoux FilenameOptions is: [pod.yaml],,false\nI1105 11:19:31.374142    5031 create.go:217] zoux PrintFlags is: , &{}, &{created}, 0x129b480, 0xc0002ae2d0, &{0xc0002ae300 0xc0002ae310 0xc0002fe22c 0xc0002ae2f0},&{0xc0002fdea8 0xc000269a40}\nI1105 11:19:31.374189    5031 create.go:218] zoux PrintObj is: 0x14e0d80\nI1105 11:19:31.374200    5031 create.go:219] zoux Recorder is: {}\nI1105 11:19:31.374212    5031 create.go:220] zoux RecordFlags is: 0xc0002fe22d\npod/nginx created\n```\n\n##### (2) 加入--output=yaml参数\n\n```\nroot@k8s-master:~# ./kubectl create -f pod.yaml -v 3\nI1105 11:19:31.373990    5031 create.go:215] zoux CreateOptions is: &{0xc000334ff0 0xc0002fc0c0 false {[pod.yaml]  false}  false  {} 0x14e0d80 {0xc0000b8000 0xc0000b8008 0xc0000b8010}}\nI1105 11:19:31.374118    5031 create.go:216] zoux FilenameOptions is: [pod.yaml],,false\nI1105 11:19:31.374142    5031 create.go:217] zoux PrintFlags is: , &{}, &{created}, 0x129b480, 0xc0002ae2d0, &{0xc0002ae300 0xc0002ae310 0xc0002fe22c 0xc0002ae2f0},&{0xc0002fdea8 0xc000269a40}\nI1105 11:19:31.374189    5031 create.go:218] zoux PrintObj is: 0x14e0d80\nI1105 11:19:31.374200    5031 create.go:219] zoux Recorder is: {}\nI1105 11:19:31.374212    5031 create.go:220] zoux RecordFlags is: 0xc0002fe22d\npod/nginx created\nroot@k8s-master:~# \nroot@k8s-master:~# \nroot@k8s-master:~# ./kubectl create -f pod.yaml --output=yaml  -v 3\nI1105 11:20:37.221485    5527 create.go:215] zoux CreateOptions is: &{0xc0003b9e00 0xc0002fa0a8 false {[pod.yaml]  false}  false  {} 0x14e0d80 {0xc0000b8000 0xc0000b8008 0xc0000b8010}}\nI1105 11:20:37.221602    5527 create.go:216] zoux FilenameOptions is: [pod.yaml],,false\nI1105 11:20:37.221618    5527 create.go:217] zoux PrintFlags is: yaml, &{}, &{created}, 0x129b480, 0xc00041f2a0, &{0xc00041f2d0 0xc00041f2e0 0xc00003ebf8 0xc00041f2c0},&{0xc00003fb58 0xc00003b960}\nI1105 11:20:37.221648    5527 create.go:218] zoux PrintObj is: 0x14e0d80\nI1105 11:20:37.221656    5527 create.go:219] zoux Recorder is: {}\nI1105 11:20:37.221662    5527 create.go:220] zoux RecordFlags is: 0xc00003ebf9\napiVersion: v1\nkind: Pod\nmetadata:\n  creationTimestamp: \"2021-11-05T03:20:37Z\"\n  name: nginx\n  namespace: default\n  resourceVersion: \"2522253\"\n  selfLink: /api/v1/namespaces/default/pods/nginx\n  uid: 947964b7-c65b-4e05-b7e5-a9e83a1b4e20\nspec:\n  containers:\n  - command:\n    - sleep\n    - \"3600\"\n    image: curlimages/curl:7.75.0\n    imagePullPolicy: IfNotPresent\n    name: nginx\n    resources: {}\n    terminationMessagePath: /dev/termination-log\n    terminationMessagePolicy: File\n    volumeMounts:\n    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount\n      name: sa-example-token-lchv2\n      readOnly: true\n  dnsPolicy: ClusterFirst\n  enableServiceLinks: true\n  nodeName: k8s-node\n  priority: 0\n  restartPolicy: Always\n  schedulerName: default-scheduler\n  securityContext: {}\n  serviceAccount: sa-example\n  serviceAccountName: sa-example\n  terminationGracePeriodSeconds: 10\n  tolerations:\n  - effect: NoExecute\n    key: node.kubernetes.io/not-ready\n    operator: Exists\n    tolerationSeconds: 300\n  - effect: NoExecute\n    key: node.kubernetes.io/unreachable\n    operator: Exists\n    tolerationSeconds: 300\n  volumes:\n  - name: sa-example-token-lchv2\n    secret:\n      defaultMode: 420\n      secretName: sa-example-token-lchv2\nstatus:\n  phase: Pending\n  qosClass: BestEffort\n```\n\n##### (3) record 会在创建对象中的annotation中记录\n\n```\nroot@k8s-master:~# ./kubectl create -f pod.yaml --record  -v 3\nI1105 11:22:41.875198    6398 create.go:215] zoux CreateOptions is: &{0xc00038ce40 0xc0002dc0a8 false {[pod.yaml]  false}  false  0xc0003ffd80 0x14e0d80 {0xc00000e010 0xc00000e018 0xc00000e020}}\nI1105 11:22:41.875342    6398 create.go:216] zoux FilenameOptions is: [pod.yaml],,false\nI1105 11:22:41.875369    6398 create.go:217] zoux PrintFlags is: , &{}, &{created}, 0x129b480, 0xc000448290, &{0xc0004482e0 0xc0004482f0 0xc00039379c 0xc0004482d0},&{0xc0002ddea8 0xc0002bfa40}\nI1105 11:22:41.875414    6398 create.go:218] zoux PrintObj is: 0x14e0d80\nI1105 11:22:41.875426    6398 create.go:219] zoux Recorder is: &{kubectl create --filename=pod.yaml --record=true --v=3}\nI1105 11:22:41.875445    6398 create.go:220] zoux RecordFlags is: 0xc00039379d\npod/nginx created\nroot@k8s-master:~# \nroot@k8s-master:~# kubectl get pod nginx -oyaml\napiVersion: v1\nkind: Pod\nmetadata:\n  annotations:\n    kubernetes.io/change-cause: kubectl create --filename=pod.yaml --record=true --v=3\n  creationTimestamp: \"2021-11-05T03:22:42Z\"\n  name: nginx\n  namespace: default\n  resourceVersion: \"2522555\"\n  selfLink: /api/v1/namespaces/default/pods/nginx\n  uid: a03f4733-0e5e-4ab6-a7ed-4aeae41d9740\n```\n\n##### (4) --raw\n\n```\n--raw='': Raw URI to POST to the server.  Uses the transport specified by the kubeconfig file.\n看起来只支持这个\n{\n  \"paths\": [\n    \"/api\",\n    \"/api/v1\",\n    \"/apis\",\n    \"/apis/\",\n    \"/apis/admissionregistration.k8s.io\",\n    \"/apis/admissionregistration.k8s.io/v1\",\n    \"/apis/admissionregistration.k8s.io/v1beta1\",\n    \"/apis/apiextensions.k8s.io\",\n    \"/apis/apiextensions.k8s.io/v1\",\n    \"/apis/apiextensions.k8s.io/v1beta1\",\n    \"/apis/apiregistration.k8s.io\",\n    \"/apis/apiregistration.k8s.io/v1\",\n    \"/apis/apiregistration.k8s.io/v1beta1\",\n    \"/apis/apps\",\n    \"/apis/apps/v1\",\n    \"/apis/authentication.k8s.io\",\n    \"/apis/authentication.k8s.io/v1\",\n    \"/apis/authentication.k8s.io/v1beta1\",\n    \"/apis/authorization.k8s.io\",\n    \"/apis/authorization.k8s.io/v1\",\n    \"/apis/authorization.k8s.io/v1beta1\",\n    \"/apis/autoscaling\",\n    \"/apis/autoscaling/v1\",\n    \"/apis/autoscaling/v2beta1\",\n    \"/apis/autoscaling/v2beta2\",\n    \"/apis/batch\",\n    \"/apis/batch/v1\",\n    \"/apis/batch/v1beta1\",\n    \"/apis/certificates.k8s.io\",\n    \"/apis/certificates.k8s.io/v1beta1\",\n    \"/apis/coordination.k8s.io\",\n    \"/apis/coordination.k8s.io/v1\",\n    \"/apis/coordination.k8s.io/v1beta1\",\n    \"/apis/discovery.k8s.io\",\n    \"/apis/discovery.k8s.io/v1beta1\",\n    \"/apis/events.k8s.io\",\n    \"/apis/events.k8s.io/v1beta1\",\n    \"/apis/extensions\",\n    \"/apis/extensions/v1beta1\",\n    \"/apis/networking.k8s.io\",\n    \"/apis/networking.k8s.io/v1\",\n    \"/apis/networking.k8s.io/v1beta1\",\n    \"/apis/node.k8s.io\",\n    \"/apis/node.k8s.io/v1beta1\",\n    \"/apis/policy\",\n    \"/apis/policy/v1beta1\",\n    \"/apis/rbac.authorization.k8s.io\",\n    \"/apis/rbac.authorization.k8s.io/v1\",\n    \"/apis/rbac.authorization.k8s.io/v1beta1\",\n    \"/apis/scheduling.k8s.io\",\n    \"/apis/scheduling.k8s.io/v1\",\n    \"/apis/scheduling.k8s.io/v1beta1\",\n    \"/apis/storage.k8s.io\",\n    \"/apis/storage.k8s.io/v1\",\n    \"/apis/storage.k8s.io/v1beta1\",\n    \"/healthz\",\n    \"/healthz/autoregister-completion\",\n    \"/healthz/etcd\",\n    \"/healthz/log\",\n    \"/healthz/ping\",\n    \"/healthz/poststarthook/apiservice-openapi-controller\",\n    \"/healthz/poststarthook/apiservice-registration-controller\",\n    \"/healthz/poststarthook/apiservice-status-available-controller\",\n    \"/healthz/poststarthook/bootstrap-controller\",\n    \"/healthz/poststarthook/crd-informer-synced\",\n    \"/healthz/poststarthook/generic-apiserver-start-informers\",\n    \"/healthz/poststarthook/kube-apiserver-autoregistration\",\n    \"/healthz/poststarthook/rbac/bootstrap-roles\",\n    \"/healthz/poststarthook/scheduling/bootstrap-system-priority-classes\",\n    \"/healthz/poststarthook/start-apiextensions-controllers\",\n    \"/healthz/poststarthook/start-apiextensions-informers\",\n    \"/healthz/poststarthook/start-cluster-authentication-info-controller\",\n    \"/healthz/poststarthook/start-kube-aggregator-informers\",\n    \"/healthz/poststarthook/start-kube-apiserver-admission-initializer\",\n    \"/livez\",\n    \"/livez/autoregister-completion\",\n    \"/livez/etcd\",\n    \"/livez/log\",\n    \"/livez/ping\",\n    \"/livez/poststarthook/apiservice-openapi-controller\",\n    \"/livez/poststarthook/apiservice-registration-controller\",\n    \"/livez/poststarthook/apiservice-status-available-controller\",\n    \"/livez/poststarthook/bootstrap-controller\",\n    \"/livez/poststarthook/crd-informer-synced\",\n    \"/livez/poststarthook/generic-apiserver-start-informers\",\n    \"/livez/poststarthook/kube-apiserver-autoregistration\",\n    \"/livez/poststarthook/rbac/bootstrap-roles\",\n    \"/livez/poststarthook/scheduling/bootstrap-system-priority-classes\",\n    \"/livez/poststarthook/start-apiextensions-controllers\",\n    \"/livez/poststarthook/start-apiextensions-informers\",\n    \"/livez/poststarthook/start-cluster-authentication-info-controller\",\n    \"/livez/poststarthook/start-kube-aggregator-informers\",\n    \"/livez/poststarthook/start-kube-apiserver-admission-initializer\",\n    \"/logs\",\n    \"/metrics\",\n    \"/openapi/v2\",\n    \"/readyz\",\n    \"/readyz/autoregister-completion\",\n    \"/readyz/etcd\",\n    \"/readyz/log\",\n    \"/readyz/ping\",\n    \"/readyz/poststarthook/apiservice-openapi-controller\",\n    \"/readyz/poststarthook/apiservice-registration-controller\",\n    \"/readyz/poststarthook/apiservice-status-available-controller\",\n    \"/readyz/poststarthook/bootstrap-controller\",\n    \"/readyz/poststarthook/crd-informer-synced\",\n    \"/readyz/poststarthook/generic-apiserver-start-informers\",\n    \"/readyz/poststarthook/kube-apiserver-autoregistration\",\n    \"/readyz/poststarthook/rbac/bootstrap-roles\",\n    \"/readyz/poststarthook/scheduling/bootstrap-system-priority-classes\",\n    \"/readyz/poststarthook/start-apiextensions-controllers\",\n    \"/readyz/poststarthook/start-apiextensions-informers\",\n    \"/readyz/poststarthook/start-cluster-authentication-info-controller\",\n    \"/readyz/poststarthook/start-kube-aggregator-informers\",\n    \"/readyz/poststarthook/start-kube-apiserver-admission-initializer\",\n    \"/readyz/shutdown\",\n    \"/version\"\n  ]\n}\n```\n\n\n<br>\n\n### 2.2 RunCreate源码分析\n\n```\n// RunCreate performs the creation\nfunc (o *CreateOptions) RunCreate(f cmdutil.Factory, cmd *cobra.Command) error {\n\t// raw only makes sense for a single file resource multiple objects aren't likely to do what you want.\n\t// the validator enforces this, so\n\t// 1.如果指定了url（apiserver暴露的restful路径）， 直接发送到这个url处理\n\tif len(o.Raw) > 0 {\n\t\trestClient, err := f.RESTClient()\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\treturn rawhttp.RawPost(restClient, o.IOStreams, o.Raw, o.FilenameOptions.Filenames[0])\n\t}\n  \n  // 2.是否指定了EditBeforeCreate\n\tif o.EditBeforeCreate {\n\t\treturn RunEditOnCreate(f, o.PrintFlags, o.RecordFlags, o.IOStreams, cmd, &o.FilenameOptions)\n\t}\n\t\n\t// \n\tschema, err := f.Validator(cmdutil.GetFlagBool(cmd, \"validate\"))\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tcmdNamespace, enforceNamespace, err := f.ToRawKubeConfigLoader().Namespace()\n\tif err != nil {\n\t\treturn err\n\t}\n    \n    \n    // 3.这个是关键函数。先补充一波基础知识再额外分析\n\tr := f.NewBuilder().\n\t\tUnstructured().\n\t\tSchema(schema).\n\t\tContinueOnError().\n\t\tNamespaceParam(cmdNamespace).DefaultNamespace().\n\t\tFilenameParam(enforceNamespace, &o.FilenameOptions).\n\t\tLabelSelectorParam(o.Selector).\n\t\tFlatten().\n\t\tDo()\n\terr = r.Err()\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tcount := 0\n\terr = r.Visit(func(info *resource.Info, err error) error {\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tif err := util.CreateOrUpdateAnnotation(cmdutil.GetFlagBool(cmd, cmdutil.ApplyAnnotationsFlag), info.Object, scheme.DefaultJSONEncoder()); err != nil {\n\t\t\treturn cmdutil.AddSourceToErr(\"creating\", info.Source, err)\n\t\t}\n\n\t\tif err := o.Recorder.Record(info.Object); err != nil {\n\t\t\tklog.V(4).Infof(\"error recording current command: %v\", err)\n\t\t}\n\n\t\tif !o.DryRun {\n\t\t\tif err := createAndRefresh(info); err != nil {\n\t\t\t\treturn cmdutil.AddSourceToErr(\"creating\", info.Source, err)\n\t\t\t}\n\t\t}\n\n\t\tcount++\n        \n        \n        // 格式化输出\n\t\treturn o.PrintObj(info.Object)\n\t})\n\tif err != nil {\n\t\treturn err\n\t}\n\tif count == 0 {\n\t\treturn fmt.Errorf(\"no objects passed to create\")\n\t}\n\treturn nil\n}\n```\n\nRunCreate函数通过`f.NewBuilder().XX.XX.Do`定义了这些visitor：DecoratedVisitor，ContinueOnErrorVisitor，FlattenListVisitor， FlattenListVisitor，EagerVisitorList FileVisitor， StreamVisitor。用于处理info\n\n然后o.PrintObj 通过之前的printFlags定义好了printer，格式化输出创建好的obj\n\n### 3. 总结\n（1）kubectl必须制定-f 或者 -k\n\n（2）增加了很多flag, 比如dryrun , editbeforCreate等等\n\n（3）定义了一堆subcommands，kubectl create ns/job 等等\n\n（4）kubectl create核心命令而已，还是RunCreate函数\n\n（5）RunCreate函数通过`f.NewBuilder().XX.XX.Do`定义了这些visitor：DecoratedVisitor，ContinueOnErrorVisitor，FlattenListVisitor， FlattenListVisitor，EagerVisitorList FileVisitor， StreamVisitor。用于处理info\n\n（6）然后o.PrintObj 通过之前的printFlags定义好了printer，格式化输出创建好的obj\n\n\n\n\n\n"
  },
  {
    "path": "k8s/kubelet/0-readme.md",
    "content": "本章节基于1.17.4版本的kubelet代码。力求从源码角度了解：\n\n（1）kubelet的启动过程\n\n（2）创建/删除/更新 pod的整个流程\n\n（3）以此为基础了解csi, cni，cri相关知识"
  },
  {
    "path": "k8s/kubelet/1-kubelet 架构浅析.md",
    "content": "Table of Contents\r\n\r\n  * [1. 概要](#1-概要)\r\n  * [2. kubelet 的主要功能](#2-kubelet-的主要功能)\r\n     * [2.1 kubelet 默认监听四个端口，分别为 10250 、10255、10248、4194。](#21-kubelet-默认监听四个端口分别为-10250-10255102484194)\r\n     * [2.2 kubelet 主要功能：](#22-kubelet-主要功能)\r\n  * [3. kubelet的架构](#3-kubelet的架构)\r\n  * [4. 参考文章](#4-参考文章)\r\n\r\n### 1. 概要\r\n\r\nkubelet 是运行在每个节点上的主要的“节点代理”，每个节点都会启动 kubelet进程，用来处理 Master 节点下发到本节点的任务，按照 PodSpec 描述来管理Pod 和其中的容器（PodSpec 是用来描述一个 pod 的 YAML 或者 JSON 对象）。\r\n\r\nkubelet 通过各种机制（主要通过 apiserver ）获取一组 PodSpec 并保证在这些 PodSpec 中描述的容器健康运行。\r\n\r\n<br>\r\n\r\n### 2. kubelet 功能介绍\r\n\r\n#### 2.1 kubelet 默认监听四个端口，分别为 10250 、10255、10248、4194。\r\n\r\n- 10250 –port：kubelet服务监听的端口，api会检测他是否存活。\r\n- 10248 –healthz-port：健康检查服务的端口。\r\n- 10255 –read-only-port：只读端口，可以不用验证和授权机制，直接访问。\r\n\r\n<br>\r\n\r\n**10250（kubelet API）**：kubelet server 与 apiserver 通信的端口，定期请求 apiserver 获取自己所应当处理的任务，通过该端口可以访问获取 node 资源以及状态。比如：\r\n\r\n**注意：** 在kamster集群上，或者其他dnode上执行也是可以访问的。` curl -k https://127.0.0.1:10250/stats/summary`\r\n\r\n```\r\nroot@node:home/zoux# curl -k https://127.0.0.1:10250/stats/summary\r\n{\r\n \"node\": {\r\n  \"nodeName\": \"10.248.34.20\",\r\n  \"systemContainers\": [\r\n   {\r\n    \"name\": \"kubelet\",\r\n    \"startTime\": \"2021-01-22T01:13:44Z\",\r\n    \"cpu\": {\r\n     \"time\": \"2021-02-22T02:11:01Z\",\r\n     \"usageNanoCores\": 110007173,\r\n     \"usageCoreNanoSeconds\": 1957269260250881\r\n    },\r\n    \"memory\": {\r\n     \"time\": \"2021-02-22T02:11:01Z\",\r\n     \"usageBytes\": 26810236928,\r\n     \"workingSetBytes\": 9253462016,\r\n     \"rssBytes\": 22356553728,\r\n     \"pageFaults\": 25889794623,\r\n     \"majorPageFaults\": 6567\r\n    }\r\n   },\r\n   {\r\n    \"name\": \"runtime\",\r\n    \"startTime\": \"2021-01-20T14:03:17Z\",\r\n    \"cpu\": {\r\n     \"time\": \"2021-02-22T02:11:06Z\",\r\n     \"usageNanoCores\": 17624832,\r\n     \"usageCoreNanoSeconds\": 803524762604160\r\n    },\r\n    \"memory\": {\r\n     \"time\": \"2021-02-22T02:11:06Z\",\r\n     \"usageBytes\": 482652160,\r\n     \"workingSetBytes\": 378077184,\r\n     \"rssBytes\": 138530816,\r\n     \"pageFaults\": 9593704428,\r\n     \"majorPageFaults\": 231\r\n    }\r\n   }\r\n  }\r\n```\r\n\r\ncAdvisor 监听\r\n\r\n```\r\n  curl -k https://127.0.0.1:10250/metrics/cadvisor\r\n```\r\n\r\n- 10248（健康检查端口）：通过访问该端口可以判断 kubelet 是否正常工作, 通过 kubelet 的启动参数 `--healthz-port` 和 `--healthz-bind-address` 来指定监听的地址和端口。\r\n\r\n  ```\r\n  $ curl http://127.0.0.1:10248/healthz\r\n  ok\r\n  ```\r\n\r\n- 10255 （readonly API）：提供了 pod 和 node 的信息，接口以只读形式暴露出去，访问该端口不需要认证和鉴权。\r\n\r\n  ```\r\n  root@k8s-node:~# curl  http://127.0.0.1:10255/pods\r\n  {\"kind\":\"PodList\",\"apiVersion\":\"v1\",\"metadata\":{},\"items\":[{\"metadata\":{\"name\":\"kube-flannel-ds-97qn4\",\"generateName\":\"kube-flannel-ds-\",\"namespace\":\"kube-system\",\"selfLink\":\"/api/v1/namespaces/kube-s]\r\n  ....\r\n  343294ac385c400b076a0d0c62979909cede65e90b2a0d8615ddba36c19cd\"}},\"ready\":true,\"restartCount\":10,\"image\":\"quay.io/coreos/flannel:v0.15.1\",\"imageID\":\"docker-pullable://quay.io/coreos/flannel@sha256:9a296fbb67790659adc3701e287adde3c59803b7fcefe354f1fc482840cdb3d9\",\"containerID\":\"docker://8c397ea4bc0ab5f8b68255be78593bb5b05b73174ed858848576ef0ce8702292\",\"started\":true}],\"qosClass\":\"Burstable\"}}]}\r\n  ```\r\n\r\n**注意：**\r\n\r\n**以上都是默认的**，代码中都能找到，cmd\\kubeadm\\app\\constants\\constants.go。\r\n\r\n<br>\r\n\r\n#### 2.2 kubelet 主要功能：\r\n\r\n- pod 管理：kubelet 定期从所监听的数据源获取节点上 pod/container 的期望状态（运行什么容器、运行的副本数量、网络或者存储如何配置等等），并调用对应的容器平台接口达到这个状态。\r\n- 容器健康检查：kubelet 创建了容器之后还要查看容器是否正常运行，如果容器运行出错，就要根据 pod 设置的重启策略进行处理。\r\n- 容器监控：kubelet 会监控所在节点的资源使用情况，并定时向 master 报告，资源使用数据都是通过 cAdvisor 获取的。知道整个集群所有节点的资源情况，对于 pod 的调度和正常运行至关重要。\r\n\r\n<br>\r\n\r\n### 3. kubelet的架构\r\n\r\n![kubelet-struct](../images/kubelet-struct.png)\r\n\r\n这一部分包括图片，摘自 https://zhuanlan.zhihu.com/p/338462784\r\n\r\n上图展示了 kubelet 组件中的模块以及模块间的划分。\r\n\r\n- 1、PLEG(Pod Lifecycle Event Generator） PLEG 是 kubelet 的核心模块,PLEG 会一直调用 container runtime 获取本节点 containers/sandboxes 的信息，并与自身维护的 pods cache 信息进行对比，生成对应的 PodLifecycleEvent，然后输出到 eventChannel 中，通过 eventChannel 发送到 kubelet syncLoop 进行消费，然后由 kubelet syncPod 来触发 pod 同步处理过程，最终达到用户的期望状态。\r\n- 2、cAdvisor cAdvisor（https://github.com/google/cadvisor）是 google 开发的容器监控工具，集成在 kubelet 中，起到收集本节点和容器的监控信息，大部分公司对容器的监控数据都是从 cAdvisor 中获取的 ，cAvisor 模块对外提供了 interface 接口，该接口也被 imageManager，OOMWatcher，containerManager 等所使用。\r\n- 3、OOMWatcher 系统 OOM 的监听器，会与 cadvisor 模块之间建立 SystemOOM,通过 Watch方式从 cadvisor 那里收到的 OOM 信号，并产生相关事件。\r\n- 4、probeManager probeManager 依赖于 statusManager,livenessManager,containerRefManager，会定时去监控 pod 中容器的健康状况，当前支持两种类型的探针：livenessProbe 和readinessProbe。 livenessProbe：用于判断容器是否存活，如果探测失败，kubelet 会 kill 掉该容器，并根据容器的重启策略做相应的处理。 readinessProbe：用于判断容器是否启动完成，将探测成功的容器加入到该 pod 所在 service 的 endpoints 中，反之则移除。readinessProbe 和 livenessProbe 有三种实现方式：http、tcp 以及 cmd。\r\n- 5、statusManager statusManager 负责维护状态信息，并把 pod 状态更新到 apiserver，但是它并不负责监控 pod 状态的变化，而是提供对应的接口供其他组件调用，比如 probeManager。\r\n- 6、containerRefManager 容器引用的管理，相对简单的Manager，用来报告容器的创建，失败等事件，通过定义 map 来实现了 containerID 与 v1.ObjectReferece 容器引用的映射。\r\n- 7、evictionManager 当节点的内存、磁盘或 inode 等资源不足时，达到了配置的 evict 策略， node 会变为 pressure 状态，此时 kubelet 会按照 qosClass 顺序来驱赶 pod，以此来保证节点的稳定性。可以通过配置 kubelet 启动参数 `--eviction-hard=` 来决定 evict 的策略值。\r\n- 8、imageGC imageGC 负责 node 节点的镜像回收，当本地的存放镜像的本地磁盘空间达到某阈值的时候，会触发镜像的回收，删除掉不被 pod 所使用的镜像，回收镜像的阈值可以通过 kubelet 的启动参数 `--image-gc-high-threshold` 和 `--image-gc-low-threshold` 来设置。\r\n- 9、containerGC containerGC 负责清理 node 节点上已消亡的 container，具体的 GC 操作由runtime 来实现。\r\n- 10、imageManager 调用 kubecontainer 提供的PullImage/GetImageRef/ListImages/RemoveImage/ImageStates 方法来保证pod 运行所需要的镜像。\r\n- 11、volumeManager 负责 node 节点上 pod 所使用 volume 的管理，volume 与 pod 的生命周期关联，负责 pod 创建删除过程中 volume 的 mount/umount/attach/detach 流程，kubernetes 采用 volume Plugins 的方式，实现存储卷的挂载等操作，内置几十种存储插件。\r\n- 12、containerManager 负责 node 节点上运行的容器的 cgroup 配置信息，kubelet 启动参数如果指定 `--cgroups-per-qos` 的时候，kubelet 会启动 goroutine 来周期性的更新 pod 的 cgroup 信息，维护其正确性，该参数默认为 `true`，实现了 pod 的Guaranteed/BestEffort/Burstable 三种级别的 Qos。\r\n- 13、runtimeManager containerRuntime 负责 kubelet 与不同的 runtime 实现进行对接，实现对于底层 container 的操作，初始化之后得到的 runtime 实例将会被之前描述的组件所使用。可以通过 kubelet 的启动参数 `--container-runtime` 来定义是使用docker 还是 rkt，默认是 `docker`。\r\n- 14、podManager podManager 提供了接口来存储和访问 pod 的信息，维持 static pod 和 mirror pods 的关系，podManager 会被statusManager/volumeManager/runtimeManager 所调用，podManager 的接口处理流程里面会调用 secretManager 以及 configMapManager。\r\n\r\n<br>\r\n\r\n### 4. 参考文章\r\n\r\nhttps://zhuanlan.zhihu.com/p/338462784\r\n\r\nhttps://www.bookstack.cn/read/source-code-reading-notes/kubernetes-kubelet-modules.md"
  },
  {
    "path": "k8s/kubelet/10-k8s驱逐机制汇总.md",
    "content": "### 1. 驱逐\n\nEviction，即驱逐的意思，意思是当节点出现异常时，为了保证工作负载的可用性，kubernetes将有相应的机制驱逐该节点上的Pod。\n\n### 2. 驱逐类型\n\n 目前有4个主要的驱逐场景, 分布是手工驱逐,节点的压力驱逐,污点导致驱逐,pod抢占导致驱逐. 一般而言主要关注的是节点压力导致的驱逐.\n\n#### 2.1 手工驱逐\n\n可以使用 `drain` 手工排空当前的计算节点. 不过在一般实践中都是先禁止调度,而后才是排空当前节点的 pod.\n\n```\nroot# kubectl drain nodeXX\nnode/nodeXX already cordoned\nevicting pod \"xx\"\npod/xx evicted\nnode/nodeXX evicted\n```\n\n手动驱逐是kubectl测直接删除所有pod，然后设置不可调度。\n\n这里可以通过查看源代码和实验验证。核心代码：\n\n```\ndrain.NewCmdDrain(f, ioStreams)\n\n// RunDrain runs the 'drain' command\nfunc (o *DrainCmdOptions) RunDrain() error {\n\tif err := o.RunCordonOrUncordon(true); err != nil {\n\t\treturn err\n\t}\n\n\tprintObj, err := o.ToPrinter(\"drained\")\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tdrainedNodes := sets.NewString()\n\tvar fatal error\n\n\tfor _, info := range o.nodeInfos {\n\t\tvar err error\n\t\tif !o.drainer.DryRun {\n\t\t\terr = o.deleteOrEvictPodsSimple(info)\n\t\t}\n\t\tif err == nil || o.drainer.DryRun {\n\t\t\tdrainedNodes.Insert(info.Name)\n\t\t\tprintObj(info.Object, o.Out)\n\t\t} else {\n\t\t\tfmt.Fprintf(o.ErrOut, \"error: unable to drain node %q, aborting command...\\n\\n\", info.Name)\n\t\t\tremainingNodes := []string{}\n\t\t\tfatal = err\n\t\t\tfor _, remainingInfo := range o.nodeInfos {\n\t\t\t\tif drainedNodes.Has(remainingInfo.Name) {\n\t\t\t\t\tcontinue\n\t\t\t\t}\n\t\t\t\tremainingNodes = append(remainingNodes, remainingInfo.Name)\n\t\t\t}\n\n\t\t\tif len(remainingNodes) > 0 {\n\t\t\t\tfmt.Fprintf(o.ErrOut, \"There are pending nodes to be drained:\\n\")\n\t\t\t\tfor _, nodeName := range remainingNodes {\n\t\t\t\t\tfmt.Fprintf(o.ErrOut, \" %s\\n\", nodeName)\n\t\t\t\t}\n\t\t\t}\n\t\t\tbreak\n\t\t}\n\t}\n\n\treturn fatal\n}\n```\n\n额外补充单纯使用 `cordon` 的时候,并**不会**对已经存在在这个节点上的 pod 发生驱逐. NoSchedule 是影响的调度去行为, NoExectue 才会导致驱逐.\n\n```\nroot# kubectl  cordon xx\nnode/xx cordoned\nspec:\n  taints:\n  - effect: NoSchedule\n    key: node.kubernetes.io/unschedulable\n    timeAdded: \"xxx\"\n  unschedulable: true\n```\n\n#### 2.2 压力驱逐-kubelet驱逐\n\n参考 [kubelet驱逐原理分析](https://github.com/zoux86/learning-k8s-source-code/blob/master/k8s/kubelet/9-kubelet%E9%A9%B1%E9%80%90%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90.md)\n\n#### 2.3 污点驱逐\n\n参考 [10-kcm-NodeLifecycleController源码分析](https://github.com/zoux86/learning-k8s-source-code/blob/master/k8s/kcm/10-kcm-NodeLifecycleController%E6%BA%90%E7%A0%81%E5%88%86%E6%9E%90.md)\n\n#### 2.4 pod抢占驱逐\n\nscheduler开启抢占的时候用到。后面在分析"
  },
  {
    "path": "k8s/kubelet/2-kubelet初始化流程-上.md",
    "content": "Table of Contents\r\n=================\r\n\r\n  * [1. kubelet入口函数](#1-kubelet入口函数)\r\n  * [2. NewKubeletCommand函数](#2-newkubeletcommand函数)\r\n     * [2.1 总结](#21-总结)\r\n  * [3. Run(kubeletServer, kubeletDeps, stopCh) 函数](#3-runkubeletserver-kubeletdeps-stopch-函数)\r\n     * [3.1. KubeletServer  函数](#31-kubeletserver--函数)\r\n     * [3.2. Dependencies 函数](#32-dependencies-函数)\r\n  * [4. run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies, stopCh &lt;-chan struct{})](#4-runs-optionskubeletserver-kubedeps-kubeletdependencies-stopch--chan-struct)\r\n  * [5. RunKubelet(s, kubeDeps, s.RunOnce) 函数](#5-runkubelets-kubedeps-srunonce-函数)\r\n     * [5.1. CreateAndInitKubelet 函数](#51-createandinitkubelet-函数)\r\n     * [5.2 NewMainKubelet](#52-newmainkubelet)\r\n     * [5.3 startKubelet](#53-startkubelet)\r\n  * [6. 总结](#6-总结)\r\n        * [补充各种 Manager](#补充各种-manager)\r\n  * [7 参考](#7-参考)\r\n\r\n\r\n\r\n这里主要弄懂kubelet的启动流程。先上图，图片来源: https://www.bookstack.cn/read/source-code-reading-notes/kubernetes-kubelet_init.md\r\n\r\n![kubelet-func-chanel](../images/kubelet-func-chanel.png)\r\n\r\n\r\n\r\n### 1. kubelet入口函数\r\n\r\ncmd\\kubelet\\kubelet.go\r\n\r\n```\r\nfunc main() {\r\n   rand.Seed(time.Now().UTC().UnixNano())\r\n\r\n   command := app.NewKubeletCommand(server.SetupSignalHandler())\r\n   logs.InitLogs()\r\n   defer logs.FlushLogs()\r\n\r\n   if err := command.Execute(); err != nil {\r\n      fmt.Fprintf(os.Stderr, \"%v\\n\", err)\r\n      os.Exit(1)\r\n   }\r\n}\r\n```\r\n\r\n这里还是和k8s其他组件一样，使用了Cobra框架，核心代码如下：\r\n\r\n```go\r\n// 初始化命令行\r\ncommand := app.NewKubeletCommand(server.SetupSignalHandler())\r\n// 执行Execute\r\nerr := command.Execute()\r\n```\r\n\r\n<br>\r\n\r\n### 2. NewKubeletCommand函数\r\n\r\n该函数整体逻辑如下：`当前不关心初始化，验证等逻辑，现在直接奔 Run(kubeletServer, kubeletDeps, stopCh) 去`\r\n\r\n（1）初始化参数解析，初始化cleanFlagSet，kubeletFlags，kubeletConfig。这些都是初始化kubelet时要用到的\r\n\r\n（2）定义一个cobra命令，这里核心就是 Run函数。Run函数的核心逻辑如下：\r\n\r\n* 针对不规范的参数输入，或者help, version情况输出帮助或者版本信息\r\n* 加载并验证 kubelet-config是否规范\r\n* 更加kubelet-config生成kubeletServer和kubeletDeps，这个是生成kubelet的必要条件\r\n* 运行kubelet，核心是Run函数\r\n\r\n（3）AddFlags的具体描述如下\r\n\r\nhttps://github.com/kubernetes/kubernetes/blob/0ed33881dc4355495f623c6f22e7dd0b7632b7c0/cmd/kubelet/app/options/options.go#L323\r\n\r\n```go\r\ncmd\\kubelet\\app\\server.go\r\n// NewKubeletCommand creates a *cobra.Command object with default parameters\r\nfunc NewKubeletCommand(stopCh <-chan struct{}) *cobra.Command {\r\n    // 初始化参数解析，初始化cleanFlagSet，kubeletFlags，kubeletConfig。这些都是初始化kubelet时要用到的\r\n\tcleanFlagSet := pflag.NewFlagSet(componentKubelet, pflag.ContinueOnError)\r\n\tcleanFlagSet.SetNormalizeFunc(flag.WordSepNormalizeFunc)\r\n\tkubeletFlags := options.NewKubeletFlags()\r\n\tkubeletConfig, err := options.NewKubeletConfiguration()\r\n\t// programmer error\r\n\tif err != nil {\r\n\t\tglog.Fatal(err)\r\n\t}\r\n\t\r\n    // 定义一个cobra命令，这里核心就是 Run函数。\r\n\tcmd := &cobra.Command{\r\n\t\tUse: componentKubelet,\r\n\t\tLong: `The kubelet is the primary \"node agent\" that runs on each\r\nnode. The kubelet works in terms of a PodSpec. A PodSpec is a YAML or JSON object\r\nthat describes a pod. The kubelet takes a set of PodSpecs that are provided through\r\nvarious mechanisms (primarily through the apiserver) and ensures that the containers\r\ndescribed in those PodSpecs are running and healthy. The kubelet doesn't manage\r\ncontainers which were not created by Kubernetes.\r\n\r\nOther than from an PodSpec from the apiserver, there are three ways that a container\r\nmanifest can be provided to the Kubelet.\r\n\r\nFile: Path passed as a flag on the command line. Files under this path will be monitored\r\nperiodically for updates. The monitoring period is 20s by default and is configurable\r\nvia a flag.\r\n\r\nHTTP endpoint: HTTP endpoint passed as a parameter on the command line. This endpoint\r\nis checked every 20 seconds (also configurable with a flag).\r\n\r\nHTTP server: The kubelet can also listen for HTTP and respond to a simple API\r\n(underspec'd currently) to submit a new manifest.`,\r\n\t\t// The Kubelet has special flag parsing requirements to enforce flag precedence rules,\r\n\t\t// so we do all our parsing manually in Run, below.\r\n\t\t// DisableFlagParsing=true provides the full set of flags passed to the kubelet in the\r\n\t\t// `args` arg to Run, without Cobra's interference.\r\n\t\tDisableFlagParsing: true,    // 没有使用Cobra框架中的默认参数解析，而是自定义参数解析。\r\n\t\tRun: func(cmd *cobra.Command, args []string) {\r\n            // 接下来都是针对不规范的参数，输出使用帮助\r\n\t\t\t// initial flag parse, since we disable cobra's flag parsing\r\n\t\t\tif err := cleanFlagSet.Parse(args); err != nil {\r\n\t\t\t\tcmd.Usage()\r\n\t\t\t\tglog.Fatal(err)\r\n\t\t\t}\r\n\r\n\t\t\t// check if there are non-flag arguments in the command line\r\n\t\t\tcmds := cleanFlagSet.Args()\r\n\t\t\tif len(cmds) > 0 {\r\n\t\t\t\tcmd.Usage()\r\n\t\t\t\tglog.Fatalf(\"unknown command: %s\", cmds[0])\r\n\t\t\t}\r\n            \r\n            // 输出help\r\n\t\t\t// short-circuit on help\r\n\t\t\thelp, err := cleanFlagSet.GetBool(\"help\")\r\n\t\t\tif err != nil {\r\n\t\t\t\tglog.Fatal(`\"help\" flag is non-bool, programmer error, please correct`)\r\n\t\t\t}\r\n\t\t\tif help {\r\n\t\t\t\tcmd.Help()\r\n\t\t\t\treturn\r\n\t\t\t}\r\n            \r\n            // 输出version\r\n\t\t\t// short-circuit on verflag\r\n\t\t\tverflag.PrintAndExitIfRequested()\r\n\t\t\tutilflag.PrintFlags(cleanFlagSet)\r\n\r\n            // 加载并校验kubelet config。其中包括校验初始化的kubeletFlags，并从kubeletFlags的KubeletConfigFile参数获取kubelet config的内容。\r\n\t\t\t// set feature gates from initial flags-based config\r\n\t\t\tif err := utilfeature.DefaultFeatureGate.SetFromMap(kubeletConfig.FeatureGates); err != nil {\r\n\t\t\t\tglog.Fatal(err)\r\n\t\t\t}\r\n\r\n\t\t\t// validate the initial KubeletFlags\r\n\t\t\tif err := options.ValidateKubeletFlags(kubeletFlags); err != nil {\r\n\t\t\t\tglog.Fatal(err)\r\n\t\t\t}\r\n\r\n\t\t\tif kubeletFlags.ContainerRuntime == \"remote\" && cleanFlagSet.Changed(\"pod-infra-container-image\") {\r\n\t\t\t\tglog.Warning(\"Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead\")\r\n\t\t\t}\r\n\r\n\t\t\t// load kubelet config file, if provided\r\n\t\t\tif configFile := kubeletFlags.KubeletConfigFile; len(configFile) > 0 {\r\n\t\t\t\tkubeletConfig, err = loadConfigFile(configFile)\r\n\t\t\t\tif err != nil {\r\n\t\t\t\t\tglog.Fatal(err)\r\n\t\t\t\t}\r\n\t\t\t\t// We must enforce flag precedence by re-parsing the command line into the new object.\r\n\t\t\t\t// This is necessary to preserve backwards-compatibility across binary upgrades.\r\n\t\t\t\t// See issue #56171 for more details.\r\n\t\t\t\tif err := kubeletConfigFlagPrecedence(kubeletConfig, args); err != nil {\r\n\t\t\t\t\tglog.Fatal(err)\r\n\t\t\t\t}\r\n\t\t\t\t// update feature gates based on new config\r\n\t\t\t\tif err := utilfeature.DefaultFeatureGate.SetFromMap(kubeletConfig.FeatureGates); err != nil {\r\n\t\t\t\t\tglog.Fatal(err)\r\n\t\t\t\t}\r\n\t\t\t}\r\n\r\n\t\t\t// We always validate the local configuration (command line + config file).\r\n\t\t\t// This is the default \"last-known-good\" config for dynamic config, and must always remain valid.\r\n\t\t\tif err := kubeletconfigvalidation.ValidateKubeletConfiguration(kubeletConfig); err != nil {\r\n\t\t\t\tglog.Fatal(err)\r\n\t\t\t}\r\n            \r\n            // 设置了，就使用动态kubelet config\r\n\t\t\t// use dynamic kubelet config, if enabled\r\n\t\t\tvar kubeletConfigController *dynamickubeletconfig.Controller\r\n\t\t\tif dynamicConfigDir := kubeletFlags.DynamicConfigDir.Value(); len(dynamicConfigDir) > 0 {\r\n\t\t\t\tvar dynamicKubeletConfig *kubeletconfiginternal.KubeletConfiguration\r\n\t\t\t\tdynamicKubeletConfig, kubeletConfigController, err = BootstrapKubeletConfigController(dynamicConfigDir,\r\n\t\t\t\t\tfunc(kc *kubeletconfiginternal.KubeletConfiguration) error {\r\n\t\t\t\t\t\t// Here, we enforce flag precedence inside the controller, prior to the controller's validation sequence,\r\n\t\t\t\t\t\t// so that we get a complete validation at the same point where we can decide to reject dynamic config.\r\n\t\t\t\t\t\t// This fixes the flag-precedence component of issue #63305.\r\n\t\t\t\t\t\t// See issue #56171 for general details on flag precedence.\r\n\t\t\t\t\t\treturn kubeletConfigFlagPrecedence(kc, args)\r\n\t\t\t\t\t})\r\n\t\t\t\tif err != nil {\r\n\t\t\t\t\tglog.Fatal(err)\r\n\t\t\t\t}\r\n\t\t\t\t// If we should just use our existing, local config, the controller will return a nil config\r\n\t\t\t\tif dynamicKubeletConfig != nil {\r\n\t\t\t\t\tkubeletConfig = dynamicKubeletConfig\r\n\t\t\t\t\t// Note: flag precedence was already enforced in the controller, prior to validation,\r\n\t\t\t\t\t// by our above transform function. Now we simply update feature gates from the new config.\r\n\t\t\t\t\tif err := utilfeature.DefaultFeatureGate.SetFromMap(kubeletConfig.FeatureGates); err != nil {\r\n\t\t\t\t\t\tglog.Fatal(err)\r\n\t\t\t\t\t}\r\n\t\t\t\t}\r\n\t\t\t}\r\n           \r\n            // 初始化kubeletServer和kubeletDeps\r\n\t\t\t// construct a KubeletServer from kubeletFlags and kubeletConfig\r\n\t\t\tkubeletServer := &options.KubeletServer{\r\n\t\t\t\tKubeletFlags:         *kubeletFlags,\r\n\t\t\t\tKubeletConfiguration: *kubeletConfig,\r\n\t\t\t}\r\n\r\n\t\t\t// use kubeletServer to construct the default KubeletDeps\r\n\t\t\tkubeletDeps, err := UnsecuredDependencies(kubeletServer)\r\n\t\t\tif err != nil {\r\n\t\t\t\tglog.Fatal(err)\r\n\t\t\t}\r\n\r\n\t\t\t// add the kubelet config controller to kubeletDeps\r\n\t\t\tkubeletDeps.KubeletConfigController = kubeletConfigController\r\n\t\t\t\r\n            // 如果开启了docker-shim 实验特效，则执行RunDockershim，这个只是调试用的，一般是false不开的\r\n\t\t\t// start the experimental docker shim, if enabled\r\n\t\t\tif kubeletServer.KubeletFlags.ExperimentalDockershim {\r\n\t\t\t\tif err := RunDockershim(&kubeletServer.KubeletFlags, kubeletConfig, stopCh); err != nil {\r\n\t\t\t\t\tglog.Fatal(err)\r\n\t\t\t\t}\r\n\t\t\t\treturn\r\n\t\t\t}\r\n\t\t \r\n\t\t\t// run the kubelet，运行kubelet\r\n\t\t\tglog.V(5).Infof(\"KubeletConfiguration: %#v\", kubeletServer.KubeletConfiguration)\r\n\t\t\tif err := Run(kubeletServer, kubeletDeps, stopCh); err != nil {\r\n\t\t\t\tglog.Fatal(err)\r\n\t\t\t}\r\n\t\t},\r\n\t}\r\n\t\t\r\n \r\n\t// keep cleanFlagSet separate, so Cobra doesn't pollute it with the global flags\r\n\tkubeletFlags.AddFlags(cleanFlagSet)\r\n\toptions.AddKubeletConfigFlags(cleanFlagSet, kubeletConfig)\r\n\toptions.AddGlobalFlags(cleanFlagSet)\r\n\tcleanFlagSet.BoolP(\"help\", \"h\", false, fmt.Sprintf(\"help for %s\", cmd.Name()))\r\n\r\n\t// ugly, but necessary, because Cobra's default UsageFunc and HelpFunc pollute the flagset with global flags\r\n\tconst usageFmt = \"Usage:\\n  %s\\n\\nFlags:\\n%s\"\r\n\tcmd.SetUsageFunc(func(cmd *cobra.Command) error {\r\n\t\tfmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine(), cleanFlagSet.FlagUsagesWrapped(2))\r\n\t\treturn nil\r\n\t})\r\n\tcmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {\r\n\t\tfmt.Fprintf(cmd.OutOrStdout(), \"%s\\n\\n\"+usageFmt, cmd.Long, cmd.UseLine(), cleanFlagSet.FlagUsagesWrapped(2))\r\n\t})\r\n\r\n\treturn cmd\r\n}\r\n```\r\n\r\n\r\n\r\n<br>\r\n\r\n### 3. Run(kubeletServer, kubeletDeps, stopCh) 函数\r\n\r\n核心就是调用\r\n\r\n```go\r\ncmd\\kubelet\\app\\server.go\r\n// Run runs the specified KubeletServer with the given Dependencies. This should never exit.\r\n// The kubeDeps argument may be nil - if so, it is initialized from the settings on KubeletServer.\r\n// Otherwise, the caller is assumed to have set up the Dependencies object and a default one will\r\n// not be generated.\r\nfunc Run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies, featureGate featuregate.FeatureGate, stopCh <-chan struct{}) error {\r\n\t// To help debugging, immediately log version\r\n\tklog.Infof(\"Version: %+v\", version.Get())\r\n  // 当运行环境是Windows的时候，初始化操作，但是该操作为空，只是预留。具体执行run(s, kubeDeps, stopCh)函数。\r\n\tif err := initForOS(s.KubeletFlags.WindowsService); err != nil {\r\n\t\treturn fmt.Errorf(\"failed OS init: %v\", err)\r\n\t}\r\n\tif err := run(s, kubeDeps, featureGate, stopCh); err != nil {\r\n\t\treturn fmt.Errorf(\"failed to run Kubelet: %v\", err)\r\n\t}\r\n\treturn nil\r\n}\r\n```\r\n\r\n<br>\r\n\r\n这里先看一下函数参数 s-kubeletServer和kubeletDeps是什么\r\n\r\n#### 3.1. KubeletServer \r\n\r\nKubeletServer  就是配置参数\r\n\r\n基本看参数名字和注释就能知道什么意思\r\n\r\n```\r\n// KubeletServer encapsulates all of the parameters necessary for starting up\r\n// a kubelet. These can either be set via command line or directly.\r\ntype KubeletServer struct {\r\n\tKubeletFlags\r\n\tkubeletconfig.KubeletConfiguration\r\n}\r\n```\r\n\r\n```\r\n// A configuration field should go in KubeletFlags instead of KubeletConfiguration if any of these are true:\r\n// - its value will never, or cannot safely be changed during the lifetime of a node\r\n// - its value cannot be safely shared between nodes at the same time (e.g. a hostname)\r\n//   KubeletConfiguration is intended to be shared between nodes\r\n// In general, please try to avoid adding flags or configuration fields,\r\n// we already have a confusingly large amount of them.\r\ntype KubeletFlags struct {\r\n\tKubeConfig          string            // 连接kmaster集群用的\r\n\tBootstrapKubeconfig string            // 申请进入集群的config\r\n\r\n\t// Insert a probability of random errors during calls to the master.\r\n\tChaosChance float64\r\n\t// Crash immediately, rather than eating panics.\r\n\tReallyCrashForTesting bool\r\n\r\n\t// TODO(mtaufen): It is increasingly looking like nobody actually uses the\r\n\t//                Kubelet's runonce mode anymore, so it may be a candidate\r\n\t//                for deprecation and removal.\r\n\t// If runOnce is true, the Kubelet will check the API server once for pods,\r\n\t// run those in addition to the pods specified by static pod files, and exit.\r\n\tRunOnce bool\r\n\r\n\t// enableServer enables the Kubelet's server\r\n\tEnableServer bool\r\n\r\n\t// HostnameOverride is the hostname used to identify the kubelet instead\r\n\t// of the actual hostname.\r\n\tHostnameOverride string\r\n\t// NodeIP is IP address of the node.\r\n\t// If set, kubelet will use this IP address for the node.\r\n\tNodeIP string\r\n\r\n\t// This flag, if set, sets the unique id of the instance that an external provider (i.e. cloudprovider)\r\n\t// can use to identify a specific node\r\n\tProviderID string\r\n   \r\n  // 包括了cni,cri等配置，比如cni bin目录\r\n\t// Container-runtime-specific options.\r\n\tconfig.ContainerRuntimeOptions\r\n\r\n\t// certDirectory is the directory where the TLS certs are located (by\r\n\t// default /var/run/kubernetes). If tlsCertFile and tlsPrivateKeyFile\r\n\t// are provided, this flag will be ignored.\r\n\tCertDirectory string\r\n\r\n\t// cloudProvider is the provider for cloud services.\r\n\t// +optional\r\n\tCloudProvider string\r\n\r\n\t// cloudConfigFile is the path to the cloud provider configuration file.\r\n\t// +optional\r\n\tCloudConfigFile string\r\n\r\n\t// rootDirectory is the directory path to place kubelet files (volume\r\n\t// mounts,etc).\r\n\t// kubelet的root目录，存放挂载, mount,etc等信息。默认是/var/lib/kubelet, 可以通过kubelet启动参数--root-dir 修改\r\n\tRootDirectory string\r\n\r\n\t// The Kubelet will use this directory for checkpointing downloaded configurations and tracking configuration health.\r\n\t// The Kubelet will create this directory if it does not already exist.\r\n\t// The path may be absolute or relative; relative paths are under the Kubelet's current working directory.\r\n\t// Providing this flag enables dynamic kubelet configuration.\r\n\t// To use this flag, the DynamicKubeletConfig feature gate must be enabled.\r\n\tDynamicConfigDir flag.StringFlag\r\n\r\n\t// The Kubelet will load its initial configuration from this file.\r\n\t// The path may be absolute or relative; relative paths are under the Kubelet's current working directory.\r\n\t// Omit this flag to use the combination of built-in default configuration values and flags.\r\n\t// 通过--config指定一个文件，里面是指定的Kubelet配置，比如maxPods=100等\r\n\tKubeletConfigFile string\r\n\r\n\t// registerNode enables automatic registration with the apiserver.\r\n\tRegisterNode bool\r\n\r\n\t// registerWithTaints are an array of taints to add to a node object when\r\n\t// the kubelet registers itself. This only takes effect when registerNode\r\n\t// is true and upon the initial registration of the node.\r\n\tRegisterWithTaints []core.Taint\r\n\r\n\t// WindowsService should be set to true if kubelet is running as a service on Windows.\r\n\t// Its corresponding flag only gets registered in Windows builds.\r\n\tWindowsService bool\r\n\r\n\t// EXPERIMENTAL FLAGS\r\n\t// Whitelist of unsafe sysctls or sysctl patterns (ending in *).\r\n\t// +optional\r\n\tAllowedUnsafeSysctls []string\r\n\t// containerized should be set to true if kubelet is running in a container.\r\n\tContainerized bool\r\n\t// remoteRuntimeEndpoint is the endpoint of remote runtime service\r\n\t// docker, containerd等容器运行时接口地址，默认是docker\r\n\tRemoteRuntimeEndpoint string\r\n\t// remoteImageEndpoint is the endpoint of remote image service\r\n\tRemoteImageEndpoint string\r\n\t// experimentalMounterPath is the path of mounter binary. Leave empty to use the default mount path\r\n\tExperimentalMounterPath string\r\n\t// If enabled, the kubelet will integrate with the kernel memcg notification to determine if memory eviction thresholds are crossed rather than polling.\r\n\t// +optional\r\n\tExperimentalKernelMemcgNotification bool\r\n\t// This flag, if set, enables a check prior to mount operations to verify that the required components\r\n\t// (binaries, etc.) to mount the volume are available on the underlying node. If the check is enabled\r\n\t// and fails the mount operation fails.\r\n\tExperimentalCheckNodeCapabilitiesBeforeMount bool\r\n\t// This flag, if set, will avoid including `EvictionHard` limits while computing Node Allocatable.\r\n\t// Refer to [Node Allocatable](https://git.k8s.io/community/contributors/design-proposals/node-allocatable.md) doc for more information.\r\n\tExperimentalNodeAllocatableIgnoreEvictionThreshold bool\r\n\t// Node Labels are the node labels to add when registering the node in the cluster\r\n\tNodeLabels map[string]string\r\n\t// volumePluginDir is the full path of the directory in which to search\r\n\t// for additional third party volume plugins\r\n\tVolumePluginDir string\r\n\t// lockFilePath is the path that kubelet will use to as a lock file.\r\n\t// It uses this file as a lock to synchronize with other kubelet processes\r\n\t// that may be running.\r\n\tLockFilePath string\r\n\t// ExitOnLockContention is a flag that signifies to the kubelet that it is running\r\n\t// in \"bootstrap\" mode. This requires that 'LockFilePath' has been set.\r\n\t// This will cause the kubelet to listen to inotify events on the lock file,\r\n\t// releasing it and exiting when another process tries to open that file.\r\n\tExitOnLockContention bool\r\n\t// seccompProfileRoot is the directory path for seccomp profiles.\r\n\tSeccompProfileRoot string\r\n\t// bootstrapCheckpointPath is the path to the directory containing pod checkpoints to\r\n\t// run on restore\r\n\tBootstrapCheckpointPath string\r\n\t// NodeStatusMaxImages caps the number of images reported in Node.Status.Images.\r\n\t// This is an experimental, short-term flag to help with node scalability.\r\n\tNodeStatusMaxImages int32\r\n\r\n\t// DEPRECATED FLAGS\r\n\t// minimumGCAge is the minimum age for a finished container before it is\r\n\t// garbage collected.\r\n\tMinimumGCAge metav1.Duration\r\n\t// maxPerPodContainerCount is the maximum number of old instances to\r\n\t// retain per container. Each container takes up some disk space.\r\n\tMaxPerPodContainerCount int32\r\n\t// maxContainerCount is the maximum number of old instances of containers\r\n\t// to retain globally. Each container takes up some disk space.\r\n\tMaxContainerCount int32\r\n\t// masterServiceNamespace is The namespace from which the kubernetes\r\n\t// master services should be injected into pods.\r\n\tMasterServiceNamespace string\r\n\t// registerSchedulable tells the kubelet to register the node as\r\n\t// schedulable. Won't have any effect if register-node is false.\r\n\t// DEPRECATED: use registerWithTaints instead\r\n\tRegisterSchedulable bool\r\n\t// nonMasqueradeCIDR configures masquerading: traffic to IPs outside this range will use IP masquerade.\r\n\tNonMasqueradeCIDR string\r\n\t// This flag, if set, instructs the kubelet to keep volumes from terminated pods mounted to the node.\r\n\t// This can be useful for debugging volume related issues.\r\n\tKeepTerminatedPodVolumes bool\r\n\t// allowPrivileged enables containers to request privileged mode.\r\n\t// Defaults to true.\r\n\tAllowPrivileged bool\r\n\t// hostNetworkSources is a comma-separated list of sources from which the\r\n\t// Kubelet allows pods to use of host network. Defaults to \"*\". Valid\r\n\t// options are \"file\", \"http\", \"api\", and \"*\" (all sources).\r\n\tHostNetworkSources []string\r\n\t// hostPIDSources is a comma-separated list of sources from which the\r\n\t// Kubelet allows pods to use the host pid namespace. Defaults to \"*\".\r\n\tHostPIDSources []string\r\n\t// hostIPCSources is a comma-separated list of sources from which the\r\n\t// Kubelet allows pods to use the host ipc namespace. Defaults to \"*\".\r\n\tHostIPCSources []string\r\n}\r\n```\r\n\r\n```\r\n// KubeletConfiguration contains the configuration for the Kubelet\r\ntype KubeletConfiguration struct {\r\n\tmetav1.TypeMeta\r\n\r\n\t// staticPodPath is the path to the directory containing local (static) pods to\r\n\t// run, or the path to a single static pod file.\r\n\tStaticPodPath string\r\n\t// syncFrequency is the max period between synchronizing running\r\n\t// containers and config\r\n\tSyncFrequency metav1.Duration\r\n\t// fileCheckFrequency is the duration between checking config files for\r\n\t// new data\r\n\tFileCheckFrequency metav1.Duration\r\n\t// httpCheckFrequency is the duration between checking http for new data\r\n\tHTTPCheckFrequency metav1.Duration\r\n\t// staticPodURL is the URL for accessing static pods to run\r\n\tStaticPodURL string\r\n\t// staticPodURLHeader is a map of slices with HTTP headers to use when accessing the podURL\r\n\tStaticPodURLHeader map[string][]string\r\n\t// address is the IP address for the Kubelet to serve on (set to 0.0.0.0\r\n\t// for all interfaces)\r\n\tAddress string\r\n\t// port is the port for the Kubelet to serve on.\r\n\tPort int32\r\n\t// readOnlyPort is the read-only port for the Kubelet to serve on with\r\n\t// no authentication/authorization (set to 0 to disable)\r\n\tReadOnlyPort int32\r\n\t// tlsCertFile is the file containing x509 Certificate for HTTPS.  (CA cert,\r\n\t// if any, concatenated after server cert). If tlsCertFile and\r\n\t// tlsPrivateKeyFile are not provided, a self-signed certificate\r\n\t// and key are generated for the public address and saved to the directory\r\n\t// passed to the Kubelet's --cert-dir flag.\r\n\tTLSCertFile string\r\n\t// tlsPrivateKeyFile is the file containing x509 private key matching tlsCertFile\r\n\tTLSPrivateKeyFile string\r\n\t// TLSCipherSuites is the list of allowed cipher suites for the server.\r\n\t// Values are from tls package constants (https://golang.org/pkg/crypto/tls/#pkg-constants).\r\n\tTLSCipherSuites []string\r\n\t// TLSMinVersion is the minimum TLS version supported.\r\n\t// Values are from tls package constants (https://golang.org/pkg/crypto/tls/#pkg-constants).\r\n\tTLSMinVersion string\r\n\t// rotateCertificates enables client certificate rotation. The Kubelet will request a\r\n\t// new certificate from the certificates.k8s.io API. This requires an approver to approve the\r\n\t// certificate signing requests. The RotateKubeletClientCertificate feature\r\n\t// must be enabled.\r\n\tRotateCertificates bool\r\n\t// serverTLSBootstrap enables server certificate bootstrap. Instead of self\r\n\t// signing a serving certificate, the Kubelet will request a certificate from\r\n\t// the certificates.k8s.io API. This requires an approver to approve the\r\n\t// certificate signing requests. The RotateKubeletServerCertificate feature\r\n\t// must be enabled.\r\n\tServerTLSBootstrap bool\r\n\t// authentication specifies how requests to the Kubelet's server are authenticated\r\n\tAuthentication KubeletAuthentication\r\n\t// authorization specifies how requests to the Kubelet's server are authorized\r\n\tAuthorization KubeletAuthorization\r\n\t// registryPullQPS is the limit of registry pulls per second.\r\n\t// Set to 0 for no limit.\r\n\tRegistryPullQPS int32\r\n\t// registryBurst is the maximum size of bursty pulls, temporarily allows\r\n\t// pulls to burst to this number, while still not exceeding registryPullQPS.\r\n\t// Only used if registryPullQPS > 0.\r\n\tRegistryBurst int32\r\n\t// eventRecordQPS is the maximum event creations per second. If 0, there\r\n\t// is no limit enforced.\r\n\tEventRecordQPS int32\r\n\t// eventBurst is the maximum size of a burst of event creations, temporarily\r\n\t// allows event creations to burst to this number, while still not exceeding\r\n\t// eventRecordQPS. Only used if eventRecordQPS > 0.\r\n\tEventBurst int32\r\n\t// enableDebuggingHandlers enables server endpoints for log collection\r\n\t// and local running of containers and commands\r\n\tEnableDebuggingHandlers bool\r\n\t// enableContentionProfiling enables lock contention profiling, if enableDebuggingHandlers is true.\r\n\tEnableContentionProfiling bool\r\n\t// healthzPort is the port of the localhost healthz endpoint (set to 0 to disable)\r\n\tHealthzPort int32\r\n\t// healthzBindAddress is the IP address for the healthz server to serve on\r\n\tHealthzBindAddress string\r\n\t// oomScoreAdj is The oom-score-adj value for kubelet process. Values\r\n\t// must be within the range [-1000, 1000].\r\n\tOOMScoreAdj int32\r\n\t// clusterDomain is the DNS domain for this cluster. If set, kubelet will\r\n\t// configure all containers to search this domain in addition to the\r\n\t// host's search domains.\r\n\tClusterDomain string\r\n\t// clusterDNS is a list of IP addresses for a cluster DNS server. If set,\r\n\t// kubelet will configure all containers to use this for DNS resolution\r\n\t// instead of the host's DNS servers.\r\n\tClusterDNS []string\r\n\t// streamingConnectionIdleTimeout is the maximum time a streaming connection\r\n\t// can be idle before the connection is automatically closed.\r\n\tStreamingConnectionIdleTimeout metav1.Duration\r\n\t// nodeStatusUpdateFrequency is the frequency that kubelet posts node\r\n\t// status to master. Note: be cautious when changing the constant, it\r\n\t// must work with nodeMonitorGracePeriod in nodecontroller.\r\n\tNodeStatusUpdateFrequency metav1.Duration\r\n\t// nodeLeaseDurationSeconds is the duration the Kubelet will set on its corresponding Lease.\r\n\tNodeLeaseDurationSeconds int32\r\n\t// imageMinimumGCAge is the minimum age for an unused image before it is\r\n\t// garbage collected.\r\n\tImageMinimumGCAge metav1.Duration\r\n\t// imageGCHighThresholdPercent is the percent of disk usage after which\r\n\t// image garbage collection is always run. The percent is calculated as\r\n\t// this field value out of 100.\r\n\tImageGCHighThresholdPercent int32\r\n\t// imageGCLowThresholdPercent is the percent of disk usage before which\r\n\t// image garbage collection is never run. Lowest disk usage to garbage\r\n\t// collect to. The percent is calculated as this field value out of 100.\r\n\tImageGCLowThresholdPercent int32\r\n\t// How frequently to calculate and cache volume disk usage for all pods\r\n\tVolumeStatsAggPeriod metav1.Duration\r\n\t// KubeletCgroups is the absolute name of cgroups to isolate the kubelet in\r\n\tKubeletCgroups string\r\n\t// SystemCgroups is absolute name of cgroups in which to place\r\n\t// all non-kernel processes that are not already in a container. Empty\r\n\t// for no container. Rolling back the flag requires a reboot.\r\n\tSystemCgroups string\r\n\t// CgroupRoot is the root cgroup to use for pods.\r\n\t// If CgroupsPerQOS is enabled, this is the root of the QoS cgroup hierarchy.\r\n\tCgroupRoot string\r\n\t// Enable QoS based Cgroup hierarchy: top level cgroups for QoS Classes\r\n\t// And all Burstable and BestEffort pods are brought up under their\r\n\t// specific top level QoS cgroup.\r\n\tCgroupsPerQOS bool\r\n\t// driver that the kubelet uses to manipulate cgroups on the host (cgroupfs or systemd)\r\n\tCgroupDriver string\r\n\t// CPUManagerPolicy is the name of the policy to use.\r\n\t// Requires the CPUManager feature gate to be enabled.\r\n\tCPUManagerPolicy string\r\n\t// CPU Manager reconciliation period.\r\n\t// Requires the CPUManager feature gate to be enabled.\r\n\tCPUManagerReconcilePeriod metav1.Duration\r\n\t// Map of QoS resource reservation percentages (memory only for now).\r\n\t// Requires the QOSReserved feature gate to be enabled.\r\n\tQOSReserved map[string]string\r\n\t// runtimeRequestTimeout is the timeout for all runtime requests except long running\r\n\t// requests - pull, logs, exec and attach.\r\n\tRuntimeRequestTimeout metav1.Duration\r\n\t// hairpinMode specifies how the Kubelet should configure the container\r\n\t// bridge for hairpin packets.\r\n\t// Setting this flag allows endpoints in a Service to loadbalance back to\r\n\t// themselves if they should try to access their own Service. Values:\r\n\t//   \"promiscuous-bridge\": make the container bridge promiscuous.\r\n\t//   \"hairpin-veth\":       set the hairpin flag on container veth interfaces.\r\n\t//   \"none\":               do nothing.\r\n\t// Generally, one must set --hairpin-mode=hairpin-veth to achieve hairpin NAT,\r\n\t// because promiscuous-bridge assumes the existence of a container bridge named cbr0.\r\n\tHairpinMode string\r\n\t// maxPods is the number of pods that can run on this Kubelet.\r\n\tMaxPods int32\r\n\t// The CIDR to use for pod IP addresses, only used in standalone mode.\r\n\t// In cluster mode, this is obtained from the master.\r\n\tPodCIDR string\r\n\t// PodPidsLimit is the maximum number of pids in any pod.\r\n\tPodPidsLimit int64\r\n\t// ResolverConfig is the resolver configuration file used as the basis\r\n\t// for the container DNS resolution configuration.\r\n\tResolverConfig string\r\n\t// cpuCFSQuota enables CPU CFS quota enforcement for containers that\r\n\t// specify CPU limits\r\n\tCPUCFSQuota bool\r\n\t// CPUCFSQuotaPeriod sets the CPU CFS quota period value, cpu.cfs_period_us, defaults to 100ms\r\n\tCPUCFSQuotaPeriod metav1.Duration\r\n\t// maxOpenFiles is Number of files that can be opened by Kubelet process.\r\n\tMaxOpenFiles int64\r\n\t// contentType is contentType of requests sent to apiserver.\r\n\tContentType string\r\n\t// kubeAPIQPS is the QPS to use while talking with kubernetes apiserver\r\n\tKubeAPIQPS int32\r\n\t// kubeAPIBurst is the burst to allow while talking with kubernetes\r\n\t// apiserver\r\n\tKubeAPIBurst int32\r\n\t// serializeImagePulls when enabled, tells the Kubelet to pull images one at a time.\r\n\tSerializeImagePulls bool\r\n\t// Map of signal names to quantities that defines hard eviction thresholds. For example: {\"memory.available\": \"300Mi\"}.\r\n\tEvictionHard map[string]string\r\n\t// Map of signal names to quantities that defines soft eviction thresholds.  For example: {\"memory.available\": \"300Mi\"}.\r\n\tEvictionSoft map[string]string\r\n\t// Map of signal names to quantities that defines grace periods for each soft eviction signal. For example: {\"memory.available\": \"30s\"}.\r\n\tEvictionSoftGracePeriod map[string]string\r\n\t// Duration for which the kubelet has to wait before transitioning out of an eviction pressure condition.\r\n\tEvictionPressureTransitionPeriod metav1.Duration\r\n\t// Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.\r\n\tEvictionMaxPodGracePeriod int32\r\n\t// Map of signal names to quantities that defines minimum reclaims, which describe the minimum\r\n\t// amount of a given resource the kubelet will reclaim when performing a pod eviction while\r\n\t// that resource is under pressure. For example: {\"imagefs.available\": \"2Gi\"}\r\n\tEvictionMinimumReclaim map[string]string\r\n\t// podsPerCore is the maximum number of pods per core. Cannot exceed MaxPods.\r\n\t// If 0, this field is ignored.\r\n\tPodsPerCore int32\r\n\t// enableControllerAttachDetach enables the Attach/Detach controller to\r\n\t// manage attachment/detachment of volumes scheduled to this node, and\r\n\t// disables kubelet from executing any attach/detach operations\r\n\tEnableControllerAttachDetach bool\r\n\t// protectKernelDefaults, if true, causes the Kubelet to error if kernel\r\n\t// flags are not as it expects. Otherwise the Kubelet will attempt to modify\r\n\t// kernel flags to match its expectation.\r\n\tProtectKernelDefaults bool\r\n\t// If true, Kubelet ensures a set of iptables rules are present on host.\r\n\t// These rules will serve as utility for various components, e.g. kube-proxy.\r\n\t// The rules will be created based on IPTablesMasqueradeBit and IPTablesDropBit.\r\n\tMakeIPTablesUtilChains bool\r\n\t// iptablesMasqueradeBit is the bit of the iptables fwmark space to mark for SNAT\r\n\t// Values must be within the range [0, 31]. Must be different from other mark bits.\r\n\t// Warning: Please match the value of the corresponding parameter in kube-proxy.\r\n\t// TODO: clean up IPTablesMasqueradeBit in kube-proxy\r\n\tIPTablesMasqueradeBit int32\r\n\t// iptablesDropBit is the bit of the iptables fwmark space to mark for dropping packets.\r\n\t// Values must be within the range [0, 31]. Must be different from other mark bits.\r\n\tIPTablesDropBit int32\r\n\t// featureGates is a map of feature names to bools that enable or disable alpha/experimental\r\n\t// features. This field modifies piecemeal the built-in default values from\r\n\t// \"k8s.io/kubernetes/pkg/features/kube_features.go\".\r\n\tFeatureGates map[string]bool\r\n\t// Tells the Kubelet to fail to start if swap is enabled on the node.\r\n\tFailSwapOn bool\r\n\t// A quantity defines the maximum size of the container log file before it is rotated. For example: \"5Mi\" or \"256Ki\".\r\n\tContainerLogMaxSize string\r\n\t// Maximum number of container log files that can be present for a container.\r\n\tContainerLogMaxFiles int32\r\n\t// ConfigMapAndSecretChangeDetectionStrategy is a mode in which config map and secret managers are running.\r\n\tConfigMapAndSecretChangeDetectionStrategy ResourceChangeDetectionStrategy\r\n\r\n\t/* the following fields are meant for Node Allocatable */\r\n\r\n\t// A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs\r\n\t// that describe resources reserved for non-kubernetes components.\r\n\t// Currently only cpu and memory are supported.\r\n\t// See http://kubernetes.io/docs/user-guide/compute-resources for more detail.\r\n\tSystemReserved map[string]string\r\n\t// A set of ResourceName=ResourceQuantity (e.g. cpu=200m,memory=150G) pairs\r\n\t// that describe resources reserved for kubernetes system components.\r\n\t// Currently cpu, memory and local ephemeral storage for root file system are supported.\r\n\t// See http://kubernetes.io/docs/user-guide/compute-resources for more detail.\r\n\tKubeReserved map[string]string\r\n\t// This flag helps kubelet identify absolute name of top level cgroup used to enforce `SystemReserved` compute resource reservation for OS system daemons.\r\n\t// Refer to [Node Allocatable](https://git.k8s.io/community/contributors/design-proposals/node/node-allocatable.md) doc for more information.\r\n\tSystemReservedCgroup string\r\n\t// This flag helps kubelet identify absolute name of top level cgroup used to enforce `KubeReserved` compute resource reservation for Kubernetes node system daemons.\r\n\t// Refer to [Node Allocatable](https://git.k8s.io/community/contributors/design-proposals/node/node-allocatable.md) doc for more information.\r\n\tKubeReservedCgroup string\r\n\t// This flag specifies the various Node Allocatable enforcements that Kubelet needs to perform.\r\n\t// This flag accepts a list of options. Acceptable options are `pods`, `system-reserved` & `kube-reserved`.\r\n\t// Refer to [Node Allocatable](https://git.k8s.io/community/contributors/design-proposals/node/node-allocatable.md) doc for more information.\r\n\tEnforceNodeAllocatable []string\r\n}\r\n```\r\n\r\n<br>\r\n\r\n#### 3.2. Dependencies \r\n\r\nkubeletDeps就是运行kubelet所需要的依赖。比如docker客户端，kube客户端，csi客户端等等。\r\n\r\n```\r\n// Dependencies is a bin for things we might consider \"injected dependencies\" -- objects constructed\r\n// at runtime that are necessary for running the Kubelet. This is a temporary solution for grouping\r\n// these objects while we figure out a more comprehensive dependency injection story for the Kubelet.\r\ntype Dependencies struct {\r\n\tOptions []Option\r\n\r\n\t// Injected Dependencies\r\n\tAuth                    server.AuthInterface\r\n\tCAdvisorInterface       cadvisor.Interface\r\n\tCloud                   cloudprovider.Interface\r\n\tContainerManager        cm.ContainerManager\r\n\tDockerClientConfig      *dockershim.ClientConfig\r\n\tEventClient             v1core.EventsGetter\r\n\tHeartbeatClient         clientset.Interface\r\n\tOnHeartbeatFailure      func()\r\n\tKubeClient              clientset.Interface\r\n\tCSIClient               csiclientset.Interface\r\n\tDynamicKubeClient       dynamic.Interface\r\n\tMounter                 mount.Interface\r\n\tOOMAdjuster             *oom.OOMAdjuster\r\n\tOSInterface             kubecontainer.OSInterface\r\n\tPodConfig               *config.PodConfig\r\n\tRecorder                record.EventRecorder\r\n\tVolumePlugins           []volume.VolumePlugin\r\n\tDynamicPluginProber     volume.DynamicPluginProber\r\n\tTLSOptions              *server.TLSOptions\r\n\tKubeletConfigController *kubeletconfig.Controller\r\n}\r\n```\r\n\r\n<br>\r\n\r\n### 4. func run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies, featureGate featuregate.FeatureGate, stopCh <-chan struct{}) (err error) {\r\n\r\n再次回到run函数，它的主体逻辑如下：\r\n\r\nkubelet  feature gates详见：https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/\r\n\r\n1. 将打开的FeatureGates用map保存\r\n2. 对kubelet的配置进行校验，校验放在FeatureGates后的原因在于某些校验是依据FeatureGates的开关决定的\r\n2. 如果ExitOnLockContention开启的话，就读取lockFilePath。目前很少看见用\r\n3. 根据配置初始化kubeconfig文件\r\n3. 判断是否为standaloneMode，如果是的话，就本地调试用的，不用连接kube-apiserver\r\n3. 如果是使用的云厂商的服务，调用InitCloudProvider初始化\r\n4. 如果不是standaloneMode, 初始化各种客户端，包括：\r\n   - kubeclient\r\n   - 事件 client：\r\n     配置 EventRecordQPS EventBrust 参数\r\n     调用 `k8s.io/client-go/kubernetes/typed/core/v1`中的 `NewForConfig`\r\n   - 心跳客户端：\r\n     配置 `QPS` `Timeout`(若开启了`NodeLease` feature，则设置 `NodeLeaseDurationSeconds` 为timeout)\r\n   - auth client\r\n6. 配置cgroup，kubelet在已有的cgroup上加了kubepods这一层，用来实现QOS。一般默认是 /sys/fs/cgroup/memory/kubepods\r\n6. 初始化cadvisor\r\n6. 初始化eventRecorder，用于上报event\r\n6. 解析系统保留资源，这些资源是不会给pod分配的。例如：--system-reserved=cpu=2000m,memory=20000Mi\r\n6. 设置驱逐阈值，例如--eviction-hard=memory.available<1Mi,nodefs.available<1Mi,nodefs.inodesFree<1\r\n6. oomAdjuster设置容器进程的oom score\r\n6. 调用RunKubelet，继续运行Kubelet核心逻辑\r\n6. 开启监控检查服务\r\n\r\n这里其实还是根据配置，做各种准备工作，比如客户端的初始化，实例化containerManager对象等等。接下来继续往下看看RunKubelet\r\n\r\n```\r\nfunc run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies, featureGate featuregate.FeatureGate, stopCh <-chan struct{}) (err error) {\r\n\t// Set global feature gates based on the value on the initial KubeletServer\r\n\t// 1. 将打开的FeatureGates用map保存\r\n\terr = utilfeature.DefaultMutableFeatureGate.SetFromMap(s.KubeletConfiguration.FeatureGates)\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\t// validate the initial KubeletServer (we set feature gates first, because this validation depends on feature gates)\r\n\t// 2.对kubelet的配置进行校验\r\n\tif err := options.ValidateKubeletServer(s); err != nil {\r\n\t\treturn err\r\n\t}\r\n\r\n\t// Obtain Kubelet Lock File\r\n\t// 3. 如果ExitOnLockContention开启的话，就读取lockFilePath\r\n\tif s.ExitOnLockContention && s.LockFilePath == \"\" {\r\n\t\treturn errors.New(\"cannot exit on lock file contention: no lock file specified\")\r\n\t}\r\n\tdone := make(chan struct{})\r\n\tif s.LockFilePath != \"\" {\r\n\t\tklog.Infof(\"acquiring file lock on %q\", s.LockFilePath)\r\n\t\tif err := flock.Acquire(s.LockFilePath); err != nil {\r\n\t\t\treturn fmt.Errorf(\"unable to acquire file lock on %q: %v\", s.LockFilePath, err)\r\n\t\t}\r\n\t\tif s.ExitOnLockContention {\r\n\t\t\tklog.Infof(\"watching for inotify events for: %v\", s.LockFilePath)\r\n\t\t\tif err := watchForLockfileContention(s.LockFilePath, done); err != nil {\r\n\t\t\t\treturn err\r\n\t\t\t}\r\n\t\t}\r\n\t}\r\n\r\n\t// Register current configuration with /configz endpoint\r\n\t// 4.根据配置初始化kubeconfig文件\r\n\terr = initConfigz(&s.KubeletConfiguration)\r\n\tif err != nil {\r\n\t\tklog.Errorf(\"unable to register KubeletConfiguration with configz, error: %v\", err)\r\n\t}\r\n   \r\n  // 5.判断是否为standaloneMode，如果是的话，就本地调试用的，不用连接kube-apiserver\r\n\t// About to get clients and such, detect standaloneMode\r\n\tstandaloneMode := true\r\n\tif len(s.KubeConfig) > 0 {\r\n\t\tstandaloneMode = false\r\n\t}\r\n\r\n\tif kubeDeps == nil {\r\n\t\tkubeDeps, err = UnsecuredDependencies(s, featureGate)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t}\r\n  \r\n  // 6.如果是使用的云厂商的服务，调用InitCloudProvider初始化\r\n\tif kubeDeps.Cloud == nil {\r\n\t\tif !cloudprovider.IsExternal(s.CloudProvider) {\r\n\t\t\tcloud, err := cloudprovider.InitCloudProvider(s.CloudProvider, s.CloudConfigFile)\r\n\t\t\tif err != nil {\r\n\t\t\t\treturn err\r\n\t\t\t}\r\n\t\t\tif cloud == nil {\r\n\t\t\t\tklog.V(2).Infof(\"No cloud provider specified: %q from the config file: %q\\n\", s.CloudProvider, s.CloudConfigFile)\r\n\t\t\t} else {\r\n\t\t\t\tklog.V(2).Infof(\"Successfully initialized cloud provider: %q from the config file: %q\\n\", s.CloudProvider, s.CloudConfigFile)\r\n\t\t\t}\r\n\t\t\tkubeDeps.Cloud = cloud\r\n\t\t}\r\n\t}\r\n\r\n\thostName, err := nodeutil.GetHostname(s.HostnameOverride)\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\tnodeName, err := getNodeName(kubeDeps.Cloud, hostName)\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\r\n\t// if in standalone mode, indicate as much by setting all clients to nil\r\n\t// 7.如果不是standaloneMode, 初始化各种客户端\r\n\tswitch {\r\n\tcase standaloneMode:\r\n\t\tkubeDeps.KubeClient = nil\r\n\t\tkubeDeps.EventClient = nil\r\n\t\tkubeDeps.HeartbeatClient = nil\r\n\t\tklog.Warningf(\"standalone mode, no API client\")\r\n\r\n\tcase kubeDeps.KubeClient == nil, kubeDeps.EventClient == nil, kubeDeps.HeartbeatClient == nil:\r\n\t\tclientConfig, closeAllConns, err := buildKubeletClientConfig(s, nodeName)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\tif closeAllConns == nil {\r\n\t\t\treturn errors.New(\"closeAllConns must be a valid function other than nil\")\r\n\t\t}\r\n\t\tkubeDeps.OnHeartbeatFailure = closeAllConns\r\n\r\n\t\tkubeDeps.KubeClient, err = clientset.NewForConfig(clientConfig)\r\n\t\tif err != nil {\r\n\t\t\treturn fmt.Errorf(\"failed to initialize kubelet client: %v\", err)\r\n\t\t}\r\n\r\n\t\t// make a separate client for events\r\n\t\teventClientConfig := *clientConfig\r\n\t\teventClientConfig.QPS = float32(s.EventRecordQPS)\r\n\t\teventClientConfig.Burst = int(s.EventBurst)\r\n\t\tkubeDeps.EventClient, err = v1core.NewForConfig(&eventClientConfig)\r\n\t\tif err != nil {\r\n\t\t\treturn fmt.Errorf(\"failed to initialize kubelet event client: %v\", err)\r\n\t\t}\r\n\r\n\t\t// make a separate client for heartbeat with throttling disabled and a timeout attached\r\n\t\theartbeatClientConfig := *clientConfig\r\n\t\theartbeatClientConfig.Timeout = s.KubeletConfiguration.NodeStatusUpdateFrequency.Duration\r\n\t\t// The timeout is the minimum of the lease duration and status update frequency\r\n\t\tleaseTimeout := time.Duration(s.KubeletConfiguration.NodeLeaseDurationSeconds) * time.Second\r\n\t\tif heartbeatClientConfig.Timeout > leaseTimeout {\r\n\t\t\theartbeatClientConfig.Timeout = leaseTimeout\r\n\t\t}\r\n\r\n\t\theartbeatClientConfig.QPS = float32(-1)\r\n\t\tkubeDeps.HeartbeatClient, err = clientset.NewForConfig(&heartbeatClientConfig)\r\n\t\tif err != nil {\r\n\t\t\treturn fmt.Errorf(\"failed to initialize kubelet heartbeat client: %v\", err)\r\n\t\t}\r\n\t}\r\n  \r\n\tif kubeDeps.Auth == nil {\r\n\t\tauth, err := BuildAuth(nodeName, kubeDeps.KubeClient, s.KubeletConfiguration)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\tkubeDeps.Auth = auth\r\n\t}\r\n\r\n\tvar cgroupRoots []string\r\n  \r\n  \r\n  // 8.配置cgroup，kubelet在已有的cgroup上加了kubepods这一层，用来实现QOS。一般默认是 /sys/fs/cgroup/memory/kubepods\r\n\tcgroupRoots = append(cgroupRoots, cm.NodeAllocatableRoot(s.CgroupRoot, s.CgroupDriver))\r\n\tkubeletCgroup, err := cm.GetKubeletContainer(s.KubeletCgroups)\r\n\tif err != nil {\r\n\t\tklog.Warningf(\"failed to get the kubelet's cgroup: %v.  Kubelet system container metrics may be missing.\", err)\r\n\t} else if kubeletCgroup != \"\" {\r\n\t\tcgroupRoots = append(cgroupRoots, kubeletCgroup)\r\n\t}\r\n  \r\n\truntimeCgroup, err := cm.GetRuntimeContainer(s.ContainerRuntime, s.RuntimeCgroups)\r\n\tif err != nil {\r\n\t\tklog.Warningf(\"failed to get the container runtime's cgroup: %v. Runtime system container metrics may be missing.\", err)\r\n\t} else if runtimeCgroup != \"\" {\r\n\t\t// RuntimeCgroups is optional, so ignore if it isn't specified\r\n\t\tcgroupRoots = append(cgroupRoots, runtimeCgroup)\r\n\t}\r\n\r\n\tif s.SystemCgroups != \"\" {\r\n\t\t// SystemCgroups is optional, so ignore if it isn't specified\r\n\t\tcgroupRoots = append(cgroupRoots, s.SystemCgroups)\r\n\t}\r\n  \r\n  // 9.初始化cadvisor\r\n\tif kubeDeps.CAdvisorInterface == nil {\r\n\t\timageFsInfoProvider := cadvisor.NewImageFsInfoProvider(s.ContainerRuntime, s.RemoteRuntimeEndpoint)\r\n\t\tkubeDeps.CAdvisorInterface, err = cadvisor.New(imageFsInfoProvider, s.RootDirectory, cgroupRoots, cadvisor.UsingLegacyCadvisorStats(s.ContainerRuntime, s.RemoteRuntimeEndpoint))\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t}\r\n\r\n\t// Setup event recorder if required.\r\n\t// 10. 初始化eventRecorder，用于上报event\r\n\tmakeEventRecorder(kubeDeps, nodeName)\r\n\r\n\tif kubeDeps.ContainerManager == nil {\r\n\t\tif s.CgroupsPerQOS && s.CgroupRoot == \"\" {\r\n\t\t\tklog.Info(\"--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /\")\r\n\t\t\ts.CgroupRoot = \"/\"\r\n\t\t}\r\n    \r\n    // 11. 解析系统保留资源，这些资源是不会给pod分配的。例如：--system-reserved=cpu=2000m,memory=20000Mi\r\n\t\tvar reservedSystemCPUs cpuset.CPUSet\r\n\t\tvar errParse error\r\n\t\tif s.ReservedSystemCPUs != \"\" {\r\n\t\t\treservedSystemCPUs, errParse = cpuset.Parse(s.ReservedSystemCPUs)\r\n\t\t\tif errParse != nil {\r\n\t\t\t\t// invalid cpu list is provided, set reservedSystemCPUs to empty, so it won't overwrite kubeReserved/systemReserved\r\n\t\t\t\tklog.Infof(\"Invalid ReservedSystemCPUs \\\"%s\\\"\", s.ReservedSystemCPUs)\r\n\t\t\t\treturn errParse\r\n\t\t\t}\r\n\t\t\t// is it safe do use CAdvisor here ??\r\n\t\t\tmachineInfo, err := kubeDeps.CAdvisorInterface.MachineInfo()\r\n\t\t\tif err != nil {\r\n\t\t\t\t// if can't use CAdvisor here, fall back to non-explicit cpu list behavor\r\n\t\t\t\tklog.Warning(\"Failed to get MachineInfo, set reservedSystemCPUs to empty\")\r\n\t\t\t\treservedSystemCPUs = cpuset.NewCPUSet()\r\n\t\t\t} else {\r\n\t\t\t\treservedList := reservedSystemCPUs.ToSlice()\r\n\t\t\t\tfirst := reservedList[0]\r\n\t\t\t\tlast := reservedList[len(reservedList)-1]\r\n\t\t\t\tif first < 0 || last >= machineInfo.NumCores {\r\n\t\t\t\t\t// the specified cpuset is outside of the range of what the machine has\r\n\t\t\t\t\tklog.Infof(\"Invalid cpuset specified by --reserved-cpus\")\r\n\t\t\t\t\treturn fmt.Errorf(\"Invalid cpuset %q specified by --reserved-cpus\", s.ReservedSystemCPUs)\r\n\t\t\t\t}\r\n\t\t\t}\r\n\t\t} else {\r\n\t\t\treservedSystemCPUs = cpuset.NewCPUSet()\r\n\t\t}\r\n\r\n\t\tif reservedSystemCPUs.Size() > 0 {\r\n\t\t\t// at cmd option valication phase it is tested either --system-reserved-cgroup or --kube-reserved-cgroup is specified, so overwrite should be ok\r\n\t\t\tklog.Infof(\"Option --reserved-cpus is specified, it will overwrite the cpu setting in KubeReserved=\\\"%v\\\", SystemReserved=\\\"%v\\\".\", s.KubeReserved, s.SystemReserved)\r\n\t\t\tif s.KubeReserved != nil {\r\n\t\t\t\tdelete(s.KubeReserved, \"cpu\")\r\n\t\t\t}\r\n\t\t\tif s.SystemReserved == nil {\r\n\t\t\t\ts.SystemReserved = make(map[string]string)\r\n\t\t\t}\r\n\t\t\ts.SystemReserved[\"cpu\"] = strconv.Itoa(reservedSystemCPUs.Size())\r\n\t\t\tklog.Infof(\"After cpu setting is overwritten, KubeReserved=\\\"%v\\\", SystemReserved=\\\"%v\\\"\", s.KubeReserved, s.SystemReserved)\r\n\t\t}\r\n\t\t// 这里会处理内存和其他资源例如pid等\r\n\t\tkubeReserved, err := parseResourceList(s.KubeReserved)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\tsystemReserved, err := parseResourceList(s.SystemReserved)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t\t\r\n\t\t// 12.设置驱逐阈值，例如--eviction-hard=memory.available<1Mi,nodefs.available<1Mi,nodefs.inodesFree<1\r\n\t\tvar hardEvictionThresholds []evictionapi.Threshold\r\n\t\t// If the user requested to ignore eviction thresholds, then do not set valid values for hardEvictionThresholds here.\r\n\t\tif !s.ExperimentalNodeAllocatableIgnoreEvictionThreshold {\r\n\t\t\thardEvictionThresholds, err = eviction.ParseThresholdConfig([]string{}, s.EvictionHard, nil, nil, nil)\r\n\t\t\tif err != nil {\r\n\t\t\t\treturn err\r\n\t\t\t}\r\n\t\t}\r\n\t\texperimentalQOSReserved, err := cm.ParseQOSReserved(s.QOSReserved)\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\r\n\t\tdevicePluginEnabled := utilfeature.DefaultFeatureGate.Enabled(features.DevicePlugins)\r\n    \r\n    //13.利用上面的配置，实例化NewContainerManager对象\r\n\t\tkubeDeps.ContainerManager, err = cm.NewContainerManager(\r\n\t\t\tkubeDeps.Mounter,\r\n\t\t\tkubeDeps.CAdvisorInterface,\r\n\t\t\tcm.NodeConfig{\r\n\t\t\t\tRuntimeCgroupsName:    s.RuntimeCgroups,\r\n\t\t\t\tSystemCgroupsName:     s.SystemCgroups,\r\n\t\t\t\tKubeletCgroupsName:    s.KubeletCgroups,\r\n\t\t\t\tContainerRuntime:      s.ContainerRuntime,\r\n\t\t\t\tCgroupsPerQOS:         s.CgroupsPerQOS,\r\n\t\t\t\tCgroupRoot:            s.CgroupRoot,\r\n\t\t\t\tCgroupDriver:          s.CgroupDriver,\r\n\t\t\t\tKubeletRootDir:        s.RootDirectory,\r\n\t\t\t\tProtectKernelDefaults: s.ProtectKernelDefaults,\r\n\t\t\t\tNodeAllocatableConfig: cm.NodeAllocatableConfig{\r\n\t\t\t\t\tKubeReservedCgroupName:   s.KubeReservedCgroup,\r\n\t\t\t\t\tSystemReservedCgroupName: s.SystemReservedCgroup,\r\n\t\t\t\t\tEnforceNodeAllocatable:   sets.NewString(s.EnforceNodeAllocatable...),\r\n\t\t\t\t\tKubeReserved:             kubeReserved,\r\n\t\t\t\t\tSystemReserved:           systemReserved,\r\n\t\t\t\t\tReservedSystemCPUs:       reservedSystemCPUs,\r\n\t\t\t\t\tHardEvictionThresholds:   hardEvictionThresholds,\r\n\t\t\t\t},\r\n\t\t\t\tQOSReserved:                           *experimentalQOSReserved,\r\n\t\t\t\tExperimentalCPUManagerPolicy:          s.CPUManagerPolicy,\r\n\t\t\t\tExperimentalCPUManagerReconcilePeriod: s.CPUManagerReconcilePeriod.Duration,\r\n\t\t\t\tExperimentalPodPidsLimit:              s.PodPidsLimit,\r\n\t\t\t\tEnforceCPULimits:                      s.CPUCFSQuota,\r\n\t\t\t\tCPUCFSQuotaPeriod:                     s.CPUCFSQuotaPeriod.Duration,\r\n\t\t\t\tExperimentalTopologyManagerPolicy:     s.TopologyManagerPolicy,\r\n\t\t\t},\r\n\t\t\ts.FailSwapOn,\r\n\t\t\tdevicePluginEnabled,\r\n\t\t\tkubeDeps.Recorder)\r\n\r\n\t\tif err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t}\r\n \r\n\tif err := checkPermissions(); err != nil {\r\n\t\tklog.Error(err)\r\n\t}\r\n\r\n\tutilruntime.ReallyCrash = s.ReallyCrashForTesting\r\n\r\n\t// TODO(vmarmol): Do this through container config.\r\n\t// 13. oomAdjuster设置容器进程的oom score\r\n\toomAdjuster := kubeDeps.OOMAdjuster\r\n\tif err := oomAdjuster.ApplyOOMScoreAdj(0, int(s.OOMScoreAdj)); err != nil {\r\n\t\tklog.Warning(err)\r\n\t}\r\n  \r\n  //14. 调用RunKubelet，继续运行Kubelet核心逻辑\r\n\tif err := RunKubelet(s, kubeDeps, s.RunOnce); err != nil {\r\n\t\treturn err\r\n\t}\r\n\r\n\t// If the kubelet config controller is available, and dynamic config is enabled, start the config and status sync loops\r\n\tif utilfeature.DefaultFeatureGate.Enabled(features.DynamicKubeletConfig) && len(s.DynamicConfigDir.Value()) > 0 &&\r\n\t\tkubeDeps.KubeletConfigController != nil && !standaloneMode && !s.RunOnce {\r\n\t\tif err := kubeDeps.KubeletConfigController.StartSync(kubeDeps.KubeClient, kubeDeps.EventClient, string(nodeName)); err != nil {\r\n\t\t\treturn err\r\n\t\t}\r\n\t}\r\n  \r\n  //15.开启监控检查服务\r\n\tif s.HealthzPort > 0 {\r\n\t\tmux := http.NewServeMux()\r\n\t\thealthz.InstallHandler(mux)\r\n\t\tgo wait.Until(func() {\r\n\t\t\terr := http.ListenAndServe(net.JoinHostPort(s.HealthzBindAddress, strconv.Itoa(int(s.HealthzPort))), mux)\r\n\t\t\tif err != nil {\r\n\t\t\t\tklog.Errorf(\"Starting healthz server failed: %v\", err)\r\n\t\t\t}\r\n\t\t}, 5*time.Second, wait.NeverStop)\r\n\t}\r\n  \r\n\tif s.RunOnce {\r\n\t\treturn nil\r\n\t}\r\n   \r\n\t// If systemd is used, notify it that we have started\r\n\tgo daemon.SdNotify(false, \"READY=1\")\r\n\r\n\tselect {\r\n\tcase <-done:\r\n\t\tbreak\r\n\tcase <-stopCh:\r\n\t\tbreak\r\n\t}\r\n\r\n\treturn nil\r\n}\r\n```\r\n\r\n<br>\r\n\r\n### 5. RunKubelet\r\n\r\nRunKubelet 主要流程：\r\n\r\n1. 获取主机名, 并建立并初始化 event recorder, 用于往apiserver发送事件\r\n2. 设置Kubelet拥有特权\r\n3. 设置kubelet rootdir，默认是 /var/lib/kubelet\r\n4. 调用createAndInitKubelet函数初始化kublet对象\r\n5. 若设置 `runonce` 参数，则只拉取一次容器组配置，并在启动容器组后退出，否则将以 server 形式保持\r\n\r\n- 对于runonce，首先创建所需的目录，监听 pod update 信息，得到 pod 信息后，创建pod 并返回他们的状态\r\n- 否则以 server 模式启动，调用startKubelet函数。\r\n\r\n到这里关注两个点：\r\n\r\n* 第一：createAndInitKubelet函数做了什么\r\n\r\n* 第二：跟进startKubelet的逻辑\r\n\r\n```\r\n// RunKubelet is responsible for setting up and running a kubelet.  It is used in three different applications:\r\n//   1 Integration tests\r\n//   2 Kubelet binary\r\n//   3 Standalone 'kubernetes' binary\r\n// Eventually, #2 will be replaced with instances of #3\r\nfunc RunKubelet(kubeServer *options.KubeletServer, kubeDeps *kubelet.Dependencies, runOnce bool) error {\r\n  // 1.获取主机名, 并建立并初始化 event recorder, 用于往apiserver发送事件\r\n\thostname, err := nodeutil.GetHostname(kubeServer.HostnameOverride)\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\t// Query the cloud provider for our node name, default to hostname if kubeDeps.Cloud == nil\r\n\tnodeName, err := getNodeName(kubeDeps.Cloud, hostname)\r\n\tif err != nil {\r\n\t\treturn err\r\n\t}\r\n\t// Setup event recorder if required.\r\n\tmakeEventRecorder(kubeDeps, nodeName)\r\n  \r\n  //2.设置Kubelet特权\r\n\tcapabilities.Initialize(capabilities.Capabilities{\r\n\t\tAllowPrivileged: true,\r\n\t})\r\n  \r\n  // 3.设置kubelet rootdir，默认是 /var/lib/kubelet\r\n\tcredentialprovider.SetPreferredDockercfgPath(kubeServer.RootDirectory)\r\n\tklog.V(2).Infof(\"Using root directory: %v\", kubeServer.RootDirectory)\r\n\r\n\tif kubeDeps.OSInterface == nil {\r\n\t\tkubeDeps.OSInterface = kubecontainer.RealOS{}\r\n\t}\r\n  \r\n  // 4. 调用createAndInitKubelet函数初始化kublet对象\r\n\tk, err := createAndInitKubelet(&kubeServer.KubeletConfiguration,\r\n\t\tkubeDeps,\r\n\t\t&kubeServer.ContainerRuntimeOptions,\r\n\t\tkubeServer.ContainerRuntime,\r\n\t\tkubeServer.RuntimeCgroups,\r\n\t\tkubeServer.HostnameOverride,\r\n\t\tkubeServer.NodeIP,\r\n\t\tkubeServer.ProviderID,\r\n\t\tkubeServer.CloudProvider,\r\n\t\tkubeServer.CertDirectory,\r\n\t\tkubeServer.RootDirectory,\r\n\t\tkubeServer.RegisterNode,\r\n\t\tkubeServer.RegisterWithTaints,\r\n\t\tkubeServer.AllowedUnsafeSysctls,\r\n\t\tkubeServer.RemoteRuntimeEndpoint,\r\n\t\tkubeServer.RemoteImageEndpoint,\r\n\t\tkubeServer.ExperimentalMounterPath,\r\n\t\tkubeServer.ExperimentalKernelMemcgNotification,\r\n\t\tkubeServer.ExperimentalCheckNodeCapabilitiesBeforeMount,\r\n\t\tkubeServer.ExperimentalNodeAllocatableIgnoreEvictionThreshold,\r\n\t\tkubeServer.MinimumGCAge,\r\n\t\tkubeServer.MaxPerPodContainerCount,\r\n\t\tkubeServer.MaxContainerCount,\r\n\t\tkubeServer.MasterServiceNamespace,\r\n\t\tkubeServer.RegisterSchedulable,\r\n\t\tkubeServer.NonMasqueradeCIDR,\r\n\t\tkubeServer.KeepTerminatedPodVolumes,\r\n\t\tkubeServer.NodeLabels,\r\n\t\tkubeServer.SeccompProfileRoot,\r\n\t\tkubeServer.BootstrapCheckpointPath,\r\n\t\tkubeServer.NodeStatusMaxImages)\r\n\tif err != nil {\r\n\t\treturn fmt.Errorf(\"failed to create kubelet: %v\", err)\r\n\t}\r\n\r\n\t// NewMainKubelet should have set up a pod source config if one didn't exist\r\n\t// when the builder was run. This is just a precaution.\r\n\tif kubeDeps.PodConfig == nil {\r\n\t\treturn fmt.Errorf(\"failed to create kubelet, pod source config was nil\")\r\n\t}\r\n\tpodCfg := kubeDeps.PodConfig\r\n\r\n\trlimit.RlimitNumFiles(uint64(kubeServer.MaxOpenFiles))\r\n  \r\n  // 5.调用startKubelet运行kubelet\r\n\t// process pods and exit.\r\n\tif runOnce {\r\n\t\tif _, err := k.RunOnce(podCfg.Updates()); err != nil {\r\n\t\t\treturn fmt.Errorf(\"runonce failed: %v\", err)\r\n\t\t}\r\n\t\tklog.Info(\"Started kubelet as runonce\")\r\n\t} else {\r\n\t\tstartKubelet(k, podCfg, &kubeServer.KubeletConfiguration, kubeDeps, kubeServer.EnableCAdvisorJSONEndpoints, kubeServer.EnableServer)\r\n\t\tklog.Info(\"Started kubelet\")\r\n\t}\r\n\treturn nil\r\n}\r\n```\r\n\r\n#### 5.1. CreateAndInitKubelet 函数\r\n\r\n主要逻辑：\r\n\r\n（1）根据各种配置生成 NewMainKubelet，并且初始化各种manager，例如livenessManager,statusManager,podManager等等\r\n\r\n（2）BirthCry ,往apiserver发送一个启动 kubelet的事件\r\n\r\n```\r\n// BirthCry sends an event that the kubelet has started up.\r\nfunc (kl *Kubelet) BirthCry() {\r\n\t// Make an event that kubelet restarted.\r\n\tkl.recorder.Eventf(kl.nodeRef, v1.EventTypeNormal, events.StartingKubelet, \"Starting kubelet.\")\r\n}\r\n```\r\n\r\n（3）启动垃圾回收，具体就是后台启动多个协程，进行container, image的垃圾回收。\r\n\r\n```\r\nfunc CreateAndInitKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,\r\n\tkubeDeps *kubelet.Dependencies,\r\n\tcrOptions *config.ContainerRuntimeOptions,\r\n\tcontainerRuntime string,\r\n\truntimeCgroups string,\r\n\thostnameOverride string,\r\n\tnodeIP string,\r\n\tproviderID string,\r\n\tcloudProvider string,\r\n\tcertDirectory string,\r\n\trootDirectory string,\r\n\tregisterNode bool,\r\n\tregisterWithTaints []api.Taint,\r\n\tallowedUnsafeSysctls []string,\r\n\tremoteRuntimeEndpoint string,\r\n\tremoteImageEndpoint string,\r\n\texperimentalMounterPath string,\r\n\texperimentalKernelMemcgNotification bool,\r\n\texperimentalCheckNodeCapabilitiesBeforeMount bool,\r\n\texperimentalNodeAllocatableIgnoreEvictionThreshold bool,\r\n\tminimumGCAge metav1.Duration,\r\n\tmaxPerPodContainerCount int32,\r\n\tmaxContainerCount int32,\r\n\tmasterServiceNamespace string,\r\n\tregisterSchedulable bool,\r\n\tnonMasqueradeCIDR string,\r\n\tkeepTerminatedPodVolumes bool,\r\n\tnodeLabels map[string]string,\r\n\tseccompProfileRoot string,\r\n\tbootstrapCheckpointPath string,\r\n\tnodeStatusMaxImages int32) (k kubelet.Bootstrap, err error) {\r\n\t// TODO: block until all sources have delivered at least one update to the channel, or break the sync loop\r\n\t// up into \"per source\" synchronizations\r\n\r\n\tk, err = kubelet.NewMainKubelet(kubeCfg,\r\n\t\tkubeDeps,\r\n\t\tcrOptions,\r\n\t\tcontainerRuntime,\r\n\t\truntimeCgroups,\r\n\t\thostnameOverride,\r\n\t\tnodeIP,\r\n\t\tproviderID,\r\n\t\tcloudProvider,\r\n\t\tcertDirectory,\r\n\t\trootDirectory,\r\n\t\tregisterNode,\r\n\t\tregisterWithTaints,\r\n\t\tallowedUnsafeSysctls,\r\n\t\tremoteRuntimeEndpoint,\r\n\t\tremoteImageEndpoint,\r\n\t\texperimentalMounterPath,\r\n\t\texperimentalKernelMemcgNotification,\r\n\t\texperimentalCheckNodeCapabilitiesBeforeMount,\r\n\t\texperimentalNodeAllocatableIgnoreEvictionThreshold,\r\n\t\tminimumGCAge,\r\n\t\tmaxPerPodContainerCount,\r\n\t\tmaxContainerCount,\r\n\t\tmasterServiceNamespace,\r\n\t\tregisterSchedulable,\r\n\t\tnonMasqueradeCIDR,\r\n\t\tkeepTerminatedPodVolumes,\r\n\t\tnodeLabels,\r\n\t\tseccompProfileRoot,\r\n\t\tbootstrapCheckpointPath,\r\n\t\tnodeStatusMaxImages)\r\n\tif err != nil {\r\n\t\treturn nil, err\r\n\t}\r\n\r\n\tk.BirthCry()\r\n\r\n\tk.StartGarbageCollection()\r\n\r\n\treturn k, nil\r\n}\r\n```\r\n\r\n<br>\r\n\r\n#### 5.2 NewMainKubelet\r\n\r\n`CreateAndInitKubelet`方法中执行的核心函数是`NewMainKubelet`，`NewMainKubelet`实例化一个`kubelet`对象，该部分的具体代码在`kubernetes/pkg/kubelet`中，具体参考：[kubernetes/pkg/kubelet/kubelet.go#L325](https://github.com/kubernetes/kubernetes/blob/0ed33881dc4355495f623c6f22e7dd0b7632b7c0/pkg/kubelet/kubelet.go#L325)。\r\n\r\n**NewMainKubelet 实例化一个 kubelet 对象，并对 kubelet 内部各个 component 进行初始化工作:**\r\n\r\n- containerGC      // 容器的垃圾回收\r\n- statusManager  // pod 状态的管理\r\n- imageManager  // 镜像的管路\r\n- probeManager   // 容器健康检测\r\n- gpuManager      // GPU 的支持\r\n- PodCache          // Pod 缓存的管理\r\n- secretManager   // secret 资源的管理\r\n- configMapManager  // configMap 资源的管理\r\n- InitNetworkPlugin     // 网络插件的初始化\r\n- PodManager // 对 pod 的管理, e.g., CRUD\r\n- makePodSourceConfig // pod 元数据的来源 (FILE, URL, api-server)\r\n- diskSpaceManager  // 磁盘空间的管理\r\n- ContainerRuntime  // 容器运行时的选择(docker 或 rkt)\r\n\r\n<br>\r\n\r\n#### 5.3 startKubelet\r\n\r\nCreateAndInitKubele之后，马上就是startKubelet了，这里核心调用的了 kubelet.Run函数\r\n\r\n主要逻辑如下：\r\n\r\n1. 检查 logserver 以及 apiserver 是否可用\r\n\r\n2. 如有 cloud provider 配置， 则启动`cloudResourceSyncManager`，将请求发送给 cloud provider\r\n\r\n3. 启动 volumeManager，VolumeManager运行一组异步循环，这些循环根据在此节点上调度的Pod来确定需要附加/装入/卸载/分离的卷。\r\n\r\n4. 调用 `kubelet.syncNodeStatus`同步 node 状态，如果从上次同步起有任何更改 或 经过了足够的时间，它将节点状态同步到主节点，并在必要时先注册kubelet。\r\n\r\n5. 调用 `kubelet.updateRuntimeUp`，updateRuntimeUp调用容器运行时状态回调，在容器运行时首次出现时初始化依赖于运行时的模块，如果状态检查失败，则返回错误。 如果状态检查确定，则在kubelet runtimeState中更新容器运行时正常运行时间。\r\n\r\n6. 开启循环，同步 iptables 规则（*但是在源码中无任何操作*）\r\n\r\n7. 启动一个用于“杀死pod” 的 goroutine，如果尚未使用其他goroutine，则podKiller会从通道(`podKillingCh`)中接收到一个pod，然后启动goroutine杀死他\r\n\r\n8. 启动 statusManager 和 probeManager （都是无限循环的同步机制），statusManager 与 apiserver 同步pod状态；probeManager 管理并接收 container 探针。\r\n\r\n9. 启动 runtimeClass manager，默认是docker\r\n\r\n   > runtimeClass 是 K8s 的一个 api 对象，可以通过定义 runtimeClass 实现 K8s 对接不同的 容器运行时。\r\n\r\n10. 启动 pleg （pod lifecycle event generator），用于生成 pod 相关的 event。\r\n\r\n至此，kubelet 整体的启动流程完毕，进入无限循环中，实时同步不同组件的状态。同时也对端口进行监听，响应 http 请求。\r\n\r\n```\r\nfunc startKubelet(k kubelet.Bootstrap, podCfg *config.PodConfig, kubeCfg *kubeletconfiginternal.KubeletConfiguration, kubeDeps *kubelet.Dependencies, enableServer bool) {\r\n\t// start the kubelet\r\n\tgo wait.Until(func() {\r\n\t\tk.Run(podCfg.Updates())\r\n\t}, 0, wait.NeverStop)\r\n\r\n\t// start the kubelet server\r\n\tif enableServer {\r\n\t\tgo k.ListenAndServe(net.ParseIP(kubeCfg.Address), uint(kubeCfg.Port), kubeDeps.TLSOptions, kubeDeps.Auth, kubeCfg.EnableDebuggingHandlers, kubeCfg.EnableContentionProfiling)\r\n\r\n\t}\r\n\tif kubeCfg.ReadOnlyPort > 0 {\r\n\t\tgo k.ListenAndServeReadOnly(net.ParseIP(kubeCfg.Address), uint(kubeCfg.ReadOnlyPort))\r\n\t}\r\n}\r\n```\r\n\r\n```\r\n// Run starts the kubelet reacting to config updates\r\nfunc (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {\r\n\tif kl.logServer == nil {\r\n\t\tkl.logServer = http.StripPrefix(\"/logs/\", http.FileServer(http.Dir(\"/var/log/\")))\r\n\t}\r\n\tif kl.kubeClient == nil {\r\n\t\tklog.Warning(\"No api server defined - no node status update will be sent.\")\r\n\t}\r\n\r\n\t// Start the cloud provider sync manager\r\n\tif kl.cloudResourceSyncManager != nil {\r\n\t\tgo kl.cloudResourceSyncManager.Run(wait.NeverStop)\r\n\t}\r\n\r\n\tif err := kl.initializeModules(); err != nil {\r\n\t\tkl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.KubeletSetupFailed, err.Error())\r\n\t\tklog.Fatal(err)\r\n\t}\r\n\r\n\t// Start volume manager\r\n\tgo kl.volumeManager.Run(kl.sourcesReady, wait.NeverStop)\r\n\r\n\tif kl.kubeClient != nil {\r\n\t\t// Start syncing node status immediately, this may set up things the runtime needs to run.\r\n\t\tgo wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)\r\n\t\tgo kl.fastStatusUpdateOnce()\r\n\r\n\t\t// start syncing lease\r\n\t\tgo kl.nodeLeaseController.Run(wait.NeverStop)\r\n\t}\r\n\tgo wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)\r\n\r\n\t// Set up iptables util rules\r\n\tif kl.makeIPTablesUtilChains {\r\n\t\tkl.initNetworkUtil()\r\n\t}\r\n\r\n\t// Start a goroutine responsible for killing pods (that are not properly\r\n\t// handled by pod workers).\r\n\tgo wait.Until(kl.podKiller, 1*time.Second, wait.NeverStop)\r\n\r\n\t// Start component sync loops.\r\n\tkl.statusManager.Start()\r\n\tkl.probeManager.Start()\r\n\r\n\t// Start syncing RuntimeClasses if enabled.\r\n\tif kl.runtimeClassManager != nil {\r\n\t\tkl.runtimeClassManager.Start(wait.NeverStop)\r\n\t}\r\n\r\n\t// Start the pod lifecycle event generator.\r\n\tkl.pleg.Start()\r\n\tkl.syncLoop(updates, kl)\r\n}\r\n```\r\n\r\n<br>\r\n\r\n### 6. 总结\r\n\r\n1. kubelet采用[Cobra](https://github.com/spf13/cobra)命令行框架和[pflag](https://github.com/spf13/pflag)参数解析框架，和apiserver、scheduler、controller-manager形成统一的代码风格。\r\n2. `kubernetes/cmd/kubelet`部分主要对运行参数进行定义和解析，初始化和构造相关的依赖组件（主要在`kubeDeps`结构体中），并没有kubelet运行的详细逻辑，该部分位于`kubernetes/pkg/kubelet`模块。\r\n3. cmd部分调用流程如下：`Main-->NewKubeletCommand-->Run(kubeletServer, kubeletDeps, stopCh)-->run(s *options.KubeletServer, kubeDeps ..., stopCh ...)--> RunKubelet(s, kubeDeps, s.RunOnce)-->startKubelet-->k.Run(podCfg.Updates())-->pkg/kubelet`。同时`RunKubelet(s, kubeDeps, s.RunOnce)-->CreateAndInitKubelet-->kubelet.NewMainKubelet-->pkg/kubelet`。\r\n4. 整体而言，到目前为止都是进行初始化工作。初始化kubelete各种控制器，然后运行kl.syncLoop(updates, kl)处理Pod\r\n\r\n另一种描述：\r\n\r\n1. 初始化模块，其实就是运行`imageManager`、`serverCertificateManager`、`oomWatcher`、`resourceAnalyzer`。\r\n2. 运行各种manager，大部分以常驻goroutine的方式运行，其中包括`volumeManager`、`statusManager`等。\r\n3. 执行处理变更的循环函数`syncLoop`，对pod的生命周期进行管理。\r\n\r\n这里直接上图：图片来源: https://www.bookstack.cn/read/source-code-reading-notes/kubernetes-kubelet_init.md\r\n\r\n![kubelet-func-chanel](../images/kubelet-func-chanel.png)\r\n\r\n\r\n\r\n\r\n\r\n##### 补充各种 Manager\r\n\r\n以下介绍kubelet运行时涉及到的manager的内容。\r\n\r\n| manager                  | 说明                                               |\r\n| ------------------------ | -------------------------------------------------- |\r\n| imageManager             | 负责镜像垃圾回收                                   |\r\n| serverCertificateManager | 负责处理证书                                       |\r\n| oomWatcher               | 监控内存使用，是否发生内存耗尽即OOM                |\r\n| resourceAnalyzer         | 监控资源使用情况                                   |\r\n| volumeManager            | 对pod执行`attached/detached/mounted/unmounted`操作 |\r\n| statusManager            | 使用apiserver同步pods状态; 也用作状态缓存          |\r\n| probeManager             | 处理容器探针                                       |\r\n| runtimeClassManager      | 同步RuntimeClasses                                 |\r\n| podKiller                | 负责杀死pod                                        |\r\n\r\n<br>\r\n\r\n### 7 参考\r\n\r\nhttps://www.huweihuang.com/kubernetes-notes/code-analysis/kubelet/NewKubeletCommand.html\r\n\r\nhttps://blog.csdn.net/jimzbq/article/details/104282753\r\n\r\nhttps://www.bookstack.cn/read/source-code-reading-notes/kubernetes-kubelet_init.md   "
  },
  {
    "path": "k8s/kubelet/3-kubelet初始化流程-下.md",
    "content": "* [1\\. 背景](#1-背景)\n* [2\\. pleg\\.Start](#2-plegstart)\n* [3\\.syncLoop](#3syncloop)\n  * [3\\.1 syncLoopIteration 相关channel介绍](#31-syncloopiteration-相关channel介绍)\n    * [3\\.1\\.1 configCh](#311-configch)\n    * [3\\.1\\.2 plegCh](#312-plegch)\n    * [3\\.1\\.3 syncCh](#313-syncch)\n    * [3\\.1\\.4 houseKeepingCh](#314-housekeepingch)\n    * [3\\.1\\.5 livenessManager\\.Updates](#315-livenessmanagerupdates)\n    * [3\\.1\\.6 SyncHandler](#316-synchandler)\n  * [3\\.2 syncLoopIteration源码分析](#32-syncloopiteration源码分析)\n    * [3\\.2\\.1 case1\\-configCh](#321-case1-configch)\n    * [3\\.2\\.2 case2\\-plegCh](#322-case2-plegch)\n    * [3\\.2\\.3 case3\\-syncCh](#323-case3-syncch)\n    * [3\\.2\\.4 case4\\-livenessManager\\.Updates](#324-case4-livenessmanagerupdates)\n    * [3\\.2\\.5 housekeepingCh](#325-housekeepingch)\n* [4\\. 总结](#4-总结)\n\n### 1. 背景\n\n书接上文，在Kubelet.Run函数中，通过pleg.Start和kl.syncLoop来处理pod，所有本文从这里开始，从源码角度了解pod创建的整个过程。\n\n```\nkl.pleg.Start()\nkl.syncLoop(updates, kl)\n```\n\n### 2. pleg.Start\n\npleg.Start每隔1秒，运行以此relist函数，relist的逻辑如下：\n\n1. 记录上一次relist的时间和间隔\n2. 通过runtimeApi获取所有的pod，包括exit的Pod\n3. 更新pods container状态,以及记录pod数量等metrics\n4. 和旧的Pod进行对比，podRecord结构体保存了旧的，和当前pod的信息，可以理解为和1s前的所有pods进行对比。对比完产生event，保存在一个map中。这里主要产生的事件为：ContainerStarted，ContainerDie, ContainerRemoved, ContainerChanged等等\n5. 如果event和Pod有绑定，并且kubelet开启了cache缓存pod信息，根据最新的信息同步缓存。注意，这里updateCache可能会删除cache。当pod创建是，cache保存这个Pod数据，当pod所有容器died的时候，cache删除这个Pod信息。 后面判断pod是否可能被删除的时候，会判断cacha是否有这个数据。\n6. 将新的record赋值为旧的，为下一轮做准备，然后依次处理event，逻辑为：不是ContainerChanged状态的event都发送到eventChannel中去\n7. 更新缓存，如果有更新失败的，记录到needsReinspection，表示下一次还需要重试\n\n这里需要注意的是：查看generateEvents函数，可以发现只有当新的container状态为plegContainerUnknown，才会产生ContainerChanged。这就是发送eventa为什么会跳过ContainerChanged的原因。\n\n```\n// Start spawns a goroutine to relist periodically.\nfunc (g *GenericPLEG) Start() {\n\tgo wait.Until(g.relist, g.relistPeriod, wait.NeverStop)\n}\n\n每隔1s进行以此g.relist\nplegRelistPeriod = time.Second * 1\n\n\n// relist queries the container runtime for list of pods/containers, compare\n// with the internal pods/containers, and generates events accordingly.\nfunc (g *GenericPLEG) relist() {\n\tklog.V(5).Infof(\"GenericPLEG: Relisting\")\n  \n  // 1.记录上一次relist的时间和间隔\n\tif lastRelistTime := g.getRelistTime(); !lastRelistTime.IsZero() {\n\t\tmetrics.PLEGRelistInterval.Observe(metrics.SinceInSeconds(lastRelistTime))\n\t\tmetrics.DeprecatedPLEGRelistInterval.Observe(metrics.SinceInMicroseconds(lastRelistTime))\n\t}\n  \n\ttimestamp := g.clock.Now()\n\tdefer func() {\n\t\tmetrics.PLEGRelistDuration.Observe(metrics.SinceInSeconds(timestamp))\n\t\tmetrics.DeprecatedPLEGRelistLatency.Observe(metrics.SinceInMicroseconds(timestamp))\n\t}()\n\n\t// Get all the pods.\n\t// 2. 通过runtimeApi获取所有的pod，包括exit的Pod\n\tpodList, err := g.runtime.GetPods(true)\n\tif err != nil {\n\t\tklog.Errorf(\"GenericPLEG: Unable to retrieve pods: %v\", err)\n\t\treturn\n\t}\n\n\tg.updateRelistTime(timestamp)\n  \n  // 3.更新pods container状态,以及记录有哪些pod\n\tpods := kubecontainer.Pods(podList)\n\t// update running pod and container count\n\tupdateRunningPodAndContainerMetrics(pods)\n\tg.podRecords.setCurrent(pods)\n\n\t// Compare the old and the current pods, and generate events.\n\teventsByPodID := map[types.UID][]*PodLifecycleEvent{}\n\t// 4.和旧的Pod进行对比，podRecord保存了旧的，和当前pod的信息。可以理解为和1s前的所有pods进行对比\n\t// type podRecord struct {\n\t//    old     *kubecontainer.Pod\n\t//    current *kubecontainer.Pod\n  //  }\n\t// 这里主要产生的事件为：ContainerStarted，ContainerDie, ContainerRemoved, ContainerChanged等等，然后保存在一个map中\n\tfor pid := range g.podRecords {\n\t\toldPod := g.podRecords.getOld(pid)\n\t\tpod := g.podRecords.getCurrent(pid)\n\t\t// Get all containers in the old and the new pod.\n\t\tallContainers := getContainersFromPods(oldPod, pod)\n\t\tfor _, container := range allContainers {\n\t\t\tevents := computeEvents(oldPod, pod, &container.ID)\n\t\t\tfor _, e := range events {\n\t\t\t\tupdateEvents(eventsByPodID, e)\n\t\t\t}\n\t\t}\n\t}\n\n\tvar needsReinspection map[types.UID]*kubecontainer.Pod\n\tif g.cacheEnabled() {\n\t\tneedsReinspection = make(map[types.UID]*kubecontainer.Pod)\n\t}\n\n\t// If there are events associated with a pod, we should update the\n\t// podCache.\n\t// 5.如果event和Pod有绑定，并且kubelet开启了cache缓存pod信息，根据最新的信息同步缓存\n\tfor pid, events := range eventsByPodID {\n\t\tpod := g.podRecords.getCurrent(pid)\n\t\tif g.cacheEnabled() {\n\t\t\t// updateCache() will inspect the pod and update the cache. If an\n\t\t\t// error occurs during the inspection, we want PLEG to retry again\n\t\t\t// in the next relist. To achieve this, we do not update the\n\t\t\t// associated podRecord of the pod, so that the change will be\n\t\t\t// detect again in the next relist.\n\t\t\t// TODO: If many pods changed during the same relist period,\n\t\t\t// inspecting the pod and getting the PodStatus to update the cache\n\t\t\t// serially may take a while. We should be aware of this and\n\t\t\t// parallelize if needed.\n\t\t\tif err := g.updateCache(pod, pid); err != nil {\n\t\t\t\t// Rely on updateCache calling GetPodStatus to log the actual error.\n\t\t\t\tklog.V(4).Infof(\"PLEG: Ignoring events for pod %s/%s: %v\", pod.Name, pod.Namespace, err)\n\n\t\t\t\t// make sure we try to reinspect the pod during the next relisting\n\t\t\t\tneedsReinspection[pid] = pod\n\n\t\t\t\tcontinue\n\t\t\t} else {\n\t\t\t\t// this pod was in the list to reinspect and we did so because it had events, so remove it\n\t\t\t\t// from the list (we don't want the reinspection code below to inspect it a second time in\n\t\t\t\t// this relist execution)\n\t\t\t\tdelete(g.podsToReinspect, pid)\n\t\t\t}\n\t\t}\n\t\t// Update the internal storage and send out the events.\n\t\t// 6. 将新的record赋值为旧的，为下一轮做准备，然后依次处理event，逻辑为：不是ContainerChanged状态的event都发送到eventChannel中去\n\t\tg.podRecords.update(pid)\n\t\tfor i := range events {\n\t\t\t// Filter out events that are not reliable and no other components use yet.\n\t\t\tif events[i].Type == ContainerChanged {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tselect {\n\t\t\tcase g.eventChannel <- events[i]:\n\t\t\tdefault:\n\t\t\t\tmetrics.PLEGDiscardEvents.WithLabelValues().Inc()\n\t\t\t\tklog.Error(\"event channel is full, discard this relist() cycle event\")\n\t\t\t}\n\t\t}\n\t}\n  \n  // 7.更新缓存，如果有更新失败的，记录到needsReinspection，表示下一次还需要重试\n\tif g.cacheEnabled() {\n\t\t// reinspect any pods that failed inspection during the previous relist\n\t\tif len(g.podsToReinspect) > 0 {\n\t\t\tklog.V(5).Infof(\"GenericPLEG: Reinspecting pods that previously failed inspection\")\n\t\t\tfor pid, pod := range g.podsToReinspect {\n\t\t\t\tif err := g.updateCache(pod, pid); err != nil {\n\t\t\t\t\t// Rely on updateCache calling GetPodStatus to log the actual error.\n\t\t\t\t\tklog.V(5).Infof(\"PLEG: pod %s/%s failed reinspection: %v\", pod.Name, pod.Namespace, err)\n\t\t\t\t\tneedsReinspection[pid] = pod\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\n\t\t// Update the cache timestamp.  This needs to happen *after*\n\t\t// all pods have been properly updated in the cache.\n\t\tg.cache.UpdateTime(timestamp)\n\t}\n\n\t// make sure we retain the list of pods that need reinspecting the next time relist is called\n\tg.podsToReinspect = needsReinspection\n}\n```\n\n<br>\n\n从generateEvents可以看出来，这里产生的event和k8s中的event不同。这里指的是PodLifecycleEvent。\n\n```\nfunc generateEvents(podID types.UID, cid string, oldState, newState plegContainerState) []*PodLifecycleEvent {\n\tif newState == oldState {\n\t\treturn nil\n\t}\n\n\tklog.V(4).Infof(\"GenericPLEG: %v/%v: %v -> %v\", podID, cid, oldState, newState)\n\tswitch newState {\n\tcase plegContainerRunning:\n\t\treturn []*PodLifecycleEvent{{ID: podID, Type: ContainerStarted, Data: cid}}\n\tcase plegContainerExited:\n\t\treturn []*PodLifecycleEvent{{ID: podID, Type: ContainerDied, Data: cid}}\n\tcase plegContainerUnknown:\n\t\treturn []*PodLifecycleEvent{{ID: podID, Type: ContainerChanged, Data: cid}}\n\tcase plegContainerNonExistent:\n\t\tswitch oldState {\n\t\tcase plegContainerExited:\n\t\t\t// We already reported that the container died before.\n\t\t\treturn []*PodLifecycleEvent{{ID: podID, Type: ContainerRemoved, Data: cid}}\n\t\tdefault:\n\t\t\treturn []*PodLifecycleEvent{{ID: podID, Type: ContainerDied, Data: cid}, {ID: podID, Type: ContainerRemoved, Data: cid}}\n\t\t}\n\tdefault:\n\t\tpanic(fmt.Sprintf(\"unrecognized container state: %v\", newState))\n\t}\n}\n```\n\n<br>\n\n### 3.syncLoop\n\n从函数的注释了解到：syncLoop是核心的同步逻辑。它监听file, apiserver, and http三个channel的变化，然后进行期望状态和当前状态的同步。\n\nsyncLoop核心是调用了`syncLoopIteration`的函数来执行更具体的监控pod变化的循环。\n\n```\n// syncLoop is the main loop for processing changes. It watches for changes from\n// three channels (file, apiserver, and http) and creates a union of them. For\n// any new change seen, will run a sync against desired state and running state. If\n// no changes are seen to the configuration, will synchronize the last known desired\n// state every sync-frequency seconds. Never returns.\nfunc (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {\n\tklog.Info(\"Starting kubelet main sync loop.\")\n\t// The syncTicker wakes up kubelet to checks if there are any pod workers\n\t// that need to be sync'd. A one-second period is sufficient because the\n\t// sync interval is defaulted to 10s.\n\tsyncTicker := time.NewTicker(time.Second)\n\tdefer syncTicker.Stop()\n\thousekeepingTicker := time.NewTicker(housekeepingPeriod)\n\tdefer housekeepingTicker.Stop()\n\tplegCh := kl.pleg.Watch()\n\tconst (\n\t\tbase   = 100 * time.Millisecond\n\t\tmax    = 5 * time.Second\n\t\tfactor = 2\n\t)\n\tduration := base\n\t// Responsible for checking limits in resolv.conf\n\t// The limits do not have anything to do with individual pods\n\t// Since this is called in syncLoop, we don't need to call it anywhere else\n\tif kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != \"\" {\n\t\tkl.dnsConfigurer.CheckLimitsForResolvConf()\n\t}\n  \n  // for循环调用syncLoopIteration\n\tfor {\n\t\tif err := kl.runtimeState.runtimeErrors(); err != nil {\n\t\t\tklog.Errorf(\"skipping pod synchronization - %v\", err)\n\t\t\t// exponential backoff\n\t\t\ttime.Sleep(duration)\n\t\t\tduration = time.Duration(math.Min(float64(max), factor*float64(duration)))\n\t\t\tcontinue\n\t\t}\n\t\t// reset backoff if we have a success\n\t\tduration = base\n\n\t\t// add by gzchenyifan\n\t\tif _, err := kl.nodeLister.Get(string(kl.nodeName)); err != nil {\n\t\t\tklog.Errorf(\"skipping pod synchronization until get nodeInfo from apiserver success - %v\", err)\n\t\t\ttime.Sleep(duration)\n\t\t\tduration = time.Duration(math.Min(float64(max), factor*float64(duration)))\n\t\t\tcontinue\n\t\t}\n\t\t// reset backoff if we have a success\n\t\tduration = base\n\n\t\tkl.syncLoopMonitor.Store(kl.clock.Now())\n\t\tif !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {\n\t\t\tbreak\n\t\t}\n\t\tkl.syncLoopMonitor.Store(kl.clock.Now())\n\t}\n}\n```\n\n#### 3.1 syncLoopIteration 相关channel介绍\n\n`syncLoopIteration`主要通过几种`channel`来对不同类型的事件进行监听并处理。其中包括：`configCh`、`plegCh`、`syncCh`、`houseKeepingCh`、`livenessManager.Updates()`。\n\n`syncLoopIteration`实际执行了pod的操作，此部分设置了几种不同的channel:\n\n- `configCh`：将配置更改的pod分派给事件类型的相应处理程序回调。\n- `plegCh`：更新runtime缓存，同步pod。\n- `syncCh`：同步所有等待同步的pod。\n- `houseKeepingCh`：触发清理pod。\n- `livenessManager.Updates()`：对失败的pod或者liveness检查失败的pod进行sync操作。\n\n<br>\n\n##### 3.1.1 configCh\n\n在NewMainKubelet函数中有一个重要的步骤, 就是makePodSourceConfig。\n\n```\nif kubeDeps.PodConfig == nil {\n\t\tvar err error\n\t\tkubeDeps.PodConfig, err = makePodSourceConfig(kubeCfg, kubeDeps, nodeName, bootstrapCheckpointPath)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n```\n\nmakePodSourceConfig的核心逻辑如下：\n\n1. 如果StaticPodPath不为空，调用NewSourceFile监听该目录下定义的Pod,并且将他们发送到cfg.Channel中去\n\n2. 如果url不为空，从url监听获取Pod, 并且将他们发送到cfg.Channel中去\n3. 如果kubeclient不为空，从apiserver监听pod，并且发送都cfg.channel中去\n\n```\n// makePodSourceConfig creates a config.PodConfig from the given\n// KubeletConfiguration or returns an error.\nfunc makePodSourceConfig(kubeCfg *kubeletconfiginternal.KubeletConfiguration, kubeDeps *Dependencies, nodeName types.NodeName, bootstrapCheckpointPath string) (*config.PodConfig, error) {\n\tmanifestURLHeader := make(http.Header)\n\tif len(kubeCfg.StaticPodURLHeader) > 0 {\n\t\tfor k, v := range kubeCfg.StaticPodURLHeader {\n\t\t\tfor i := range v {\n\t\t\t\tmanifestURLHeader.Add(k, v[i])\n\t\t\t}\n\t\t}\n\t}\n\n\t// source of all configuration\n\tcfg := config.NewPodConfig(config.PodConfigNotificationIncremental, kubeDeps.Recorder)\n  \n  // 1.如果StaticPodPath不为空，调用NewSourceFile监听该目录下定义的Pod,并且将他们发送到cfg.Channel中去\n\t// define file config source\n\tif kubeCfg.StaticPodPath != \"\" {\n\t\tklog.Infof(\"Adding pod path: %v\", kubeCfg.StaticPodPath)\n\t\tconfig.NewSourceFile(kubeCfg.StaticPodPath, nodeName, kubeCfg.FileCheckFrequency.Duration, cfg.Channel(kubetypes.FileSource))\n\t}\n\n\t// define url config source\n\t// 2.如果url不为空，从url监听获取Pod, 并且将他们发送到cfg.Channel中去\n\tif kubeCfg.StaticPodURL != \"\" {\n\t\tklog.Infof(\"Adding pod url %q with HTTP header %v\", kubeCfg.StaticPodURL, manifestURLHeader)\n\t\tconfig.NewSourceURL(kubeCfg.StaticPodURL, manifestURLHeader, nodeName, kubeCfg.HTTPCheckFrequency.Duration, cfg.Channel(kubetypes.HTTPSource))\n\t}\n\n\t// Restore from the checkpoint path\n\t// NOTE: This MUST happen before creating the apiserver source\n\t// below, or the checkpoint would override the source of truth.\n\n\tvar updatechannel chan<- interface{}\n\tif bootstrapCheckpointPath != \"\" {\n\t\tklog.Infof(\"Adding checkpoint path: %v\", bootstrapCheckpointPath)\n\t\tupdatechannel = cfg.Channel(kubetypes.ApiserverSource)\n\t\terr := cfg.Restore(bootstrapCheckpointPath, updatechannel)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n   \n   // 3.如果kubeclient不为空，从apiserver监听pod，并且发送都cfg.channel中去\n\tif kubeDeps.KubeClient != nil {\n\t\tklog.Infof(\"Watching apiserver\")\n\t\tif updatechannel == nil {\n\t\t\tupdatechannel = cfg.Channel(kubetypes.ApiserverSource)\n\t\t}\n\t\tconfig.NewSourceApiserver(kubeDeps.KubeClient, nodeName, updatechannel)\n\t}\n\treturn cfg, nil\n}\n```\n\n**所以：** update这个channel是从远端传过来的期望状态。这个channel就是configCh\n\n##### 3.1.2 plegCh\n\n```\nplegCh := kl.pleg.Watch()\n\n就是Pleg.Start发送的那个eventChannel\nfunc (g *GenericPLEG) Watch() chan *PodLifecycleEvent {\n\treturn g.eventChannel\n}\n```\n\n**所以**：plegCh就是本地当前pod的状态\n\n##### 3.1.3 syncCh\n\n```\nsyncTicker := time.NewTicker(time.Second)\n```\n\n每秒一次往该channel发送数据。手动触发同步\n\n##### 3.1.4 houseKeepingCh\n\n默认值housekeepingPeriod=2s。每2秒一次往该channel发送数据。手动触发同步\n\n```\nhousekeepingTicker := time.NewTicker(housekeepingPeriod)\n```\n\n<br>\n\n##### 3.1.5 livenessManager.Updates\n\n在NewMainKubelet的时候就定义了。不用想，就是livenessManager定期探测pod，有失败的将往这个channel发送。\n\n##### 3.1.6 SyncHandler\n\n```\n// SyncHandler is an interface implemented by Kubelet, for testability\ntype SyncHandler interface {\n\tHandlePodAdditions(pods []*v1.Pod)\n\tHandlePodUpdates(pods []*v1.Pod)\n\tHandlePodRemoves(pods []*v1.Pod)\n\tHandlePodReconcile(pods []*v1.Pod)\n\tHandlePodSyncs(pods []*v1.Pod)\n\tHandlePodCleanups() error\n}\n```\n\nSyncHandler`是一个定义Pod的不同Handler的接口，具体是实现者是`kubelet. 这里直接将kubelet对象传了进去。\n\n```\nkl.syncLoop(updates, kl)\n```\n\n具体在pkg/kubelet/kubelet.go中。\n\n<br>\n\n#### 3.2 syncLoopIteration源码分析\n\n了解了syncLoopIteration函数的核心参数，接下来就很好分析syncLoopIteration的逻辑了。\n\nsyncLoopIteration整体结构是通过 select 监听每个channel，然后执行对应的操作。因为是调用者syncLoop是for循环调用的。所以syncLoopIteration会一直被调用执行。\n\n<br>\n\n```\n// syncLoopIteration reads from various channels and dispatches pods to the\n// given handler.\n//\n// Arguments:\n// 1.  configCh:       a channel to read config events from\n// 2.  handler:        the SyncHandler to dispatch pods to\n// 3.  syncCh:         a channel to read periodic sync events from\n// 4.  housekeepingCh: a channel to read housekeeping events from\n// 5.  plegCh:         a channel to read PLEG updates from\n//\n// Events are also read from the kubelet liveness manager's update channel.\n//\n// The workflow is to read from one of the channels, handle that event, and\n// update the timestamp in the sync loop monitor.\n//\n// Here is an appropriate place to note that despite the syntactical\n// similarity to the switch statement, the case statements in a select are\n// evaluated in a pseudorandom order if there are multiple channels ready to\n// read from when the select is evaluated.  In other words, case statements\n// are evaluated in random order, and you can not assume that the case\n// statements evaluate in order if multiple channels have events.\n//\n// With that in mind, in truly no particular order, the different channels\n// are handled as follows:\n//\n// * configCh: dispatch the pods for the config change to the appropriate\n//             handler callback for the event type\n// * plegCh: update the runtime cache; sync pod\n// * syncCh: sync all pods waiting for sync\n// * housekeepingCh: trigger cleanup of pods\n// * liveness manager: sync pods that have failed or in which one or more\n//                     containers have failed liveness checks\nfunc (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,\n\tsyncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {\n\tselect {\n\tcase u, open := <-configCh:\n\t\t// Update from a config source; dispatch it to the right handler\n\t\t// callback.\n\t\tif !open {\n\t\t\tklog.Errorf(\"Update channel is closed. Exiting the sync loop.\")\n\t\t\treturn false\n\t\t}\n\n\t\tswitch u.Op {\n\t\tcase kubetypes.ADD:\n\t\t\tklog.V(2).Infof(\"SyncLoop (ADD, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\t// After restarting, kubelet will get all existing pods through\n\t\t\t// ADD as if they are new pods. These pods will then go through the\n\t\t\t// admission process and *may* be rejected. This can be resolved\n\t\t\t// once we have checkpointing.\n\t\t\thandler.HandlePodAdditions(u.Pods)\n\t\tcase kubetypes.UPDATE:\n\t\t\tklog.V(2).Infof(\"SyncLoop (UPDATE, %q): %q\", u.Source, format.PodsWithDeletionTimestamps(u.Pods))\n\t\t\thandler.HandlePodUpdates(u.Pods)\n\t\tcase kubetypes.REMOVE:\n\t\t\tklog.V(2).Infof(\"SyncLoop (REMOVE, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\thandler.HandlePodRemoves(u.Pods)\n\t\tcase kubetypes.RECONCILE:\n\t\t\tklog.V(4).Infof(\"SyncLoop (RECONCILE, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\thandler.HandlePodReconcile(u.Pods)\n\t\tcase kubetypes.DELETE:\n\t\t\tklog.V(2).Infof(\"SyncLoop (DELETE, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\t// DELETE is treated as a UPDATE because of graceful deletion.\n\t\t\thandler.HandlePodUpdates(u.Pods)\n\t\tcase kubetypes.RESTORE:\n\t\t\tklog.V(2).Infof(\"SyncLoop (RESTORE, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\t// These are pods restored from the checkpoint. Treat them as new\n\t\t\t// pods.\n\t\t\thandler.HandlePodAdditions(u.Pods)\n\t\tcase kubetypes.SET:\n\t\t\t// TODO: Do we want to support this?\n\t\t\tklog.Errorf(\"Kubelet does not support snapshot update\")\n\t\t}\n\n\t\tif u.Op != kubetypes.RESTORE {\n\t\t\t// If the update type is RESTORE, it means that the update is from\n\t\t\t// the pod checkpoints and may be incomplete. Do not mark the\n\t\t\t// source as ready.\n\n\t\t\t// Mark the source ready after receiving at least one update from the\n\t\t\t// source. Once all the sources are marked ready, various cleanup\n\t\t\t// routines will start reclaiming resources. It is important that this\n\t\t\t// takes place only after kubelet calls the update handler to process\n\t\t\t// the update to ensure the internal pod cache is up-to-date.\n\t\t\tkl.sourcesReady.AddSource(u.Source)\n\t\t}\n\tcase e := <-plegCh:\n\t\tif isSyncPodWorthy(e) {\n\t\t\t// PLEG event for a pod; sync it.\n\t\t\tif pod, ok := kl.podManager.GetPodByUID(e.ID); ok {\n\t\t\t\tklog.V(2).Infof(\"SyncLoop (PLEG): %q, event: %#v\", format.Pod(pod), e)\n\t\t\t\thandler.HandlePodSyncs([]*v1.Pod{pod})\n\t\t\t} else {\n\t\t\t\t// If the pod no longer exists, ignore the event.\n\t\t\t\tklog.V(4).Infof(\"SyncLoop (PLEG): ignore irrelevant event: %#v\", e)\n\t\t\t}\n\t\t}\n\n\t\tif e.Type == pleg.ContainerDied {\n\t\t\tif containerID, ok := e.Data.(string); ok {\n\t\t\t\tkl.cleanUpContainersInPod(e.ID, containerID)\n\t\t\t}\n\t\t}\n\tcase <-syncCh:\n\t\t// Sync pods waiting for sync\n\t\tpodsToSync := kl.getPodsToSync()\n\t\tif len(podsToSync) == 0 {\n\t\t\tbreak\n\t\t}\n\t\tklog.V(4).Infof(\"SyncLoop (SYNC): %d pods; %s\", len(podsToSync), format.Pods(podsToSync))\n\t\thandler.HandlePodSyncs(podsToSync)\n\tcase update := <-kl.livenessManager.Updates():\n\t\tif update.Result == proberesults.Failure {\n\t\t\t// The liveness manager detected a failure; sync the pod.\n\n\t\t\t// We should not use the pod from livenessManager, because it is never updated after\n\t\t\t// initialization.\n\t\t\tpod, ok := kl.podManager.GetPodByUID(update.PodUID)\n\t\t\tif !ok {\n\t\t\t\t// If the pod no longer exists, ignore the update.\n\t\t\t\tklog.V(4).Infof(\"SyncLoop (container unhealthy): ignore irrelevant update: %#v\", update)\n\t\t\t\tbreak\n\t\t\t}\n\t\t\tklog.V(1).Infof(\"SyncLoop (container unhealthy): %q\", format.Pod(pod))\n\t\t\thandler.HandlePodSyncs([]*v1.Pod{pod})\n\t\t}\n\tcase <-housekeepingCh:\n\t\tif !kl.sourcesReady.AllReady() {\n\t\t\t// If the sources aren't ready or volume manager has not yet synced the states,\n\t\t\t// skip housekeeping, as we may accidentally delete pods from unready sources.\n\t\t\tklog.V(4).Infof(\"SyncLoop (housekeeping, skipped): sources aren't ready yet.\")\n\t\t} else {\n\t\t\tklog.V(4).Infof(\"SyncLoop (housekeeping)\")\n\t\t\tif err := handler.HandlePodCleanups(); err != nil {\n\t\t\t\tklog.Errorf(\"Failed cleaning pods: %v\", err)\n\t\t\t}\n\t\t}\n\t}\n\treturn true\n}\n```\n\n##### 3.2.1 case1-configCh\n\n这个channel有数据，说用pod的配置发生了改变。具体处理逻辑如下：\n\n（1）如果是 ADD 或者RESTORE，表明是新创建了一个Pod，调用 HandlePodAdditions函数处理\n\n（2）如果是UPDATE，调用HandlePodUpdates处理\n\n（3）如果是REMOVE，调用HandlePodRemoves处理。remove和delete不一样，remove表示第二次删除\n\n（4）如果是RECONCILE，调用HandlePodReconcile处理。RECONCILE表示pod有修改，但是状态还没同步, 需要协调\n\n（5）如果是DELETE，调用HandlePodUpdates处理。因为这是第一次删除，只是赋值了Pod.deletationTimestamp而已\n\n后面根据pod创建，删除，更新等具体的情况进行分析\n\n```\ncase u, open := <-configCh:\n\t\t// Update from a config source; dispatch it to the right handler\n\t\t// callback.\n\t\tif !open {\n\t\t\tklog.Errorf(\"Update channel is closed. Exiting the sync loop.\")\n\t\t\treturn false\n\t\t}\n\n\t\tswitch u.Op {\n\t\tcase kubetypes.ADD:\n\t\t\tklog.V(2).Infof(\"SyncLoop (ADD, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\t// After restarting, kubelet will get all existing pods through\n\t\t\t// ADD as if they are new pods. These pods will then go through the\n\t\t\t// admission process and *may* be rejected. This can be resolved\n\t\t\t// once we have checkpointing.\n\t\t\thandler.HandlePodAdditions(u.Pods)\n\t\tcase kubetypes.UPDATE:\n\t\t\tklog.V(2).Infof(\"SyncLoop (UPDATE, %q): %q\", u.Source, format.PodsWithDeletionTimestamps(u.Pods))\n\t\t\thandler.HandlePodUpdates(u.Pods)\n\t\tcase kubetypes.REMOVE:\n\t\t\tklog.V(2).Infof(\"SyncLoop (REMOVE, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\thandler.HandlePodRemoves(u.Pods)\n\t\tcase kubetypes.RECONCILE:\n\t\t\tklog.V(4).Infof(\"SyncLoop (RECONCILE, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\thandler.HandlePodReconcile(u.Pods)\n\t\tcase kubetypes.DELETE:\n\t\t\tklog.V(2).Infof(\"SyncLoop (DELETE, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\t// DELETE is treated as a UPDATE because of graceful deletion.\n\t\t\thandler.HandlePodUpdates(u.Pods)\n\t\tcase kubetypes.RESTORE:\n\t\t\tklog.V(2).Infof(\"SyncLoop (RESTORE, %q): %q\", u.Source, format.Pods(u.Pods))\n\t\t\t// These are pods restored from the checkpoint. Treat them as new\n\t\t\t// pods.\n\t\t\thandler.HandlePodAdditions(u.Pods)\n\t\tcase kubetypes.SET:\n\t\t\t// TODO: Do we want to support this?\n\t\t\tklog.Errorf(\"Kubelet does not support snapshot update\")\n\t\t}\n\n\t\tif u.Op != kubetypes.RESTORE {\n\t\t\t// If the update type is RESTORE, it means that the update is from\n\t\t\t// the pod checkpoints and may be incomplete. Do not mark the\n\t\t\t// source as ready.\n\n\t\t\t// Mark the source ready after receiving at least one update from the\n\t\t\t// source. Once all the sources are marked ready, various cleanup\n\t\t\t// routines will start reclaiming resources. It is important that this\n\t\t\t// takes place only after kubelet calls the update handler to process\n\t\t\t// the update to ensure the internal pod cache is up-to-date.\n\t\t\tkl.sourcesReady.AddSource(u.Source)\n\t\t}\n```\n\n##### 3.2.2 case2-plegCh\n\n该channel表示底层container的状态发生了改变。核心逻辑如下：\n\n（1）如果pod的状态发生了改变，并且Pod还存在，调用HandlePodSyncs进行同步\n\n（2）如果是ContainerDied, 清理pod的容器。具体是调用apiserver接口更新Podstatus，然后再清理这个容器\n\n```\ncase e := <-plegCh:\n    // return event.Type != pleg.ContainerRemoved\n    // 只要不是containerRemoved,都有同步的意义\n\t\tif isSyncPodWorthy(e) {\n\t\t\t// PLEG event for a pod; sync it.\n\t\t\tif pod, ok := kl.podManager.GetPodByUID(e.ID); ok {\n\t\t\t\tklog.V(2).Infof(\"SyncLoop (PLEG): %q, event: %#v\", format.Pod(pod), e)\n\t\t\t\thandler.HandlePodSyncs([]*v1.Pod{pod})\n\t\t\t} else {\n\t\t\t\t// If the pod no longer exists, ignore the event.\n\t\t\t\tklog.V(4).Infof(\"SyncLoop (PLEG): ignore irrelevant event: %#v\", e)\n\t\t\t}\n\t\t}\n   \n    // 如果是ContainerDied, 清理pod的容器。具体是调用apiserver接口更新Podstatus，然后再清理这个容器\n \t\tif e.Type == pleg.ContainerDied {\n\t\t\tif containerID, ok := e.Data.(string); ok {\n\t\t\t\tkl.cleanUpContainersInPod(e.ID, containerID)\n\t\t\t}\n\t\t}\n```\n\n<br>\n\n##### 3.2.3 case3-syncCh\n\n这个channel是每1s调用1次。核心逻辑如下：\n\n（1）调用getPodsToSync得到需要同步的pod列表。 （pod中有一个参数activeDeadlineSeconds 可以设置 Pod 最长的运行时间。如果pod的存活时间>activeDeadlineSeconds, 表示这个pod需要同步 ）\n\n（2）调用HandlePodSyncs同步pod\n\n```\n\tcase <-syncCh:\n\t\t// Sync pods waiting for sync\n\t\tpodsToSync := kl.getPodsToSync()\n\t\tif len(podsToSync) == 0 {\n\t\t\tbreak\n\t\t}\n\t\tklog.V(4).Infof(\"SyncLoop (SYNC): %d pods; %s\", len(podsToSync), format.Pods(podsToSync))\n\t\thandler.HandlePodSyncs(podsToSync)\n```\n\n<br>\n\n##### 3.2.4 case4-livenessManager.Updates\n\n这个channel来自于 livenessManager，核心功能也很明显。就是对探测失败的pod，调用HandlePodSyncs同步一下状态。\n\n```\n\tcase update := <-kl.livenessManager.Updates():\n\t\tif update.Result == proberesults.Failure {\n\t\t\t// The liveness manager detected a failure; sync the pod.\n\n\t\t\t// We should not use the pod from livenessManager, because it is never updated after\n\t\t\t// initialization.\n\t\t\tpod, ok := kl.podManager.GetPodByUID(update.PodUID)\n\t\t\tif !ok {\n\t\t\t\t// If the pod no longer exists, ignore the update.\n\t\t\t\tklog.V(4).Infof(\"SyncLoop (container unhealthy): ignore irrelevant update: %#v\", update)\n\t\t\t\tbreak\n\t\t\t}\n\t\t\tklog.V(1).Infof(\"SyncLoop (container unhealthy): %q\", format.Pod(pod))\n\t\t\thandler.HandlePodSyncs([]*v1.Pod{pod})\n\t\t}\n```\n\n##### 3.2.5 housekeepingCh\n\n这个channel每2秒更新一次数据，然后调用HandlePodCleanups进行清理工作\n\n```\ncase <-housekeepingCh:\n\t\tif !kl.sourcesReady.AllReady() {\n\t\t\t// If the sources aren't ready or volume manager has not yet synced the states,\n\t\t\t// skip housekeeping, as we may accidentally delete pods from unready sources.\n\t\t\tklog.V(4).Infof(\"SyncLoop (housekeeping, skipped): sources aren't ready yet.\")\n\t\t} else {\n\t\t\tklog.V(4).Infof(\"SyncLoop (housekeeping)\")\n\t\t\tif err := handler.HandlePodCleanups(); err != nil {\n\t\t\t\tklog.Errorf(\"Failed cleaning pods: %v\", err)\n\t\t\t}\n\t\t}\n\t}\n```\n\n<br>\n\n### 4. 总结\n\n本章节主要分析了kubelet的 syncLoop函数。他的核心逻辑如下：\n\n死循环处理一下五个channel的数据：\n\n（1）来自configCh的数据。这个数据来源于apisever/url/file，表示pod的期望值已经更新了，然后根据不同类型的type(add ,del, update等)进行不同的处理\n\n（2）来自plegCh的数据。这个数据来源于底层的数据，表示真实的pod状态已经发生了改变，调用HandlePodSyncs同步状态\n\n（3）来自syncCh的数据。这个数据来源于定时数，每1秒运行一次。找出pod运行时间已经大于activeDeadlineSeconds值，需要调用HandlePodSyncs同步状态\n\n（4）来自livenessManager.Update的数据。这个数据来源于livenessManager探针的结果。如果有pod探针失败，调用HandlePodSyncs同步数据\n\n（5）来自housekeepingCh的数据，这个数据来源于定时数，每2秒运行1次。调用HandlePodCleanups进行数据清理\n\n至此，kubelet的核心流程分析完了。接下来的针对每个场景进行具体分析。比如Pod的创建流程，HandlePodCleanups是如何工作的。\n"
  },
  {
    "path": "k8s/kubelet/4-kubelet 监听pod变化.md",
    "content": "* [1\\.背景](#1背景)\n* [1\\. makePodSourceConfig](#1-makepodsourceconfig)\n* [2\\. PodConfig 结构体介绍](#2-podconfig-结构体介绍)\n* [3\\.Merge](#3merge)\n* [4\\. s\\.merge](#4-smerge)\n  * [4\\.1 updatePodsFunc](#41-updatepodsfunc)\n  * [4\\.2 checkAndUpdatePod](#42-checkandupdatepod)\n* [5\\.总结](#5总结)\n\n### 1.背景\n\n从上文中知道，kubelet 监听了apiserver, file ,url等pod资源。然后送到 configCh channel中去。\n\n但是在处理configCh channel的时候，确实 ADD，UPDATE， REMOVE等状态变化。这个肯定是经过转换的。\n\n所以本章节就是了解kubelte到底是如何监听处理apiserver等来源的Pod\n\n<br>\n\n### 1. makePodSourceConfig\n\n这之前的分析中，在NewMainKubelet函数中有一个重要的步骤, 就是makePodSourceConfig。\n\n```\nif kubeDeps.PodConfig == nil {\n\t\tvar err error\n\t\tkubeDeps.PodConfig, err = makePodSourceConfig(kubeCfg, kubeDeps, nodeName, bootstrapCheckpointPath)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n```\n\n<br>\n\n### 2. PodConfig 结构体介绍\n\n```\n// PodConfig is a configuration mux that merges many sources of pod configuration into a single\n// consistent structure, and then delivers incremental change notifications to listeners\n// in order.\ntype PodConfig struct {\n\tpods *podStorage\n\tmux  *config.Mux\n\n\t// the channel of denormalized changes passed to listeners\n\tupdates chan kubetypes.PodUpdate\n\n\t// contains the list of all configured sources\n\tsourcesLock       sync.Mutex\n\tsources           sets.String\n\tcheckpointManager checkpointmanager.CheckpointManager\n}\n\n\n// PodConfig的构造函数\n// NewPodConfig creates an object that can merge many configuration sources into a stream\n// of normalized updates to a pod configuration.\nfunc NewPodConfig(mode PodConfigNotificationMode, recorder record.EventRecorder) *PodConfig {\n\tupdates := make(chan kubetypes.PodUpdate, 50)\n\tstorage := newPodStorage(updates, mode, recorder)\n\tpodConfig := &PodConfig{\n\t\tpods:    storage,\n\t\tmux:     config.NewMux(storage),\n\t\tupdates: updates,\n\t\tsources: sets.String{},\n\t}\n\treturn podConfig\n}\n```\n\npodStorage的构造函数及结构定义如下，由结构名得知它主要是负责pod的存储，且它的成员中有一个用于存储pod对象的map,查看了对updates通道的引用是往里面塞入对象，主要通过PodStorage的Merge方法传入\n\n```\n// TODO: PodConfigNotificationMode could be handled by a listener to the updates channel\n// in the future, especially with multiple listeners.\n// TODO: allow initialization of the current state of the store with snapshotted version.\nfunc newPodStorage(updates chan<- kubetypes.PodUpdate, mode PodConfigNotificationMode, recorder record.EventRecorder) *podStorage {\n\treturn &podStorage{\n\t\tpods:        make(map[string]map[types.UID]*v1.Pod),\n\t\tmode:        mode,\n\t\tupdates:     updates,\n\t\tsourcesSeen: sets.String{},\n\t\trecorder:    recorder,\n\t}\n}\n```\n\n### 3.Merge\n\nkubelet通过podStorage来实现对三个渠道的Pod的处理。在makePodSourceConfig函数中，针对三个渠道的更新都扔进去了updates channel。以apiserver为例。每个update是有来源的，这里来源kubetypes.ApiserverSource。\n\n```\n// newSourceApiserverFromLW holds creates a config source that watches and pulls from the apiserver.\nfunc newSourceApiserverFromLW(lw cache.ListerWatcher, updates chan<- interface{}) {\n\tsend := func(objs []interface{}) {\n\t\tvar pods []*v1.Pod\n\t\tfor _, o := range objs {\n\t\t\tpods = append(pods, o.(*v1.Pod))\n\t\t}\n\t\tupdates <- kubetypes.PodUpdate{Pods: pods, Op: kubetypes.SET, Source: kubetypes.ApiserverSource}\n\t}\n\tr := cache.NewReflector(lw, &v1.Pod{}, cache.NewUndeltaStore(send, cache.MetaNamespaceKeyFunc), 0)\n\tgo r.Run(wait.NeverStop)\n}\n```\n\n<br>\n\n而podStorage.Merge对每个来源的数据进行了统一的处理。进行合并。\n\nMerge函数的核心就是调用 s.merge  整理处理 add/updates/del/remove等等的pods。\n\nadds, updates, deletes, removes, reconciles, restores := s.merge(source, change)\n\n```\n// Merge normalizes a set of incoming changes from different sources into a map of all Pods\n// and ensures that redundant changes are filtered out, and then pushes zero or more minimal\n// updates onto the update channel.  Ensures that updates are delivered in order.\nfunc (s *podStorage) Merge(source string, change interface{}) error {\n\ts.updateLock.Lock()\n\tdefer s.updateLock.Unlock()\n\n\tseenBefore := s.sourcesSeen.Has(source)\n\tadds, updates, deletes, removes, reconciles, restores := s.merge(source, change)\n\tfirstSet := !seenBefore && s.sourcesSeen.Has(source)\n\n\t// deliver update notifications\n\tswitch s.mode {\n\tcase PodConfigNotificationIncremental:\n\t\tif len(removes.Pods) > 0 {\n\t\t\ts.updates <- *removes\n\t\t}\n\t\tif len(adds.Pods) > 0 {\n\t\t\ts.updates <- *adds\n\t\t}\n\t\tif len(updates.Pods) > 0 {\n\t\t\ts.updates <- *updates\n\t\t}\n\t\tif len(deletes.Pods) > 0 {\n\t\t\ts.updates <- *deletes\n\t\t}\n\t\tif len(restores.Pods) > 0 {\n\t\t\ts.updates <- *restores\n\t\t}\n\t\tif firstSet && len(adds.Pods) == 0 && len(updates.Pods) == 0 && len(deletes.Pods) == 0 {\n\t\t\t// Send an empty update when first seeing the source and there are\n\t\t\t// no ADD or UPDATE or DELETE pods from the source. This signals kubelet that\n\t\t\t// the source is ready.\n\t\t\ts.updates <- *adds\n\t\t}\n\t\t// Only add reconcile support here, because kubelet doesn't support Snapshot update now.\n\t\tif len(reconciles.Pods) > 0 {\n\t\t\ts.updates <- *reconciles\n\t\t}\n\n\tcase PodConfigNotificationSnapshotAndUpdates:\n\t\tif len(removes.Pods) > 0 || len(adds.Pods) > 0 || firstSet {\n\t\t\ts.updates <- kubetypes.PodUpdate{Pods: s.MergedState().([]*v1.Pod), Op: kubetypes.SET, Source: source}\n\t\t}\n\t\tif len(updates.Pods) > 0 {\n\t\t\ts.updates <- *updates\n\t\t}\n\t\tif len(deletes.Pods) > 0 {\n\t\t\ts.updates <- *deletes\n\t\t}\n\n\tcase PodConfigNotificationSnapshot:\n\t\tif len(updates.Pods) > 0 || len(deletes.Pods) > 0 || len(adds.Pods) > 0 || len(removes.Pods) > 0 || firstSet {\n\t\t\ts.updates <- kubetypes.PodUpdate{Pods: s.MergedState().([]*v1.Pod), Op: kubetypes.SET, Source: source}\n\t\t}\n\n\tcase PodConfigNotificationUnknown:\n\t\tfallthrough\n\tdefault:\n\t\tpanic(fmt.Sprintf(\"unsupported PodConfigNotificationMode: %#v\", s.mode))\n\t}\n\n\treturn nil\n}\n```\n\n### 4. s.merge\n\n```\nfunc (s *podStorage) merge(source string, change interface{}) (adds, updates, deletes, removes, reconciles, restores *kubetypes.PodUpdate) {\n  ... \n  \n  // 1.关注这个updatePodFunc\n\t// updatePodFunc is the local function which updates the pod cache *oldPods* with new pods *newPods*.\n\t// After updated, new pod will be stored in the pod cache *pods*.\n\t// Notice that *pods* and *oldPods* could be the same cache.\n\tupdatePodsFunc := func(newPods []*v1.Pod, oldPods, pods map[types.UID]*v1.Pod) {\n\t\tfiltered := filterInvalidPods(newPods, source, s.recorder)\n\t\tfor _, ref := range filtered {\n\t\t\t// Annotate the pod with the source before any comparison.\n\t\t\tif ref.Annotations == nil {\n\t\t\t\tref.Annotations = make(map[string]string)\n\t\t\t}\n\t\t\tref.Annotations[kubetypes.ConfigSourceAnnotationKey] = source\n\t\t\tif existing, found := oldPods[ref.UID]; found {\n\t\t\t\tpods[ref.UID] = existing\n\t\t\t\tneedUpdate, needReconcile, needGracefulDelete := checkAndUpdatePod(existing, ref)\n\t\t\t\tif needUpdate {\n\t\t\t\t\tupdatePods = append(updatePods, existing)\n\t\t\t\t} else if needReconcile {\n\t\t\t\t\treconcilePods = append(reconcilePods, existing)\n\t\t\t\t} else if needGracefulDelete {\n\t\t\t\t\tdeletePods = append(deletePods, existing)\n\t\t\t\t}\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\trecordFirstSeenTime(ref)\n\t\t\tpods[ref.UID] = ref\n\t\t\taddPods = append(addPods, ref)\n\t\t}\n\t}\n\n\tupdate := change.(kubetypes.PodUpdate)\n\tswitch update.Op {\n\tcase kubetypes.ADD, kubetypes.UPDATE, kubetypes.DELETE:\n\t...\n\tcase kubetypes.REMOVE:\n\t...\n\t// 2.只用关心这个case就行了。因为三个来源塞入数据的时候都是 kubetypes.SET\n\tcase kubetypes.SET:\n\t\tklog.V(4).Infof(\"Setting pods for source %s\", source)\n\t\ts.markSourceSet(source)\n\t\t// Clear the old map entries by just creating a new map\n\t\toldPods := pods\n\t\tpods = make(map[types.UID]*v1.Pod)\n\t\tupdatePodsFunc(update.Pods, oldPods, pods)\n\t  // 遍历旧的pods, 如果发现旧的pods有，但是新的pod就认识pod已经删除了，就是remove事件\n\t\tfor uid, existing := range oldPods {\n\t\t\tif _, found := pods[uid]; !found {\n\t\t\t\t// this is a delete\n\t\t\t\tremovePods = append(removePods, existing)\n\t\t\t}\n\t\t}\n\tcase kubetypes.RESTORE:\n\tdefault:\n\t\tklog.Warningf(\"Received invalid update type: %v\", update)\n\t}\n\treturn adds, updates, deletes, removes, reconciles, restores\n}\n```\n\n#### 4.1 updatePodsFunc\n\nupdatePodsFunc的核心逻辑如下：\n\n（1）根据PodName进行去重\n\n（2）通过pod Annotations表明pod的来源（apiserver/url/file）\n\n（3）开始分类，逻辑如下：oldPods是podStorage缓存的pods数据。\n\n* 如果pod在oldPods没有找到，那说明肯定就是add，加入addPods\n* 如果找到了，调用checkAndUpdatePod进行进一步判断\n\n（4）checkAndUpdatePod逻辑如下：\n\n* 如果本地pod和新pod处理除了状态外，其他都一样，那就是Reconcile，加入reconcilePods\n\n* 如果新pod有DeletionTimestamp，那就是需要needGracefulDelete, 加入deletePods\n\n* 否则就是update，加入updatePods\n\n（5) 遍历旧的pods, 如果发现旧的pods有，但是新的pod就认识pod已经删除了，就是remove事件-这个是SET的逻辑\n\n**注意**：kubelet不是使用其他控制器场景的informer机制，他是使用了更底层的reflect。所以新的Pods，你可以认为是list出来的所有Pods。所以这样可以判断一个pod有没有被删除。\n\n```\nupdatePodsFunc := func(newPods []*v1.Pod, oldPods, pods map[types.UID]*v1.Pod) {\n    // 1.根据PodName进行去重\n\t\tfiltered := filterInvalidPods(newPods, source, s.recorder)\n\t\tfor _, ref := range filtered {\n\t\t\t// Annotate the pod with the source before any comparison.\n\t\t\tif ref.Annotations == nil {\n\t\t\t\tref.Annotations = make(map[string]string)\n\t\t\t}\n\t\t\t// 2.通过pod Annotations表明pod的来源（apiserver/url/file）\n\t\t\tref.Annotations[kubetypes.ConfigSourceAnnotationKey] = source\n\t\t\t// oldPods是本地缓存的Pods数据。\n\t\t\t// 3.开始分类，逻辑如下：\n\t\t\tif existing, found := oldPods[ref.UID]; found {\n\t\t\t\tpods[ref.UID] = existing\n\t\t\t\tneedUpdate, needReconcile, needGracefulDelete := checkAndUpdatePod(existing, ref)\n\t\t\t\tif needUpdate {\n\t\t\t\t\tupdatePods = append(updatePods, existing)\n\t\t\t\t} else if needReconcile {\n\t\t\t\t\treconcilePods = append(reconcilePods, existing)\n\t\t\t\t} else if needGracefulDelete {\n\t\t\t\t\tdeletePods = append(deletePods, existing)\n\t\t\t\t}\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\trecordFirstSeenTime(ref)\n\t\t\tpods[ref.UID] = ref\n\t\t\taddPods = append(addPods, ref)\n\t\t}\n\t}\n```\n\n#### 4.2 checkAndUpdatePod\n\n参数：existing是本地的pod,  ref是新pod\n\n逻辑：\n\n（1）如果本地pod和新pod处理除了状态外，其他都一样，那就是Reconcile \n\n（2）如果新pod有DeletionTimestamp，那就是需要needGracefulDelete\n\n（3）否则就是update\n\n```\n// checkAndUpdatePod updates existing, and:\n//   * if ref makes a meaningful change, returns needUpdate=true\n//   * if ref makes a meaningful change, and this change is graceful deletion, returns needGracefulDelete=true\n//   * if ref makes no meaningful change, but changes the pod status, returns needReconcile=true\n//   * else return all false\n//   Now, needUpdate, needGracefulDelete and needReconcile should never be both true\nfunc checkAndUpdatePod(existing, ref *v1.Pod) (needUpdate, needReconcile, needGracefulDelete bool) {\n\n\t// 1. this is a reconcile\n\t// TODO: it would be better to update the whole object and only preserve certain things\n\t//       like the source annotation or the UID (to ensure safety)\n\tif !podsDifferSemantically(existing, ref) {\n\t\t// this is not an update\n\t\t// Only check reconcile when it is not an update, because if the pod is going to\n\t\t// be updated, an extra reconcile is unnecessary\n\t\tif !reflect.DeepEqual(existing.Status, ref.Status) {\n\t\t\t// Pod with changed pod status needs reconcile, because kubelet should\n\t\t\t// be the source of truth of pod status.\n\t\t\texisting.Status = ref.Status\n\t\t\tneedReconcile = true\n\t\t}\n\t\treturn\n\t}\n\n\t// Overwrite the first-seen time with the existing one. This is our own\n\t// internal annotation, there is no need to update.\n\tref.Annotations[kubetypes.ConfigFirstSeenAnnotationKey] = existing.Annotations[kubetypes.ConfigFirstSeenAnnotationKey]\n\n\texisting.Spec = ref.Spec\n\texisting.Labels = ref.Labels\n\texisting.DeletionTimestamp = ref.DeletionTimestamp\n\texisting.DeletionGracePeriodSeconds = ref.DeletionGracePeriodSeconds\n\texisting.Status = ref.Status\n\tupdateAnnotations(existing, ref)\n\n\t// 2. this is an graceful delete\n\tif ref.DeletionTimestamp != nil {\n\t\tneedGracefulDelete = true\n\t} else {\n\t\t// 3. this is an update\n\t\tneedUpdate = true\n\t}\n\n\treturn\n}\n\n\n\n// 这些都一样，说明pod是一样的\nfunc podsDifferSemantically(existing, ref *v1.Pod) bool {\n\tif reflect.DeepEqual(existing.Spec, ref.Spec) &&\n\t\treflect.DeepEqual(existing.Labels, ref.Labels) &&\n\t\treflect.DeepEqual(existing.DeletionTimestamp, ref.DeletionTimestamp) &&\n\t\treflect.DeepEqual(existing.DeletionGracePeriodSeconds, ref.DeletionGracePeriodSeconds) &&\n\t\tisAnnotationMapEqual(existing.Annotations, ref.Annotations) {\n\t\treturn false\n\t}\n\treturn true\n}\n```\n\n<br>\n\n### 5.总结\n\nkubelet通过podStorage缓存了所有的旧数据。然后监听三个来源的新数据进行对比。然后更加以下逻辑分类：\n\n（1）旧的有，新的没有那就是删除，对应remove\n\n（2）旧的有，新的有，但是元数据不一样，那就是更新，对应update\n\n（3）旧的有，新的有，元数据也一样，但是status不一样，那就是同步，对应reconclie\n\n（4）旧的没有，新的有，那就是新增，对应add\n\n（5）旧的有，新的有，新的带有deleteTimeStamp，那就是删除，对应delete\n\n![image-20220310171935572](../images/kubelet-sourceConfig.png)"
  },
  {
    "path": "k8s/kubelet/5-pod创建流程.md",
    "content": "* [1\\. 背景](#1-背景)\n* [2\\. HandlePodAdditions](#2-handlepodadditions)\n  * [2\\.1 dispatchWork](#21-dispatchwork)\n    * [2\\.1\\.1 podWorkers\\.UpdatePod](#211-podworkersupdatepod)\n    * [2\\.1\\.2 managePodLoop](#212-managepodloop)\n    * [2\\.1\\.3 syncPodFn](#213-syncpodfn)\n    * [2\\.1\\.4 containerRuntime\\.SyncPod](#214-containerruntimesyncpod)\n    * [2\\.1\\.5 总结](#215-总结)\n* [3\\. pod创建过程详细分析](#3-pod创建过程详细分析)\n  * [3\\.1 创建sandbox过程做了什么工作](#31-创建sandbox过程做了什么工作)\n    * [3\\.1\\.1 createPodSandbox](#311-createpodsandbox)\n    * [3\\.1\\.2 runtimeService\\.RunPodSandbox](#312-runtimeservicerunpodsandbox)\n    * [3\\.1\\.3 runtimeClient\\.RunPodSandbox](#313-runtimeclientrunpodsandbox)\n    * [3\\.1\\.4 srv\\.(RuntimeServiceServer)\\.RunPodSandbox](#314-srvruntimeserviceserverrunpodsandbox)\n    * [3\\.1\\.5 总结](#315-总结)\n  * [3\\.2 start init/业务容器做了什么操作](#32-start-init业务容器做了什么操作)\n    * [3\\.2\\.1 start](#321-start)\n    * [3\\.2\\.2 startContainer](#322-startcontainer)\n* [4\\.总结](#4总结)\n\n### 1. 背景\n\n本章节分析kubelet创建pod的详细过程。在 上一章节中了解到。kubelet通过configCh这个channel获取了来自apiserver/file/url的pod。并且调用了 HandlePodAdditions进行了处理。\n\n### 2. HandlePodAdditions\n\nHandlePodAdditions核心逻辑如下：\n\n（1）获取所有待创建的Pods, 并且根据创建时间排序，并依次加入podManager。可以认为podManager是Kubelet的本地缓存，如果一个pod没在podManager中，那kubelet就认为该Pod已经被删除了\n\n（2）如果是staticpod, 直接调用handleMirrorPod进行处理。从file/url来的Pod基本都是属于staticpod。kubelet为每一个staticpod创建了对应的MirrorPod. apiserver可以看见这个pod，但是不能管理它。这里基本很少用，跳过。\n\n（3）否则的话，首先判断该pod能不能在该节点运行，这里是调用了canAdmitPod进行判断。canAdmitPod的核心是拿该节点已经运行的所有pod一个一个的和newPod进行校验。比如是资源不足或者其他系统问题等等。\n\n（4）调用dispatchWork开始处理pod\n\n（5）probeManager.AddPod将newPod加入到probeManager中去\n\n```\n// HandlePodAdditions is the callback in SyncHandler for pods being added from\n// a config source.\nfunc (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {\n\tstart := kl.clock.Now()\n\tsort.Sort(sliceutils.PodsByCreationTime(pods))\n\tfor _, pod := range pods {\n\t\texistingPods := kl.podManager.GetPods()\n\t\t// Always add the pod to the pod manager. Kubelet relies on the pod\n\t\t// manager as the source of truth for the desired state. If a pod does\n\t\t// not exist in the pod manager, it means that it has been deleted in\n\t\t// the apiserver and no action (other than cleanup) is required.\n\t\tkl.podManager.AddPod(pod)\n\n\t\tif kubetypes.IsMirrorPod(pod) {\n\t\t\tkl.handleMirrorPod(pod, start)\n\t\t\tcontinue\n\t\t}\n\n\t\tif !kl.podIsTerminated(pod) {\n\t\t\t// Only go through the admission process if the pod is not\n\t\t\t// terminated.\n\n\t\t\t// We failed pods that we rejected, so activePods include all admitted\n\t\t\t// pods that are alive.\n\t\t\tactivePods := kl.filterOutTerminatedPods(existingPods)\n\n\t\t\t// Check if we can admit the pod; if not, reject it.\n\t\t\tif ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {\n\t\t\t\tkl.rejectPod(pod, reason, message)\n\t\t\t\tcontinue\n\t\t\t}\n\t\t}\n\t\tmirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)\n\t\tkl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)\n\t\tkl.probeManager.AddPod(pod)\n\t}\n}\n```\n\n<br>\n\n#### 2.1 dispatchWork\n\n这里的类型是SyncPodCreate。核心就是调用UpdatePod进行处理\n\n```\n// dispatchWork starts the asynchronous sync of the pod in a pod worker.\n// If the pod is terminated, dispatchWork\nfunc (kl *Kubelet) dispatchWork(pod *v1.Pod, syncType kubetypes.SyncPodType, mirrorPod *v1.Pod, start time.Time) {\n\tif kl.podIsTerminated(pod) {\n\t\tif pod.DeletionTimestamp != nil {\n\t\t\t// If the pod is in a terminated state, there is no pod worker to\n\t\t\t// handle the work item. Check if the DeletionTimestamp has been\n\t\t\t// set, and force a status update to trigger a pod deletion request\n\t\t\t// to the apiserver.\n\t\t\tkl.statusManager.TerminatePod(pod)\n\t\t}\n\t\treturn\n\t}\n\t// Run the sync in an async worker.\n\tkl.podWorkers.UpdatePod(&UpdatePodOptions{\n\t\tPod:        pod,\n\t\tMirrorPod:  mirrorPod,\n\t\tUpdateType: syncType,\n\t\tOnCompleteFunc: func(err error) {\n\t\t\tif err != nil {\n\t\t\t\tmetrics.PodWorkerDuration.WithLabelValues(syncType.String()).Observe(metrics.SinceInSeconds(start))\n\t\t\t\tmetrics.DeprecatedPodWorkerLatency.WithLabelValues(syncType.String()).Observe(metrics.SinceInMicroseconds(start))\n\t\t\t}\n\t\t},\n\t})\n\t// 记录metric\n\t// Note the number of containers for new pods.\n\tif syncType == kubetypes.SyncPodCreate {\n\t\tmetrics.ContainersPerPodCount.Observe(float64(len(pod.Spec.Containers)))\n\t}\n}\n```\n\n##### 2.1.1 podWorkers.UpdatePod\n\n**podWorkers**结构体如下，核心：\n\n**podUpdates**： 有变化的pod列表\n\n**isWorking**： 正在处理的pod列表\n\n**lastUndeliveredWorkUpdate**：最新的还没来得及处理的pod列表。（pod在isWorking队列中的时候，又来了。就放入lastUndeliveredWorkUpdate中去）\n\n**podCache**： 本地缓存的podStatus状态，runc状态，是Pod当前真实的状态\n\n在newMainKubelet函数中定于了 klet.podCache = kubecontainer.NewCache()。 而pleg的channel中有一个判断就是，如果enable cache就会更新cache。其实就是cache\n\n```\ntype podWorkers struct {\n\t// Protects all per worker fields.\n\tpodLock sync.Mutex\n\n\t// Tracks all running per-pod goroutines - per-pod goroutine will be\n\t// processing updates received through its corresponding channel.\n\t// 有变化的pod列表\n\tpodUpdates map[types.UID]chan UpdatePodOptions\n\t\n\t// Track the current state of per-pod goroutines.\n\t// Currently all update request for a given pod coming when another\n\t// update of this pod is being processed are ignored.\n\t// 正在处理的pod列表\n\tisWorking map[types.UID]bool\n\t\n\t\n\t// Tracks the last undelivered work item for this pod - a work item is\n\t// undelivered if it comes in while the worker is working.\n\t// 最新的还没来得及处理的pod列表。（pod在isWorking队列中的时候，又来了。就放入lastUndeliveredWorkUpdate中去）\n\tlastUndeliveredWorkUpdate map[types.UID]UpdatePodOptions\n\n\tworkQueue queue.WorkQueue\n\n\t// This function is run to sync the desired stated of pod.\n\t// NOTE: This function has to be thread-safe - it can be called for\n\t// different pods at the same time.\n\tsyncPodFn syncPodFnType\n\n\t// The EventRecorder to use\n\trecorder record.EventRecorder\n\n\t// backOffPeriod is the duration to back off when there is a sync error.\n\tbackOffPeriod time.Duration\n\n\t// resyncInterval is the duration to wait until the next sync.\n\tresyncInterval time.Duration\n    \n    // 本地缓存的podStatus状态，runc状态，是Pod当前真实的状态\n\t// podCache stores kubecontainer.PodStatus for all pods.\n\tpodCache kubecontainer.Cache\n}\n```\n\n<br>\n\nUpdatePod的核心逻辑是：\n\n（1）创建pod的话，是第一次。会将pod加入p.podUpdates列表，并且执行启动managePodLoop这些携程处理要更新的Pod。注意这里managePodLoop会一直启动，所以后面pod update啥的，只要往podUpdates扔进数据。managePodLoop会自动处理\n\n（2）如果该pod没有正在处理，标记该pod正在处理，处理该pod； 否则就纪录到lastUndeliveredWorkUpdate中，等着Update；这里如果是pod已经存在，并且当前待更新是kill类型的话。就不更新了，所以看得出来，Pod删除是不可逆的。就是pod有了deleteTimeStamp后，再强行删掉 deleteTimeStamp，pod还是会被删除的\n\n```\n// Apply the new setting to the specified pod.\n// If the options provide an OnCompleteFunc, the function is invoked if the update is accepted.\n// Update requests are ignored if a kill pod request is pending.\nfunc (p *podWorkers) UpdatePod(options *UpdatePodOptions) {\n\tpod := options.Pod\n\tuid := pod.UID\n\tvar podUpdates chan UpdatePodOptions\n\tvar exists bool\n\n\tp.podLock.Lock()\n\tdefer p.podLock.Unlock()\n\t// 1.创建pod的话，是第一次。会将pod加入p.podUpdates列表，并且执行启动managePodLoop这些携程处理要更新的Pod\n\tif podUpdates, exists = p.podUpdates[uid]; !exists {\n\t\t// We need to have a buffer here, because checkForUpdates() method that\n\t\t// puts an update into channel is called from the same goroutine where\n\t\t// the channel is consumed. However, it is guaranteed that in such case\n\t\t// the channel is empty, so buffer of size 1 is enough.\n\t\tpodUpdates = make(chan UpdatePodOptions, 1)\n\t\tp.podUpdates[uid] = podUpdates\n\n\t\t// Creating a new pod worker either means this is a new pod, or that the\n\t\t// kubelet just restarted. In either case the kubelet is willing to believe\n\t\t// the status of the pod for the first pod worker sync. See corresponding\n\t\t// comment in syncPod.\n\t\tgo func() {\n\t\t\tdefer runtime.HandleCrash()\n\t\t\tp.managePodLoop(podUpdates)\n\t\t}()\n\t}\n\t\n\t// 2.如果该pod没有正在处理，标记该pod正在处理，处理该pod； 否则就纪录到lastUndeliveredWorkUpdate中，等着Update\n\t// 这里如果是pod已经存在，并且当前待更新是kill类型的话。就不更新了，所以看得出来，Pod删除是不可逆的。就是pod有了deleteTimeStamp后，再强行删掉       // deleteTimeStamp，pod还是会被删除的\n\tif !p.isWorking[pod.UID] {\n\t\tp.isWorking[pod.UID] = true\n\t\tpodUpdates <- *options\n\t} else {\n\t\t// if a request to kill a pod is pending, we do not let anything overwrite that request.\n\t\tupdate, found := p.lastUndeliveredWorkUpdate[pod.UID]\n\t\tif !found || update.UpdateType != kubetypes.SyncPodKill {\n\t\t\tp.lastUndeliveredWorkUpdate[pod.UID] = *options\n\t\t}\n\t}\n}\n```\n\n##### 2.1.2 managePodLoop\n\nmanagePodLoop的核心就是获取Pod当前最新的本地状态。然后调用syncPodFn就行同步。\n\n注意这里处理完syncPodFn后调用了p.wrapUp(update.Pod.UID, err)函数。wrapUp->checkForUpdates，在checkForUpdates函数中，会将lastUndeliveredWorkUpdate的数据发送的updates中去，然后删除lastUndeliveredWorkUpdate。\n\n假设podA-1, podA-2, podA-3分别表示podA更新的3个状态。\n\nPodA-1在isWorking队列中时，podA-2来了会进入lastUndeliveredWorkUpdate队列。podA-3来了，就将podA-2替换。\n\n等PodA-1处理完了。PodA-3会扔进managePodLoop队列再次处理。\n\n```\nfunc (p *podWorkers) managePodLoop(podUpdates <-chan UpdatePodOptions) {\n\tvar lastSyncTime time.Time\n\tfor update := range podUpdates {\n\t\terr := func() error {\n\t\t\tpodUID := update.Pod.UID\n\t\t\t// This is a blocking call that would return only if the cache\n\t\t\t// has an entry for the pod that is newer than minRuntimeCache\n\t\t\t// Time. This ensures the worker doesn't start syncing until\n\t\t\t// after the cache is at least newer than the finished time of\n\t\t\t// the previous sync.\n\t\t\tstatus, err := p.podCache.GetNewerThan(podUID, lastSyncTime)\n\t\t\tif err != nil {\n\t\t\t\t// This is the legacy event thrown by manage pod loop\n\t\t\t\t// all other events are now dispatched from syncPodFn\n\t\t\t\tp.recorder.Eventf(update.Pod, v1.EventTypeWarning, events.FailedSync, \"error determining status: %v\", err)\n\t\t\t\treturn err\n\t\t\t}\n\t\t\terr = p.syncPodFn(syncPodOptions{\n\t\t\t\tmirrorPod:      update.MirrorPod,\n\t\t\t\tpod:            update.Pod,\n\t\t\t\tpodStatus:      status,\n\t\t\t\tkillPodOptions: update.KillPodOptions,\n\t\t\t\tupdateType:     update.UpdateType,\n\t\t\t})\n\t\t\tlastSyncTime = time.Now()\n\t\t\treturn err\n\t\t}()\n\t\t// notify the call-back function if the operation succeeded or not\n\t\tif update.OnCompleteFunc != nil {\n\t\t\tupdate.OnCompleteFunc(err)\n\t\t}\n\t\tif err != nil {\n\t\t\t// IMPORTANT: we do not log errors here, the syncPodFn is responsible for logging errors\n\t\t\tklog.Errorf(\"Error syncing pod %s (%q), skipping: %v\", update.Pod.UID, format.Pod(update.Pod), err)\n\t\t}\n\t\tp.wrapUp(update.Pod.UID, err)\n\t}\n}\n```\n\n##### 2.1.3 syncPodFn\n\n在初始化syncPodFn函数的时候指定了函数为：syncPod。该函数的具体逻辑如下：\n\n（1）如果是要kill pod，调用SetPodStatus设置状态，并且调用killPod\n\n（2）如果是创建pod，先纪录pod的 firstSeenTime时间\n\n（3）设置pod的 apiStatus，kubelet 监听得到的pod是没有status的。所以第一次创建的时候，kubelet会根据spec的内容，创建status，例如hostip, podid等等。\n\n（4）如果pod已经running了，记录pod从 firstSeenTime到running的时间\n\n（5）判断该Pod能否运行在这个节点上，如果不行给出原因，比如pid不够等原因\n\n（6）更新statusManager中该pod status，如果pod deletetimeStamp!=nil 可能会调用apiserver删除pod操作。\n\n（7）如果pod不能运行在该node上，或者有DeletionTimestamp，或者Pod状态为failed，调用killpod函数删除该pod\n\n（8）pod使用的不是hostNetwork，并且如果网络插件没有准备好，报错后返回\n\n（9）如果pod没有被设置删除，并且是第一次出现，更新pod的cgroup。这里最终会调用func (m *qosContainerManagerImpl) setCPUCgroupConfig 函数设置 cpu/mem等cgruop\n\n（10）如果是staticpod，创建mirrorpod\n\n（11）为pod创建data目录。会在root-dir目录下创建 pods,以及pods/volume等目录。 默认的root-dir为 /var/lib/kubelet，可以通过--root-dir修改\n\n（12）如果pod要删除，attach mount\n\n（13）获取pod的secrets，并且调用containerRuntime.SyncPod同步\n\n**总结：**\n\n对于创建Pod而言，syncPod其实就是更新pod的status，然后创建data目录，最核心的就是调用containerRuntime.SyncPod进行底层容器的创建。接下来看看containerRuntime.SyncPod做了什么工作。\n\n```\n// syncPod is the transaction script for the sync of a single pod.\n//\n// Arguments:\n//\n// o - the SyncPodOptions for this invocation\n//\n// The workflow is:\n// * If the pod is being created, record pod worker start latency\n// * Call generateAPIPodStatus to prepare an v1.PodStatus for the pod\n// * If the pod is being seen as running for the first time, record pod\n//   start latency\n// * Update the status of the pod in the status manager\n// * Kill the pod if it should not be running\n// * Create a mirror pod if the pod is a static pod, and does not\n//   already have a mirror pod\n// * Create the data directories for the pod if they do not exist\n// * Wait for volumes to attach/mount\n// * Fetch the pull secrets for the pod\n// * Call the container runtime's SyncPod callback\n// * Update the traffic shaping for the pod's ingress and egress limits\n//\n// If any step of this workflow errors, the error is returned, and is repeated\n// on the next syncPod call.\n//\n// This operation writes all events that are dispatched in order to provide\n// the most accurate information possible about an error situation to aid debugging.\n// Callers should not throw an event if this operation returns an error.\nfunc (kl *Kubelet) syncPod(o syncPodOptions) error {\n\t// pull out the required options\n\tpod := o.pod\n\tmirrorPod := o.mirrorPod\n\tpodStatus := o.podStatus\n\tupdateType := o.updateType\n\n\t// if we want to kill a pod, do it now!\n\t// 1.如果是要kill pod，调用SetPodStatus设置状态，并且调用killPod\n\tif updateType == kubetypes.SyncPodKill {\n\t\tkillPodOptions := o.killPodOptions\n\t\tif killPodOptions == nil || killPodOptions.PodStatusFunc == nil {\n\t\t\treturn fmt.Errorf(\"kill pod options are required if update type is kill\")\n\t\t}\n\t\tapiPodStatus := killPodOptions.PodStatusFunc(pod, podStatus)\n\t\tkl.statusManager.SetPodStatus(pod, apiPodStatus)\n\t\t// we kill the pod with the specified grace period since this is a termination\n\t\tif err := kl.killPod(pod, nil, podStatus, killPodOptions.PodTerminationGracePeriodSecondsOverride); err != nil {\n\t\t\tkl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, \"error killing pod: %v\", err)\n\t\t\t// there was an error killing the pod, so we return that error directly\n\t\t\tutilruntime.HandleError(err)\n\t\t\treturn err\n\t\t}\n\t\treturn nil\n\t}\n\n\t// Latency measurements for the main workflow are relative to the\n\t// first time the pod was seen by the API server.\n\tvar firstSeenTime time.Time\n\tif firstSeenTimeStr, ok := pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey]; ok {\n\t\tfirstSeenTime = kubetypes.ConvertToTimestamp(firstSeenTimeStr).Get()\n\t}\n\n\t// Record pod worker start latency if being created\n\t// TODO: make pod workers record their own latencies\n\t// 2.如果是创建pod，先纪录pod的 firstSeenTime时间\n\tif updateType == kubetypes.SyncPodCreate {\n\t\tif !firstSeenTime.IsZero() {\n\t\t\t// This is the first time we are syncing the pod. Record the latency\n\t\t\t// since kubelet first saw the pod if firstSeenTime is set.\n\t\t\tmetrics.PodWorkerStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))\n\t\t\tmetrics.DeprecatedPodWorkerStartLatency.Observe(metrics.SinceInMicroseconds(firstSeenTime))\n\t\t} else {\n\t\t\tklog.V(3).Infof(\"First seen time not recorded for pod %q\", pod.UID)\n\t\t}\n\t}\n\n\t// Generate final API pod status with pod and status manager status\n\t// 3.设置pod的 apiStatus\n\tapiPodStatus := kl.generateAPIPodStatus(pod, podStatus)\n\t// The pod IP may be changed in generateAPIPodStatus if the pod is using host network. (See #24576)\n\t// TODO(random-liu): After writing pod spec into container labels, check whether pod is using host network, and\n\t// set pod IP to hostIP directly in runtime.GetPodStatus\n\tpodStatus.IPs = make([]string, 0, len(apiPodStatus.PodIPs))\n\tfor _, ipInfo := range apiPodStatus.PodIPs {\n\t\tpodStatus.IPs = append(podStatus.IPs, ipInfo.IP)\n\t}\n\n\tif len(podStatus.IPs) == 0 && len(apiPodStatus.PodIP) > 0 {\n\t\tpodStatus.IPs = []string{apiPodStatus.PodIP}\n\t}\n\n\t// Record the time it takes for the pod to become running.\n\t// 4.纪录pod从 firstSeenTime到running的时间\n\texistingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)\n\tif !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&\n\t\t!firstSeenTime.IsZero() {\n\t\tmetrics.PodStartDuration.Observe(metrics.SinceInSeconds(firstSeenTime))\n\t\tmetrics.DeprecatedPodStartLatency.Observe(metrics.SinceInMicroseconds(firstSeenTime))\n\t}\n    \n    // 5.判断该Pod能否运行在这个节点上，如果不行给出原因，比如pid不够等原因\n\trunnable := kl.canRunPod(pod)\n\tif !runnable.Admit {\n\t\t// Pod is not runnable; update the Pod and Container statuses to why.\n\t\tapiPodStatus.Reason = runnable.Reason\n\t\tapiPodStatus.Message = runnable.Message\n\t\t// Waiting containers are not creating.\n\t\tconst waitingReason = \"Blocked\"\n\t\tfor _, cs := range apiPodStatus.InitContainerStatuses {\n\t\t\tif cs.State.Waiting != nil {\n\t\t\t\tcs.State.Waiting.Reason = waitingReason\n\t\t\t}\n\t\t}\n\t\tfor _, cs := range apiPodStatus.ContainerStatuses {\n\t\t\tif cs.State.Waiting != nil {\n\t\t\t\tcs.State.Waiting.Reason = waitingReason\n\t\t\t}\n\t\t}\n\t}\n\n\t// Update status in the status manager\n\t// 6. 更新statusManager中该pod status。查看下去最终就是调用了apisever client更新了pod状态\n\tkl.statusManager.SetPodStatus(pod, apiPodStatus)\n    \n    // 7. 如果pod不能运行在该node上，或者有DeletionTimestamp，或者Pod状态为failed，调用killpod函数删除该pod\n\t// Kill pod if it should not be running\n\tif !runnable.Admit || pod.DeletionTimestamp != nil || apiPodStatus.Phase == v1.PodFailed {\n\t\tvar syncErr error\n\t\tif err := kl.killPod(pod, nil, podStatus, nil); err != nil {\n\t\t\tkl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, \"error killing pod: %v\", err)\n\t\t\tsyncErr = fmt.Errorf(\"error killing pod: %v\", err)\n\t\t\tutilruntime.HandleError(syncErr)\n\t\t} else {\n\t\t\tif !runnable.Admit {\n\t\t\t\t// There was no error killing the pod, but the pod cannot be run.\n\t\t\t\t// Return an error to signal that the sync loop should back off.\n\t\t\t\tsyncErr = fmt.Errorf(\"pod cannot be run: %s\", runnable.Message)\n\t\t\t}\n\t\t}\n\t\treturn syncErr\n\t}\n     \n    // 8.pod使用的不是hostNetwork，并且如果网络插件没有准备好，报错后返回\n\t// If the network plugin is not ready, only start the pod if it uses the host network\n\tif err := kl.runtimeState.networkErrors(); err != nil && !kubecontainer.IsHostNetworkPod(pod) {\n\t\tkl.recorder.Eventf(pod, v1.EventTypeWarning, events.NetworkNotReady, \"%s: %v\", NetworkNotReadyErrorMsg, err)\n\t\treturn fmt.Errorf(\"%s: %v\", NetworkNotReadyErrorMsg, err)\n\t}\n\n\t// Create Cgroups for the pod and apply resource parameters\n\t// to them if cgroups-per-qos flag is enabled.\n\tpcm := kl.containerManager.NewPodContainerManager()\n\t// If pod has already been terminated then we need not create\n\t// or update the pod's cgroup\n\t// 9.如果pod没有被设置删除，并且是第一次出现，更新pod的cgroup。\n\tif !kl.podIsTerminated(pod) {\n\t\t// When the kubelet is restarted with the cgroups-per-qos\n\t\t// flag enabled, all the pod's running containers\n\t\t// should be killed intermittently and brought back up\n\t\t// under the qos cgroup hierarchy.\n\t\t// Check if this is the pod's first sync\n\t\tfirstSync := true\n\t\tfor _, containerStatus := range apiPodStatus.ContainerStatuses {\n\t\t\tif containerStatus.State.Running != nil {\n\t\t\t\tfirstSync = false\n\t\t\t\tbreak\n\t\t\t}\n\t\t}\n\t\t// Don't kill containers in pod if pod's cgroups already\n\t\t// exists or the pod is running for the first time\n\t\tpodKilled := false\n\t\tif !pcm.Exists(pod) && !firstSync {\n\t\t\tif err := kl.killPod(pod, nil, podStatus, nil); err == nil {\n\t\t\t\tpodKilled = true\n\t\t\t}\n\t\t}\n\t\t// Create and Update pod's Cgroups\n\t\t// Don't create cgroups for run once pod if it was killed above\n\t\t// The current policy is not to restart the run once pods when\n\t\t// the kubelet is restarted with the new flag as run once pods are\n\t\t// expected to run only once and if the kubelet is restarted then\n\t\t// they are not expected to run again.\n\t\t// We don't create and apply updates to cgroup if its a run once pod and was killed above\n\t\tif !(podKilled && pod.Spec.RestartPolicy == v1.RestartPolicyNever) {\n\t\t\tif !pcm.Exists(pod) {\n\t\t\t\tif err := kl.containerManager.UpdateQOSCgroups(); err != nil {\n\t\t\t\t\tklog.V(2).Infof(\"Failed to update QoS cgroups while syncing pod: %v\", err)\n\t\t\t\t}\n\t\t\t\tif err := pcm.EnsureExists(pod); err != nil {\n\t\t\t\t\tkl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToCreatePodContainer, \"unable to ensure pod container exists: %v\", err)\n\t\t\t\t\treturn fmt.Errorf(\"failed to ensure that the pod: %v cgroups exist and are correctly applied: %v\", pod.UID, err)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\t// Create Mirror Pod for Static Pod if it doesn't already exist\n\t// 10.如果是staticpod，创建mirrorpod\n\tif kubetypes.IsStaticPod(pod) {\n\t\tpodFullName := kubecontainer.GetPodFullName(pod)\n\t\tdeleted := false\n\t\tif mirrorPod != nil {\n\t\t\tif mirrorPod.DeletionTimestamp != nil || !kl.podManager.IsMirrorPodOf(mirrorPod, pod) {\n\t\t\t\t// The mirror pod is semantically different from the static pod. Remove\n\t\t\t\t// it. The mirror pod will get recreated later.\n\t\t\t\tklog.Infof(\"Trying to delete pod %s %v\", podFullName, mirrorPod.ObjectMeta.UID)\n\t\t\t\tvar err error\n\t\t\t\tdeleted, err = kl.podManager.DeleteMirrorPod(podFullName, &mirrorPod.ObjectMeta.UID)\n\t\t\t\tif deleted {\n\t\t\t\t\tklog.Warningf(\"Deleted mirror pod %q because it is outdated\", format.Pod(mirrorPod))\n\t\t\t\t} else if err != nil {\n\t\t\t\t\tklog.Errorf(\"Failed deleting mirror pod %q: %v\", format.Pod(mirrorPod), err)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif mirrorPod == nil || deleted {\n\t\t\tnode, err := kl.GetNode()\n\t\t\tif err != nil || node.DeletionTimestamp != nil {\n\t\t\t\tklog.V(4).Infof(\"No need to create a mirror pod, since node %q has been removed from the cluster\", kl.nodeName)\n\t\t\t} else {\n\t\t\t\tklog.V(4).Infof(\"Creating a mirror pod for static pod %q\", format.Pod(pod))\n\t\t\t\tif err := kl.podManager.CreateMirrorPod(pod); err != nil {\n\t\t\t\t\tklog.Errorf(\"Failed creating a mirror pod for %q: %v\", format.Pod(pod), err)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n     \n    // 11. 为pod创建data目录。会在root-dir目录下创建 pods,以及pods/volume等目录。\n\t// Make data directories for the pod\n\tif err := kl.makePodDataDirs(pod); err != nil {\n\t\tkl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToMakePodDataDirectories, \"error making pod data directories: %v\", err)\n\t\tklog.Errorf(\"Unable to make pod data directories for pod %q: %v\", format.Pod(pod), err)\n\t\treturn err\n\t}\n\t\n    // 12.如果pod要删除，attach mount\n\t// Volume manager will not mount volumes for terminated pods\n\tif !kl.podIsTerminated(pod) {\n\t\t// Wait for volumes to attach/mount\n\t\tif err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {\n\t\t\tkl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, \"Unable to attach or mount volumes: %v\", err)\n\t\t\tklog.Errorf(\"Unable to attach or mount volumes for pod %q: %v; skipping pod\", format.Pod(pod), err)\n\t\t\treturn err\n\t\t}\n\t}\n    \n    // 13. 获取pod的secrets，并且调用containerRuntime.SyncPod同步\n\t// Fetch the pull secrets for the pod\n\tpullSecrets := kl.getPullSecretsForPod(pod)\n\n\t// Call the container runtime's SyncPod callback\n\tresult := kl.containerRuntime.SyncPod(pod, podStatus, pullSecrets, kl.backOff)\n\tkl.reasonCache.Update(pod.UID, result)\n\tif err := result.Error(); err != nil {\n\t\t// Do not return error if the only failures were pods in backoff\n\t\tfor _, r := range result.SyncResults {\n\t\t\tif r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {\n\t\t\t\t// Do not record an event here, as we keep all event logging for sync pod failures\n\t\t\t\t// local to container runtime so we get better errors\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\n\t\treturn nil\n\t}\n\n\treturn nil\n}\n```\n\n##### 2.1.4 containerRuntime.SyncPod\n\n该函数是最核心的功能，函数的主要功能就是，输入Pod期望的状态。然后调用cri 对容器执行对应的操作，已达到期望状态。\n\n具体流程如下：\n\n（1）调用computePodActions 判断pod状态。主要是完善这个结构\n\n```\n\tchanges := podActions{\n\t\tKillPod:           createPodSandbox,     // bool值 判断是否需要kill pod\n\t\tCreateSandbox:     createPodSandbox,     // bool值 判断是否需要创建sandbox\n\t\tSandboxID:         sandboxID,            // string, sandboxId\n\t\tAttempt:           attempt,              // int，sandbox创建次数，每次+1\n\t\tContainersToStart: []int{},              // int，第几个容器需要start \n\t\tContainersToKill:  make(map[kubecontainer.ContainerID]containerToKillInfo),   //哪写容器需要kill\n\t}\n```\n\n（2）如果sandbox改变了，kill pod，这里不会将pod重建，Pod名字啥的不会改变。只是清空pod的容器，然后重新创建init容器。pod ip是可能会变的；否则，就根据第一步得的的ContainersToKill列表，清理这些container\n\n（3)   清理unknow，faild状态的的initContainer\n\n（4）如果需要的话，创建SandboxID\n\n（5）得到podSandboxConfig，用于后面的容器的启动\n\n（6）定义start 容器函数\n\n（7）start 临时容器，临时容器是什么可以参考：https://cloud.tencent.com/developer/article/1645954\n\n（8）启动Init容器\n\n（9）启动业务容器\n\n```\n// SyncPod syncs the running pod into the desired pod by executing following steps:\n//\n//  1. Compute sandbox and container changes.\n//  2. Kill pod sandbox if necessary.\n//  3. Kill any containers that should not be running.\n//  4. Create sandbox if necessary.\n//  5. Create ephemeral containers.\n//  6. Create init containers.\n//  7. Create normal containers.\nfunc (m *kubeGenericRuntimeManager) SyncPod(pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, backOff *flowcontrol.Backoff) (result kubecontainer.PodSyncResult) {\n\t// Step 1: Compute sandbox and container changes.\n\t// 1.调用computePodActions 判断pod状态\n\tpodContainerChanges := m.computePodActions(pod, podStatus)\n\tklog.V(3).Infof(\"computePodActions got %+v for pod %q\", podContainerChanges, format.Pod(pod))\n\tif podContainerChanges.CreateSandbox {\n\t\tref, err := ref.GetReference(legacyscheme.Scheme, pod)\n\t\tif err != nil {\n\t\t\tklog.Errorf(\"Couldn't make a ref to pod %q: '%v'\", format.Pod(pod), err)\n\t\t}\n\t\tif podContainerChanges.SandboxID != \"\" {\n\t\t\tm.recorder.Eventf(ref, v1.EventTypeNormal, events.SandboxChanged, \"Pod sandbox changed, it will be killed and re-created.\")\n\t\t} else {\n\t\t\tklog.V(4).Infof(\"SyncPod received new pod %q, will create a sandbox for it\", format.Pod(pod))\n\t\t}\n\t}\n\n\t// Step 2: Kill the pod if the sandbox has changed.\n\t// 2.如果sandbox改变了，kill pod，这里不会将pod重建，Pod名字啥的不会改变。只是清空pod的容器，然后重新创建init容器。pod ip是    可能会变的, 否则，就根据第一步得的的ContainersToKill列表，清理这些container\n\tif podContainerChanges.KillPod {\n\t\tif podContainerChanges.CreateSandbox {\n\t\t\tklog.V(4).Infof(\"Stopping PodSandbox for %q, will start new one\", format.Pod(pod))\n\t\t} else {\n\t\t\tklog.V(4).Infof(\"Stopping PodSandbox for %q because all other containers are dead.\", format.Pod(pod))\n\t\t}\n\n\t\tkillResult := m.killPodWithSyncResult(pod, kubecontainer.ConvertPodStatusToRunningPod(m.runtimeName, podStatus), nil)\n\t\tresult.AddPodSyncResult(killResult)\n\t\tif killResult.Error() != nil {\n\t\t\tklog.Errorf(\"killPodWithSyncResult failed: %v\", killResult.Error())\n\t\t\treturn\n\t\t}\n\n\t\tif podContainerChanges.CreateSandbox {\n\t\t\tm.purgeInitContainers(pod, podStatus)\n\t\t}\n\t} else {\n\t\t// Step 3: kill any running containers in this pod which are not to keep.\n\t\tfor containerID, containerInfo := range podContainerChanges.ContainersToKill {\n\t\t\tklog.V(3).Infof(\"Killing unwanted container %q(id=%q) for pod %q\", containerInfo.name, containerID, format.Pod(pod))\n\t\t\tkillContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, containerInfo.name)\n\t\t\tresult.AddSyncResult(killContainerResult)\n\t\t\tif err := m.killContainer(pod, containerID, containerInfo.name, containerInfo.message, nil); err != nil {\n\t\t\t\tkillContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())\n\t\t\t\tklog.Errorf(\"killContainer %q(id=%q) for pod %q failed: %v\", containerInfo.name, containerID, format.Pod(pod), err)\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t}\n\n\t// Keep terminated init containers fairly aggressively controlled\n\t// This is an optimization because container removals are typically handled\n\t// by container garbage collector.\n\t// 3.清理unknow，faild状态的的initContainer\n\tm.pruneInitContainersBeforeStart(pod, podStatus)\n\n\t// We pass the value of the PRIMARY podIP and list of podIPs down to\n\t// generatePodSandboxConfig and generateContainerConfig, which in turn\n\t// passes it to various other functions, in order to facilitate functionality\n\t// that requires this value (hosts file and downward API) and avoid races determining\n\t// the pod IP in cases where a container requires restart but the\n\t// podIP isn't in the status manager yet. The list of podIPs is used to\n\t// generate the hosts file.\n\t//\n\t// We default to the IPs in the passed-in pod status, and overwrite them if the\n\t// sandbox needs to be (re)started.\n\tvar podIPs []string\n\tif podStatus != nil {\n\t\tpodIPs = podStatus.IPs\n\t}\n   \n  \n  // 4.如果需要的话，创建SandboxID\n\t// Step 4: Create a sandbox for the pod if necessary.\n\tpodSandboxID := podContainerChanges.SandboxID\n\tif podContainerChanges.CreateSandbox {\n\t\tvar msg string\n\t\tvar err error\n\n\t\tklog.V(4).Infof(\"Creating sandbox for pod %q\", format.Pod(pod))\n\t\tcreateSandboxResult := kubecontainer.NewSyncResult(kubecontainer.CreatePodSandbox, format.Pod(pod))\n\t\tresult.AddSyncResult(createSandboxResult)\n\t\tpodSandboxID, msg, err = m.createPodSandbox(pod, podContainerChanges.Attempt)\n\t\tif err != nil {\n\t\t\tcreateSandboxResult.Fail(kubecontainer.ErrCreatePodSandbox, msg)\n\t\t\tklog.Errorf(\"createPodSandbox for pod %q failed: %v\", format.Pod(pod), err)\n\t\t\tref, referr := ref.GetReference(legacyscheme.Scheme, pod)\n\t\t\tif referr != nil {\n\t\t\t\tklog.Errorf(\"Couldn't make a ref to pod %q: '%v'\", format.Pod(pod), referr)\n\t\t\t}\n\t\t\tm.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedCreatePodSandBox, \"Failed to create pod sandbox: %v\", err)\n\t\t\treturn\n\t\t}\n\t\tklog.V(4).Infof(\"Created PodSandbox %q for pod %q\", podSandboxID, format.Pod(pod))\n\n\t\tpodSandboxStatus, err := m.runtimeService.PodSandboxStatus(podSandboxID)\n\t\tif err != nil {\n\t\t\tref, referr := ref.GetReference(legacyscheme.Scheme, pod)\n\t\t\tif referr != nil {\n\t\t\t\tklog.Errorf(\"Couldn't make a ref to pod %q: '%v'\", format.Pod(pod), referr)\n\t\t\t}\n\t\t\tm.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedStatusPodSandBox, \"Unable to get pod sandbox status: %v\", err)\n\t\t\tklog.Errorf(\"Failed to get pod sandbox status: %v; Skipping pod %q\", err, format.Pod(pod))\n\t\t\tresult.Fail(err)\n\t\t\treturn\n\t\t}\n\n\t\t// If we ever allow updating a pod from non-host-network to\n\t\t// host-network, we may use a stale IP.\n\t\tif !kubecontainer.IsHostNetworkPod(pod) {\n\t\t\t// Overwrite the podIPs passed in the pod status, since we just started the pod sandbox.\n\t\t\tpodIPs = m.determinePodSandboxIPs(pod.Namespace, pod.Name, podSandboxStatus)\n\t\t\tklog.V(4).Infof(\"Determined the ip %v for pod %q after sandbox changed\", podIPs, format.Pod(pod))\n\t\t}\n\t}\n\n\t// the start containers routines depend on pod ip(as in primary pod ip)\n\t// instead of trying to figure out if we have 0 < len(podIPs)\n\t// everytime, we short circuit it here\n\tpodIP := \"\"\n\tif len(podIPs) != 0 {\n\t\tpodIP = podIPs[0]\n\t}\n  \n  // 5.得到podSandboxConfig，用于后面的容器的启动\n\t// Get podSandboxConfig for containers to start.\n\tconfigPodSandboxResult := kubecontainer.NewSyncResult(kubecontainer.ConfigPodSandbox, podSandboxID)\n\tresult.AddSyncResult(configPodSandboxResult)\n\tpodSandboxConfig, err := m.generatePodSandboxConfig(pod, podContainerChanges.Attempt)\n\tif err != nil {\n\t\tmessage := fmt.Sprintf(\"GeneratePodSandboxConfig for pod %q failed: %v\", format.Pod(pod), err)\n\t\tklog.Error(message)\n\t\tconfigPodSandboxResult.Fail(kubecontainer.ErrConfigPodSandbox, message)\n\t\treturn\n\t}\n\n\t// Helper containing boilerplate common to starting all types of containers.\n\t// typeName is a label used to describe this type of container in log messages,\n\t// currently: \"container\", \"init container\" or \"ephemeral container\"\n\t// 6.定义start 容器函数\n\tstart := func(typeName string, container *v1.Container) error {\n\t\tstartContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, container.Name)\n\t\tresult.AddSyncResult(startContainerResult)\n\n\t\tisInBackOff, msg, err := m.doBackOff(pod, container, podStatus, backOff)\n\t\tif isInBackOff {\n\t\t\tstartContainerResult.Fail(err, msg)\n\t\t\tklog.V(4).Infof(\"Backing Off restarting %v %+v in pod %v\", typeName, container, format.Pod(pod))\n\t\t\treturn err\n\t\t}\n\n\t\tklog.V(4).Infof(\"Creating %v %+v in pod %v\", typeName, container, format.Pod(pod))\n\t\t// NOTE (aramase) podIPs are populated for single stack and dual stack clusters. Send only podIPs.\n\t\tif msg, err := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, podIPs); err != nil {\n\t\t\tstartContainerResult.Fail(err, msg)\n\t\t\t// known errors that are logged in other places are logged at higher levels here to avoid\n\t\t\t// repetitive log spam\n\t\t\tswitch {\n\t\t\tcase err == images.ErrImagePullBackOff:\n\t\t\t\tklog.V(3).Infof(\"%v start failed: %v: %s\", typeName, err, msg)\n\t\t\tdefault:\n\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"%v start failed: %v: %s\", typeName, err, msg))\n\t\t\t}\n\t\t\treturn err\n\t\t}\n\n\t\treturn nil\n\t}\n  \n  // 7. start 临时容器\n\t// Step 5: start ephemeral containers\n\t// These are started \"prior\" to init containers to allow running ephemeral containers even when there\n\t// are errors starting an init container. In practice init containers will start first since ephemeral\n\t// containers cannot be specified on pod creation.\n\tif utilfeature.DefaultFeatureGate.Enabled(features.EphemeralContainers) {\n\t\tfor _, idx := range podContainerChanges.EphemeralContainersToStart {\n\t\t\tc := (*v1.Container)(&pod.Spec.EphemeralContainers[idx].EphemeralContainerCommon)\n\t\t\tstart(\"ephemeral container\", c)\n\t\t}\n\t}\n  \n  // 8. 启动init 容器\n\t// Step 6: start the init container.\n\tif container := podContainerChanges.NextInitContainerToStart; container != nil {\n\t\t// Start the next init container.\n\t\tif err := start(\"init container\", container); err != nil {\n\t\t\treturn\n\t\t}\n\n\t\t// Successfully started the container; clear the entry in the failure\n\t\tklog.V(4).Infof(\"Completed init container %q for pod %q\", container.Name, format.Pod(pod))\n\t}\n  \n  // 9. 启动业务容器\n\t// Step 7: start containers in podContainerChanges.ContainersToStart.\n\tfor _, idx := range podContainerChanges.ContainersToStart {\n\t\tstart(\"container\", &pod.Spec.Containers[idx])\n\t}\n\n\treturn\n}\n```\n\n##### 2.1.5 总结\n\n再次梳理总结一下pod创建过程：有pod创建，会别kubelet监听，然后通过disPatchWork ->  managePodLoop -> syncPod\n\nkubelte主要做了两件事情：\n\n（1）更新pod的status，比如status.startTime， containerStatus等等\n\n（2）调用containerRuntime.SyncPod 完成 创建sandbox, 启动容器等过程\n\n（3）后面容器启动好了后，会触发pleg channel的update，这个后面在分析\n\n### 3. pod创建过程详细分析\n\n#### 3.1 创建sandbox过程做了什么工作\n\n##### 3.1.1 createPodSandbox\n\n在SyncPod中的m.createPodSandbox 调用了 createPodSandbox\n\nkubeGenericRuntimeManager.createPodSandbox核心逻辑如下：\n\n（1）创建logdir。 默认是 /var/log/pods/XXX\n\n（2）调用runtimeService.RunPodSandbox创建sandbox\n\n```\n// createPodSandbox creates a pod sandbox and returns (podSandBoxID, message, error).\nfunc (m *kubeGenericRuntimeManager) createPodSandbox(pod *v1.Pod, attempt uint32) (string, string, error) {\n\tpodSandboxConfig, err := m.generatePodSandboxConfig(pod, attempt)\n\tif err != nil {\n\t\tmessage := fmt.Sprintf(\"GeneratePodSandboxConfig for pod %q failed: %v\", format.Pod(pod), err)\n\t\tklog.Error(message)\n\t\treturn \"\", message, err\n\t}\n\n\t// Create pod logs directory\n\t// 1.创建logdir。 默认是 /var/log/pods/XXX\n\terr = m.osInterface.MkdirAll(podSandboxConfig.LogDirectory, 0755)\n\tif err != nil {\n\t\tmessage := fmt.Sprintf(\"Create pod log directory for pod %q failed: %v\", format.Pod(pod), err)\n\t\tklog.Errorf(message)\n\t\treturn \"\", message, err\n\t}\n\n\truntimeHandler := \"\"\n\tif utilfeature.DefaultFeatureGate.Enabled(features.RuntimeClass) && m.runtimeClassManager != nil {\n\t\truntimeHandler, err = m.runtimeClassManager.LookupRuntimeHandler(pod.Spec.RuntimeClassName)\n\t\tif err != nil {\n\t\t\tmessage := fmt.Sprintf(\"CreatePodSandbox for pod %q failed: %v\", format.Pod(pod), err)\n\t\t\treturn \"\", message, err\n\t\t}\n\t\tif runtimeHandler != \"\" {\n\t\t\tklog.V(2).Infof(\"Running pod %s with RuntimeHandler %q\", format.Pod(pod), runtimeHandler)\n\t\t}\n\t}\n  \n  // 2.调用runtimeService.RunPodSandbox创建sandbox\n\tpodSandBoxID, err := m.runtimeService.RunPodSandbox(podSandboxConfig, runtimeHandler)\n\tif err != nil {\n\t\tmessage := fmt.Sprintf(\"CreatePodSandbox for pod %q failed: %v\", format.Pod(pod), err)\n\t\tklog.Error(message)\n\t\treturn \"\", message, err\n\t}\n\n\treturn podSandBoxID, \"\", nil\n}\n```\n\n##### 3.1.2 runtimeService.RunPodSandbox\n\nruntimeService.RunPodSandbox是一个接口，这里主要调用runtimeClient.RunPodSandbox\n\n```\n// RunPodSandbox creates and starts a pod-level sandbox. Runtimes should ensure\n// the sandbox is in ready state.\nfunc (r *RemoteRuntimeService) RunPodSandbox(config *runtimeapi.PodSandboxConfig, runtimeHandler string) (string, error) {\n\t// Use 2 times longer timeout for sandbox operation (4 mins by default)\n\t// TODO: Make the pod sandbox timeout configurable.\n\tctx, cancel := getContextWithTimeout(r.timeout * 2)\n\tdefer cancel()\n\n\tresp, err := r.runtimeClient.RunPodSandbox(ctx, &runtimeapi.RunPodSandboxRequest{\n\t\tConfig:         config,\n\t\tRuntimeHandler: runtimeHandler,\n\t})\n\tif err != nil {\n\t\tklog.Errorf(\"RunPodSandbox from runtime service failed: %v\", err)\n\t\treturn \"\", err\n\t}\n\n\tif resp.PodSandboxId == \"\" {\n\t\terrorMessage := fmt.Sprintf(\"PodSandboxId is not set for sandbox %q\", config.GetMetadata())\n\t\tklog.Errorf(\"RunPodSandbox failed: %s\", errorMessage)\n\t\treturn \"\", errors.New(errorMessage)\n\t}\n\n\treturn resp.PodSandboxId, nil\n}\n```\n\n##### 3.1.3 runtimeClient.RunPodSandbox\n\n这个也是个接口，在 k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go 中实现。并且指定了该服务的Handler是_RuntimeService_RunPodSandbox_Handler\n\n```\nfunc (c *runtimeServiceClient) RunPodSandbox(ctx context.Context, in *RunPodSandboxRequest, opts ...grpc.CallOption) (*RunPodSandboxResponse, error) {\n\tout := new(RunPodSandboxResponse)\n\terr := c.cc.Invoke(ctx, \"/runtime.v1alpha2.RuntimeService/RunPodSandbox\", in, out, opts...)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn out, nil\n}\n\nvar _RuntimeService_serviceDesc = grpc.ServiceDesc{\n\tServiceName: \"runtime.v1alpha2.RuntimeService\",\n\tHandlerType: (*RuntimeServiceServer)(nil),\n\tMethods: []grpc.MethodDesc{\n\t\t{\n\t\t\tMethodName: \"Version\",\n\t\t\tHandler:    _RuntimeService_Version_Handler,\n\t\t},\n\t\t{\n\t\t\tMethodName: \"RunPodSandbox\",\n\t\t\tHandler:    _RuntimeService_RunPodSandbox_Handler,\n\t\t},\n```\n\n<br>\n\n_RuntimeService_RunPodSandbox_Handler调用了srv.(RuntimeServiceServer).RunPodSandbox处理。\n\n```\nfunc _RuntimeService_RunPodSandbox_Handler(srv interface{}, ctx context.Context, dec func(interface{}) error, interceptor grpc.UnaryServerInterceptor) (interface{}, error) {\n\tin := new(RunPodSandboxRequest)\n\tif err := dec(in); err != nil {\n\t\treturn nil, err\n\t}\n\tif interceptor == nil {\n\t\treturn srv.(RuntimeServiceServer).RunPodSandbox(ctx, in)\n\t}\n\tinfo := &grpc.UnaryServerInfo{\n\t\tServer:     srv,\n\t\tFullMethod: \"/runtime.v1alpha2.RuntimeService/RunPodSandbox\",\n\t}\n\thandler := func(ctx context.Context, req interface{}) (interface{}, error) {\n\t\treturn srv.(RuntimeServiceServer).RunPodSandbox(ctx, req.(*RunPodSandboxRequest))\n\t}\n\treturn interceptor(ctx, in, info, handler)\n}\n```\n\n<br>\n\n##### 3.1.4 srv.(RuntimeServiceServer).RunPodSandbox\n\n这个也是接口，主要是针对不同的容器运行时。这里分析的是docker，所以是如下的函数。\n\n该函数逻辑为：\n\n（1）拉去sandbox镜像，其实就是 pause\n\n（2）调用docker-shim往docker发送创建 sandbox容器的请求，就是pauser容器\n\n（3）初始化networkReady false，就是默认网络是没好的\n\n（4）设置Sandbox Checkpoint\n\n（5）启动sandbox， 相当于docker start\n\n（6）覆盖容器的dnsConfig\n\n（7）如果是hostNetWork模式，到这里就可以返回了，因为不用分配ip\n\n（8）进行网络设置，包括分配 IP、设置 sandbox 内的路由、创建虚拟网卡等。这里主要是通过 network.SetUpPod 调用创建网络；\n\n通过network.TearDownPod 回收网络。这里具体是通过cni操作的。后面分析cni的章节根据再根据这2个函数展开。\n\n```\npkg/kubelet/dockershim/docker_sandbox.go\n// RunPodSandbox creates and starts a pod-level sandbox. Runtimes should ensure\n// the sandbox is in ready state.\n// For docker, PodSandbox is implemented by a container holding the network\n// namespace for the pod.\n// Note: docker doesn't use LogDirectory (yet).\nfunc (ds *dockerService) RunPodSandbox(ctx context.Context, r *runtimeapi.RunPodSandboxRequest) (*runtimeapi.RunPodSandboxResponse, error) {\n\tconfig := r.GetConfig()\n  \n  // 1.拉取sandbox镜像，其实就是 pause\n\t// Step 1: Pull the image for the sandbox.\n\timage := defaultSandboxImage\n\tpodSandboxImage := ds.podSandboxImage\n\tif len(podSandboxImage) != 0 {\n\t\timage = podSandboxImage\n\t}\n\n\t// NOTE: To use a custom sandbox image in a private repository, users need to configure the nodes with credentials properly.\n\t// see: http://kubernetes.io/docs/user-guide/images/#configuring-nodes-to-authenticate-to-a-private-repository\n\t// Only pull sandbox image when it's not present - v1.PullIfNotPresent.\n\tif err := ensureSandboxImageExists(ds.client, image); err != nil {\n\t\treturn nil, err\n\t}\n  \n  // 2.调用docker-shim往docker发送创建 sandbox容器的请求，就是pauser容器\n\t// Step 2: Create the sandbox container.\n\tif r.GetRuntimeHandler() != \"\" && r.GetRuntimeHandler() != runtimeName {\n\t\treturn nil, fmt.Errorf(\"RuntimeHandler %q not supported\", r.GetRuntimeHandler())\n\t}\n\tcreateConfig, err := ds.makeSandboxDockerConfig(config, image)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"failed to make sandbox docker config for pod %q: %v\", config.Metadata.Name, err)\n\t}\n\tcreateResp, err := ds.client.CreateContainer(*createConfig)\n\tif err != nil {\n\t\tcreateResp, err = recoverFromCreationConflictIfNeeded(ds.client, *createConfig, err)\n\t}\n\n\tif err != nil || createResp == nil {\n\t\treturn nil, fmt.Errorf(\"failed to create a sandbox for pod %q: %v\", config.Metadata.Name, err)\n\t}\n\tresp := &runtimeapi.RunPodSandboxResponse{PodSandboxId: createResp.ID}\n  \n  // 3.初始化networkReady false，就是默认网络是没好的\n\tds.setNetworkReady(createResp.ID, false)\n\tdefer func(e *error) {\n\t\t// Set networking ready depending on the error return of\n\t\t// the parent function\n\t\tif *e == nil {\n\t\t\tds.setNetworkReady(createResp.ID, true)\n\t\t}\n\t}(&err)\n  \n  // 4.设置Sandbox Checkpoint\n\t// Step 3: Create Sandbox Checkpoint.\n\tif err = ds.checkpointManager.CreateCheckpoint(createResp.ID, constructPodSandboxCheckpoint(config)); err != nil {\n\t\treturn nil, err\n\t}\n  \n  // 5.启动sandbox， 相当于docker start\n\t// Step 4: Start the sandbox container.\n\t// Assume kubelet's garbage collector would remove the sandbox later, if\n\t// startContainer failed.\n\terr = ds.client.StartContainer(createResp.ID)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"failed to start sandbox container for pod %q: %v\", config.Metadata.Name, err)\n\t}\n\n\t// Rewrite resolv.conf file generated by docker.\n\t// NOTE: cluster dns settings aren't passed anymore to docker api in all cases,\n\t// not only for pods with host network: the resolver conf will be overwritten\n\t// after sandbox creation to override docker's behaviour. This resolv.conf\n\t// file is shared by all containers of the same pod, and needs to be modified\n\t// only once per pod.\n\t// 6. 覆盖容器的dnsConfig\n\tif dnsConfig := config.GetDnsConfig(); dnsConfig != nil {\n\t\tcontainerInfo, err := ds.client.InspectContainer(createResp.ID)\n\t\tif err != nil {\n\t\t\treturn nil, fmt.Errorf(\"failed to inspect sandbox container for pod %q: %v\", config.Metadata.Name, err)\n\t\t}\n\n\t\tif err := rewriteResolvFile(containerInfo.ResolvConfPath, dnsConfig.Servers, dnsConfig.Searches, dnsConfig.Options); err != nil {\n\t\t\treturn nil, fmt.Errorf(\"rewrite resolv.conf failed for pod %q: %v\", config.Metadata.Name, err)\n\t\t}\n\t}\n  \n  // 7.如果是hostNetWork模式，到这里就可以返回了，因为不用分配ip\n\t// Do not invoke network plugins if in hostNetwork mode.\n\tif config.GetLinux().GetSecurityContext().GetNamespaceOptions().GetNetwork() == runtimeapi.NamespaceMode_NODE {\n\t\treturn resp, nil\n\t}\n\n\t// Step 5: Setup networking for the sandbox.\n\t// All pod networking is setup by a CNI plugin discovered at startup time.\n\t// This plugin assigns the pod ip, sets up routes inside the sandbox,\n\t// creates interfaces etc. In theory, its jurisdiction ends with pod\n\t// sandbox networking, but it might insert iptables rules or open ports\n\t// on the host as well, to satisfy parts of the pod spec that aren't\n\t// recognized by the CNI standard yet.\n\t// 8.进行网络设置，包括分配 IP、设置 sandbox 内的路由、创建虚拟网卡等。\n\tcID := kubecontainer.BuildContainerID(runtimeName, createResp.ID)\n\tnetworkOptions := make(map[string]string)\n\tif dnsConfig := config.GetDnsConfig(); dnsConfig != nil {\n\t\t// Build DNS options.\n\t\tdnsOption, err := json.Marshal(dnsConfig)\n\t\tif err != nil {\n\t\t\treturn nil, fmt.Errorf(\"failed to marshal dns config for pod %q: %v\", config.Metadata.Name, err)\n\t\t}\n\t\tnetworkOptions[\"dns\"] = string(dnsOption)\n\t}\n\terr = ds.network.SetUpPod(config.GetMetadata().Namespace, config.GetMetadata().Name, cID, config.Annotations, networkOptions)\n\tif err != nil {\n\t\terrList := []error{fmt.Errorf(\"failed to set up sandbox container %q network for pod %q: %v\", createResp.ID, config.Metadata.Name, err)}\n\n\t\t// Ensure network resources are cleaned up even if the plugin\n\t\t// succeeded but an error happened between that success and here.\n\t\terr = ds.network.TearDownPod(config.GetMetadata().Namespace, config.GetMetadata().Name, cID)\n\t\tif err != nil {\n\t\t\terrList = append(errList, fmt.Errorf(\"failed to clean up sandbox container %q network for pod %q: %v\", createResp.ID, config.Metadata.Name, err))\n\t\t}\n\n\t\terr = ds.client.StopContainer(createResp.ID, defaultSandboxGracePeriod)\n\t\tif err != nil {\n\t\t\terrList = append(errList, fmt.Errorf(\"failed to stop sandbox container %q for pod %q: %v\", createResp.ID, config.Metadata.Name, err))\n\t\t}\n\n\t\treturn resp, utilerrors.NewAggregate(errList)\n\t}\n\n\treturn resp, nil\n}\n```\n\n##### 3.1.5 总结\n\n创建sandbox就是 启动了pause容器，同时通过cni 创建了网络资源\n\n<br>\n\n#### 3.2 start init/业务容器做了什么操作\n\n##### 3.2.1 start\n\nstart init/业务容器其实都是一样的，只不过是顺序不同而言。先init 容器，再业务容器。具体是调用start函数。\n\n```\nstart := func(typeName string, container *v1.Container) error {\n\t\tstartContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, container.Name)\n\t\tresult.AddSyncResult(startContainerResult)\n    \n    // 1.检查容器是否backoff（可以理解为失败），并给出具体原因\n\t\tisInBackOff, msg, err := m.doBackOff(pod, container, podStatus, backOff)\n\t\tif isInBackOff {\n\t\t\tstartContainerResult.Fail(err, msg)\n\t\t\tklog.V(4).Infof(\"Backing Off restarting %v %+v in pod %v\", typeName, container, format.Pod(pod))\n\t\t\treturn err\n\t\t}\n    \n    // 2.调用startContainer启动容器\n\t\tklog.V(4).Infof(\"Creating %v %+v in pod %v\", typeName, container, format.Pod(pod))\n\t\t// NOTE (aramase) podIPs are populated for single stack and dual stack clusters. Send only podIPs.\n\t\tif msg, err := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, podIPs); err != nil {\n\t\t\tstartContainerResult.Fail(err, msg)\n\t\t\t// known errors that are logged in other places are logged at higher levels here to avoid\n\t\t\t// repetitive log spam\n\t\t\tswitch {\n\t\t\tcase err == images.ErrImagePullBackOff:\n\t\t\t\tklog.V(3).Infof(\"%v start failed: %v: %s\", typeName, err, msg)\n\t\t\tdefault:\n\t\t\t\tutilruntime.HandleError(fmt.Errorf(\"%v start failed: %v: %s\", typeName, err, msg))\n\t\t\t}\n\t\t\treturn err\n\t\t}\n\n\t\treturn nil\n\t}\n```\n\n##### 3.2.2 startContainer\n\n看注释该函数的逻辑就很清楚了。\n\n（1）拉取镜像\n\n（2）创建容器\n\n（3）start容器\n\n（4）执行postWebhook，就是配置容器启动前的操作\n\n```\n/ startContainer starts a container and returns a message indicates why it is failed on error.\n// It starts the container through the following steps:\n// * pull the image\n// * create the container\n// * start the container\n// * run the post start lifecycle hooks (if applicable)\nfunc (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, container *v1.Container, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, podIPs []string) (string, error) {\n\t// Step 1: pull the image.\n\timageRef, msg, err := m.imagePuller.EnsureImageExists(pod, container, pullSecrets, podSandboxConfig)\n\tif err != nil {\n\t\ts, _ := grpcstatus.FromError(err)\n\t\tm.recordContainerEvent(pod, container, \"\", v1.EventTypeWarning, events.FailedToCreateContainer, \"Error: %v\", s.Message())\n\t\treturn msg, err\n\t}\n\n\t// Step 2: create the container.\n\tref, err := kubecontainer.GenerateContainerRef(pod, container)\n\tif err != nil {\n\t\tklog.Errorf(\"Can't make a ref to pod %q, container %v: %v\", format.Pod(pod), container.Name, err)\n\t}\n\tklog.V(4).Infof(\"Generating ref for container %s: %#v\", container.Name, ref)\n\n\t// For a new container, the RestartCount should be 0\n\trestartCount := 0\n\tcontainerStatus := podStatus.FindContainerStatusByName(container.Name)\n\tif containerStatus != nil {\n\t\trestartCount = containerStatus.RestartCount + 1\n\t}\n\n\tcontainerConfig, cleanupAction, err := m.generateContainerConfig(container, pod, restartCount, podIP, imageRef, podIPs)\n\tif cleanupAction != nil {\n\t\tdefer cleanupAction()\n\t}\n\tif err != nil {\n\t\ts, _ := grpcstatus.FromError(err)\n\t\tm.recordContainerEvent(pod, container, \"\", v1.EventTypeWarning, events.FailedToCreateContainer, \"Error: %v\", s.Message())\n\t\treturn s.Message(), ErrCreateContainerConfig\n\t}\n\n\tcontainerID, err := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)\n\tif err != nil {\n\t\ts, _ := grpcstatus.FromError(err)\n\t\tm.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToCreateContainer, \"Error: %v\", s.Message())\n\t\treturn s.Message(), ErrCreateContainer\n\t}\n\terr = m.internalLifecycle.PreStartContainer(pod, container, containerID)\n\tif err != nil {\n\t\ts, _ := grpcstatus.FromError(err)\n\t\tm.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, \"Internal PreStartContainer hook failed: %v\", s.Message())\n\t\treturn s.Message(), ErrPreStartHook\n\t}\n\tm.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.CreatedContainer, fmt.Sprintf(\"Created container %s\", container.Name))\n\n\tif ref != nil {\n\t\tm.containerRefManager.SetRef(kubecontainer.ContainerID{\n\t\t\tType: m.runtimeName,\n\t\t\tID:   containerID,\n\t\t}, ref)\n\t}\n\n\t// Step 3: start the container.\n\terr = m.runtimeService.StartContainer(containerID)\n\tif err != nil {\n\t\ts, _ := grpcstatus.FromError(err)\n\t\tm.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, \"Error: %v\", s.Message())\n\t\treturn s.Message(), kubecontainer.ErrRunContainer\n\t}\n\tm.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.StartedContainer, fmt.Sprintf(\"Started container %s\", container.Name))\n\n\t// Symlink container logs to the legacy container log location for cluster logging\n\t// support.\n\t// TODO(random-liu): Remove this after cluster logging supports CRI container log path.\n\tcontainerMeta := containerConfig.GetMetadata()\n\tsandboxMeta := podSandboxConfig.GetMetadata()\n\tlegacySymlink := legacyLogSymlink(containerID, containerMeta.Name, sandboxMeta.Name,\n\t\tsandboxMeta.Namespace)\n\tcontainerLog := filepath.Join(podSandboxConfig.LogDirectory, containerConfig.LogPath)\n\t// only create legacy symlink if containerLog path exists (or the error is not IsNotExist).\n\t// Because if containerLog path does not exist, only dandling legacySymlink is created.\n\t// This dangling legacySymlink is later removed by container gc, so it does not make sense\n\t// to create it in the first place. it happens when journald logging driver is used with docker.\n\tif _, err := m.osInterface.Stat(containerLog); !os.IsNotExist(err) {\n\t\tif err := m.osInterface.Symlink(containerLog, legacySymlink); err != nil {\n\t\t\tklog.Errorf(\"Failed to create legacy symbolic link %q to container %q log %q: %v\",\n\t\t\t\tlegacySymlink, containerID, containerLog, err)\n\t\t}\n\t}\n\n\t// Step 4: execute the post start hook.\n\tif container.Lifecycle != nil && container.Lifecycle.PostStart != nil {\n\t\tkubeContainerID := kubecontainer.ContainerID{\n\t\t\tType: m.runtimeName,\n\t\t\tID:   containerID,\n\t\t}\n\t\tmsg, handlerErr := m.runner.Run(kubeContainerID, pod, container, container.Lifecycle.PostStart)\n\t\tif handlerErr != nil {\n\t\t\tm.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, msg)\n\t\t\tif err := m.killContainer(pod, kubeContainerID, container.Name, \"FailedPostStartHook\", nil); err != nil {\n\t\t\t\tklog.Errorf(\"Failed to kill container %q(id=%q) in pod %q: %v, %v\",\n\t\t\t\t\tcontainer.Name, kubeContainerID.String(), format.Pod(pod), ErrPostStartHook, err)\n\t\t\t}\n\t\t\treturn msg, fmt.Errorf(\"%s: %v\", ErrPostStartHook, handlerErr)\n\t\t}\n\t}\n\n\treturn \"\", nil\n}\n```\n\n<br>\n\n### 4.总结\n\n到这里为止，Pod在kubelet的创建流程就清楚了。了解整个过程会对排查问题，优化调度有所帮助。\n\n比如看到pod有开始拉取业务容器image的动作时，可以确定，网络初始化是已经完成了，如果这个时候发现pod没有podip，那肯定就是有问题的。\n\n这里再次初略总结一下整个流程：\n\n（1）kubelet监听pod的创建，然后进行处理\n\n（2）更新Pod应有的status，然后调用containerRuntime.SyncPod 完成 创建sandbox, 启动容器等过程, 以达到pod的期望状态\n\n（3）先创建了pod 的目录包括 log，volume目录等等。然后先启动sandbox，启动这个后会完成网络的初始化\n\n（4）成功启动sandbox后，在依次启动Init容器，业务容器。启动的过程是先拉取镜像，然后再create container， start contaienr\n\n"
  },
  {
    "path": "k8s/kubelet/6-pod pleg更新流程.md",
    "content": "* [1\\. 背景](#1-背景)\n* [2\\. HandlePodSyncs](#2-handlepodsyncs)\n  * [2\\.1 dispatchWork](#21-dispatchwork)\n  * [2\\.2 UpdatePod](#22-updatepod)\n  * [2\\.3 managePodLoop](#23-managepodloop)\n  * [2\\.4 syncPod](#24-syncpod)\n  * [2\\.5 containerRuntime\\.SyncPod](#25-containerruntimesyncpod)\n* [3 总结](#3-总结)\n\n### 1. 背景\n\n本节介绍pod更新流程。本节目的是假设创建pod后。podA的容器啥的都起来了，kubelet是如何更新podA的。\n\n在 kubelet初始化流程-下 的章节中，介绍到pleg的生产逻辑是：\n\npleg.Start每隔1秒，运行以此relist函数，relist的逻辑如下：\n\n1. 记录上一次relist的时间和间隔\n2. 通过runtimeApi获取所有的pod，包括exit的Pod\n3. 更新pods container状态,以及记录pod数量等metrics\n4. 和旧的Pod进行对比，podRecord结构体保存了旧的，和当前pod的信息，可以理解为和1s前的所有pods进行对比。对比完产生event，保存在一个map中。这里主要产生的事件为：ContainerStarted，ContainerDie, ContainerRemoved, ContainerChanged等等\n5. 如果event和Pod有绑定，并且kubelet开启了cache缓存pod信息，根据最新的信息同步缓存\n6. 将新的record赋值为旧的，为下一轮做准备，然后依次处理event，逻辑为：不是ContainerChanged状态的event都发送到eventChannel中去\n7. 更新缓存，如果有更新失败的，记录到needsReinspection，表示下一次还需要重试\n\n<br>\n\npleg的消费逻辑是:  如果pod的状态发生了改变，并且Pod还存在，调用HandlePodSyncs进行同步\n\n本节就是分析，podA的容器都起来了，pleg 调用HandlePodSyncs做了哪些工作。\n\n<br>\n\n### 2. HandlePodSyncs\n\nHandlePodSyncs调用了dispatchWork，接下里就和pod创建的流程一样了。只不过执行的函数逻辑不一样而言。\n\n在上一节中比较详细的描述了这个调用链的函数流程。接下来只是会简单过一下用到的流程，不会贴出来每个函数的详细实现了。建议看的时候，打开上一节的内容对比看。\n\n```\n// HandlePodSyncs is the callback in the syncHandler interface for pods\n// that should be dispatched to pod workers for sync.\nfunc (kl *Kubelet) HandlePodSyncs(pods []*v1.Pod) {\n\tstart := kl.clock.Now()\n\tfor _, pod := range pods {\n\t\tmirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)\n\t\tkl.dispatchWork(pod, kubetypes.SyncPodSync, mirrorPod, start)\n\t}\n}\n```\n\n<br>\n\n#### 2.1 dispatchWork\n\ndispatchWork调用UpdatePod函数\n\n```\nkl.podWorkers.UpdatePod(&UpdatePodOptions{\n\t\tPod:        pod,\n\t\tMirrorPod:  mirrorPod,\n\t\tUpdateType: syncType,\n\t\tOnCompleteFunc: func(err error) {\n\t\t\tif err != nil {\n\t\t\t\tmetrics.PodWorkerDuration.WithLabelValues(syncType.String()).Observe(metrics.SinceInSeconds(start))\n\t\t\t\tmetrics.DeprecatedPodWorkerLatency.WithLabelValues(syncType.String()).Observe(metrics.SinceInMicroseconds(start))\n\t\t\t}\n\t\t},\n\t})\n```\n\n<br>\n\n#### 2.2 UpdatePod\n\nUpdatePod主要是通过managePodLoop处理更新。因为kulete收到Pod创建的时候已经启动了这个协程处理。\n\n```\np.managePodLoop(podUpdates)\n```\n\n#### 2.3 managePodLoop\n\nmanagePodLoop继续往下调用syncPodFn进行同步\n\n```\nerr = p.syncPodFn(syncPodOptions{\n\t\t\t\tmirrorPod:      update.MirrorPod,\n\t\t\t\tpod:            update.Pod,\n\t\t\t\tpodStatus:      status,   //当前最新状态\n\t\t\t\tkillPodOptions: update.KillPodOptions,\n\t\t\t\tupdateType:     update.UpdateType,\n\t\t\t})\n\t\t\tlastSyncTime = time.Now()\n\t\t\treturn err\n\t\t}()\n```\n\n#### 2.4 syncPod\n\n因为是更新Pod status，所以执行到这个函数只会执行以下逻辑：\n\n（1）设置pod的 apiStatus，然后更新statuManager中Pod状态\n\n（2）创建目录，已经exist不会报错\n\n（3）调用containerRuntime.SyncPod同步\n\n#### 2.5 containerRuntime.SyncPod\n\n这个函数的第一步就是调用computePodActions 判断pod状态。从这里可以得到，不要创建新的sandbox，也没有容器需要创建，也不需要删除，所有就没了，不用继续往下调用了。\n\n### 3 总结\n\nPleg 每隔1s执行对比有变化的容器。然后发送到channel。对应的HandlePodSyncs会和创建Pod一样处理。只不过跳过了很多步骤。\n\n该流程核心就是：生成最新状态，发送到apiserver"
  },
  {
    "path": "k8s/kubelet/7-pod delete流程.md",
    "content": "* [1\\.背景](#1背景)\n* [2\\. HandlePodUpdates](#2-handlepodupdates)\n  * [2\\.1 syncPodFn](#21-syncpodfn)\n  * [2\\.2 kl\\.killPod(pod, nil, podStatus, nil)](#22-klkillpodpod-nil-podstatus-nil)\n  * [2\\.3  kl\\.containerRuntime\\.KillPod(pod, p, gracePeriodOverride)](#23--klcontainerruntimekillpodpod-p-graceperiodoverride)\n  * [2\\.4 killContainersWithSyncResult 删除业务容器](#24-killcontainerswithsyncresult-删除业务容器)\n    * [2\\.4\\.1 StopContainer](#241-stopcontainer)\n  * [2\\.5 StopPodSandbox](#25-stoppodsandbox)\n  * [2\\.6 总结](#26-总结)\n* [3\\. Pod是如何被删除的](#3-pod是如何被删除的)\n  * [3\\.1 SetPodStatus](#31-setpodstatus)\n  * [3\\.2 updateStatusInternal](#32-updatestatusinternal)\n  * [3\\.3 statusManager\\.start](#33-statusmanagerstart)\n  * [3\\.4 m\\.syncPod(syncRequest\\.podUID, syncRequest\\.status)](#34-msyncpodsyncrequestpoduid-syncrequeststatus)\n  * [3\\.4 总结](#34-总结)\n* [4 kubelet监听到删除pod操作后做了什么操作](#4-kubelet监听到删除pod操作后做了什么操作)\n  * [4\\.1 HandlePodRemoves](#41-handlepodremoves)\n  * [4\\.2  kl\\.deletePod](#42--kldeletepod)\n  * [4\\.3 podKiller处理 podKillingCh](#43-podkiller处理-podkillingch)\n* [5\\. 总结](#5-总结)\n\n### 1.背景\n\n当一个pod删除时，client端向apiserver发送请求，apiserver将pod的deletionTimestamp打上时间。kubelet watch到该事件，开始处理。所以一开始 kubele监听到的其实是 update事件。\n\n所以通过分析kubelet delete其实也是分析了 apiserver更新pod流程。所以不单独写 apiserver更新pod，kubelet的流程处理了。\n\n<br>\n\n### 2. HandlePodUpdates\n\n由于在分析pod创建流程的时候，将函数的流程都说明了。这里就不重复说明了，还是建议对比着看。\n\nHandlePodUpdates也是调用了dispatchWork进行处理\n\ndispatchWork调用UpdatePod\n\nUpdatePod调用managePodLoop\n\nmanagePodLoop调用syncPodFn\n\n```\n// dispatchWork调用UpdatePod\n// Run the sync in an async worker.\n\tkl.podWorkers.UpdatePod(&UpdatePodOptions{\n\t\tPod:        pod,\n\t\tMirrorPod:  mirrorPod,\n\t\tUpdateType: syncType,\n\t\tOnCompleteFunc: func(err error) {\n\t\t\tif err != nil {\n\t\t\t\tmetrics.PodWorkerDuration.WithLabelValues(syncType.String()).Observe(metrics.SinceInSeconds(start))\n\t\t\t\tmetrics.DeprecatedPodWorkerLatency.WithLabelValues(syncType.String()).Observe(metrics.SinceInMicroseconds(start))\n\t\t\t}\n\t\t},\n\t})\n\t\n\t// UpdatePod调用managePodLoop\n\tp.managePodLoop(podUpdates)\n\t\n\t// managePodLoop调用syncPodFn进行处理\n\terr = p.syncPodFn(syncPodOptions{\n\t\t\t\tmirrorPod:      update.MirrorPod,\n\t\t\t\tpod:            update.Pod,\n\t\t\t\tpodStatus:      status,\n\t\t\t\tkillPodOptions: update.KillPodOptions,\n\t\t\t\tupdateType:     update.UpdateType,\n\t\t\t})\n\t\t\tlastSyncTime = time.Now()\n\t\t\treturn err\n```\n\n<br>\n\n这里注意一个点就是：在dispatchWork函数中, kubelet是不会执行kl.statusManager.TerminatePod(pod) 函数的。\n\n因为podIsTerminated 判断pod是否terminated，并不是有了DeletionTimestamp就会认为是Terminated状态，而是有DeletionTimestamp且所有的容器不在运行了。\n\n这个时候kubelet收到了pod update事件，并且有DeletionTimestamp，但是它container还是running的。\n\n```\n// dispatchWork starts the asynchronous sync of the pod in a pod worker.\n// If the pod is terminated, dispatchWork\nfunc (kl *Kubelet) dispatchWork(pod *v1.Pod, syncType kubetypes.SyncPodType, mirrorPod *v1.Pod, start time.Time) {\n    \n\tif kl.podIsTerminated(pod) {\n\t\tif pod.DeletionTimestamp != nil {\n\t\t\t// If the pod is in a terminated state, there is no pod worker to\n\t\t\t// handle the work item. Check if the DeletionTimestamp has been\n\t\t\t// set, and force a status update to trigger a pod deletion request\n\t\t\t// to the apiserver.\n\t\t\tkl.statusManager.TerminatePod(pod)\n\t\t}\n\t\treturn\n\t}\n\t// Run the sync in an async worker.\n\tkl.podWorkers.UpdatePod(&UpdatePodOptions{\n\t\tPod:        pod,\n\t\tMirrorPod:  mirrorPod,\n\t\tUpdateType: syncType,\n\t\tOnCompleteFunc: func(err error) {\n\t\t\tif err != nil {\n\t\t\t\tmetrics.PodWorkerDuration.WithLabelValues(syncType.String()).Observe(metrics.SinceInSeconds(start))\n\t\t\t\tmetrics.DeprecatedPodWorkerLatency.WithLabelValues(syncType.String()).Observe(metrics.SinceInMicroseconds(start))\n\t\t\t}\n\t\t},\n\t})\n\t// Note the number of containers for new pods.\n\tif syncType == kubetypes.SyncPodCreate {\n\t\tmetrics.ContainersPerPodCount.Observe(float64(len(pod.Spec.Containers)))\n\t}\n}\n\n\n并不是有了DeletionTimestamp就会认为是Terminated状态，而是有DeletionTimestamp且所有的容器不在运行了\n// podIsTerminated returns true if pod is in the terminated state (\"Failed\" or \"Succeeded\").\nfunc (kl *Kubelet) podIsTerminated(pod *v1.Pod) bool {\n\t// Check the cached pod status which was set after the last sync.\n\tstatus, ok := kl.statusManager.GetPodStatus(pod.UID)\n\tif !ok {\n\t\t// If there is no cached status, use the status from the\n\t\t// apiserver. This is useful if kubelet has recently been\n\t\t// restarted.\n\t\tstatus = pod.Status\n\t}\n\treturn status.Phase == v1.PodFailed || status.Phase == v1.PodSucceeded || (pod.DeletionTimestamp != nil && notRunning(status.ContainerStatuses))\n}\n```\n\n<br>\n\n####  2.1 syncPodFn\n\n```\n在初始化syncPodFn函数的时候指定了函数为：syncPod。该函数的具体逻辑如下：\n\n（1）如果是要kill pod，调用SetPodStatus设置状态，并且调用killPod\n\n（2）如果是创建pod，先纪录pod的 firstSeenTime时间\n\n（3）设置pod的 apiStatus，kubelet 监听得到的pod是没有status的。所以第一次创建的时候，kubelet会根据spec的内容，创建status，例如hostip, podid等等。\n\n（4）如果pod已经running了，记录pod从 firstSeenTime到running的时间\n\n（5）判断该Pod能否运行在这个节点上，如果不行给出原因，比如pid不够等原因\n\n（6）更新statusManager中该pod status。查看下去最终就是调用了apisever client更新了pod状态\n\n（7）如果pod不能运行在该node上，或者有DeletionTimestamp，或者Pod状态为failed，调用killpod函数删除该pod\n\n（8）pod使用的不是hostNetwork，并且如果网络插件没有准备好，报错后返回\n\n（9）如果pod没有被设置删除，并且是第一次出现，更新pod的cgroup。这里最终会调用func (m *qosContainerManagerImpl) setCPUCgroupConfig 函数设置 cpu/mem等cgruop\n\n（10）如果是staticpod，创建mirrorpod\n\n（11）为pod创建data目录。会在root-dir目录下创建 pods,以及pods/volume等目录。 默认的root-dir为 /var/lib/kubelet，可以通过--root-dir修改\n\n（12）如果pod要删除，attach mount\n\n（13）获取pod的secrets，并且调用containerRuntime.SyncPod同步\n```\n\n到这里了，还是不要被欺骗了，这里的updateType还是update，而不是kill。所以会跳过第一步。而是执行第7步。\n\n```\n\t// Kill pod if it should not be running\n\tif !runnable.Admit || pod.DeletionTimestamp != nil || apiPodStatus.Phase == v1.PodFailed {\n\t\tvar syncErr error\n\t\tif err := kl.killPod(pod, nil, podStatus, nil); err != nil {\n\t\t\tkl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, \"error killing pod: %v\", err)\n\t\t\tsyncErr = fmt.Errorf(\"error killing pod: %v\", err)\n\t\t\tutilruntime.HandleError(syncErr)\n\t\t} else {\n\t\t\tif !runnable.Admit {\n\t\t\t\t// There was no error killing the pod, but the pod cannot be run.\n\t\t\t\t// Return an error to signal that the sync loop should back off.\n\t\t\t\tsyncErr = fmt.Errorf(\"pod cannot be run: %s\", runnable.Message)\n\t\t\t}\n\t\t}\n\t\treturn syncErr\n\t}\n```\n\n<br>\n\n#### 2.2 kl.killPod(pod, nil, podStatus, nil)\n\n参数介绍：\n\npod： apiserver传下的将要删除的pod\n\nrunningPod:  nil\n\nstatus: kubelet从runtime获得的真实pod状态\n\ngracePeriodOverride： nil\n\n该函数核心就是调用 kl.containerRuntime.KillPod(pod, p, gracePeriodOverride) 函数\n\n```\n// One of the following arguments must be non-nil: runningPod, status.\n// TODO: Modify containerRuntime.KillPod() to accept the right arguments.\nfunc (kl *Kubelet) killPod(pod *v1.Pod, runningPod *kubecontainer.Pod, status *kubecontainer.PodStatus, gracePeriodOverride *int64) error {\n\tvar p kubecontainer.Pod\n\tif runningPod != nil {\n\t\tp = *runningPod\n\t} else if status != nil {\n\t\tp = kubecontainer.ConvertPodStatusToRunningPod(kl.getRuntime().Type(), status)\n\t} else {\n\t\treturn fmt.Errorf(\"one of the two arguments must be non-nil: runningPod, status\")\n\t}\n\n\t// Call the container runtime KillPod method which stops all running containers of the pod\n\tif err := kl.containerRuntime.KillPod(pod, p, gracePeriodOverride); err != nil {\n\t\treturn err\n\t}\n\tif err := kl.containerManager.UpdateQOSCgroups(); err != nil {\n\t\tklog.V(2).Infof(\"Failed to update QoS cgroups while killing pod: %v\", err)\n\t}\n\treturn nil\n}\n```\n\n#### 2.3  kl.containerRuntime.KillPod(pod, p, gracePeriodOverride)\n\nKillPod 直接调用的是 killPodWithSyncResult。注意gracePeriodOverride=nil。 表示这一次是优雅删除，不是强制的。\n\n```\n// KillPod kills all the containers of a pod. Pod may be nil, running pod must not be.\n// gracePeriodOverride if specified allows the caller to override the pod default grace period.\n// only hard kill paths are allowed to specify a gracePeriodOverride in the kubelet in order to not corrupt user data.\n// it is useful when doing SIGKILL for hard eviction scenarios, or max grace period during soft eviction scenarios.\nfunc (m *kubeGenericRuntimeManager) KillPod(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) error {\n\terr := m.killPodWithSyncResult(pod, runningPod, gracePeriodOverride)\n\treturn err.Error()\n}\n```\n\n<br>\n\nkillPodWithSyncResult 核心是：\n\n（1）调用killContainersWithSyncResult 函数删除业务容器\n\n（2）调用StopPodSandbox 函数停止sandbox容器。这里有一点，这里只是停止sandbox，说清理工作会由gc做\n\n```\n// killPodWithSyncResult kills a runningPod and returns SyncResult.\n// Note: The pod passed in could be *nil* when kubelet restarted.\nfunc (m *kubeGenericRuntimeManager) killPodWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (result kubecontainer.PodSyncResult) {\n\tkillContainerResults := m.killContainersWithSyncResult(pod, runningPod, gracePeriodOverride)\n\tfor _, containerResult := range killContainerResults {\n\t\tresult.AddSyncResult(containerResult)\n\t}\n\n\t// stop sandbox, the sandbox will be removed in GarbageCollect\n\tkillSandboxResult := kubecontainer.NewSyncResult(kubecontainer.KillPodSandbox, runningPod.ID)\n\tresult.AddSyncResult(killSandboxResult)\n\t// Stop all sandboxes belongs to same pod\n\tfor _, podSandbox := range runningPod.Sandboxes {\n\t\tif err := m.runtimeService.StopPodSandbox(podSandbox.ID.ID); err != nil {\n\t\t\tkillSandboxResult.Fail(kubecontainer.ErrKillPodSandbox, err.Error())\n\t\t\tklog.Errorf(\"Failed to stop sandbox %q\", podSandbox.ID)\n\t\t}\n\t}\n\n\treturn\n}\n```\n\n<br>\n\n#### 2.4 killContainersWithSyncResult 删除业务容器\n\n这里就是异步调用m.killContainer，kill掉所有容器\n\n```\n// killContainersWithSyncResult kills all pod's containers with sync results.\nfunc (m *kubeGenericRuntimeManager) killContainersWithSyncResult(pod *v1.Pod, runningPod kubecontainer.Pod, gracePeriodOverride *int64) (syncResults []*kubecontainer.SyncResult) {\n\tcontainerResults := make(chan *kubecontainer.SyncResult, len(runningPod.Containers))\n\twg := sync.WaitGroup{}\n\n\twg.Add(len(runningPod.Containers))\n\tfor _, container := range runningPod.Containers {\n\t\tgo func(container *kubecontainer.Container) {\n\t\t\tdefer utilruntime.HandleCrash()\n\t\t\tdefer wg.Done()\n\n\t\t\tkillContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, container.Name)\n\t\t\tif err := m.killContainer(pod, container.ID, container.Name, \"\", gracePeriodOverride); err != nil {\n\t\t\t\tkillContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())\n\t\t\t}\n\t\t\tcontainerResults <- killContainerResult\n\t\t}(container)\n\t}\n\twg.Wait()\n\tclose(containerResults)\n\n\tfor containerResult := range containerResults {\n\t\tsyncResults = append(syncResults, containerResult)\n\t}\n\treturn\n}\n```\n\n<br>\n\nkillContainer的逻辑也很简单：\n\n（1）在stop container之前运行 pre-stop的hooks命令\n\n（2）stop container。如果gracePeriodOverride!=nil， 这里会将gracePeriod带到stop container函数\n\n```\n// killContainer kills a container through the following steps:\n// * Run the pre-stop lifecycle hooks (if applicable).\n// * Stop the container.\nfunc (m *kubeGenericRuntimeManager) killContainer(pod *v1.Pod, containerID kubecontainer.ContainerID, containerName string, message string, gracePeriodOverride *int64) error {\n\tvar containerSpec *v1.Container\n\tif pod != nil {\n\t\tif containerSpec = kubecontainer.GetContainerSpec(pod, containerName); containerSpec == nil {\n\t\t\treturn fmt.Errorf(\"failed to get containerSpec %q(id=%q) in pod %q when killing container for reason %q\",\n\t\t\t\tcontainerName, containerID.String(), format.Pod(pod), message)\n\t\t}\n\t} else {\n\t\t// Restore necessary information if one of the specs is nil.\n\t\trestoredPod, restoredContainer, err := m.restoreSpecsFromContainerLabels(containerID)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tpod, containerSpec = restoredPod, restoredContainer\n\t}\n\n\t// From this point, pod and container must be non-nil.\n\tgracePeriod := int64(minimumGracePeriodInSeconds)\n\tswitch {\n\tcase pod.DeletionGracePeriodSeconds != nil:\n\t\tgracePeriod = *pod.DeletionGracePeriodSeconds\n\tcase pod.Spec.TerminationGracePeriodSeconds != nil:\n\t\tgracePeriod = *pod.Spec.TerminationGracePeriodSeconds\n\t}\n\n\tif len(message) == 0 {\n\t\tmessage = fmt.Sprintf(\"Stopping container %s\", containerSpec.Name)\n\t}\n\tm.recordContainerEvent(pod, containerSpec, containerID.ID, v1.EventTypeNormal, events.KillingContainer, message)\n\n\t// Run internal pre-stop lifecycle hook\n\tif err := m.internalLifecycle.PreStopContainer(containerID.ID); err != nil {\n\t\treturn err\n\t}\n\n\t// Run the pre-stop lifecycle hooks if applicable and if there is enough time to run it\n\tif containerSpec.Lifecycle != nil && containerSpec.Lifecycle.PreStop != nil && gracePeriod > 0 {\n\t\tgracePeriod = gracePeriod - m.executePreStopHook(pod, containerID, containerSpec, gracePeriod)\n\t}\n\t// always give containers a minimal shutdown window to avoid unnecessary SIGKILLs\n\tif gracePeriod < minimumGracePeriodInSeconds {\n\t\tgracePeriod = minimumGracePeriodInSeconds\n\t}\n\tif gracePeriodOverride != nil {\n\t\tgracePeriod = *gracePeriodOverride\n\t\tklog.V(3).Infof(\"Killing container %q, but using %d second grace period override\", containerID, gracePeriod)\n\t}\n\n\tklog.V(2).Infof(\"Killing container %q with %d second grace period\", containerID.String(), gracePeriod)\n\n\terr := m.runtimeService.StopContainer(containerID.ID, gracePeriod)\n\tif err != nil {\n\t\tklog.Errorf(\"Container %q termination failed with gracePeriod %d: %v\", containerID.String(), gracePeriod, err)\n\t} else {\n\t\tklog.V(3).Infof(\"Container %q exited normally\", containerID.String())\n\t}\n\n\tm.containerRefManager.ClearRef(containerID)\n\n\treturn err\n}\n```\n\n<br>\n\n##### 2.4.1 StopContainer\n\n最终是调用了docker stop Container停止了容器。前面的gracePeriodOverride就是这里的超时参数。如果为nil，上面传入的默认超时是2s。\n\n```\n// StopContainer stops a running container with a grace period (i.e., timeout).\nfunc (r *RemoteRuntimeService) StopContainer(containerID string, timeout int64) error {\n\t// Use timeout + default timeout (2 minutes) as timeout to leave extra time\n\t// for SIGKILL container and request latency.\n\tt := r.timeout + time.Duration(timeout)*time.Second\n\tctx, cancel := getContextWithTimeout(t)\n\tdefer cancel()\n\n\tr.logReduction.ClearID(containerID)\n\t_, err := r.runtimeClient.StopContainer(ctx, &runtimeapi.StopContainerRequest{\n\t\tContainerId: containerID,\n\t\tTimeout:     timeout,\n\t})\n\tif err != nil {\n\t\tklog.Errorf(\"StopContainer %q from runtime service failed: %v\", containerID, err)\n\t\treturn err\n\t}\n\n\treturn nil\n}\n\n\n\n// StopContainer stops a running container with a grace period (i.e., timeout).\nfunc (ds *dockerService) StopContainer(_ context.Context, r *runtimeapi.StopContainerRequest) (*runtimeapi.StopContainerResponse, error) {\n\terr := ds.client.StopContainer(r.ContainerId, time.Duration(r.Timeout)*time.Second)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn &runtimeapi.StopContainerResponse{}, nil\n}\n```\n\n<br>\n\n####  2.5 StopPodSandbox\n\nStopPodSandbox最终调用的是 dockershim的StopPodSandbox。该函数逻辑如下：、\n\n（1）docker inspect 获取元数据\n\n（2）通过TearDownPod 清理网络，ip，网桥啥的\n\n（3）stop sandboxcontainer\n\n```\npkg/kubelet/dockershim/docker_sandbox.go\n// StopPodSandbox stops the sandbox. If there are any running containers in the\n// sandbox, they should be force terminated.\n// TODO: This function blocks sandbox teardown on networking teardown. Is it\n// better to cut our losses assuming an out of band GC routine will cleanup\n// after us?\nfunc (ds *dockerService) StopPodSandbox(ctx context.Context, r *runtimeapi.StopPodSandboxRequest) (*runtimeapi.StopPodSandboxResponse, error) {\n\tvar namespace, name string\n\tvar hostNetwork bool\n\n\tpodSandboxID := r.PodSandboxId\n\tresp := &runtimeapi.StopPodSandboxResponse{}\n\n\t// Try to retrieve minimal sandbox information from docker daemon or sandbox checkpoint.\n\t// 1.docker inspect 获取元数据。这里看到了checkpoint的作用了，如果失败，还可以通过checkpint获取\n\tinspectResult, metadata, statusErr := ds.getPodSandboxDetails(podSandboxID)\n\tif statusErr == nil {\n\t\tnamespace = metadata.Namespace\n\t\tname = metadata.Name\n\t\thostNetwork = (networkNamespaceMode(inspectResult) == runtimeapi.NamespaceMode_NODE)\n\t} else {\n\t\tcheckpoint := NewPodSandboxCheckpoint(\"\", \"\", &CheckpointData{})\n\t\tcheckpointErr := ds.checkpointManager.GetCheckpoint(podSandboxID, checkpoint)\n\n\t\t// Proceed if both sandbox container and checkpoint could not be found. This means that following\n\t\t// actions will only have sandbox ID and not have pod namespace and name information.\n\t\t// Return error if encounter any unexpected error.\n\t\tif checkpointErr != nil {\n\t\t\tif checkpointErr != errors.ErrCheckpointNotFound {\n\t\t\t\terr := ds.checkpointManager.RemoveCheckpoint(podSandboxID)\n\t\t\t\tif err != nil {\n\t\t\t\t\tklog.Errorf(\"Failed to delete corrupt checkpoint for sandbox %q: %v\", podSandboxID, err)\n\t\t\t\t}\n\t\t\t}\n\t\t\tif libdocker.IsContainerNotFoundError(statusErr) {\n\t\t\t\tklog.Warningf(\"Both sandbox container and checkpoint for id %q could not be found. \"+\n\t\t\t\t\t\"Proceed without further sandbox information.\", podSandboxID)\n\t\t\t} else {\n\t\t\t\treturn nil, utilerrors.NewAggregate([]error{\n\t\t\t\t\tfmt.Errorf(\"failed to get checkpoint for sandbox %q: %v\", podSandboxID, checkpointErr),\n\t\t\t\t\tfmt.Errorf(\"failed to get sandbox status: %v\", statusErr)})\n\t\t\t}\n\t\t} else {\n\t\t\t_, name, namespace, _, hostNetwork = checkpoint.GetData()\n\t\t}\n\t}\n\n\t// WARNING: The following operations made the following assumption:\n\t// 1. kubelet will retry on any error returned by StopPodSandbox.\n\t// 2. tearing down network and stopping sandbox container can succeed in any sequence.\n\t// This depends on the implementation detail of network plugin and proper error handling.\n\t// For kubenet, if tearing down network failed and sandbox container is stopped, kubelet\n\t// will retry. On retry, kubenet will not be able to retrieve network namespace of the sandbox\n\t// since it is stopped. With empty network namespcae, CNI bridge plugin will conduct best\n\t// effort clean up and will not return error.\n\terrList := []error{}\n\tready, ok := ds.getNetworkReady(podSandboxID)\n\tif !hostNetwork && (ready || !ok) {\n\t\t// Only tear down the pod network if we haven't done so already\n\t\tcID := kubecontainer.BuildContainerID(runtimeName, podSandboxID)\n\t\terr := ds.network.TearDownPod(namespace, name, cID)\n\t\tif err == nil {\n\t\t\tds.setNetworkReady(podSandboxID, false)\n\t\t} else {\n\t\t\terrList = append(errList, err)\n\t\t}\n\t}\n\tif err := ds.client.StopContainer(podSandboxID, defaultSandboxGracePeriod); err != nil {\n\t\t// Do not return error if the container does not exist\n\t\tif !libdocker.IsContainerNotFoundError(err) {\n\t\t\tklog.Errorf(\"Failed to stop sandbox %q: %v\", podSandboxID, err)\n\t\t\terrList = append(errList, err)\n\t\t} else {\n\t\t\t// remove the checkpoint for any sandbox that is not found in the runtime\n\t\t\tds.checkpointManager.RemoveCheckpoint(podSandboxID)\n\t\t}\n\t}\n\n\tif len(errList) == 0 {\n\t\treturn resp, nil\n\t}\n\n\t// TODO: Stop all running containers in the sandbox.\n\treturn nil, utilerrors.NewAggregate(errList)\n}\n```\n\n<br>\n\n#### 2.6 总结\n\n客户端删除（优雅删除）的情况下，apiserver只是update了deleteTimestamp。然后kubelete监听到了这个事件后停止了所有容器，清理了网络。\n\n### 3. Pod是如何被删除的\n\n从上面了解到客户端删除pod（一般是优雅删除），其实是apiserver给他打上了deleteTimestamp，这其实是一个update操作。\n\n然后kubelet收到这个update后，就会进行上面的操作。stop了所有容器，清理了网络。\n\n那pod是如何彻底被清除的呢？\n\n<br>\n\n在pleg的更新流程，我们可以知道。当所有容器被stop的时候，其实也会触发pleg的一次update操作。这个其实也会调用dispatchWork->UpdatePod->managePodLoop->syncPod。\n\n而syncPod的逻辑有一个很重要的步骤SetPodStatus。详细流程可以参考 pod创建流程那一章节。\n\nSetPodStatus核心是更新statusManger中Pod状态，但是如果pod deletetimeStamp!=nil 可能会调用apiserver删除pod操作。\n\n```\n（6）更新statusManager中该pod status，如果pod deletetimeStamp!=nil 可能会调用apiserver删除pod操作。\n// Update status in the status manager\n\tkl.statusManager.SetPodStatus(pod, apiPodStatus)\n```\n\n<br>\n\n#### 3.1 SetPodStatus\n\n这里核心调用updateStatusInternal函数\n\n```\nfunc (m *manager) SetPodStatus(pod *v1.Pod, status v1.PodStatus) {\n\tm.podStatusesLock.Lock()\n\tdefer m.podStatusesLock.Unlock()\n\n\tfor _, c := range pod.Status.Conditions {\n\t\tif !kubetypes.PodConditionByKubelet(c.Type) {\n\t\t\tklog.Errorf(\"Kubelet is trying to update pod condition %q for pod %q. \"+\n\t\t\t\t\"But it is not owned by kubelet.\", string(c.Type), format.Pod(pod))\n\t\t}\n\t}\n\t// Make sure we're caching a deep copy.\n\tstatus = *status.DeepCopy()\n\n\t// Force a status update if deletion timestamp is set. This is necessary\n\t// because if the pod is in the non-running state, the pod worker still\n\t// needs to be able to trigger an update and/or deletion.\n\tm.updateStatusInternal(pod, status, pod.DeletionTimestamp != nil)\n}\n```\n\n#### 3.2 updateStatusInternal\n\nupdateStatusInternal的核心就是更新podStatus,然后往这个podStatusChannel发送状态\n\n```\n// updateStatusInternal updates the internal status cache, and queues an update to the api server if\n// necessary. Returns whether an update was triggered.\n// This method IS NOT THREAD SAFE and must be called from a locked function.\nfunc (m *manager) updateStatusInternal(pod *v1.Pod, status v1.PodStatus, forceUpdate bool) bool {\n\tvar oldStatus v1.PodStatus\n\t...\n\tm.podStatuses[pod.UID] = newStatus\n  \n  // 核心是往这个podStatusChannel发送状态\n\tselect {\n\tcase m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}:\n\t\tklog.V(5).Infof(\"Status Manager: adding pod: %q, with status: (%d, %v) to podStatusChannel\",\n\t\t\tpod.UID, newStatus.version, newStatus.status)\n\t\treturn true\n\tdefault:\n\t\t// Let the periodic syncBatch handle the update if the channel is full.\n\t\t// We can't block, since we hold the mutex lock.\n\t\tklog.V(4).Infof(\"Skipping the status update for pod %q for now because the channel is full; status: %+v\",\n\t\t\tformat.Pod(pod), status)\n\t\treturn false\n\t}\n}\n```\n\n<br>\n\n#### 3.3 statusManager.start\n\n在Kubelet.Run的时候, statusManager.start了起来\n\n```\n// Start component sync loops.\n\tkl.statusManager.Start()\n\tkl.probeManager.Start()\n```\n\n<br>\n\nStart函数的核心就是用一个协程处理podStatusChannel的数据。\n\n```\nfunc (m *manager) Start() {\n\t// Don't start the status manager if we don't have a client. This will happen\n\t// on the master, where the kubelet is responsible for bootstrapping the pods\n\t// of the master components.\n\tif m.kubeClient == nil {\n\t\tklog.Infof(\"Kubernetes client is nil, not starting status manager.\")\n\t\treturn\n\t}\n\n\tklog.Info(\"Starting to sync pod status with apiserver\")\n\t//lint:ignore SA1015 Ticker can link since this is only called once and doesn't handle termination.\n\tsyncTicker := time.Tick(syncPeriod)\n\t// syncPod and syncBatch share the same go routine to avoid sync races.\n\tgo wait.Forever(func() {\n\t\tselect {\n\t\tcase syncRequest := <-m.podStatusChannel:\n\t\t\tklog.V(5).Infof(\"Status Manager: syncing pod: %q, with status: (%d, %v) from podStatusChannel\",\n\t\t\t\tsyncRequest.podUID, syncRequest.status.version, syncRequest.status.status)\n\t\t\tm.syncPod(syncRequest.podUID, syncRequest.status)\n\t\tcase <-syncTicker:\n\t\t\tm.syncBatch()\n\t\t}\n\t}, 0)\n}\n```\n\n#### 3.4 m.syncPod(syncRequest.podUID, syncRequest.status)\n\nm.syncPod核心逻辑如下：\n\n（1）根据resourceVerison判断是否要更新\n\n（2）从apiserver获得pod的最新数据\n\n（3）调用apiserver接口更新podstatus\n\n（4）如果pod canBeDeleted, 调用delete删除pod。只有这里的NewDeleteOptions(0)表示不要优雅删除\n\ncanBeDeleted的逻辑是同时满足以下条件：\n\n* pod不能是mirrorPod，并且有DeletionTimestamp\n\n* 没有容器运行\n\n* volumes已经被清除\n\n* cgroup已经被清除\n\n```\n// syncPod syncs the given status with the API server. The caller must not hold the lock.\nfunc (m *manager) syncPod(uid types.UID, status versionedPodStatus) {\n  // 1.根据resourceVerison判断是否要更新\n\tif !m.needsUpdate(uid, status) {\n\t\tklog.V(1).Infof(\"Status for pod %q is up-to-date; skipping\", uid)\n\t\treturn\n\t}\n  \n  // 2.从apiserver获得pod的最新数据\n\t// TODO: make me easier to express from client code\n\tpod, err := m.kubeClient.CoreV1().Pods(status.podNamespace).Get(status.podName, metav1.GetOptions{})\n\tif errors.IsNotFound(err) {\n\t\tklog.V(3).Infof(\"Pod %q does not exist on the server\", format.PodDesc(status.podName, status.podNamespace, uid))\n\t\t// If the Pod is deleted the status will be cleared in\n\t\t// RemoveOrphanedStatuses, so we just ignore the update here.\n\t\treturn\n\t}\n\tif err != nil {\n\t\tklog.Warningf(\"Failed to get status for pod %q: %v\", format.PodDesc(status.podName, status.podNamespace, uid), err)\n\t\treturn\n\t}\n\n\ttranslatedUID := m.podManager.TranslatePodUID(pod.UID)\n\t// Type convert original uid just for the purpose of comparison.\n\tif len(translatedUID) > 0 && translatedUID != kubetypes.ResolvedPodUID(uid) {\n\t\tklog.V(2).Infof(\"Pod %q was deleted and then recreated, skipping status update; old UID %q, new UID %q\", format.Pod(pod), uid, translatedUID)\n\t\tm.deletePodStatus(uid)\n\t\treturn\n\t}\n  \n  // 3.调用apiserver接口更新podstatus\n\toldStatus := pod.Status.DeepCopy()\n\tnewPod, patchBytes, err := statusutil.PatchPodStatus(m.kubeClient, pod.Namespace, pod.Name, pod.UID, *oldStatus, mergePodStatus(*oldStatus, status.status))\n\tklog.V(3).Infof(\"Patch status for pod %q with %q\", format.Pod(pod), patchBytes)\n\tif err != nil {\n\t\tklog.Warningf(\"Failed to update status for pod %q: %v\", format.Pod(pod), err)\n\t\treturn\n\t}\n\tpod = newPod\n\n\tklog.V(3).Infof(\"Status for pod %q updated successfully: (%d, %+v)\", format.Pod(pod), status.version, status.status)\n\tm.apiStatusVersions[kubetypes.MirrorPodUID(pod.UID)] = status.version\n  \n  // 4.如果pod canBeDeleted, 调用delete删除pod。只有这里的NewDeleteOptions(0)表示不要优雅删除\n\t// We don't handle graceful deletion of mirror pods.\n\tif m.canBeDeleted(pod, status.status) {\n\t\tdeleteOptions := metav1.NewDeleteOptions(0)\n\t\t// Use the pod UID as the precondition for deletion to prevent deleting a newly created pod with the same name and namespace.\n\t\tdeleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID))\n\t\terr = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)\n\t\tif err != nil {\n\t\t\tklog.Warningf(\"Failed to delete status for pod %q: %v\", format.Pod(pod), err)\n\t\t\treturn\n\t\t}\n\t\tklog.V(3).Infof(\"Pod %q fully terminated and removed from etcd\", format.Pod(pod))\n\t\tm.deletePodStatus(uid)\n\t}\n}\n```\n\n而canBeDeleted的逻辑是什么样子的呢？\n\n```\nfunc (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool {\n\tif pod.DeletionTimestamp == nil || kubetypes.IsMirrorPod(pod) {\n\t\treturn false\n\t}\n\treturn m.podDeletionSafety.PodResourcesAreReclaimed(pod, status)\n}\n```\n\n可以看出来canBeDeleted的逻辑是同时满足以下条件：\n\n（1）pod不能是mirrorPod，并且有DeletionTimestamp\n\n（2）没有容器运行\n\n（3)   volumes已经被清除\n\n（4）cgroup已经被清除\n\n```\n// PodResourcesAreReclaimed returns true if all required node-level resources that a pod was consuming have\n// been reclaimed by the kubelet.  Reclaiming resources is a prerequisite to deleting a pod from the API server.\nfunc (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {\n   if !notRunning(status.ContainerStatuses) {\n      // We shouldn't delete pods that still have running containers\n      klog.V(3).Infof(\"Pod %q is terminated, but some containers are still running\", format.Pod(pod))\n      return false\n   }\n   // pod's containers should be deleted\n   runtimeStatus, err := kl.podCache.Get(pod.UID)\n   if err != nil {\n      klog.V(3).Infof(\"Pod %q is terminated, Error getting runtimeStatus from the podCache: %s\", format.Pod(pod), err)\n      return false\n   }\n   if len(runtimeStatus.ContainerStatuses) > 0 {\n      var statusStr string\n      for _, status := range runtimeStatus.ContainerStatuses {\n         statusStr += fmt.Sprintf(\"%+v \", *status)\n      }\n      klog.V(3).Infof(\"Pod %q is terminated, but some containers have not been cleaned up: %s\", format.Pod(pod), statusStr)\n      return false\n   }\n   if kl.podVolumesExist(pod.UID) && !kl.keepTerminatedPodVolumes {\n      // We shouldn't delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes\n      klog.V(3).Infof(\"Pod %q is terminated, but some volumes have not been cleaned up\", format.Pod(pod))\n      return false\n   }\n   if kl.kubeletConfiguration.CgroupsPerQOS {\n      pcm := kl.containerManager.NewPodContainerManager()\n      if pcm.Exists(pod) {\n         klog.V(3).Infof(\"Pod %q is terminated, but pod cgroup sandbox has not been cleaned up\", format.Pod(pod))\n         return false\n      }\n   }\n   return true\n}\n```\n\n#### 3.4 总结\n\n等容器清理完后，通过pleg的同步，在判断容器停止，volume清理后，最终会往apiserver发送一个 强制删除pod的请求。这个时候apiserver才会往etcd调用删除pod的操作。\n\n### 4 kubelet监听到删除pod操作后做了什么操作\n\n参考 kubelet监听pod变化那一章节。可以知道apiserver 从etcd删除了pod数据。kubelet最终会受到一个remove的update。这个对应HandlePodRemoves函数\n\n#### 4.1 HandlePodRemoves\n\n核心逻辑：\n\n（1）调用podManager.DeletePod，主要是从secretManager, configMapManager，checkpointManager等等地方删除这个pod，告诉他们这个pod真的不在了\n\n（2）调用kl.deletePod\n\n（3）从probeManager中移除该pod\n\n```\n// HandlePodRemoves is the callback in the SyncHandler interface for pods\n// being removed from a config source.\nfunc (kl *Kubelet) HandlePodRemoves(pods []*v1.Pod) {\n\tstart := kl.clock.Now()\n\tfor _, pod := range pods {\n\t\tkl.podManager.DeletePod(pod)\n\t\tif kubetypes.IsMirrorPod(pod) {\n\t\t\tkl.handleMirrorPod(pod, start)\n\t\t\tcontinue\n\t\t}\n\t\t// Deletion is allowed to fail because the periodic cleanup routine\n\t\t// will trigger deletion again.\n\t\tif err := kl.deletePod(pod); err != nil {\n\t\t\tklog.V(2).Infof(\"Failed to delete pod %q, err: %v\", format.Pod(pod), err)\n\t\t}\n\t\tkl.probeManager.RemovePod(pod)\n\t}\n}\n```\n\n<br>\n\n#### 4.2  kl.deletePod\n\n该函数核心就是：\n\n（1）停掉对应的PodWorker\n\n（2）发送到 kl.podKillingCh <- &podPair\n\n```\n// deletePod deletes the pod from the internal state of the kubelet by:\n// 1.  stopping the associated pod worker asynchronously\n// 2.  signaling to kill the pod by sending on the podKillingCh channel\n//\n// deletePod returns an error if not all sources are ready or the pod is not\n// found in the runtime cache.\nfunc (kl *Kubelet) deletePod(pod *v1.Pod) error {\n\t...\n\t// 1.停掉对应的PodWorker\n\tkl.podWorkers.ForgetWorker(pod.UID)\n\n  \n  ...\n\t// 2.发送到podKillingCh\n\tpodPair := kubecontainer.PodPair{APIPod: pod, RunningPod: &runningPod}\n  \n  \n\tkl.podKillingCh <- &podPair\n\t// TODO: delete the mirror pod here?\n\n\t// We leave the volume/directory cleanup to the periodic cleanup routine.\n\treturn nil\n}\n```\n\n<br>\n\n#### 4.3 podKiller处理 podKillingCh\n\npodKiller就是调用killpod函数处理。这个和2节是一样的。做的是删除容器等操作。就不再展开了。\n\n如果pod 强制删除，那其实是没有delete操作，而是直接remove。所以这一步也是要的。\n\n```\n// podKiller launches a goroutine to kill a pod received from the channel if\n// another goroutine isn't already in action.\nfunc (kl *Kubelet) podKiller() {\n\tkilling := sets.NewString()\n\t// guard for the killing set\n\tlock := sync.Mutex{}\n\tfor podPair := range kl.podKillingCh {\n\t\trunningPod := podPair.RunningPod\n\t\tapiPod := podPair.APIPod\n\n\t\tlock.Lock()\n\t\texists := killing.Has(string(runningPod.ID))\n\t\tif !exists {\n\t\t\tkilling.Insert(string(runningPod.ID))\n\t\t}\n\t\tlock.Unlock()\n\n\t\tif !exists {\n\t\t\tgo func(apiPod *v1.Pod, runningPod *kubecontainer.Pod) {\n\t\t\t\tklog.V(2).Infof(\"Killing unwanted pod %q\", runningPod.Name)\n\t\t\t\t// 调用killpod函数处理\n\t\t\t\terr := kl.killPod(apiPod, runningPod, nil, nil)\n\t\t\t\tif err != nil {\n\t\t\t\t\tklog.Errorf(\"Failed killing the pod %q: %v\", runningPod.Name, err)\n\t\t\t\t}\n\t\t\t\tlock.Lock()\n\t\t\t\tkilling.Delete(string(runningPod.ID))\n\t\t\t\tlock.Unlock()\n\t\t\t}(apiPod, runningPod)\n\t\t}\n\t}\n}\n```\n\n### 5. 总结\n\nKubelet 其实会进行2次删除。第一次是pod更新时间，有deleteTimeStamp，对应了 delete\n\n第二次是pod被删除了，执行了remove操作。\n\n通过一个实例来总结整个的删除过程：下面是删除1个pod对应的kubelet日志和分析过程\n\n```\nroot:/home/zoux# tail -f /home/zoux/log/kubelet.stderr.log | grep 'SyncLoop' | grep test-pod2\n\n// 1、有了deleteTime所以是删除事件\nI0310 11:44:02.640238  731714 kubelet.go:1932] SyncLoop (DELETE, \"api\"): \"test-pod2_default(9cf9dae8-5c99-43e1-8ff8-78e766e176db)\"\n\n// 2. SYNC 定时触发了一次同步\nI0310 11:44:03.768615  731714 kubelet.go:1980] SyncLoop (SYNC): 1 pods; test-pod2_default(9cf9dae8-5c99-43e1-8ff8-78e766e176db)\n\n// 3. 应该是业务容器died，然后触发了pleg的同步\nI0310 11:44:33.496057  731714 kubelet.go:1961] SyncLoop (PLEG): \"test-pod2_default(9cf9dae8-5c99-43e1-8ff8-78e766e176db)\", event: &pleg.PodLifecycleEvent{ID:\"9cf9dae8-5c99-43e1-8ff8-78e766e176db\", Type:\"ContainerDied\", Data:\"9a0d86b06f0558b73633e3e05efebc1bbd23f6c227e13e276552f96387aa2357\"}\n\n// 4. 应该是sandbox died，然后触发了pleg的同步\nI0310 11:44:33.496180  731714 kubelet.go:1961] SyncLoop (PLEG): \"test-pod2_default(9cf9dae8-5c99-43e1-8ff8-78e766e176db)\", event: &pleg.PodLifecycleEvent{ID:\"9cf9dae8-5c99-43e1-8ff8-78e766e176db\", Type:\"ContainerDied\", Data:\"4a65fcda2151b053f61ae08688fcf5adacc041ef672bf2d0fb31284882823e88\"}\n\n// 5.apiserver 触发了第一次同步，应该是1的更新status导致 （status还不同步）\nI0310 11:44:33.504168  731714 kubelet.go:1929] SyncLoop (RECONCILE, \"api\"): \"test-pod2_default(9cf9dae8-5c99-43e1-8ff8-78e766e176db)\"\n\n// 6. apiserver 触发了第2次同步，应该是2的更新status导致 （status还不同步）\nI0310 11:44:33.512140  731714 kubelet.go:1929] SyncLoop (RECONCILE, \"api\"): \"test-pod2_default(9cf9dae8-5c99-43e1-8ff8-78e766e176db)\"\n\n// 7.应该是容器都died了，这个时候status也同步了，但是有deleteTImeStamp，所有又是一次delete\nI0310 11:44:34.533461  731714 kubelet.go:1932] SyncLoop (DELETE, \"api\"): \"test-pod2_default(9cf9dae8-5c99-43e1-8ff8-78e766e176db)\"\n\n// 8.etcd没有这个数据了，所以remove\nI0310 11:44:34.535965  731714 kubelet.go:1926] SyncLoop (REMOVE, \"api\"): \"test-pod2_default(9cf9dae8-5c99-43e1-8ff8-78e766e176db)\"\n```\n\n"
  },
  {
    "path": "k8s/kubelet/8-kubelet gc流程.md",
    "content": "* [1\\. 背景](#1-背景)\n* [2\\. StartGarbageCollection](#2-startgarbagecollection)\n* [3\\. container gc处理流程](#3-container-gc处理流程)\n  * [3\\.1 gc 参数设置](#31-gc-参数设置)\n  * [3\\.2 GarbageCollect](#32-garbagecollect)\n  * [3\\.3 containerGC\\.GarbageCollect](#33-containergcgarbagecollect)\n    * [3\\.2\\.1 移除需要驱逐的的containers](#321-移除需要驱逐的的containers)\n    * [3\\.2\\.2 移除sandboxes](#322-移除sandboxes)\n    * [3\\.2\\.3 回收log Directories](#323-回收log-directories)\n* [4\\. Image Gc处理流程](#4-image-gc处理流程)\n  * [4\\.1 freeSpace](#41-freespace)\n* [5 总结](#5-总结)\n\n### 1. 背景\n\n上文分析到，所有的容器都是stop停了，但是没有清理。这个清理工作就是GC做的。在kubelet初始化的时createAndInitKubelet函数中，开启了gc 流程。接下里看看GC流程的处理逻辑。\n\n```\ncmd/kubelet/app/server.go\ncreateAndInitKubelet\n\nk.StartGarbageCollection()\n```\n\n<br>\n\n### 2. StartGarbageCollection\n\nStartGarbageCollection逻辑如下：\n\n（1） 开启一个协程进行container的GC处理。间隔时间1分钟，ContainerGCPeriod=1 min\n\n（2）判断是否开了了image gc。HighThresholdPercent表示磁盘使用量超过多少开始GC\n\n（3）如果开了了image gc，协程进行image的GC处理。间隔时间5分钟，ImageGCPeriod=5 min\n\n可以看到，kubelet的核心就是contianer Gc 和image gc\n\n```\n// StartGarbageCollection starts garbage collection threads.\nfunc (kl *Kubelet) StartGarbageCollection() {\n\tloggedContainerGCFailure := false\n\t\n\t// 1.开启一个协程进行container的GC处理。间隔时间1分钟，ContainerGCPeriod=1 min\n\tgo wait.Until(func() {\n\t\tif err := kl.containerGC.GarbageCollect(); err != nil {\n\t\t\tklog.Errorf(\"Container garbage collection failed: %v\", err)\n\t\t\tkl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ContainerGCFailed, err.Error())\n\t\t\tloggedContainerGCFailure = true\n\t\t} else {\n\t\t\tvar vLevel klog.Level = 4\n\t\t\tif loggedContainerGCFailure {\n\t\t\t\tvLevel = 1\n\t\t\t\tloggedContainerGCFailure = false\n\t\t\t}\n\n\t\t\tklog.V(vLevel).Infof(\"Container garbage collection succeeded\")\n\t\t}\n\t}, ContainerGCPeriod, wait.NeverStop)\n  \n  \n \n\t// when the high threshold is set to 100, stub the image GC manager\n\t// 2.判断是否开了了image gc。HighThresholdPercent表示磁盘使用量超过多少开始GC。\n\tif kl.kubeletConfiguration.ImageGCHighThresholdPercent == 100 {\n\t\tklog.V(2).Infof(\"ImageGCHighThresholdPercent is set 100, Disable image GC\")\n\t\treturn\n\t}\n  \n  // 3.开启image gc\n\tprevImageGCFailed := false\n\tgo wait.Until(func() {\n\t\tif err := kl.imageManager.GarbageCollect(); err != nil {\n\t\t\tif prevImageGCFailed {\n\t\t\t\tklog.Errorf(\"Image garbage collection failed multiple times in a row: %v\", err)\n\t\t\t\t// Only create an event for repeated failures\n\t\t\t\tkl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.ImageGCFailed, err.Error())\n\t\t\t} else {\n\t\t\t\tklog.Errorf(\"Image garbage collection failed once. Stats initialization may not have completed yet: %v\", err)\n\t\t\t}\n\t\t\tprevImageGCFailed = true\n\t\t} else {\n\t\t\tvar vLevel klog.Level = 4\n\t\t\tif prevImageGCFailed {\n\t\t\t\tvLevel = 1\n\t\t\t\tprevImageGCFailed = false\n\t\t\t}\n\n\t\t\tklog.V(vLevel).Infof(\"Image garbage collection succeeded\")\n\t\t}\n\t}, ImageGCPeriod, wait.NeverStop)\n}\n```\n\n<br>\n\n### 3. container gc处理流程\n\n#### 3.1 gc 参数设置\n\nContainerGCPolicy 结构如下：\n\n **MinAge** 对应 kubelet的启动参数`--minimum-container-ttl-duration`， 表示已经退出的容器可以存活的最小时间，默认为 0s。\n\n**MaxPerPodContainer** 对应kubelet的启动参数  `--maximum-dead-containers-per-container`表示一个 pod 最多可以保存多少个已经停止的容器，默认为1；\n\n **MaxContainers** 对应kubele启动参数   `--maximum-dead-containers`一个 node 上最多可以保留多少个已经停止的容器，默认为 -1，表示没有限制；\n\n```\n// Specified a policy for garbage collecting containers.\ntype ContainerGCPolicy struct {\n\t// Minimum age at which a container can be garbage collected, zero for no limit.\n\tMinAge time.Duration\n\n\t// Max number of dead containers any single pod (UID, container name) pair is\n\t// allowed to have, less than zero for no limit.\n\tMaxPerPodContainer int\n\n\t// Max number of total dead containers, less than zero for no limit.\n\tMaxContainers int\n}\n```\n\n<br>\n\n#### 3.2 GarbageCollect\n\n调用链为：GarbageCollect -> runtime.GarbageCollect -> containerGC.GarbageCollect\n\n```\nfunc (cgc *realContainerGC) GarbageCollect() error {\n\treturn cgc.runtime.GarbageCollect(cgc.policy, cgc.sourcesReadyProvider.AllReady(), false)\n}\n\n// GarbageCollect removes dead containers using the specified container gc policy.\nfunc (m *kubeGenericRuntimeManager) GarbageCollect(gcPolicy kubecontainer.ContainerGCPolicy, allSourcesReady bool, evictNonDeletedPods bool) error {\n\treturn m.containerGC.GarbageCollect(gcPolicy, allSourcesReady, evictNonDeletedPods)\n}\n```\n\n<br>\n\n#### 3.3 containerGC.GarbageCollect\n\n该函数主要逻辑为：\n （1）移除需要驱逐的的containers\n\n* 得到需要所有需要驱逐的contaienrs，非running已经挂掉 MinAge分钟了\n\n* 根据参数MaxPerPodContainer,保留最新的MaxPerPodContainer个containers，其他的都要驱逐\n\n* 根据MaxContainers参数，保留最新的MaxContainers个contaeinrs，其他的按照创建时间依次驱逐\n\n （2）移除sandboxes\n\n* 获取 node 上所有的 container以及所有的 sandboxes\n\n* 收集所有 container 的 PodSandboxId， 构建 sandboxes 与 pod 的对应关系并将其保存在 sandboxesByPodUID 中\n\n* 遍历 sandboxesByPod，若 sandboxes 所在的 pod 处于 deleted 状态，则删除该 pod 中所有的 sandboxes ；否则只保留退出时间最短的一个 sandboxes\n\n （3）回收log Directories\n\n* 首先回收 deleted 状态 pod logs dir，遍历 pod logs dir /var/log/pods，/var/log/pods 为 pod logs 的默认目录，pod logs dir 的格式为 /var/log/pods/NAMESPACE_NAME_UID，解析 pod logs dir 获取 pod uid，判断 pod 是否处于 deleted 状态，若处于 deleted 状态则删除其 logs dir； \n\n* 回收 deleted 状态 container logs 链接目录，/var/log/containers 为 container log 的默认目录，其会软链接到 pod 的 log dir 下，例如： /var/log/containers/storage-provisioner_kube-system_storage-provisioner-acc8386e409dfb3cc01618cbd14c373d8ac6d7f0aaad9ced018746f31d0081e2.log -> /var/log/pods/kube-system_storage-provisioner_b448e496-eb5d-4d71-b93f-ff7ff77d2348/storage-provisioner/0.log\n\n```\n// GarbageCollect removes dead containers using the specified container gc policy.\n// Note that gc policy is not applied to sandboxes. Sandboxes are only removed when they are\n// not ready and containing no containers.\n//\n// GarbageCollect consists of the following steps:\n// * gets evictable containers which are not active and created more than gcPolicy.MinAge ago.\n// * removes oldest dead containers for each pod by enforcing gcPolicy.MaxPerPodContainer.\n// * removes oldest dead containers by enforcing gcPolicy.MaxContainers.\n// * gets evictable sandboxes which are not ready and contains no containers.\n// * removes evictable sandboxes.\nfunc (cgc *containerGC) GarbageCollect(gcPolicy kubecontainer.ContainerGCPolicy, allSourcesReady bool, evictTerminatedPods bool) error {\n   errors := []error{}\n   // Remove evictable containers\n   if err := cgc.evictContainers(gcPolicy, allSourcesReady, evictTerminatedPods); err != nil {\n      errors = append(errors, err)\n   }\n\n   // Remove sandboxes with zero containers\n   if err := cgc.evictSandboxes(evictTerminatedPods); err != nil {\n      errors = append(errors, err)\n   }\n\n   // Remove pod sandbox log directory\n   if err := cgc.evictPodLogsDirectories(allSourcesReady); err != nil {\n      errors = append(errors, err)\n   }\n   return utilerrors.NewAggregate(errors)\n}\n```\n\n<br>\n\n##### 3.2.1 移除需要驱逐的的containers\n\n（1）得到需要所有需要驱逐的contaienrs，非running已经挂掉 MinAge分钟了\n\n（2）根据参数MaxPerPodContainer,保留最新的MaxPerPodContainer个containers，其他的都要驱逐\n\n（3）根据MaxContainers参数，保留最新的MaxContainers个contaeinrs，其他的按照创建时间依次驱逐\n\n```\n// evict all containers that are evictable\nfunc (cgc *containerGC) evictContainers(gcPolicy kubecontainer.ContainerGCPolicy, allSourcesReady bool, evictTerminatedPods bool) error {\n\t// Separate containers by evict units.\n\t// 1.得到需要所有需要驱逐的contaienrs，非running已经挂掉 MinAge分钟了\n\tevictUnits, err := cgc.evictableContainers(gcPolicy.MinAge)\n\tif err != nil {\n\t\treturn err\n\t}\n  \n  // 2.根据配置参数,保留最新的N个containers，其他的都要驱逐\n\t// Remove deleted pod containers if all sources are ready.\n\tif allSourcesReady {\n\t\tfor key, unit := range evictUnits {\n\t\t\tif cgc.podStateProvider.IsPodDeleted(key.uid) || (cgc.podStateProvider.IsPodTerminated(key.uid) && evictTerminatedPods) {\n\t\t\t\tcgc.removeOldestN(unit, len(unit)) // Remove all.\n\t\t\t\tdelete(evictUnits, key)\n\t\t\t}\n\t\t}\n\t}\n\n\t// Enforce max containers per evict unit.\n\tif gcPolicy.MaxPerPodContainer >= 0 {\n\t\tcgc.enforceMaxContainersPerEvictUnit(evictUnits, gcPolicy.MaxPerPodContainer)\n\t}\n\n\t// Enforce max total number of containers.\n\t// 3.\n\tif gcPolicy.MaxContainers >= 0 && evictUnits.NumContainers() > gcPolicy.MaxContainers {\n\t\t// Leave an equal number of containers per evict unit (min: 1).\n\t\tnumContainersPerEvictUnit := gcPolicy.MaxContainers / evictUnits.NumEvictUnits()\n\t\tif numContainersPerEvictUnit < 1 {\n\t\t\tnumContainersPerEvictUnit = 1\n\t\t}\n\t\tcgc.enforceMaxContainersPerEvictUnit(evictUnits, numContainersPerEvictUnit)\n\n\t\t// If we still need to evict, evict oldest first.\n\t\tnumContainers := evictUnits.NumContainers()\n\t\tif numContainers > gcPolicy.MaxContainers {\n\t\t\tflattened := make([]containerGCInfo, 0, numContainers)\n\t\t\tfor key := range evictUnits {\n\t\t\t\tflattened = append(flattened, evictUnits[key]...)\n\t\t\t}\n\t\t\tsort.Sort(byCreated(flattened))\n\n\t\t\tcgc.removeOldestN(flattened, numContainers-gcPolicy.MaxContainers)\n\t\t}\n\t}\n\treturn nil\n}\n```\n\n##### 3.2.2 移除sandboxes\n\n该函数逻辑如下：\n\n（1）获取 node 上所有的 container以及所有的 sandboxes\n\n （2）收集所有 container 的 PodSandboxId， 构建 sandboxes 与 pod 的对应关系并将其保存在 sandboxesByPodUID 中\n\n （3）遍历 sandboxesByPod，若 sandboxes 所在的 pod 处于 deleted 状态，则删除该 pod 中所有的 sandboxes ；否则只保留退出时间最短的一个 sandboxes\n\n```\n// evictSandboxes remove all evictable sandboxes. An evictable sandbox must\n// meet the following requirements:\n//   1. not in ready state\n//   2. contains no containers.\n//   3. belong to a non-existent (i.e., already removed) pod, or is not the\n//      most recently created sandbox for the pod.\nfunc (cgc *containerGC) evictSandboxes(evictTerminatedPods bool) error {\n\tcontainers, err := cgc.manager.getKubeletContainers(true)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tsandboxes, err := cgc.manager.getKubeletSandboxes(true)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\t// collect all the PodSandboxId of container\n\tsandboxIDs := sets.NewString()\n\tfor _, container := range containers {\n\t\tsandboxIDs.Insert(container.PodSandboxId)\n\t}\n\n\tsandboxesByPod := make(sandboxesByPodUID)\n\tfor _, sandbox := range sandboxes {\n\t\tpodUID := types.UID(sandbox.Metadata.Uid)\n\t\tsandboxInfo := sandboxGCInfo{\n\t\t\tid:         sandbox.Id,\n\t\t\tcreateTime: time.Unix(0, sandbox.CreatedAt),\n\t\t}\n\n\t\t// Set ready sandboxes to be active.\n\t\tif sandbox.State == runtimeapi.PodSandboxState_SANDBOX_READY {\n\t\t\tsandboxInfo.active = true\n\t\t}\n\n\t\t// Set sandboxes that still have containers to be active.\n\t\tif sandboxIDs.Has(sandbox.Id) {\n\t\t\tsandboxInfo.active = true\n\t\t}\n\n\t\tsandboxesByPod[podUID] = append(sandboxesByPod[podUID], sandboxInfo)\n\t}\n\n\t// Sort the sandboxes by age.\n\tfor uid := range sandboxesByPod {\n\t\tsort.Sort(sandboxByCreated(sandboxesByPod[uid]))\n\t}\n\n\tfor podUID, sandboxes := range sandboxesByPod {\n\t\tif cgc.podStateProvider.IsPodDeleted(podUID) || (cgc.podStateProvider.IsPodTerminated(podUID) && evictTerminatedPods) {\n\t\t\t// Remove all evictable sandboxes if the pod has been removed.\n\t\t\t// Note that the latest dead sandbox is also removed if there is\n\t\t\t// already an active one.\n\t\t\tcgc.removeOldestNSandboxes(sandboxes, len(sandboxes))\n\t\t} else {\n\t\t\t// Keep latest one if the pod still exists.\n\t\t\tcgc.removeOldestNSandboxes(sandboxes, len(sandboxes)-1)\n\t\t}\n\t}\n\treturn nil\n}\n```\n\n<br>\n\n##### 3.2.3 回收log Directories\n\n该方法会回收所有可回收 pod 以及 container 的 log dir，其主要逻辑为：\n （1）首先回收 deleted 状态 pod logs dir，遍历 pod logs dir /var/log/pods，/var/log/pods 为 pod logs 的默认目录，pod logs dir 的格式为 /var/log/pods/NAMESPACE_NAME_UID，解析 pod logs dir 获取 pod uid，判断 pod 是否处于 deleted 状态，若处于 deleted 状态则删除其 logs dir； \n\n（2）回收 deleted 状态 container logs 链接目录，/var/log/containers 为 container log 的默认目录，其会软链接到 pod 的 log dir 下，例如： /var/log/containers/storage-provisioner_kube-system_storage-provisioner-acc8386e409dfb3cc01618cbd14c373d8ac6d7f0aaad9ced018746f31d0081e2.log -> /var/log/pods/kube-system_storage-provisioner_b448e496-eb5d-4d71-b93f-ff7ff77d2348/storage-provisioner/0.log\n\n```\n// evictPodLogsDirectories evicts all evictable pod logs directories. Pod logs directories\n// are evictable if there are no corresponding pods.\nfunc (cgc *containerGC) evictPodLogsDirectories(allSourcesReady bool) error {\n\tosInterface := cgc.manager.osInterface\n\tif allSourcesReady {\n\t\t// Only remove pod logs directories when all sources are ready.\n\t\tdirs, err := osInterface.ReadDir(podLogsRootDirectory)\n\t\tif err != nil {\n\t\t\treturn fmt.Errorf(\"failed to read podLogsRootDirectory %q: %v\", podLogsRootDirectory, err)\n\t\t}\n\t\tfor _, dir := range dirs {\n\t\t\tname := dir.Name()\n\t\t\tpodUID := parsePodUIDFromLogsDirectory(name)\n\t\t\tif !cgc.podStateProvider.IsPodDeleted(podUID) {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\terr := osInterface.RemoveAll(filepath.Join(podLogsRootDirectory, name))\n\t\t\tif err != nil {\n\t\t\t\tklog.Errorf(\"Failed to remove pod logs directory %q: %v\", name, err)\n\t\t\t}\n\t\t}\n\t}\n\n\t// Remove dead container log symlinks.\n\t// TODO(random-liu): Remove this after cluster logging supports CRI container log path.\n\tlogSymlinks, _ := osInterface.Glob(filepath.Join(legacyContainerLogsDir, fmt.Sprintf(\"*.%s\", legacyLogSuffix)))\n\tfor _, logSymlink := range logSymlinks {\n\t\tif _, err := osInterface.Stat(logSymlink); os.IsNotExist(err) {\n\t\t\terr := osInterface.Remove(logSymlink)\n\t\t\tif err != nil {\n\t\t\t\tklog.Errorf(\"Failed to remove container log dead symlink %q: %v\", logSymlink, err)\n\t\t\t}\n\t\t}\n\t}\n\treturn nil\n}\n```\n\n<br>\n\n### 4. Image Gc处理流程\n\n该函数逻辑\n\n（1）获取容器镜像存储目录挂载点文件系统的磁盘信息\n\n（2）若当前使用率大于 HighThresholdPercent，此时需要回收镜像\n\n（3）调用 im.freeSpace 回收未使用的镜像信息\n\n```\nfunc (im *realImageGCManager) GarbageCollect() error {\n\t// Get disk usage on disk holding images.\n\t// 1.获取容器镜像存储目录挂载点文件系统的磁盘信息\n\tfsStats, err := im.statsProvider.ImageFsStats()\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tvar capacity, available int64\n\tif fsStats.CapacityBytes != nil {\n\t\tcapacity = int64(*fsStats.CapacityBytes)\n\t}\n\tif fsStats.AvailableBytes != nil {\n\t\tavailable = int64(*fsStats.AvailableBytes)\n\t}\n\n\tif available > capacity {\n\t\tklog.Warningf(\"available %d is larger than capacity %d\", available, capacity)\n\t\tavailable = capacity\n\t}\n\n\t// Check valid capacity.\n\tif capacity == 0 {\n\t\terr := goerrors.New(\"invalid capacity 0 on image filesystem\")\n\t\tim.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.InvalidDiskCapacity, err.Error())\n\t\treturn err\n\t}\n\n\t// If over the max threshold, free enough to place us at the lower threshold.\n\t// 2.若当前使用率大于 HighThresholdPercent，此时需要回收镜像\n\tusagePercent := 100 - int(available*100/capacity)\n\tif usagePercent >= im.policy.HighThresholdPercent {\n\t\tamountToFree := capacity*int64(100-im.policy.LowThresholdPercent)/100 - available\n\t\tklog.Infof(\"[imageGCManager]: Disk usage on image filesystem is at %d%% which is over the high threshold (%d%%). Trying to free %d bytes down to the low threshold (%d%%).\", usagePercent, im.policy.HighThresholdPercent, amountToFree, im.policy.LowThresholdPercent)\n\t\t// 3. 调用 im.freeSpace 回收未使用的镜像信息\n\t\tfreed, err := im.freeSpace(amountToFree, time.Now())\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\n\t\tif freed < amountToFree {\n\t\t\terr := fmt.Errorf(\"failed to garbage collect required amount of images. Wanted to free %d bytes, but freed %d bytes\", amountToFree, freed)\n\t\t\tim.recorder.Eventf(im.nodeRef, v1.EventTypeWarning, events.FreeDiskSpaceFailed, err.Error())\n\t\t\treturn err\n\t\t}\n\t}\n\n\treturn nil\n}\n```\n\n<br>\n\n#### 4.1 freeSpace\n\n该函数的主要逻辑：\n （1）获取已经使用的 images 列表\n\n （2）获取所有未使用的 images 列表\n\n （3）按镜像最近使用时间进行排序\n\n （4）从旧到新，依次回收，达到了需要释放的空间，就停止\n\n```\n// Tries to free bytesToFree worth of images on the disk.\n//\n// Returns the number of bytes free and an error if any occurred. The number of\n// bytes freed is always returned.\n// Note that error may be nil and the number of bytes free may be less\n// than bytesToFree.\nfunc (im *realImageGCManager) freeSpace(bytesToFree int64, freeTime time.Time) (int64, error) {\n  // 1.获取已经使用的 images 列表\n\timagesInUse, err := im.detectImages(freeTime)\n\tif err != nil {\n\t\treturn 0, err\n\t}\n\n\tim.imageRecordsLock.Lock()\n\tdefer im.imageRecordsLock.Unlock()\n\n\t// Get all images in eviction order.\n\t// 2.获取所有未使用的 images列表\n\timages := make([]evictionInfo, 0, len(im.imageRecords))\n\tfor image, record := range im.imageRecords {\n\t\tif isImageUsed(image, imagesInUse) {\n\t\t\tklog.V(5).Infof(\"Image ID %s is being used\", image)\n\t\t\tcontinue\n\t\t}\n\t\timages = append(images, evictionInfo{\n\t\t\tid:          image,\n\t\t\timageRecord: *record,\n\t\t})\n\t}\n\t// 3.按镜像最近使用时间进行排序\n\tsort.Sort(byLastUsedAndDetected(images))\n  \n  // 4.从旧到新，依次回收，达到了需要释放的空间，就停止\n\t// Delete unused images until we've freed up enough space.\n\tvar deletionErrors []error\n\tspaceFreed := int64(0)\n\tfor _, image := range images {\n\t\tklog.V(5).Infof(\"Evaluating image ID %s for possible garbage collection\", image.id)\n\t\t// Images that are currently in used were given a newer lastUsed.\n\t\tif image.lastUsed.Equal(freeTime) || image.lastUsed.After(freeTime) {\n\t\t\tklog.V(5).Infof(\"Image ID %s has lastUsed=%v which is >= freeTime=%v, not eligible for garbage collection\", image.id, image.lastUsed, freeTime)\n\t\t\tcontinue\n\t\t}\n\n\t\t// Avoid garbage collect the image if the image is not old enough.\n\t\t// In such a case, the image may have just been pulled down, and will be used by a container right away.\n\n\t\tif freeTime.Sub(image.firstDetected) < im.policy.MinAge {\n\t\t\tklog.V(5).Infof(\"Image ID %s has age %v which is less than the policy's minAge of %v, not eligible for garbage collection\", image.id, freeTime.Sub(image.firstDetected), im.policy.MinAge)\n\t\t\tcontinue\n\t\t}\n\n\t\t// Remove image. Continue despite errors.\n\t\tklog.Infof(\"[imageGCManager]: Removing image %q to free %d bytes\", image.id, image.size)\n\t\terr := im.runtime.RemoveImage(container.ImageSpec{Image: image.id})\n\t\tif err != nil {\n\t\t\tdeletionErrors = append(deletionErrors, err)\n\t\t\tcontinue\n\t\t}\n\t\tdelete(im.imageRecords, image.id)\n\t\tspaceFreed += image.size\n\n\t\tif spaceFreed >= bytesToFree {\n\t\t\tbreak\n\t\t}\n\t}\n\n\tif len(deletionErrors) > 0 {\n\t\treturn spaceFreed, fmt.Errorf(\"wanted to free %d bytes, but freed %d bytes space with errors in image deletion: %v\", bytesToFree, spaceFreed, errors.NewAggregate(deletionErrors))\n\t}\n\treturn spaceFreed, nil\n}\n```\n\n<br>\n\n### 5 总结\n\n**容器gc的清理逻辑: **\n （1）移除需要驱逐的的containers\n\n* 得到需要所有需要驱逐的contaienrs，非running已经挂掉 MinAge分钟了\n\n* 根据参数MaxPerPodContainer,保留最新的MaxPerPodContainer个containers，其他的都要驱逐\n\n* 根据MaxContainers参数，保留最新的MaxContainers个contaeinrs，其他的按照创建时间依次驱逐\n\n （2）移除sandboxes\n\n* 获取 node 上所有的 container以及所有的 sandboxes\n\n* 收集所有 container 的 PodSandboxId， 构建 sandboxes 与 pod 的对应关系并将其保存在 sandboxesByPodUID 中\n\n* 遍历 sandboxesByPod，若 sandboxes 所在的 pod 处于 deleted 状态，则删除该 pod 中所有的 sandboxes ；否则只保留退出时间最短的一个 sandboxes\n\n （3）回收log Directories\n\n* 首先回收 deleted 状态 pod logs dir，遍历 pod logs dir /var/log/pods，/var/log/pods 为 pod logs 的默认目录，pod logs dir 的格式为 /var/log/pods/NAMESPACE_NAME_UID，解析 pod logs dir 获取 pod uid，判断 pod 是否处于 deleted 状态，若处于 deleted 状态则删除其 logs dir； \n\n* 回收 deleted 状态 container logs 链接目录，/var/log/containers 为 container log 的默认目录，其会软链接到 pod 的 log dir 下，例如： /var/log/containers/storage-provisioner_kube-system_storage-provisioner-acc8386e409dfb3cc01618cbd14c373d8ac6d7f0aaad9ced018746f31d0081e2.log -> /var/log/pods/kube-system_storage-provisioner_b448e496-eb5d-4d71-b93f-ff7ff77d2348/storage-provisioner/0.log\n\n<br>\n\n**镜像gc的清理逻辑: **\n\n（1）获取容器镜像存储目录挂载点文件系统的磁盘信息\n\n（2）若当前使用率大于 HighThresholdPercent，此时需要回收镜像\n\n（3）调用 im.freeSpace 回收未使用的镜像信息\n\n* 获取已经使用的 images 列表，然后过滤出来未使用的images 列表\n\n* 将未使用的Images按镜像最近使用时间进行排序\n\n* 从旧到新，依次回收，达到了需要释放的空间，就停止\n\n"
  },
  {
    "path": "k8s/kubelet/9-kubelet驱逐源码分析.md",
    "content": "- [1. 关键调用链路](#1-------)\n- [2. initializeRuntimeDependentModules](#2-initializeruntimedependentmodules)\n- [3. evictionManager.Start](#3-evictionmanagerstart)\n  * [3.1 synchronize](#31-synchronize)\n    + [3.1.1 summaryProvider.Get(updateStats)](#311-summaryproviderget-updatestats-)\n    + [3.1.2 makeSignalObservations](#312-makesignalobservations)\n  * [3.2 waitForPodsCleanup](#32-waitforpodscleanup)\n- [4. 总结](#4---)\n\nk8s版本信息：v1.17.4\n\n### 1. 关键调用链路\n\n![image-20220812170846761](../images/kubeletEvict.png)\n\n### 2. initializeRuntimeDependentModules\n\n省略kubelet->run->updateRuntimeUp，直接从initializeRuntimeDependentModules开发分析。\n\ninitializeRuntimeDependentModules的核心逻辑就是启动evictionManager和其他相关的组件。该函数只在kubelet运行时启动一次。\n\n这里关注的核心函数：\n\n（1）启动cadvisor\n\n（2）启动containerManager\n\n（3）启动evictionManager\n\n因为evictionManager需要的数据是来源于，cadvisor的，所以必须等cadvisor启动完后在启动evictionManager\n\n```\n// initializeRuntimeDependentModules will initialize internal modules that require the container runtime to be up.\nfunc (kl *Kubelet) initializeRuntimeDependentModules() {\n  // 1. 启动cadvisor\n\tif err := kl.cadvisor.Start(); err != nil {\n\t\t// Fail kubelet and rely on the babysitter to retry starting kubelet.\n\t\t// TODO(random-liu): Add backoff logic in the babysitter\n\t\tklog.Fatalf(\"Failed to start cAdvisor %v\", err)\n\t}\n\n\t// trigger on-demand stats collection once so that we have capacity information for ephemeral storage.\n\t// ignore any errors, since if stats collection is not successful, the container manager will fail to start below.\n\tkl.StatsProvider.GetCgroupStats(\"/\", true)\n\t// Start container manager.\n\tnode, err := kl.getNodeAnyWay()\n\tif err != nil {\n\t\t// Fail kubelet and rely on the babysitter to retry starting kubelet.\n\t\tklog.Fatalf(\"Kubelet failed to get node info: %v\", err)\n\t}\n\t\n\t// 2.启动containerManager\n\t// containerManager must start after cAdvisor because it needs filesystem capacity information\n\tif err := kl.containerManager.Start(node, kl.GetActivePods, kl.sourcesReady, kl.statusManager, kl.runtimeService); err != nil {\n\t\t// Fail kubelet and rely on the babysitter to retry starting kubelet.\n\t\tklog.Fatalf(\"Failed to start ContainerManager %v\", err)\n\t}\n\t\n\t// 3.启动evictionManager\n\t// eviction manager must start after cadvisor because it needs to know if the container runtime has a dedicated imagefs\n\tkl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)\n\n\t...\n}\n```\n\n### 3. evictionManager.Start\n\n核心逻辑如下\n（1）是否利用kernel memcg notification机制。默认是否，可以通过--kernel-memcg-notification参数开启\n\nkubelet 定期通过 cadvisor 接口采集节点内存使用数据，当节点短时间内内存使用率突增，此时 kubelet 无法感知到也不会有 MemoryPressure 相关事件，但依然会调用 OOMKiller 停止容器。可以通过为 kubelet 配置 `--kernel-memcg-notification`\n 参数启用 memcg api，当触发 memory 使用率阈值时 memcg 会主动进行通知；\n\nmemcg 主动通知的功能是 cgroup 中已有的，kubelet 会在 `/sys/fs/cgroup/memory/cgroup.event_control`\n 文件中写入 memory.available 的阈值，而阈值与 inactive_file 文件的大小有关系，kubelet 也会定期更新阈值，当 memcg 使用率达到配置的阈值后会主动通知 kubelet，kubelet 通过 epoll 机制来接收通知。\n\n这个暂时先了解一下，不做深入。\n\n（2）循环调用synchronize，waitForPodsCleanup来驱逐清理pod。循环间隔是10s，monitoringInterval默认10s\n\n```\nkl.evictionManager.Start(kl.StatsProvider, kl.GetActivePods, kl.podResourcesAreReclaimed, evictionMonitoringPeriod)\n\n// Start starts the control loop to observe and response to low compute resources.\nfunc (m *managerImpl) Start(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc, podCleanedUpFunc PodCleanedUpFunc, monitoringInterval time.Duration) {\n\t\n\tthresholdHandler := func(message string) {\n\t\tklog.Infof(message)\n\t\tm.synchronize(diskInfoProvider, podFunc)\n\t}\n\t// 1.是否利用kernel memcg notification机制。默认是否，可以通过--kernel-memcg-notification参数开启\n\tif m.config.KernelMemcgNotification {\n\t\tfor _, threshold := range m.config.Thresholds {\n\t\t\tif threshold.Signal == evictionapi.SignalMemoryAvailable || threshold.Signal == evictionapi.SignalAllocatableMemoryAvailable {\n\t\t\t\tnotifier, err := NewMemoryThresholdNotifier(threshold, m.config.PodCgroupRoot, &CgroupNotifierFactory{}, thresholdHandler)\n\t\t\t\tif err != nil {\n\t\t\t\t\tklog.Warningf(\"eviction manager: failed to create memory threshold notifier: %v\", err)\n\t\t\t\t} else {\n\t\t\t\t\tgo notifier.Start()\n\t\t\t\t\tm.thresholdNotifiers = append(m.thresholdNotifiers, notifier)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\t\n\t// 2.循环调用synchronize，waitForPodsCleanup来驱逐清理pod。循环间隔是10s，monitoringInterval默认10s\n\t// start the eviction manager monitoring\n\tgo func() {\n\t\tfor {\n\t\t\tif evictedPods := m.synchronize(diskInfoProvider, podFunc); evictedPods != nil {\n\t\t\t\tklog.Infof(\"eviction manager: pods %s evicted, waiting for pod to be cleaned up\", format.Pods(evictedPods))\n\t\t\t\tm.waitForPodsCleanup(podCleanedUpFunc, evictedPods)\n\t\t\t} else {\n\t\t\t\ttime.Sleep(monitoringInterval)\n\t\t\t}\n\t\t}\n\t}()\n}\n```\n\n#### 3.1 synchronize\n\n核心逻辑：\n\n（1）得到该节点所有activePods(得到所有Pod然后去掉了status.Phase == v1.PodFailed || status.Phase == v1.PodSucceeded || (pod.DeletionTimestamp != nil && notRunning(status.ContainerStatuses)))\n\n（2）从cadvisor获取详细信息，就是node, pod的资源统计信息-（重要环节）\n\n（3）从统计数据中获得节点资源的使用情况observations\n\n（4）将资源实际使用量和资源容量进行比较，最终得到阈值结构体对象的列表。举例来说就是，我设置了pid, mem, fs三个thresholds,但是通过观察，可能就是mem这一个限制达到了驱逐阈值\n\n（5）再加上最小强制回收值(防止反复驱逐)。算出来最终的哪些限制到阈值了。可以参[最小强制回收](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#minimum-eviction-reclaim)_\n\n（6）记录每个限制第一次驱逐时间，因为软驱逐会有时间容忍，所以对于软驱逐而言，过来容忍期还是超了阈值，这个时候就要驱逐\n\n（7）回收节点级的资源，如果回收的资源足够的话，直接返回，不需要驱逐正在运行中的pod\n\n（8）对不同阈值驱逐场景下pod有不同的排序，比如如果是mem驱逐，就是按照req limit的qos进行排序驱逐\n\n（9）按照排序后的结果每次驱逐一个pod,每个Pod的annotation会带有为什么驱逐的关键信息，日志也会打印klog.Infof(\"eviction manager: pod %s is evicted successfully\", format.Pod(pod))\n\n```\n// synchronize is the main control loop that enforces eviction thresholds.\n// Returns the pod that was killed, or nil if no pod was killed.\nfunc (m *managerImpl) synchronize(diskInfoProvider DiskInfoProvider, podFunc ActivePodsFunc) []*v1.Pod {\n\t// if we have nothing to do, just return\n\t// 这个基本不会满足。\n\t条件1：thresholds是包含evictionHard，evictionSoft等等配置在内，有默认值。所以不会为空\n\t条件2： 不允许使用 本地临时存储以及emptyDir卷的sizeLimit 属性 或者没有设置thresholds就不进行同步（基本不会）\n\tthresholds := m.config.Thresholds\n\tif len(thresholds) == 0 && !utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {\n\t\treturn nil\n\t}\n\n\tklog.V(3).Infof(\"eviction manager: synchronize housekeeping\")\n\t// build the ranking functions (if not yet known)\n\t// TODO: have a function in cadvisor that lets us know if global housekeeping has completed\n\tif m.dedicatedImageFs == nil {\n\t\thasImageFs, ok := diskInfoProvider.HasDedicatedImageFs()\n\t\tif ok != nil {\n\t\t\treturn nil\n\t\t}\n\t\tm.dedicatedImageFs = &hasImageFs\n\t\tm.signalToRankFunc = buildSignalToRankFunc(hasImageFs)\n\t\tm.signalToNodeReclaimFuncs = buildSignalToNodeReclaimFuncs(m.imageGC, m.containerGC, hasImageFs)\n\t}\n\t\n\t// 1. 得到该节点所有activePods(得到所有Pod然后去掉了status.Phase == v1.PodFailed || status.Phase == v1.PodSucceeded || (pod.DeletionTimestamp != nil && notRunning(status.ContainerStatuses)))\n\tactivePods := podFunc()\n\tupdateStats := true\n\t// 2. 从cadvisor获取详细信息，就是node, pod的资源统计信息-（重要环节）\n\tsummary, err := m.summaryProvider.Get(updateStats)\n\tif err != nil {\n\t\tklog.Errorf(\"eviction manager: failed to get summary stats: %v\", err)\n\t\treturn nil\n\t}\n\t\n\t// 之前内核notify相关，一般不开启，这里忽略\n\tif m.clock.Since(m.thresholdsLastUpdated) > notifierRefreshInterval {\n\t\tm.thresholdsLastUpdated = m.clock.Now()\n\t\tfor _, notifier := range m.thresholdNotifiers {\n\t\t\tif err := notifier.UpdateThreshold(summary); err != nil {\n\t\t\t\tklog.Warningf(\"eviction manager: failed to update %s: %v\", notifier.Description(), err)\n\t\t\t}\n\t\t}\n\t}\n\t\n\t// 3. 从统计数据中获得节点资源的使用情况observations\n\t// make observations and get a function to derive pod usage stats relative to those observations.\n\tobservations, statsFunc := makeSignalObservations(summary)\n\tdebugLogObservations(\"observations\", observations)\n\t\n\t// 4. 将资源实际使用量和资源容量进行比较，最终得到阈值结构体对象的列表。举例来说就是，我设置了pid, mem, fs三个thresholds,但是通过观察，可能就是mem这一个限制达到了驱逐阈值\n\t// determine the set of thresholds met independent of grace period\n\tthresholds = thresholdsMet(thresholds, observations, false)\n\tdebugLogThresholdsWithObservation(\"thresholds - ignoring grace period\", thresholds, observations)\n  \n  // 5. 加上enforceMinReclaim最小强制回收资源值。https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#minimum-eviction-reclaim\n\t// determine the set of thresholds previously met that have not yet satisfied the associated min-reclaim\n\tif len(m.thresholdsMet) > 0 {\n\t\tthresholdsNotYetResolved := thresholdsMet(m.thresholdsMet, observations, true)\n\t\tthresholds = mergeThresholds(thresholds, thresholdsNotYetResolved)\n\t}\n\tdebugLogThresholdsWithObservation(\"thresholds - reclaim not satisfied\", thresholds, observations)\n\t\n\t// 6.记录每个限制第一次驱逐时间，因为软驱逐会有时间容忍，所以对于软驱逐而言，过来容忍期还是超了阈值，这个时候就要驱逐\n\t// track when a threshold was first observed\n\tnow := m.clock.Now()\n\tthresholdsFirstObservedAt := thresholdsFirstObservedAt(thresholds, m.thresholdsFirstObservedAt, now)\n\n\t// the set of node conditions that are triggered by currently observed thresholds\n\tnodeConditions := nodeConditions(thresholds)\n\tif len(nodeConditions) > 0 {\n\t\tklog.V(3).Infof(\"eviction manager: node conditions - observed: %v\", nodeConditions)\n\t}\n\n\t// track when a node condition was last observed\n\tnodeConditionsLastObservedAt := nodeConditionsLastObservedAt(nodeConditions, m.nodeConditionsLastObservedAt, now)\n\n\t// node conditions report true if it has been observed within the transition period window\n\tnodeConditions = nodeConditionsObservedSince(nodeConditionsLastObservedAt, m.config.PressureTransitionPeriod, now)\n\tif len(nodeConditions) > 0 {\n\t\tklog.V(3).Infof(\"eviction manager: node conditions - transition period not met: %v\", nodeConditions)\n\t}\n\n\t// determine the set of thresholds we need to drive eviction behavior (i.e. all grace periods are met)\n\tthresholds = thresholdsMetGracePeriod(thresholdsFirstObservedAt, now)\n\tdebugLogThresholdsWithObservation(\"thresholds - grace periods satisfied\", thresholds, observations)\n\n\t// update internal state\n\tm.Lock()\n\tm.nodeConditions = nodeConditions\n\tm.thresholdsFirstObservedAt = thresholdsFirstObservedAt\n\tm.nodeConditionsLastObservedAt = nodeConditionsLastObservedAt\n\tm.thresholdsMet = thresholds\n\n\t// determine the set of thresholds whose stats have been updated since the last sync\n\tthresholds = thresholdsUpdatedStats(thresholds, observations, m.lastObservations)\n\tdebugLogThresholdsWithObservation(\"thresholds - updated stats\", thresholds, observations)\n\n\tm.lastObservations = observations\n\tm.Unlock()\n\n\t// evict pods if there is a resource usage violation from local volume temporary storage\n\t// If eviction happens in localStorageEviction function, skip the rest of eviction action\n\tif utilfeature.DefaultFeatureGate.Enabled(features.LocalStorageCapacityIsolation) {\n\t\tif evictedPods := m.localStorageEviction(summary, activePods); len(evictedPods) > 0 {\n\t\t\treturn evictedPods\n\t\t}\n\t}\n\n\tif len(thresholds) == 0 {\n\t\tklog.V(3).Infof(\"eviction manager: no resources are starved\")\n\t\treturn nil\n\t}\n\t\n\t// 对thresholds排序\n\t// rank the thresholds by eviction priority\n\tsort.Sort(byEvictionPriority(thresholds))\n\tthresholdToReclaim, resourceToReclaim, foundAny := getReclaimableThreshold(thresholds)\n\tif !foundAny {\n\t\treturn nil\n\t}\n\tklog.Warningf(\"eviction manager: attempting to reclaim %v\", resourceToReclaim)\n\n\t// record an event about the resources we are now attempting to reclaim via eviction\n\tm.recorder.Eventf(m.nodeRef, v1.EventTypeWarning, \"EvictionThresholdMet\", \"Attempting to reclaim %s\", resourceToReclaim)\n\n  \n  // 7.回收节点级的资源，如果回收的资源足够的话，直接返回，不需要驱逐正在运行中的pod\n\t// check if there are node-level resources we can reclaim to reduce pressure before evicting end-user pods.\n\tif m.reclaimNodeLevelResources(thresholdToReclaim.Signal, resourceToReclaim) {\n\t\tklog.Infof(\"eviction manager: able to reduce %v pressure without evicting pods.\", resourceToReclaim)\n\t\treturn nil\n\t}\n\n\tklog.Infof(\"eviction manager: must evict pod(s) to reclaim %v\", resourceToReclaim)\n\n\t// rank the pods for eviction\n\trank, ok := m.signalToRankFunc[thresholdToReclaim.Signal]\n\tif !ok {\n\t\tklog.Errorf(\"eviction manager: no ranking function for signal %s\", thresholdToReclaim.Signal)\n\t\treturn nil\n\t}\n\n\t// the only candidates viable for eviction are those pods that had anything running.\n\tif len(activePods) == 0 {\n\t\tklog.Errorf(\"eviction manager: eviction thresholds have been met, but no pods are active to evict\")\n\t\treturn nil\n\t}\n\t\n\t// 8.对不同阈值驱逐场景下pod有不同的排序，比如如果是mem驱逐，就是按照req limit的qos进行排序驱逐\n\t// rank the running pods for eviction for the specified resource\n\trank(activePods, statsFunc)\n\n\tklog.Infof(\"eviction manager: pods ranked for eviction: %s\", format.Pods(activePods))\n\n\t//record age of metrics for met thresholds that we are using for evictions.\n\tfor _, t := range thresholds {\n\t\ttimeObserved := observations[t.Signal].time\n\t\tif !timeObserved.IsZero() {\n\t\t\tmetrics.EvictionStatsAge.WithLabelValues(string(t.Signal)).Observe(metrics.SinceInSeconds(timeObserved.Time))\n\t\t\tmetrics.DeprecatedEvictionStatsAge.WithLabelValues(string(t.Signal)).Observe(metrics.SinceInMicroseconds(timeObserved.Time))\n\t\t}\n\t}\n\n\t// we kill at most a single pod during each eviction interval\n\t// 9.按照排序后的结果每次驱逐一个pod,每个Pod的annotation会带有为什么驱逐的关键信息，日志也会打印klog.Infof(\"eviction manager: pod %s is evicted successfully\", format.Pod(pod))\n\tfor i := range activePods {\n\t\tpod := activePods[i]\n\t\tgracePeriodOverride := int64(0)\n\t\tif !isHardEvictionThreshold(thresholdToReclaim) {\n\t\t\tgracePeriodOverride = m.config.MaxPodGracePeriodSeconds\n\t\t}\n\t\tmessage, annotations := evictionMessage(resourceToReclaim, pod, statsFunc)\n\t\tif m.evictPod(pod, gracePeriodOverride, message, annotations) {\n\t\t\tmetrics.Evictions.WithLabelValues(string(thresholdToReclaim.Signal)).Inc()\n\t\t\treturn []*v1.Pod{pod}\n\t\t}\n\t}\n\tklog.Infof(\"eviction manager: unable to evict any pods from the node\")\n\treturn nil\n}\n```\n\n##### 3.1.1 summaryProvider.Get(updateStats)\n\n可以看到，这里核心就是从cadvisor算出2个数据，nodeStats 和podStats。\n\n```\nfunc (sp *summaryProviderImpl) Get(updateStats bool) (*statsapi.Summary, error) {\n  。。。\n\tnodeStats := statsapi.NodeStats{\n\t\tNodeName:         node.Name,   \n\t\tCPU:              rootStats.CPU,\n\t\tMemory:           rootStats.Memory,\n\t\tNetwork:          networkStats,\n\t\tStartTime:        sp.systemBootTime,\n\t\tFs:               rootFsStats,\n\t\tRuntime:          &statsapi.RuntimeStats{ImageFs: imageFsStats},\n\t\tRlimit:           rlimit,\n\t\tSystemContainers: sp.GetSystemContainersStats(nodeConfig, podStats, updateStats),\n\t}\n\tsummary := statsapi.Summary{\n\t\tNode: nodeStats,\n\t\tPods: podStats,\n\t}\n\treturn &summary, nil\n}\n```\n\n以mem为例：\n\n可以看到，pods的memlimit, RSSBytes,UsageBytes等信息都统计在内\n\n```\nif info.Spec.HasMemory && cstat.Memory != nil {\n\t\tpageFaults := cstat.Memory.ContainerData.Pgfault\n\t\tmajorPageFaults := cstat.Memory.ContainerData.Pgmajfault\n\t\tmemoryStats = &statsapi.MemoryStats{\n\t\t\tTime:            metav1.NewTime(cstat.Timestamp),\n\t\t\tUsageBytes:      &cstat.Memory.Usage,\n\t\t\tWorkingSetBytes: &cstat.Memory.WorkingSet,\n\t\t\tRSSBytes:        &cstat.Memory.RSS,\n\t\t\tPageFaults:      &pageFaults,\n\t\t\tMajorPageFaults: &majorPageFaults,\n\t\t}\n\t\t// availableBytes = memory limit (if known) - workingset\n\t\tif !isMemoryUnlimited(info.Spec.Memory.Limit) {\n\t\t\tavailableBytes := info.Spec.Memory.Limit - cstat.Memory.WorkingSet\n\t\t\tmemoryStats.AvailableBytes = &availableBytes\n\t\t}\n\t}\n```\n\n##### 3.1.2 makeSignalObservations\n\n以Memory为例。构造的Observation就是\n\n```\n\tif memory := summary.Node.Memory; memory != nil && memory.AvailableBytes != nil && memory.WorkingSetBytes != nil {\n\t\tresult[evictionapi.SignalMemoryAvailable] = signalObservation{\n\t\t\tavailable: resource.NewQuantity(int64(*memory.AvailableBytes), resource.BinarySI),\n\t\t\tcapacity:  resource.NewQuantity(int64(*memory.AvailableBytes+*memory.WorkingSetBytes), resource.BinarySI),\n\t\t\ttime:      memory.Time,\n\t\t}\n\t}\n```\n\n这里针对于内存计算需要注意的是：\n\ntotal_inactive_file为非活动内存:可以被交换到磁盘 cache 缓存存储器存储当前保存在内存中的磁盘数据，所以判断container_memory_working_set_bytes会比container_memory_usage_bytes更为准确\n\n```\nmemory.working_set = memory.usage - memory.total_inactive_file\nmemory.available = memory.total - memory.working_set  = memory.total - memory.usage + memory.total_inactive_file\nmemory.total =  memory.total\n```\n\n#### 3.2 waitForPodsCleanup\n\nwaitForPodsCleanup逻辑很简单, 就是调用PodResourcesAreReclaimed清理容器volume， cgroup资源\n\n```\nfunc (m *managerImpl) waitForPodsCleanup(podCleanedUpFunc PodCleanedUpFunc, pods []*v1.Pod) {\n\ttimeout := m.clock.NewTimer(podCleanupTimeout)\n\tdefer timeout.Stop()\n\tticker := m.clock.NewTicker(podCleanupPollFreq)\n\tdefer ticker.Stop()\n\tfor {\n\t\tselect {\n\t\tcase <-timeout.C():\n\t\t\tklog.Warningf(\"eviction manager: timed out waiting for pods %s to be cleaned up\", format.Pods(pods))\n\t\t\treturn\n\t\tcase <-ticker.C():\n\t\t\tfor i, pod := range pods {\n\t\t\t\tif !podCleanedUpFunc(pod) {\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t\tif i == len(pods)-1 {\n\t\t\t\t\tklog.Infof(\"eviction manager: pods %s successfully cleaned up\", format.Pods(pods))\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n}\n```\n\n<br>\n\n```\n// PodResourcesAreReclaimed returns true if all required node-level resources that a pod was consuming have\n// been reclaimed by the kubelet.  Reclaiming resources is a prerequisite to deleting a pod from the API server.\nfunc (kl *Kubelet) PodResourcesAreReclaimed(pod *v1.Pod, status v1.PodStatus) bool {\n\tif !notRunning(status.ContainerStatuses) {\n\t\t// We shouldn't delete pods that still have running containers\n\t\tklog.V(3).Infof(\"Pod %q is terminated, but some containers are still running\", format.Pod(pod))\n\t\treturn false\n\t}\n\t// pod's containers should be deleted\n\truntimeStatus, err := kl.podCache.Get(pod.UID)\n\tif err != nil {\n\t\tklog.V(3).Infof(\"Pod %q is terminated, Error getting runtimeStatus from the podCache: %s\", format.Pod(pod), err)\n\t\treturn false\n\t}\n\tif len(runtimeStatus.ContainerStatuses) > 0 {\n\t\tvar statusStr string\n\t\tfor _, status := range runtimeStatus.ContainerStatuses {\n\t\t\tstatusStr += fmt.Sprintf(\"%+v \", *status)\n\t\t}\n\t\tklog.V(3).Infof(\"Pod %q is terminated, but some containers have not been cleaned up: %s\", format.Pod(pod), statusStr)\n\t\treturn false\n\t}\n\tif kl.podVolumesExist(pod.UID) && !kl.keepTerminatedPodVolumes {\n\t\t// We shouldn't delete pods whose volumes have not been cleaned up if we are not keeping terminated pod volumes\n\t\tklog.V(3).Infof(\"Pod %q is terminated, but some volumes have not been cleaned up\", format.Pod(pod))\n\t\treturn false\n\t}\n\tif kl.kubeletConfiguration.CgroupsPerQOS {\n\t\tpcm := kl.containerManager.NewPodContainerManager()\n\t\tif pcm.Exists(pod) {\n\t\t\tklog.V(3).Infof(\"Pod %q is terminated, but pod cgroup sandbox has not been cleaned up\", format.Pod(pod))\n\t\t\treturn false\n\t\t}\n\t}\n\treturn true\n}\n```\n\n### 4. 总结\n\nkubelet驱逐整体是比较明确的，就是每10s进行一次判断，如果超过了阈值就驱逐。\n\n使用上可以参考[官方的文档](https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/reserve-compute-resources/)。但是官方文档有个错误在于，memory.available是包含system-reserved，kube-reserved这些的，它指的是宿主可以用的资源。\n\n举个例子：\n\n这样是基本上不可能触发mem驱逐的。因为这个驱逐条件是宿主可用的资源小于2Gi, 但是给系统保留了20Gi，所以很难因为pod mem压力大而实现驱逐。反而会因为pod 使用mem过大，超过limit，会触发oom而不是驱逐。\n\n```\n--system-reserved=cpu=2000m,memory=20Gi --eviction-hard=memory.available<2Gi,nodefs.available<1Mi,nodefs.inodesFree<1\n```\n\n<br>\n\n可用这样设置，就是当宿主可用资源小于25G的时候进行驱逐。这样的设置给了pod 5Gi的空间。当pod可用资源只剩下5Gi的时候，先驱逐，而不是oom。\n\n```\n--system-reserved=cpu=2000m,memory=20Gi --eviction-hard=memory.available<25Gi,nodefs.available<1Mi,nodefs.inodesFree<1\n```\n\n但是需要注意：oom是个系统概率，驱逐时10s的延迟概念。\n\n当pod只剩下5Gi空间可用时，如果10s内pod使用的mem超过5G，oom会先发出来。\n\n当pod只剩下5Gi空间可用时，如果10s内pod使用的mem不超过5G，驱逐会先发出来。\n\n<br>\n\n代码详见：\n\nSignalMemoryAvailable直接就是mem threshold。设置多少就是多少，包含了system-reserved，kube-reserved\n\n```\n// hardEvictionReservation returns a resourcelist that includes reservation of resources based on hard eviction thresholds.\nfunc hardEvictionReservation(thresholds []evictionapi.Threshold, capacity v1.ResourceList) v1.ResourceList {\n\tif len(thresholds) == 0 {\n\t\treturn nil\n\t}\n\tret := v1.ResourceList{}\n\tfor _, threshold := range thresholds {\n\t\tif threshold.Operator != evictionapi.OpLessThan {\n\t\t\tcontinue\n\t\t}\n\t\tswitch threshold.Signal {\n\t\tcase evictionapi.SignalMemoryAvailable:\n\t\t\tmemoryCapacity := capacity[v1.ResourceMemory]\n\t\t\tvalue := evictionapi.GetThresholdQuantity(threshold.Value, &memoryCapacity)\n\t\t\tret[v1.ResourceMemory] = *value\n\t\tcase evictionapi.SignalNodeFsAvailable:\n\t\t\tstorageCapacity := capacity[v1.ResourceEphemeralStorage]\n\t\t\tvalue := evictionapi.GetThresholdQuantity(threshold.Value, &storageCapacity)\n\t\t\tret[v1.ResourceEphemeralStorage] = *value\n\t\t}\n\t}\n\treturn ret\n}\n```"
  }
]