Full Code of hust-open-atom-club/linux-insides-zh for AI

master a4da605128e8 cached

87 files

1.2 MB

421.2k tokens

2 symbols

1 requests

Download .txt

Showing preview only (1,263K chars total). Download the full file or copy to clipboard to get everything.

Repository: hust-open-atom-club/linux-insides-zh
Branch: master
Commit: a4da605128e8
Files: 87
Total size: 1.2 MB

Directory structure:
gitextract_e1uf8jnz/

├── .github/
│   └── ISSUE_TEMPLATE/
│       └── 1_translation_request.yaml
├── Booting/
│   ├── README.md
│   ├── linux-bootstrap-1.md
│   ├── linux-bootstrap-2.md
│   ├── linux-bootstrap-3.md
│   ├── linux-bootstrap-4.md
│   ├── linux-bootstrap-5.md
│   └── linux-bootstrap-6.md
├── CONTRIBUTING.md
├── CONTRIBUTORS.md
├── Cgroups/
│   ├── README.md
│   └── linux-cgroups-1.md
├── Concepts/
│   ├── README.md
│   ├── linux-cpu-1.md
│   ├── linux-cpu-2.md
│   ├── linux-cpu-3.md
│   └── linux-cpu-4.md
├── DataStructures/
│   ├── README.md
│   ├── linux-datastructures-1.md
│   ├── linux-datastructures-2.md
│   └── linux-datastructures-3.md
├── Initialization/
│   ├── README.md
│   ├── linux-initialization-1.md
│   ├── linux-initialization-10.md
│   ├── linux-initialization-2.md
│   ├── linux-initialization-3.md
│   ├── linux-initialization-4.md
│   ├── linux-initialization-5.md
│   ├── linux-initialization-6.md
│   ├── linux-initialization-7.md
│   ├── linux-initialization-8.md
│   └── linux-initialization-9.md
├── Interrupts/
│   ├── README.md
│   ├── linux-interrupts-1.md
│   ├── linux-interrupts-10.md
│   ├── linux-interrupts-2.md
│   ├── linux-interrupts-3.md
│   ├── linux-interrupts-4.md
│   ├── linux-interrupts-5.md
│   ├── linux-interrupts-6.md
│   ├── linux-interrupts-7.md
│   ├── linux-interrupts-8.md
│   └── linux-interrupts-9.md
├── KernelStructures/
│   ├── README.md
│   └── linux-kernelstructure-1.md
├── LICENSE
├── LINKS.md
├── MM/
│   ├── README.md
│   ├── linux-mm-1.md
│   ├── linux-mm-2.md
│   └── linux-mm-3.md
├── Misc/
│   ├── README.md
│   ├── linux-misc-1.md
│   ├── linux-misc-2.md
│   ├── linux-misc-3.md
│   └── linux-misc-4.md
├── README.md
├── SUMMARY.md
├── Scripts/
│   └── validate_markdown_links.py
├── SyncPrim/
│   ├── README.md
│   ├── linux-sync-1.md
│   ├── linux-sync-2.md
│   ├── linux-sync-3.md
│   ├── linux-sync-4.md
│   ├── linux-sync-5.md
│   └── linux-sync-6.md
├── SysCall/
│   ├── README.md
│   ├── linux-syscall-1.md
│   ├── linux-syscall-2.md
│   ├── linux-syscall-3.md
│   ├── linux-syscall-4.md
│   ├── linux-syscall-5.md
│   └── linux-syscall-6.md
├── TRANSLATION_NOTES.md
├── TRANSLATION_STATUS.md
├── Theory/
│   ├── README.md
│   ├── linux-theory-1.md
│   ├── linux-theory-2.md
│   └── linux-theory-3.md
└── Timers/
    ├── README.md
    ├── linux-timers-1.md
    ├── linux-timers-2.md
    ├── linux-timers-3.md
    ├── linux-timers-4.md
    ├── linux-timers-5.md
    ├── linux-timers-6.md
    └── linux-timers-7.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/ISSUE_TEMPLATE/1_translation_request.yaml
================================================
name: 请求翻译/Request Translation
description: 请求翻译一篇文章/Request 
title: 请求翻译
labels: [request, translation]
body:
  - type: markdown
    id: introduction
    attributes:
      value: |
        感谢您对Linux-Insides-ZN的兴趣。请填写以下信息以帮助我们了解您的请求。
        Thank you for your interest in Linux-Insides-ZN project. Please fill out the following information to help us understand your request.
        
  - type: textarea
    id: name
    attributes:
      label: 翻译章节/Chapter
      placeholder: |
        请填写要翻译的章节号。
        Please fill in the chapter number you want to translate. 
    validations:
      required: true
      
  - type: textarea
    id: upstream
    attributes:
      label: 上游链接/Upstream URL
      placeholder: |
        请填写原文的链接。
        Please fill in the link to the original article.
    validations:
      required: true


================================================
FILE: Booting/README.md
================================================
# 内核引导过程

本章介绍了Linux内核引导过程。此处你将在这看到一些描述内核加载过程的整个周期的文章：

* [从引导程序到内核](linux-bootstrap-1.md) - 介绍了从启动计算机到内核执行第一条指令之前的所有阶段;
* [在内核设置代码的第一步](linux-bootstrap-2.md) - 介绍了在内核设置代码的第一个步骤。你会看到堆的初始化，查询不同的参数，如 EDD，IST 等...
* [视频模式初始化和保护模式切换](linux-bootstrap-3.md) - 介绍了内核设置代码中的视频模式初始化，并切换到保护模式。
* [切换 64 位模式](linux-bootstrap-4.md) - 介绍切换到 64 位模式的准备工作以及切换的细节。
* [内核解压缩](linux-bootstrap-5.md) - 介绍了内核解压缩之前的准备工作以及直接解压缩的细节。
* [内核地址随机化](linux-bootstrap-6.md) - 介绍了 Linux 内核加载地址随机化的细节。


================================================
FILE: Booting/linux-bootstrap-1.md
================================================
内核引导过程. 第一部分.
================================================================================

从引导加载程序内核
--------------------------------------------------------------------------------

如果看过我在这之前的[文章](http://0xax.blogspot.com/search/label/asm)，你就会知道我已经开始涉足底层的代码编写。我写了一些关于 Linux x86_64  汇编的文章。同时，我开始深入研究 Linux 源代码。底层是如何工作的，程序是如何在电脑上运行的，它们是如何在内存中定位的，内核是如何管理进程和内存，网络堆栈是如何在底层工作的等等，这些我都非常感兴趣。因此，我决定去写另外的一系列文章关于 **x86_64** 框架的 Linux 内核。

*注意这不是官方文档，只是学习和分享知识*

**需要的基础知识**

* 理解 C 代码
* 理解 汇编语言 代码 （AT&T 语法）

不管怎样，如果你才开始学一些，我会在这些文章中尝试去解释一些部分。好了，小的介绍结束，我们开始深入内核和底层。

我们的文章是基于 Linux 内核 3.18 版本进行的，如果后续的内核版本有任何改变，我将作出相应的更新。

神奇的电源按钮，接下来会发生什么？
--------------------------------------------------------------------------------

尽管这是一系列关于 Linux 内核的文章，我们在第一章并不会从内核代码开始。电脑在你按下电源开关的时候，就开始工作。主板发送信号给[电源](https://en.wikipedia.org/wiki/Power_supply)，而电源收到信号后会给电脑供应合适的电量。一旦主板收到了[电源备妥信号](https://en.wikipedia.org/wiki/Power_good_signal)，它会尝试启动 CPU 。CPU 则复位寄存器的所有数据，并设置每个寄存器的预定值。


[80386](https://en.wikipedia.org/wiki/Intel_80386) 
以及后来的 CPUs 在电脑复位后，在 CPU 寄存器中定义了如下预定义数据：

```
IP          0xfff0
CS selector 0xf000
CS base     0xffff0000
```

处理器开始在[实模式](https://en.wikipedia.org/wiki/Real_mode)工作。我们需要退回一点去理解在这种模式下的内存分段机制。从 [8086](https://en.wikipedia.org/wiki/Intel_8086)到现在的 Intel 64 位  CPU，所有 x86兼容处理器都支持实模式。8086 处理器有一个20位寻址总线，这意味着它可以对0到 2^20  位地址空间（ 1MB ）进行操作。不过它只有16位的寄存器，所以最大寻址空间是  2^16 即 0xffff （64 KB）。实模式使用[段式内存管理](http://en.wikipedia.org/wiki/Memory_segmentation) 来管理整个内存空间。所有内存被分成固定的65536字节（64 KB） 大小的小块。由于我们不能用16位寄存器寻址大于 64KB 的内存，一种替代的方法被设计出来了。一个地址包括两个部分：数据段起始地址和从该数据段起的偏移量。为了得到内存中的物理地址，我们要让数据段乘16并加上偏移量：

```
PhysicalAddress = Segment * 16 + Offset
```

举个例子，如果 `CS:IP` 是 `0x2000:0x0010`, 则对应的物理地址将会是： 

```python
>>> hex((0x2000 << 4) + 0x0010)
'0x20010'
```

不过如果我们使用16位2进制能表示的最大值进行寻址：`0xffff:0xffff`，根据上面的公式，结果将会是：

```python
>>> hex((0xffff << 4) + 0xffff)
'0x10ffef'
```

这超出 1MB 65519 字节。既然实模式下， CPU 只能访问 1MB 地址空间，禁用 [A20线](https://en.wikipedia.org/wiki/A20_line) 后 `0x10ffef` 将变为 `0x00ffef`。

我们了解了实模式和在实模式下的内存寻址方式，让我们来回头继续来看复位后的寄存器值。

`CS` 寄存器包含两个部分：可视段选择器和隐含基址。 结合之前定义的 `CS` 基址和 `IP` 值，逻辑地址应该是：

```
0xffff0000:0xfff0
```

这种形式的起始地址为EIP寄存器里的值加上基址地址：

```python
>>> 0xffff0000 + 0xfff0
'0xfffffff0'
```

得到的 `0xfffffff0` 是位于 4GB - 16 字节处的地址。 这个地方是 [复位向量(Reset vector)](http://en.wikipedia.org/wiki/Reset_vector) 。 这是CPU在重置后期望执行的第一条指令的内存地址。它包含一个 [jump](http://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) 指令，这个指令通常指向BIOS入口点。举个例子，如果访问 [coreboot](http://www.coreboot.org/) 源代码，将看到：

```assembly
	.section ".reset", "ax", %progbits
	.code16
.globl	_start
_start:
	.byte  0xe9
	.int   _start16bit - ( . + 2 )
	...
```

上面的跳转指令（ [opcode](http://ref.x86asm.net/coder32.html#xE9) - 0xe9）跳转到地址  `_start16bit - ( . + 2)` 去执行代码。 `reset` 段是 `16` 字节代码段， 起始于地址 
`0xfffffff0`(`src/cpu/x86/16bit/reset16.ld`)，因此 CPU 复位之后，就会跳到这个地址来执行相应的代码 ：

```
SECTIONS {
	/* Trigger an error if I have an unuseable start address */
 	_bogus = ASSERT(_start16bit >= 0xffff0000, "_start16bit too low. Please report.");
	_ROMTOP = 0xfffffff0;
	. = _ROMTOP;
	.reset . : {
		*(.reset);
		. = 15;
		BYTE(0x00);
	}
}
```

现在BIOS已经开始工作了。在初始化和检查硬件之后，需要寻找到一个可引导设备。可引导设备列表存储在在 BIOS 配置中, BIOS 将根据其中配置的顺序，尝试从不同的设备上寻找引导程序。对于硬盘，BIOS  将尝试寻找引导扇区。如果在硬盘上存在一个MBR分区，那么引导扇区储存在第一个扇区(512字节)的头446字节，引导扇区的最后必须是 `0x55` 和 `0xaa` ，这2个字节称为魔术字节（Magic Bytes)，如果 BIOS 看到这2个字节，就知道这个设备是一个可引导设备。举个例子：

```assembly
;
; Note: this example is written in Intel Assembly syntax
;
[BITS 16]
[ORG  0x7c00]

boot:
    mov al, '!'
    mov ah, 0x0e
    mov bh, 0x00
    mov bl, 0x07

    int 0x10
    jmp $

times 510-($-$$) db 0

db 0x55
db 0xaa
```

构建并运行：

```
nasm -f bin boot.nasm && qemu-system-x86_64 boot
```

这让 [QEMU](http://qemu.org) 使用刚才新建的 `boot` 二进制文件作为磁盘镜像。由于这个二进制文件是由上述汇编语言产生，它满足引导扇区(起始设为 `0x7c00`, 用Magic Bytes结束)的需求。QEMU将这个二进制文件作为磁盘镜像的主引导记录(MBR)。

将看到:

![Simple bootloader which prints only `!`](images/simple_bootloader.png)

在这个例子中，这段代码被执行在16位的实模式，起始于内存0x7c00。之后调用 [0x10](http://www.ctyme.com/intr/rb-0106.htm) 中断打印 `!` 符号。用0填充剩余的510字节并用两个Magic Bytes `0xaa` 和 `0x55` 结束。

可以使用 `objdump` 工具来查看转储信息：

```
nasm -f bin boot.nasm
objdump -D -b binary -mi386 -Maddr16,data16,intel boot
```

一个真实的启动扇区包含了分区表，以及用来启动系统的指令，而不是像我们上面的程序，只是输出了一个感叹号就结束了。从启动扇区的代码被执行开始，BIOS 就将系统的控制权转移给了引导程序，让我们继续往下看看引导程序都做了些什么。

**NOTE**: 强调一点，上面的引导程序是运行在实模式下的，因此 CPU 是使用下面的公式进行物理地址的计算的：

```
PhysicalAddress = Segment * 16 + Offset
```

而且正如我前面所说的，在实模式下，CPU 只能使用16位的通用寄存器。16位寄存器能够表达的最大数值是：`0xffff` ，所以按照上面的公式计算出的最大物理地址是：

```python
>>> hex((0xffff * 16) + 0xffff)
'0x10ffef'
```

这个地址在 [8086](https://en.wikipedia.org/wiki/Intel_8086) 处理器下，将被转换成地址 `0x0ffef`, 原因是因为，8086 cpu 只有20位地址线，只能表示 `2^20 = 1MB` 的地址，而上面这个地址已经超出了 1MB 地址的范围，所以 CPU 就舍弃了最高位。

实模式下的 1MB 地址空间分配表：

```
0x00000000 - 0x000003FF - Real Mode Interrupt Vector Table
0x00000400 - 0x000004FF - BIOS Data Area
0x00000500 - 0x00007BFF - Unused
0x00007C00 - 0x00007DFF - Our Bootloader
0x00007E00 - 0x0009FFFF - Unused
0x000A0000 - 0x000BFFFF - Video RAM (VRAM) Memory
0x000B0000 - 0x000B7777 - Monochrome Video Memory
0x000B8000 - 0x000BFFFF - Color Video Memory
0x000C0000 - 0x000C7FFF - Video ROM BIOS
0x000C8000 - 0x000EFFFF - BIOS Shadow Area
0x000F0000 - 0x000FFFFF - System BIOS
```

如果你的记性不错，在看到这张表的时候，一定会跳出来一个问题。在上面的章节中，我说了 CPU 执行的第一条指令是在地址 `0xFFFFFFF0` 处，这个地址远远大于 `0xFFFFF` ( 1MB )。那么实模式下的 CPU 是如何访问到这个地址的呢？文档 [coreboot](http://www.coreboot.org/Developer_Manual/Memory_map) 给出了答案:

```
0xFFFE_0000 - 0xFFFF_FFFF: 128 kilobyte ROM mapped into address space
```

`0xFFFFFFF0` 这个地址被映射到了 ROM，因此 CPU 执行的第一条指令来自于 ROM，而不是 RAM。

引导程序
--------------------------------------------------------------------------------

在现实世界中，要启动 Linux 系统，有多种引导程序可以选择。比如 [GRUB 2](https://www.gnu.org/software/grub/) 和 [syslinux](http://www.syslinux.org/wiki/index.php/The_Syslinux_Project)。Linux内核通过 [Boot protocol](http://lxr.free-electrons.com/source/Documentation/x86/boot.txt?v=3.18) 来定义应该如何实现引导程序。在这里我们将只介绍 GRUB 2。

现在 BIOS 已经选择了一个启动设备，并且将控制权转移给了启动扇区中的代码，在我们的例子中，启动扇区代码是 [boot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/boot.S;hb=HEAD)。因为这段代码只能占用一个扇区，因此非常简单，只做一些必要的初始化，然后就跳转到 GRUB 2's core image 去执行。 Core image 的代码请参考 [diskboot.img](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/boot/i386/pc/diskboot.S;hb=HEAD)，一般来说 core image 在磁盘上存储在启动扇区之后到第一个可用分区之前。core image 的初始化代码会把整个 core image （包括 GRUB 2的内核代码和文件系统驱动） 引导到内存中。 引导完成之后，[grub_main](http://git.savannah.gnu.org/gitweb/?p=grub.git;a=blob;f=grub-core/kern/main.c)将被调用。

`grub_main` 初始化控制台，计算模块基地址，设置 root 设备，读取 grub 配置文件，加载模块。最后，将 GRUB 置于 normal 模式，在这个模式中，`grub_normal_execute` (from `grub-core/normal/main.c`) 将被调用以完成最后的准备工作，然后显示一个菜单列出所用可用的操作系统。当某个操作系统被选择之后，`grub_menu_execute_entry` 开始执行，它将调用 GRUB 的 `boot` 命令，来引导被选中的操作系统。

就像 kernel boot protocol 所描述的，引导程序必须填充 kernel setup header （位于 kernel setup code 偏移 `0x01f1` 处）  的必要字段。kernel setup header的定义开始于 [arch/x86/boot/header.S](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18)：

```assembly
	.globl hdr
hdr:
	setup_sects: .byte 0
	root_flags:  .word ROOT_RDONLY
	syssize:     .long 0
	ram_size:    .word 0
	vid_mode:    .word SVGA_MODE
	root_dev:    .word 0
	boot_flag:   .word 0xAA55
```

bootloader必须填充在 Linux boot protocol 中标记为 `write` 的头信息，比如 [type_of_loader](http://lxr.free-electrons.com/source/Documentation/x86/boot.txt?v=3.18#L354)，这些头信息可能来自命令行，或者通过计算得到。在这里我们不会详细介绍所有的 kernel setup header，我们将在需要的时候逐个介绍。不过，你可以自己通过 [boot protocol](http://lxr.free-electrons.com/source/Documentation/x86/boot.txt?v=3.18#L156) 来了解这些设置。

通过阅读 kernel boot protocol，在内核被引导入内存后，内存使用情况将如下表所示：

```shell
         | Protected-mode kernel  |
100000   +------------------------+
         | I/O memory hole        |
0A0000   +------------------------+
         | Reserved for BIOS      | Leave as much as possible unused
         ~                        ~
         | Command line           | (Can also be below the X+10000 mark)
X+10000  +------------------------+
         | Stack/heap             | For use by the kernel real-mode code.
X+08000  +------------------------+
         | Kernel setup           | The kernel real-mode code.
         | Kernel boot sector     | The kernel legacy boot sector.
       X +------------------------+
         | Boot loader            | <- Boot sector entry point 0x7C00
001000   +------------------------+
         | Reserved for MBR/BIOS  |
000800   +------------------------+
         | Typically used by MBR  |
000600   +------------------------+
         | BIOS use only          |
000000   +------------------------+

```

所以当 bootloader 完成任务，将执行权移交给 kernel，kernel 的代码从以下地址开始执行：

```
0x1000 + X + sizeof(KernelBootSector) + 1
个人以为应该是 X + sizeof(KernelBootSector) + 1 因为 X 已经是一个具体的物理地址了，不是一个偏移
```

上面的公式中， `X` 是 kernel bootsector 被引导入内存的位置。在我的机器上， `X` 的值是 `0x10000`，我们可以通过 memory dump 来检查这个地址：

![kernel first address](images/kernel_first_address.png)

到这里，引导程序完成它的使命，并将控制权移交给了 Linux kernel。下面我们就来看看 kernel setup code 都做了些什么。

内核设置
--------------------------------------------------------------------------------

经过上面的一系列操作，我们终于进入到内核了。不过从技术上说，内核还没有被运行起来，因为首先我们需要正确设置内核，启动内存管理，进程管理等等。内核设置代码的运行起点是 [arch/x86/boot/header.S](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18) 中定义的 [_start](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18#L293) 函数。 在 `_start` 函数开始之前，还有很多的代码，那这些代码是做什么的呢？

实际上 `_start` 开始之前的代码是 kernel 自带的 bootloader。在很久以前，是可以使用这个 bootloader 来启动 Linux 的。不过在新的 Linux 中，这个 bootloader 代码已经不再启动 Linux 内核，而只是输出一个错误信息。 如果你运行下面的命令，直接使用 Linux 内核来启动，你会看到下图所示的错误：

```
qemu-system-x86_64 vmlinuz-3.18-generic
```

![Try vmlinuz in qemu](images/try_vmlinuz_in_qemu.png)

为了能够作为 bootloader 来使用, `header.S` 开始处定义了 [MZ] [MZ](https://en.wikipedia.org/wiki/DOS_MZ_executable) 魔术数字, 并且定义了  [PE](https://en.wikipedia.org/wiki/Portable_Executable) 头，在 PE 头中定义了输出的字符串：

```assembly
#ifdef CONFIG_EFI_STUB
# "MZ", MS-DOS header
.byte 0x4d
.byte 0x5a
#endif
...
...
...
pe_header:
	.ascii "PE"
	.word 0
```

之所以代码需要这样写，这个是因为遵从 [UEFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface) 的硬件需要这样的结构才能正常引导操作系统。

去除这些作为 bootloader 使用的代码，真正的内核代码就从 `_start` 开始了：

```
// header.S line 292
.globl _start
_start:
```

其他的 bootloader (grub2 and others) 知道 _start 所在的位置（ 从 `MZ` 头开始偏移 `0x200` 字节 ），所以这些 bootloader 就会忽略所有在这个位置前的代码（这些之前的代码位于 `.bstext` 段中）， 直接跳转到这个位置启动内核。

```
//
// arch/x86/boot/setup.ld
//
. = 0;                    // current position
.bstext : { *(.bstext) }  // put .bstext section to position 0
.bsdata : { *(.bsdata) }
```

```assembly
	.globl _start
_start:
	.byte 0xeb
	.byte start_of_setup-1f
1:
	//
	// rest of the header
	//
```

`_start` 开始就是一个 `jmp` 语句（`jmp` 语句的 opcode 是 `0xeb` ），这个跳转语句是一个短跳转，跟在后面的是一个相对地址 （ `start_of_setup - 1f ` ）。在汇编代码中 `Nf` 代表了当前代码之后第一个标号为 `N` 的代码段的地址。回到我们的代码，在 `_start` 标号之后的第一个标号为 `1` 的代码段中包含了剩下的 setup header 结构。在标号为 `1` 的代码段结束之后，紧接着就是标号为 `start_of_setup` 的代码段 （这个代码段位于 `.entrytext` 代码区，这个代码段中的第一条指令实际上是内核开始执行之后的第一条指令） 。

下面让我们来看一下 GRUB2 的代码是如何跳转到 `_start` 标号处的。从 Linux 内核代码中，我们知道 `_start` 标号的代码位于偏移 `0x200` 处。在 GRUB2 的源代码中我们可以看到下面的代码：

```C
  state.gs = state.fs = state.es = state.ds = state.ss = segment;
  state.cs = segment + 0x20;
```

在我的机器上，因为我的内核代码被加载到了内存地址 `0x10000` 处，所以在上面的代码执行完成之后 `cs = 0x1020` （ 因此第一条指令的内存地址将是 `cs << 4 + 0 = 0x10200`，刚好是 `0x10000` 开始后的 `0x200` 处的指令）：

```
fs = es = ds = ss = 0x1000
cs = 0x1020
```

从 `start_of_setup` 标号开始的代码需要完成下面这些事情：

* 将所有段寄存器的值设置成一样的内容
* 设置堆栈
* 设置 [bss](https://en.wikipedia.org/wiki/.bss) （静态变量区）
* 跳转到 [main.c](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18) 开始执行代码

段寄存器设置
--------------------------------------------------------------------------------

首先，内核保证将 `ds` 和 `es` 段寄存器指向相同地址，随后，使用 `cld` 指令来清理方向标志位：

```assembly
	movw	%ds, %ax
	movw	%ax, %es
	cld
```

就像我在上面一节中所写的， 为了能够跳转到 `_start` 标号出执行代码，grub2 将 `cs` 段寄存器的值设置成了 `0x1020`，这个值和其他段寄存器都是不一样的，因此下面的代码就是将 `cs` 段寄存器的值和其他段寄存器一致：

```assembly
	pushw	%ds
	pushw	$6f
	lretw
```

上面的代码使用了一个小小的技巧来重置 `cs` 寄存器的内容，下面我们就来仔细分析。 这段代码首先将 `ds`寄存器的值入栈，然后将标号为 [6](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18#L494) 的代码段地址入栈 ，接着执行 `lretw` 指令，这条指令，将把标号为 `6` 的内存地址放入 `ip` 寄存器 （[instruction pointer](https://en.wikipedia.org/wiki/Program_counter)），将 `ds` 寄存器的值放入 `cs` 寄存器。 这样一来 `ds` 和 `cs` 段寄存器就拥有了相同的值。

设置堆栈
--------------------------------------------------------------------------------

绝大部分的 setup 代码都是为 C 语言运行环境做准备。在设置了 `ds` 和 `es` 寄存器之后，接下来 [step](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18#L467) 的代码将检查 `ss` 寄存器的内容，如果寄存器的内容不对，那么将进行更正：

```assembly
	movw	%ss, %dx
	cmpw	%ax, %dx
	movw	%sp, %dx
	je	2f
```

当进入这段代码的时候， `ss` 寄存器的值可能是一下三种情况之一：

* `ss` 寄存器的值是 0x10000 ( 和其他除了 `cs` 寄存器之外的所有寄存器的一样）
* `ss` 寄存器的值不是 0x10000，但是 `CAN_USE_HEAP` 标志被设置了
* `ss` 寄存器的值不是 0x10000，同时 `CAN_USE_HEAP` 标志没有被设置

下面我们就来分析在这三中情况下，代码都是如何工作的：

* `ss` 寄存器的值是 0x10000，在这种情况下，代码将直接跳转到标号为 `2` 的代码处执行:

```
2: 	andw	$~3, %dx
	jnz	3f
	movw	$0xfffc, %dx
3:  movw	%ax, %ss
	movzwl %dx, %esp
	sti
```

这段代码首先将 `dx` 寄存器的值（就是当前`sp` 寄存器的值）4字节对齐，然后检查是否为0（如果是0，堆栈就不对了，因为堆栈是从大地址向小地址发展的），如果是0，那么就将 `dx` 寄存器的值设置成 `0xfffc` （64KB地址段的最后一个4字节地址）。如果不是0，那么就保持当前值不变。接下来，就将 `ax` 寄存器的值（ 0x10000 ）设置到 `ss` 寄存器，并根据 `dx` 寄存器的值设置正确的 `sp`。这样我们就得到了正确的堆栈设置，具体请参考下图：

![stack](images/stack1.png)

* 下面让我们来看 `ss` != `ds`的情况，首先将 setup code 的结束地址 [_end](http://lxr.free-electrons.com/source/arch/x86/boot/setup.ld?v=3.18#L52) 写入 `dx` 寄存器。然后检查 `loadflags` 中是否设置了 `CAN_USE_HEAP` 标志。   根据 kernel boot protocol 的定义，[loadflags](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18#L321) 是一个标志字段。这个字段的 `Bit 7` 就是 `CAN_USE_HEAP` 标志：

```
Field name:	loadflags

  This field is a bitmask.

  Bit 7 (write): CAN_USE_HEAP
	Set this bit to 1 to indicate that the value entered in the
	heap_end_ptr is valid.  If this field is clear, some setup code
	functionality will be disabled.
```

`loadflags` 字段其他可以设置的标志包括：

```C
#define LOADED_HIGH	    (1<<0)
#define QUIET_FLAG	    (1<<5)
#define KEEP_SEGMENTS	(1<<6)
#define CAN_USE_HEAP	(1<<7)
```

如果 `CAN_USE_HEAP` 被置位，那么将 `heap_end_ptr` 放入 `dx` 寄存器，然后加上 `STACK_SIZE` （最小堆栈大小是 512 bytes）。在加法完成之后，如果结果没有溢出（CF flag 没有置位，如果置位那么程序就出错了），那么就跳转到标号为 `2` 的代码处继续执行（这段代码的逻辑在1中已经详细介绍了），接着我们就得到了如下图所示的堆栈：

![stack](images/stack2.png)

* 最后一种情况就是 `CAN_USE_HEAP` 没有置位， 那么我们就将 `dx` 寄存器的值加上 `STACK_SIZE`，然后跳转到标号为 `2` 的代码处继续执行，接着我们就得到了如下图所示的堆栈：

![minimal stack](images/minimal_stack.png)

BSS段设置
--------------------------------------------------------------------------------

在我们正式执行 C 代码之前，我们还有2件事情需要完成。1）设置正确的 [BSS](https://en.wikipedia.org/wiki/.bss)段 ；2）检查 `magic` 签名。接下来的代码，首先检查 `magic` 签名 [setup_sig](http://lxr.free-electrons.com/source/arch/x86/boot/setup.ld?v=3.18#L39)，如果签名不对，直接跳转到 `setup_bad` 部分执行代码：

```assembly
cmpl	$0x5a5aaa55, setup_sig
jne	setup_bad
```

如果 `magic` 签名是对的， 那么我们只要设置好 `BSS` 段，就可以开始执行 C 代码了。

BSS 段用来存储那些没有被初始化的静态变量。对于这个段使用的内存， Linux 首先使用下面的代码将其全部清零：

```assembly
	movw	$__bss_start, %di
	movw	$_end+3, %cx
	xorl	%eax, %eax
	subw	%di, %cx
	shrw	$2, %cx
	rep; stosl
```

在这段代码中，首先将 [__bss_start](http://lxr.free-electrons.com/source/arch/x86/boot/setup.ld?v=3.18#L47) 地址放入 `di` 寄存器，然后将 `_end + 3` （4字节对齐） 地址放入 `cx`，接着使用 `xor` 指令将 `ax` 寄存器清零，接着计算 BSS 段的大小 （ `cx` - `di` ），然后将大小放入 `cx` 寄存器。接下来将 `cx` 寄存器除4，最后使用 `rep; stosl` 指令将 `ax` 寄存器的值（0）写入 寄存器整个 BSS 段。 代码执行完成之后，我们将得到如下图所示的 BSS 段:

![bss](images/bss.png)

跳转到 main 函数
--------------------------------------------------------------------------------

到目前为止，我们完成了堆栈和 BSS 的设置，现在我们可以正式跳入 `main()` 函数来执行 C 代码了：

```assembly
	call main
```

`main()` 函数定义在 [arch/x86/boot/main.c](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18)，我们将在下一章详细介绍这个函数做了什么事情。

结束语
--------------------------------------------------------------------------------

本章到此结束了，在下一章中我们将详细介绍在 Linux 内核设置过程中调用的第一个 C 代码（ `main()` ），也将介绍诸如 `memset`, `memcpy`, `earlyprintk` 这些底层函数的实现，敬请期待。

如果你有任何的问题或者建议，你可以留言，也可以直接发消息给我[twitter](https://twitter.com/0xAX)。

**如果你发现文中描述有任何问题，请提交一个 PR 到 [linux-insides-zh](https://github.com/hust-open-atom-club/linux-insides-zh) 。**

相关链接
--------------------------------------------------------------------------------

  * [Intel 80386 programmer's reference manual 1986](http://css.csail.mit.edu/6.858/2014/readings/i386.pdf)
  * [Minimal Boot Loader for Intel® Architecture](https://www.cs.cmu.edu/~410/doc/minimal_boot.pdf)
  * [8086](http://en.wikipedia.org/wiki/Intel_8086)
  * [80386](http://en.wikipedia.org/wiki/Intel_80386)
  * [Reset vector](http://en.wikipedia.org/wiki/Reset_vector)
  * [Real mode](http://en.wikipedia.org/wiki/Real_mode)
  * [Linux kernel boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt)
  * [CoreBoot developer manual](http://www.coreboot.org/Developer_Manual)
  * [Ralf Brown's Interrupt List](http://www.ctyme.com/intr/int.htm)
  * [Power supply](http://en.wikipedia.org/wiki/Power_supply)
  * [Power good signal](http://en.wikipedia.org/wiki/Power_good_signal)


================================================
FILE: Booting/linux-bootstrap-2.md
================================================
# 在内核安装代码的第一步

内核启动的第一步  
--------------------------------------------------------------------------------

在[上一节中](/Booting/linux-bootstrap-1.md)我们开始接触到内核启动代码，并且分析了初始化部分，最后我们停在了对`main`函数（`main`函数是第一个用C写的函数）的调用（`main`函数位于[arch/x86/boot/main.c](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18)）。

在这一节中我们将继续对内核启动过程的研究，我们将  
* 认识`保护模式` 
* 如何从实模式进入保护模式 
* 堆和控制台初始化 
* 内存检测，cpu验证，键盘初始化 
* 还有更多 

现在让我们开始我们的旅程

保护模式 
--------------------------------------------------------------------------------
在操作系统可以使用Intel 64位CPU的[长模式](http://en.wikipedia.org/wiki/Long_mode)之前，内核必须首先将CPU切换到保护模式运行。

什么是[保护模式](https://en.wikipedia.org/wiki/Protected_mode)？保护模式于1982年被引入到Intel CPU家族，并且从那之后，直到Intel 64出现，保护模式都是Intel CPU的主要运行模式。

淘汰[实模式](http://wiki.osdev.org/Real_Mode)的主要原因是因为在实模式下，系统能够访问的内存非常有限。如果你还记得我们在上一节说的，在实模式下，系统最多只能访问1M内存，而且在很多时候，实际能够访问的内存只有640K。

保护模式带来了很多的改变，不过主要的改变都集中在内存管理方法。在保护模式中，实模式的20位地址线被替换成32位地址线，因此系统可以访问多达4GB的地址空间。另外，在保护模式中引入了[内存分页](http://en.wikipedia.org/wiki/Paging)功能，在后面的章节中我们将介绍这个功能。

保护模式提供了2种完全不同的内存管理机制：

* 段式内存管理
* 内存分页

在这一节中，我们只介绍段式内存管理，内存分页我们将在后面的章节进行介绍。

在上一节中我们说过，在实模式下，一个物理地址是由2个部分组成的：

* 内存段的基地址 
* 从基地址开始的偏移 
 
使用这2个信息，我们可以通过下面的公式计算出对应的物理地址 

```
PhysicalAddress = Segment * 16 + Offset
```

在保护模式中，内存段的定义和实模式完全不同。在保护模式中，每个内存段不再是64K大小，段的大小和起始位置是通过一个叫做`段描述符`的数据结构进行描述。所有内存段的段描述符存储在一个叫做`全局描述符表`(GDT)的内存结构中。

`全局描述符表`这个内存数据结构在内存中的位置并不是固定的，它的地址保存在一个特殊寄存器 `GDTR` 中。在后面的章节中，我们将在Linux内核代码中看到全局描述符表的地址是如何被保存到 `GDTR` 中的。具体的汇编代码看起来是这样的：

```assembly
lgdt gdt
```

`lgdt` 汇编代码将把全局描述符表的基地址和大小保存到 `GDTR` 寄存器中。`GDTR` 是一个48位的寄存器，这个寄存器中的保存了2部分的内容:

* 全局描述符表的大小 (16位）
* 全局描述符表的基址 (32位)
 
就像前面的段落说的，全局描述符表包含了所有内存段的`段描述符`。每个段描述符长度是64位，结构如下图描述：

```
31          24        19      16              7            0
------------------------------------------------------------
|             | |B| |A|       | |   | |0|E|W|A|            |
| BASE 31:24  |G|/|L|V| LIMIT |P|DPL|S|  TYPE | BASE 23:16 | 4
|             | |D| |L| 19:16 | |   | |1|C|R|A|            |
------------------------------------------------------------
|                             |                            |
|        BASE 15:0            |       LIMIT 15:0           | 0
|                             |                            |
------------------------------------------------------------
```

粗粗一看，上面的结构非常吓人，不过实际上这个结构是非常容易理解的。比如在上图中的 LIMIT 15:0 表示这个数据结构的0到15位保存的是内存段的大小的0到15位。相似的 LIMITE 19:16 表示上述数据结构的16到19位保存的是内存段大小的16到19位。从这个分析中，我们可以看出每个内存段的大小是通过20位进行描述的。下面我们将对这个数据结构进行仔细分析：

1. Limit[20位] 被保存在上述内存结构的0-15和16-19位。根据上述内存结构中`G`位的设置，这20位内存定义的内存长度是不一样的。下面是一些具体的例子：

  * 如果`G` = 0, 并且Limit = 0， 那么表示段长度是1 byte
  * 如果`G` = 1, 并且Limit = 0, 那么表示段长度是4K bytes
  * 如果`G` = 0，并且Limit = 0xfffff，那么表示段长度是1M bytes
  * 如果`G` = 1，并且Limit = 0xfffff，那么表示段长度是4G bytes

  从上面的例子我们可以看出：
  
  * 如果G = 0, 那么内存段的长度是按照1 byte进行增长的 ( Limit每增加1，段长度增加1 byte )，最大的内存段长度将是1M bytes；
  * 如果G = 1, 那么内存段的长度是按照4K bytes进行增长的 ( Limit每增加1，段长度增加4K bytes )，最大的内存段长度将是4G bytes;
  * 段长度的计算公式是 base_seg_length * ( LIMIT + 1)。
   
2. Base[32-bits] 被保存在上述地址结构的0-15， 32-39以及56-63位。Base定义了段基址。

3. Type/Attribute (40-47 bits) 定义了内存段的类型以及支持的操作。
  * `S` 标记（ 第44位 ）定义了段的类型，`S` = 0说明这个内存段是一个系统段；`S` = 1说明这个内存段是一个代码段或者是数据段（ 堆栈段是一种特殊类型的数据段，堆栈段必须是可以进行读写的段 ）。
   
在`S` = 1的情况下，上述内存结构的第43位决定了内存段是数据段还是代码段。如果43位 = 0，说明是一个数据段，否则就是一个代码段。

对于数据段和代码段，下面的表格给出了段类型定义

```
|           Type Field        | Descriptor Type | Description
|-----------------------------|-----------------|------------------
| Decimal                     |                 |
|             0    E    W   A |                 |
| 0           0    0    0   0 | Data            | Read-Only
| 1           0    0    0   1 | Data            | Read-Only, accessed
| 2           0    0    1   0 | Data            | Read/Write
| 3           0    0    1   1 | Data            | Read/Write, accessed
| 4           0    1    0   0 | Data            | Read-Only, expand-down
| 5           0    1    0   1 | Data            | Read-Only, expand-down, accessed
| 6           0    1    1   0 | Data            | Read/Write, expand-down
| 7           0    1    1   1 | Data            | Read/Write, expand-down, accessed
|                  C    R   A |                 |
| 8           1    0    0   0 | Code            | Execute-Only
| 9           1    0    0   1 | Code            | Execute-Only, accessed
| 10          1    0    1   0 | Code            | Execute/Read
| 11          1    0    1   1 | Code            | Execute/Read, accessed
| 12          1    1    0   0 | Code            | Execute-Only, conforming
| 14          1    1    0   1 | Code            | Execute-Only, conforming, accessed
| 13          1    1    1   0 | Code            | Execute/Read, conforming
| 15          1    1    1   1 | Code            | Execute/Read, conforming, accessed
```

从上面的表格我们可以看出，当第43位是`0`的时候，这个段描述符对应的是一个数据段，如果该位是`1`，那么表示这个段描述符对应的是一个代码段。对于数据段，第42，41，40位表示的是(*E*扩展，*W*可写，*A*可访问）；对于代码段，第42，41，40位表示的是(*C*一致，*R*可读，*A*可访问）。 

  * 如果`E` = 0，数据段是向上扩展数据段，反之为向下扩展数据段。关于向上扩展和向下扩展数据段，可以参考下面的[链接](http://www.sudleyplace.com/dpmione/expanddown.html)。在一般情况下，应该是不会使用向下扩展数据段的。
  * 如果`W` = 1，说明这个数据段是可写的，否则不可写。所有数据段都是可读的。
  * A位表示该内存段是否已经被CPU访问。
  * 如果`C` = 1，说明这个代码段可以被低优先级的代码访问，比如可以被用户态代码访问。反之如果`C` = 0，说明只能同优先级的代码段可以访问。
  * 如果`R` = 1，说明该代码段可读。代码段是永远没有写权限的。

4. DPL（2-bits, bit 45 和 46）定义了该段的优先级。具体数值是0-3。

5. P 标志(bit 47) - 说明该内存段是否已经存在于内存中。如果`P` = 0，那么在访问这个内存段的时候将报错。

6. AVL 标志(bit 52) - 这个位在Linux内核中没有被使用。

7. L 标志(bit 53) - 只对代码段有意义，如果`L` = 1，说明该代码段需要运行在64位模式下。

8. D/B flag(bit 54) - 根据段描述符描述的是一个可执行代码段、下扩数据段还是一个堆栈段，这个标志具有不同的功能。（对于32位代码和数据段，这个标志应该总是设置为1；对于16位代码和数据段，这个标志被设置为0。）。

  * 可执行代码段。此时这个标志称为D标志并用于指出该段中的指令引用有效地址和操作数的默认长度。如果该标志置位，则默认值是32位地址和32位或8位的操作数；如果该标志为0，则默认值是16位地址和16位或8位的操作数。指令前缀0x66可以用来选择非默认值的操作数大小；前缀0x67可用来选择非默认值的地址大小。
  * 栈段（由SS寄存器指向的数据段）。此时该标志称为B（Big）标志，用于指明隐含堆栈操作（如PUSH、POP或CALL）时的栈指针大小。如果该标志置位，则使用32位栈指针并存放在ESP寄存器中；如果该标志为0，则使用16位栈指针并存放在SP寄存器中。如果堆栈段被设置成一个下扩数据段，这个B标志也同时指定了堆栈段的上界限。
  * 下扩数据段。此时该标志称为B标志，用于指明堆栈段的上界限。如果设置了该标志，则堆栈段的上界限是0xFFFFFFFF（4GB）；如果没有设置该标志，则堆栈段的上界限是0xFFFF（64KB）。

在保护模式下，段寄存器保存的不再是一个内存段的基地址，而是一个称为`段选择子`的结构。每个段描述符都对应一个`段选择子`。`段选择子`是一个16位的数据结构，下图显示了这个数据结构的内容：

```
-----------------------------
|       Index    | TI | RPL |
-----------------------------
```

其中，
* **Index** 表示在GDT中，对应段描述符的索引号。
* **TI** 表示要在GDT还是LDT中查找对应的段描述符
* **RPL** 表示请求者优先级。这个优先级将和段描述符中的优先级协同工作，共同确定访问是否合法。

在保护模式下，每个段寄存器实际上包含下面2部分内容：
* 可见部分 - 段选择子
* 隐藏部分 - 段描述符

在保护模式中，cpu是通过下面的步骤来找到一个具体的物理地址的：

* 代码必须将相应的`段选择子`装入某个段寄存器
* CPU根据`段选择子`从GDT中找到一个匹配的段描述符，然后将段描述符放入段寄存器的隐藏部分
* 在没有使用向下扩展段的时候，那么内存段的基地址就是`段描述符中的基地址`，段描述符的`limit + 1`就是内存段的长度。如果你知道一个内存地址的`偏移`，那么在没有开启分页机制的情况下，这个内存的物理地址就是`基地址+偏移`

![linear address](images/linear_address.png)

当代码要从实模式进入保护模式的时候，需要执行下面的操作：

* 禁止中断发生
* 使用命令 `lgdt` 将GDT表装入 `GDTR` 寄存器
* 设置CR0寄存器的PE位为1，使CPU进入保护模式
* 跳转开始执行保护模式代码

在后面的章节中，我们将看到Linux 内核中完整的转换代码。不过在系统进入保护模式之前，内核有很多的准备工作需要进行。

让我们打开C文件 [arch/x86/boot/main.c](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18)。这个文件包含了很多的函数，这些函数分别会执行键盘初始化，内存堆初始化等等操作...，下面让我们来具体看一些重要的函数。

将启动参数拷贝到"zeropage"
--------------------------------------------------------------------------------

让我们从`main`函数开始看起，这个函数中，首先调用了[`copy_boot_params(void)`](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18#L30)。 

这个函数将内核设置信息拷贝到`boot_params`结构的相应字段。大家可以在[arch/x86/include/uapi/asm/bootparam.h](http://lxr.free-electrons.com/source/arch/x86/include/uapi/asm/bootparam.h?v=3.18#L113)找到`boot_params`结构的定义。

`boot_params`结构中包含`struct setup_header hdr`字段。这个结构包含了[linux boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt)中定义的相同字段，并且由boot loader填写。在内核编译的时候`copy_boot_params`完成两个工作：

1. 将[header.S](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18#L281)中定义的 `hdr` 结构中的内容拷贝到 `boot_params` 结构的字段 `struct setup_header hdr` 中。

2. 如果内核是通过老的命令行协议运行起来的，那么就更新内核的命令行指针。

这里需要注意的是拷贝 `hdr` 数据结构的 `memcpy` 函数不是C语言中的函数，而是定义在 [copy.S](http://lxr.free-electrons.com/source/arch/x86/boot/copy.S?v=3.18)。让我们来具体分析一下这段代码：

```assembly
GLOBAL(memcpy)
	pushw	%si          ;push si to stack
	pushw	%di          ;push di to stack
	movw	%ax, %di     ;move &boot_param.hdr to di
	movw	%dx, %si     ;move &hdr to si
	pushw	%cx          ;push cx to stack ( sizeof(hdr) )
	shrw	$2, %cx    
	rep; movsl           ;copy based on 4 bytes
	popw	%cx          ;pop cx
	andw	$3, %cx      ;cx = cx % 4
	rep; movsb           ;copy based on one byte
	popw	%di
	popw	%si
	retl
ENDPROC(memcpy)
```

在`copy.S`文件中，你可以看到所有的方法都开始于 `GLOBAL` 宏定义，而结束于 `ENDPROC` 宏定义。 

你可以在 [arch/x86/include/asm/linkage.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/linkage.h?v=3.18)中找到 `GLOBAL` 宏定义。这个宏给代码段分配了一个名字标签，并且让这个名字全局可用。 

```assembly
#define GLOBAL(name)	\
	.globl name;	\
	name:
```

你可以在[include/linux/linkage.h](http://lxr.free-electrons.com/source/include/linux/linkage.h?v=3.18)中找到 `ENDPROC` 宏的定义。 这个宏通过 `END(name)` 代码标识了汇编函数的结束，同时将函数名输出，从而静态分析工具可以找到这个函数。

```assembly
#define ENDPROC(name) \
	.type name, @function ASM_NL \
	END(name)
```

`memcpy` 的实现代码是很容易理解的。首先，代码将 `si` 和 `di` 寄存器的值压入堆栈进行保存，这么做的原因是因为后续的代码将修改 `si` 和 `di` 寄存器的值。`memcpy` 函数（也包括其他定义在copy.s中的其他函数）使用了 `fastcall` 调用规则，意味着所有的函数调用参数是通过 `ax`, `dx`,  `cx`寄存器传入的，而不是传统的通过堆栈传入。因此在使用下面的代码调用 `memcpy` 函数的时候

```c
memcpy(&boot_params.hdr, &hdr, sizeof hdr);
```

函数的参数是这样传递的

* `ax` 寄存器指向 `boot_param.hdr` 的内存地址
* `dx` 寄存器指向 `hdr` 的内存地址
* `cx` 寄存器包含 `hdr` 结构的大小
 
`memcpy` 函数在将 `si` 和 `di` 寄存器压栈之后，将 `boot_param.hdr` 的地址放入 `di` 寄存器，将 `hdr` 的地址放入 `si` 寄存器，并且将 `hdr` 数据结构的大小压栈。 接下来代码首先以4个字节为单位，将 `si` 寄存器指向的内存内容拷贝到 `di` 寄存器指向的内存。当剩下的字节数不足4字节的时候，代码将原始的 `hdr` 数据结构大小出栈放入 `cx` ，然后对 `cx` 的值对4求模，接下来就是根据 `cx` 的值，以字节为单位将 `si` 寄存器指向的内存内容拷贝到 `di` 寄存器指向的内存。当拷贝操作完成之后，将保留的 `si` 以及 `di` 寄存器值出栈，函数返回。

控制台初始化
--------------------------------------------------------------------------------

在 `hdr` 结构体被拷贝到 `boot_params.hdr` 成员之后，系统接下来将进行控制台的初始化。控制台初始化时通过调用[arch/x86/boot/early_serial_console.c](http://lxr.free-electrons.com/source/arch/x86/boot/early_serial_console.c?v=3.18)中定义的 `console_init` 函数实现的。

这个函数首先查看命令行参数是否包含 `earlyprintk` 选项。如果命令行参数包含该选项，那么函数将分析这个选项的内容。得到控制台将使用的串口信息，然后进行串口的初始化。以下是 `earlyprintk` 选项可能的取值：

* serial,0x3f8,115200
* serial,ttyS0,115200
* ttyS0,115200

当串口初始化成功之后，如果命令行参数包含 `debug` 选项，我们将看到如下的输出。

```C
if (cmdline_find_option_bool("debug"))
    puts("early console in setup code\n");
```

`puts` 函数定义在[tty.c](http://lxr.free-electrons.com/source/arch/x86/boot/tty.c?v=3.18)。这个函数只是简单的调用 `putchar` 函数将输入字符串中的内容按字节输出。下面让我们来看看  `putchar`函数的实现：

```C
void __attribute__((section(".inittext"))) putchar(int ch)
{
    if (ch == '\n')
        putchar('\r');

    bios_putchar(ch);

    if (early_serial_base != 0)
        serial_putchar(ch);
}
```

`__attribute__((section(".inittext")))` 说明这段代码将被放入 `.inittext` 代码段。关于 `.inittext` 代码段的定义你可以在 [setup.ld](http://lxr.free-electrons.com/source/arch/x86/boot/setup.ld?v=3.18#L19)中找到。

如果需要输出的字符是 `\n` ，那么 `putchar` 函数将调用自己首先输出一个字符 `\r`。接下来，就调用 `bios_putchar` 函数将字符输出到显示器（使用bios int10中断）：

```C
static void __attribute__((section(".inittext"))) bios_putchar(int ch)
{
    struct biosregs ireg;

    initregs(&ireg);
    ireg.bx = 0x0007;
    ireg.cx = 0x0001;
    ireg.ah = 0x0e;
    ireg.al = ch;
    intcall(0x10, &ireg, NULL);
}
```

在上面的代码中 `initreg` 函数接受一个 `biosregs` 结构的地址作为输入参数，该函数首先调用 `memset` 函数将 `biosregs` 结构体所有成员清0。

```C
    memset(reg, 0, sizeof *reg);
    reg->eflags |= X86_EFLAGS_CF;
    reg->ds = ds();
    reg->es = ds();
    reg->fs = fs();
    reg->gs = gs();
```

下面让我们来看看[memset](http://lxr.free-electrons.com/source/arch/x86/boot/copy.S?v=3.18#L36)函数的实现 :

```assembly
GLOBAL(memset)
    pushw   %di
    movw    %ax, %di
    movzbl  %dl, %eax
    imull   $0x01010101,%eax
    pushw   %cx
    shrw    $2, %cx
    rep; stosl
    popw    %cx
    andw    $3, %cx
    rep; stosb
    popw    %di
    retl
ENDPROC(memset)
```

首先你会发现，`memset` 函数和 `memcpy` 函数一样使用了 `fastcall` 调用规则，因此函数的参数是通过 `ax`，`dx` 以及 `cx` 寄存器传入函数内部的。

就像memcpy函数一样，`memset` 函数一开始将 `di` 寄存器入栈，然后将 `biosregs` 结构的地址从 `ax` 寄存器拷贝到`di`寄存器。接下来，使用 `movzbl` 指令将 `dl` 寄存器的内容拷贝到 `ax` 寄存器的低字节，到这里 `ax` 寄存器就包含了需要拷贝到 `di` 寄存器所指向的内存的值。

接下来的 `imull` 指令将 `eax` 寄存器的值乘上 `0x01010101`。这么做的原因是代码每次将尝试拷贝4个字节内存的内容。下面让我们来看一个具体的例子，假设我们需要将 `0x7` 这个数值放到内存中，在执行 `imull` 指令之前，`eax` 寄存器的值是 `0x7`，在 `imull` 指令被执行之后，`eax` 寄存器的内容变成了 `0x07070707`（4个字节的 `0x7`）。在 `imull` 指令之后，代码使用 `rep; stosl` 指令将 `eax` 寄存器的内容拷贝到 `es:di` 指向的内存。

在 `bisoregs` 结构体被 `initregs` 函数正确填充之后，`bios_putchar` 调用中断 [0x10](http://www.ctyme.com/intr/rb-0106.htm) 在显示器上输出一个字符。接下来 `putchar` 函数检查是否初始化了串口，如果串口被初始化了，那么将调用[serial_putchar](http://lxr.free-electrons.com/source/arch/x86/boot/tty.c?v=3.18#L30)将字符输出到串口。

堆初始化
--------------------------------------------------------------------------------

当堆栈和bss段在[header.S](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18)中被初始化之后 (细节请参考上一篇[part](linux-bootstrap-1.md))， 内核需要初始化全局堆，全局堆的初始化是通过 [`init_heap`](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18#L116) 函数实现的。

代码首先检查内核设置头中的[`loadflags`](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18#L321)是否设置了 [`CAN_USE_HEAP`](http://lxr.free-electrons.com/source/arch/x86/include/uapi/asm/bootparam.h?v=3.18#L21)标志。 如果该标记被设置了，那么代码将计算堆栈的结束地址：:

```C
    char *stack_end;
    
    //%P1 is (-STACK_SIZE)
    if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
        asm("leal %P1(%%esp),%0"
            : "=r" (stack_end) : "i" (-STACK_SIZE));
```

换言之`stack_end = esp - STACK_SIZE`.

在计算了堆栈结束地址之后，代码计算了堆的结束地址：

```c

    //heap_end = heap_end_ptr + 512
    heap_end = (char *)((size_t)boot_params.hdr.heap_end_ptr + 0x200);
```

接下来代码判断 `heap_end` 是否大于 `stack_end`，如果条件成立，将 `stack_end` 设置成 `heap_end`（这么做是因为在大部分系统中全局堆和堆栈是相邻的，但是增长方向是相反的）。

到这里为止，全局堆就被正确初始化了。在全局堆被初始化之后，我们就可以使用 `GET_HEAP` 方法。至于这个函数的实现和使用，我们将在后续的章节中看到。

检查CPU类型
--------------------------------------------------------------------------------

在堆栈初始化之后，内核代码通过调用[arch/x86/boot/cpu.c](http://lxr.free-electrons.com/source/arch/x86/boot/cpu.c?v=3.18)提供的 `validate_cpu` 方法检查CPU级别以确定系统是否能够在当前的CPU上运行。

`validate_cpu` 调用了[`check_cpu`](http://lxr.free-electrons.com/source/arch/x86/boot/cpucheck.c?v=3.18#L102)方法得到当前系统的CPU级别，并且和系统预设的最低CPU级别进行比较。如果不满足条件，则不允许系统运行。

```c
/*from cpu.c*/
check_cpu(&cpu_level, &req_level, &err_flags);
/*after check_cpu call, req_level = req_level defined in cpucheck.c*/
if (cpu_level < req_level) {
    printf("This kernel requires an %s CPU, ", cpu_name(req_level)); 
    printf("but only detected an %s CPU.\n", cpu_name(cpu_level));
    return -1;
}
```

除此之外，`check_cpu` 方法还做了大量的其他检测和设置工作，下面就简单介绍一些：1）检查cpu标志，如果cpu是64位cpu，那么就设置[long mode](http://en.wikipedia.org/wiki/Long_mode), 2) 检查CPU的制造商，根据制造商的不同，设置不同的CPU选项。比如对于AMD出厂的cpu，如果不支持 `SSE+SSE2`，那么就禁止这些选项。

内存分布侦测
--------------------------------------------------------------------------------

接下来，内核调用 `detect_memory` 方法进行内存侦测，以得到系统当前内存的使用分布。该方法使用多种编程接口，包括 `0xe820`（获取全部内存分配），`0xe801` 和 `0x88`（获取临近内存大小），进行内存分布侦测。在这里我们只介绍[arch/x86/boot/memory.c](http://lxr.free-electrons.com/source/arch/x86/boot/memory.c?v=3.18)中提供的 `detect_memory_e820` 方法。

该方法首先调用 `initregs` 方法初始化 `biosregs` 数据结构，然后向该数据结构填入 `0xe820` 编程接口所要求的参数：

```assembly
    initregs(&ireg);
    ireg.ax  = 0xe820;
    ireg.cx  = sizeof buf;
    ireg.edx = SMAP;
    ireg.di  = (size_t)&buf;
```

* `ax` 固定为 `0xe820`
* `cx` 包含数据缓冲区的大小，该缓冲区将包含系统内存的信息数据
* `edx` 必须是 `SMAP` 这个魔术数字，就是 `0x534d4150`
* `es:di` 包含数据缓冲区的地址
* `ebx` 必须为0.

接下来就是通过一个循环来收集内存信息了。每个循环都开始于一个 `0x15` 中断调用，这个中断调用返回地址分配表中的一项，接着程序将返回的 `ebx` 设置到 `biosregs` 数据结构中，然后进行下一次的 `0x15` 中断调用。那么循环什么时候结束呢？直到 `0x15` 调用返回的eflags包含标志 `X86_EFLAGS_CF`:

```C
    intcall(0x15, &ireg, &oreg);
    ireg.ebx = oreg.ebx;
```

在循环结束之后，整个内存分配信息将被写入到 `e820entry` 数组中，这个数组的每个元素包含下面3个信息:

* 内存段的起始地址
* 内存段的大小
* 内存段的类型（类型可以是reserved, usable等等)。

你可以在 `dmesg` 输出中看到这个数组的内容：

```
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffdffff] usable
[    0.000000] BIOS-e820: [mem 0x000000003ffe0000-0x000000003fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
```

键盘初始化
--------------------------------------------------------------------------------

接下来内核调用[`keyboard_init()`](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18#L65) 方法进行键盘初始化操作。 首先，方法调用`initregs`初始化寄存器结构，然后调用[0x16](http://www.ctyme.com/intr/rb-1756.htm)中断来获取键盘状态。

```c
    initregs(&ireg);
    ireg.ah = 0x02;     /* Get keyboard status */
    intcall(0x16, &ireg, &oreg);
    boot_params.kbd_status = oreg.al;
```

在获取了键盘状态之后，代码再次调用[0x16](http://www.ctyme.com/intr/rb-1757.htm)中断来设置键盘的按键检测频率。

```c
    ireg.ax = 0x0305;   /* Set keyboard repeat rate */
    intcall(0x16, &ireg, NULL);
```

系统参数查询
--------------------------------------------------------------------------------

接下来内核将进行一系列的参数查询。我们在这里将不深入介绍所有这些查询，我们将在后续章节中再进行详细介绍。在这里我们将简单介绍一些系统参数查询:

[query_mca](http://lxr.free-electrons.com/source/arch/x86/boot/mca.c?v=3.18#L18) 方法调用[0x15](http://www.ctyme.com/intr/rb-1594.htm)中断来获取机器的型号信息，BIOS版本以及其他一些硬件相关的属性：

```c
int query_mca(void)
{
    struct biosregs ireg, oreg;
    u16 len;

    initregs(&ireg);
    ireg.ah = 0xc0;
    intcall(0x15, &ireg, &oreg);

    if (oreg.eflags & X86_EFLAGS_CF)
        return -1;  /* No MCA present */

    set_fs(oreg.es);
    len = rdfs16(oreg.bx);

    if (len > sizeof(boot_params.sys_desc_table))
        len = sizeof(boot_params.sys_desc_table);

    copy_from_fs(&boot_params.sys_desc_table, oreg.bx, len);
    return 0;
}
```

这个方法设置 `ah` 寄存器的值为 `0xc0`，然后调用 `0x15` BIOS中断。中断返回之后代码检查 [carry flag](http://en.wikipedia.org/wiki/Carry_flag)。如果它被置位，说明BIOS不支持[**MCA**](https://en.wikipedia.org/wiki/Micro_Channel_architecture)。如果CF被设置成0，那么 `ES:BX` 指向系统信息表。这个表的内容如下所示：

```
Offset  Size    Description
 00h    WORD    number of bytes following
 02h    BYTE    model (see #00515)
 03h    BYTE    submodel (see #00515)
 04h    BYTE    BIOS revision: 0 for first release, 1 for 2nd, etc.
 05h    BYTE    feature byte 1 (see #00510)
 06h    BYTE    feature byte 2 (see #00511)
 07h    BYTE    feature byte 3 (see #00512)
 08h    BYTE    feature byte 4 (see #00513)
 09h    BYTE    feature byte 5 (see #00514)
---AWARD BIOS---
 0Ah  N BYTEs   AWARD copyright notice
---Phoenix BIOS---
 0Ah    BYTE    ??? (00h)
 0Bh    BYTE    major version
 0Ch    BYTE    minor version (BCD)
 0Dh  4 BYTEs   ASCIZ string "PTL" (Phoenix Technologies Ltd)
---Quadram Quad386---
 0Ah 17 BYTEs   ASCII signature string "Quadram Quad386XT"
---Toshiba (Satellite Pro 435CDS at least)---
 0Ah  7 BYTEs   signature "TOSHIBA"
 11h    BYTE    ??? (8h)
 12h    BYTE    ??? (E7h) product ID??? (guess)
 13h  3 BYTEs   "JPN"
 ```

接下来代码调用 `set_fs` 方法，将 `es` 寄存器的值写入 `fs` 寄存器:

```c
static inline void set_fs(u16 seg)
{
    asm volatile("movw %0,%%fs" : : "rm" (seg));
}
```

在[boot.h](http://lxr.free-electrons.com/source/arch/x86/boot/boot.h?v=3.18) 存在很多类似于 `set_fs` 的方法, 比如 `set_gs`。

在 `query_mca` 的最后，代码将 `es:bx` 指向的内存地址的内容拷贝到 `boot_params.sys_desc_table`。

接下来，内核调用 `query_ist` 方法获取[Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep)信息。这个方法首先检查CPU类型，然后调用 `0x15` 中断获得这个信息并放入 `boot_params` 中。

接下来，内核会调用[query_apm_bios](http://lxr.free-electrons.com/source/arch/x86/boot/apm.c?v=3.18#L21) 方法从BIOS获得 [高级电源管理](http://en.wikipedia.org/wiki/Advanced_Power_Management) 信息。`query_apm_bios` 也是调用 `0x15` 中断，只不过将 `ax` 设置成 `0x5300` 以得到APM设置信息。中断调用返回之后，代码将检查 `bx` 和 `cx` 的值，如果 `bx` 不是 `0x504d` ( PM 标记 )，或者 `cx` 不是 `0x02` (0x02，表示支持32位模式)，那么代码直接返回错误。否则，将进行下面的步骤。

接下来，代码使用 `ax = 0x5304` 来调用 `0x15` 中断，以断开 `APM` 接口；然后使用 `ax = 0x5303` 调用 `0x15` 中断，使用32位接口重新连接 `APM`；最后使用 `ax = 0x5300` 调用 `0x15` 中断再次获取APM设置，然后将信息写入 `boot_params.apm_bios_info`。

需要注意的是，只有在 `CONFIG_APM` 或者 `CONFIG_APM_MODULE` 被设置的情况下，`query_apm_bios` 方法才会被调用：

```C
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
    query_apm_bios();
#endif
```

最后是[`query_edd`](http://lxr.free-electrons.com/source/arch/x86/boot/edd.c?v=3.18#L122) 方法调用, 这个方法从BIOS中查询 `Enhanced Disk Drive` 信息。下面让我们看看 `query_edd` 方法的实现。

首先，代码检查内核命令行参数是否设置了[edd](http://lxr.free-electrons.com/source/Documentation/kernel-parameters.txt?v=3.18#L1023) 选项，如果edd选项设置成 `off`，`query_edd` 不做任何操作，直接返回。

如果EDD被激活了，`query_edd` 遍历所有BIOS支持的硬盘，并获取相应硬盘的EDD信息：

```C
for (devno = 0x80; devno < 0x80+EDD_MBR_SIG_MAX; devno++) {
    if (!get_edd_info(devno, &ei) && boot_params.eddbuf_entries < EDDMAXNR) {
        memcpy(edp, &ei, sizeof ei);
        edp++;
        boot_params.eddbuf_entries++;
    }
    ...
    ...
    ...
```

在代码中 `0x80` 是第一块硬盘，`EDD_MBR_SIG_MAX` 是一个宏，值为16。代码把获得的信息放入数组[edd_info](http://lxr.free-electrons.com/source/include/uapi/linux/edd.h?v=3.18#L172)中。`get_edd_info` 方法通过调用 `0x13` 中断调用（设置 `ah = 0x41` ) 来检查EDD是否被硬盘支持。如果EDD被支持，代码将再次调用 `0x13` 中断，在这次调用中 `ah = 0x48`，并且 `si` 指向一个数据缓冲区地址。中断调用之后，EDD信息将被保存到 `si` 指向的缓冲区地址。

结束语
--------------------------------------------------------------------------------

本章到此就结束了，在下一章我们将讲解显示模式设置，以及在进入保护模式之前的其他准备工作，在下一章的最后我们将成功进入保护模式。

如果你有任何的问题或者建议，你可以留言，也可以直接发消息给我[twitter](https://twitter.com/0xAX).

**如果你发现文中描述有任何问题，请提交一个 PR 到 [linux-insides-zh](https://github.com/hust-open-atom-club/linux-insides-zh) 。**

相关链接
--------------------------------------------------------------------------------

* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
* [Protected mode](http://wiki.osdev.org/Protected_Mode)
* [Long mode](http://en.wikipedia.org/wiki/Long_mode)
* [Nice explanation of CPU Modes with code](http://www.codeproject.com/Articles/45788/The-Real-Protected-Long-mode-assembly-tutorial-for)
* [How to Use Expand Down Segments on Intel 386 and Later CPUs](http://www.sudleyplace.com/dpmione/expanddown.html)
* [earlyprintk documentation](http://lxr.free-electrons.com/source/Documentation/x86/earlyprintk.txt?v=3.18)
* [Kernel Parameters](http://lxr.free-electrons.com/source/Documentation/kernel-parameters.txt?v=3.18)
* [Serial console](http://lxr.free-electrons.com/source/Documentation/serial-console.txt?v=3.18)
* [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep)
* [APM](https://en.wikipedia.org/wiki/Advanced_Power_Management)
* [EDD specification](http://www.t13.org/documents/UploadedDocuments/docs2004/d1572r3-EDD3.pdf)
* [TLDP documentation for Linux Boot Process](http://www.tldp.org/HOWTO/Linux-i386-Boot-Code-HOWTO/setup.html) (old)
* [Previous Part](linux-bootstrap-1.md)
* [BIOS Interrupt](http://wiki.osdev.org/BIOS)


================================================
FILE: Booting/linux-bootstrap-3.md
================================================
内核启动过程，第三部分
================================================================================

显示模式初始化和进入保护模式
--------------------------------------------------------------------------------

这一章是`内核启动过程`的第三部分，在[前一章](linux-bootstrap-2.md#kernel-booting-process-part-2)中，我们的内核启动过程之旅停在了对 `set_video` 函数的调用（这个函数定义在 [main.c](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18#L181)）。在这一章中，我们将接着上一章继续我们的内核启动之旅。在这一章你将读到下面的内容：
- 显示模式的初始化，
- 在进入保护模式之前的准备工作，
- 正式进入保护模式

**注意** 如果你对保护模式一无所知，你可以查看[前一章](linux-bootstrap-2.md#protected-mode) 的相关内容。另外，你也可以查看下面这些[链接](linux-bootstrap-2.md#links) 以了解更多关于保护模式的内容。

就像我们前面所说的，我们将从 `set_video` 函数开始我们这章的内容，你可以在 [arch/x86/boot/video.c](http://lxr.free-electrons.com/source/arch/x86/boot/video.c?v=3.18#L315) 找到这个函数的定义。 这个函数首先从 `boot_params.hdr` 数据结构获取显示模式设置：

```C
u16 mode = boot_params.hdr.vid_mode;
```

至于 `boot_params.hdr` 数据结构中的内容，是通过 `copy_boot_params` 函数实现的 （关于这个函数的实现细节请查看上一章的内容），`boot_params.hdr` 中的 `vid_mode` 是引导程序必须填入的字段。你可以在 `kernel boot protocol` 文档中找到关于 `vid_mode` 的详细信息：

```
Offset	Proto	Name		Meaning
/Size
01FA/2	ALL	    vid_mode	Video mode control
```

而在 `linux kernel boot protocol` 文档中定义了如何通过命令行参数的方式为 `vid_mode` 字段传入相应的值：

```
**** SPECIAL COMMAND LINE OPTIONS
vga=<mode>
	<mode> here is either an integer (in C notation, either
	decimal, octal, or hexadecimal) or one of the strings
	"normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask"
	(meaning 0xFFFD).  This value should be entered into the
	vid_mode field, as it is used by the kernel before the command
	line is parsed.
```

根据上面的描述，我们可以通过将 `vga` 选项写入 grub 或者写到引导程序的配置文件，从而让内核命令行得到相应的显示模式设置信息。这个选项可以接受不同类型的值来表示相同的意思。比如你可以传入 0XFFFD 或者 ask，这2个值都表示需要显示一个菜单让用户选择想要的显示模式。下面的链接就给出了这个菜单：

![video mode setup menu](images/video_mode_setup_menu.png)

通过这个菜单，用户可以选择想要进入的显示模式。不过在我们进一步了解显示模式的设置过程之前，让我们先回头了解一些重要的概念。

内核数据类型
--------------------------------------------------------------------------------

在前面的章节中，我们已经接触到了一个类似于 `u16` 的内核数据类型。下面列出了更多内核支持的数据类型：


| Type | char | short | int | long | u8 | u16 | u32 | u64 |
|------|------|-------|-----|------|----|-----|-----|-----|
| Size |  1   |   2   |  4  |   8  |  1 |  2  |  4  |  8  |

如果你尝试阅读内核代码，最好能够牢记这些数据类型。

堆操作 API
--------------------------------------------------------------------------------

在 `set_video` 函数将 `vid_mod` 的值设置完成之后，将调用 `RESET_HEAP` 宏将 HEAP 头指向 `_end` 符号。`RESET_HEAP` 宏定义在  [boot.h](http://lxr.free-electrons.com/source/arch/x86/boot/boot.h?v=3.18#L199)：

```C
#define RESET_HEAP() ((void *)( HEAP = _end ))
```

如果你阅读过第二部分，你应该还记得在第二部分中，我们通过 [`init_heap`](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18#L116) 函数完成了 HEAP 的初始化。在 `boot.h` 中定义了一系列的方法来操作被初始化之后的 HEAP。这些操作包括：

```C
#define RESET_HEAP() ((void *)( HEAP = _end ))
```

就像我们在前面看到的，这个宏只是简单的将 HEAP 头设置到 `_end` 标号。在上一章中我们已经说明了 `_end` 标号，在 `boot.h` 中通过 `extern char _end[];` 来引用（从这里可以看出，在内核初始化的时候堆和栈是共享内存空间的，详细的信息可以查看第一章的堆栈初始化和第二章的堆初始化）：

下面一个是 `GET_HEAP` 宏：

```C
#define GET_HEAP(type, n) \
	((type *)__get_heap(sizeof(type),__alignof__(type),(n)))
```

可以看出这个宏调用了 `__get_heap` 函数来进行内存的分配。`__get_heap` 需要下面3个参数来进行内存分配操作：

* 某个数据类型所占用的字节数
* `__alignof__(type)` 返回对于请求的数据类型需要怎样的对齐方式 ( 根据我的了解这个是 gcc 提供的一个功能 ）
* `n` 需要分配多少个对应数据类型的对象

下面是 `__get_heap` 函数的实现：

```C
static inline char *__get_heap(size_t s, size_t a, size_t n)
{
	char *tmp;

	HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1));
	tmp = HEAP;
	HEAP += s*n;
	return tmp;
}
```

现在让我们来了解这个函数是如何工作的。 这个函数首先根据对齐方式要求（参数 `a` ）调整 `HEAP` 的值，然后将 `HEAP` 值赋值给一个临时变量 `tmp`。接下来根据需要分配的对象的个数（参数 `n` ），预留出所需要的内存，然后将 `tmp` 返回给调用端。

最后一个关于 HEAP 的操作是：

```C
static inline bool heap_free(size_t n)
{
	return (int)(heap_end - HEAP) >= (int)n;
}
```

这个函数简单做了一个减法 `heap_end - HEAP`，如果相减的结果大于请求的内存，那么就返回真，否则返回假。

我们已经看到了所有可以对 HEAP 进行操作，下面让我们继续显示模式设置过程。

设置显示模式
--------------------------------------------------------------------------------

在我们分析了内核数据类型以及和 HEAP 相关的操作之后，让我们回来继续分析显示模式的初始化。在 `RESET_HEAP()` 函数被调用之后，`set_video` 函数接着调用 `store_mode_params` 函数将对应显示模式的相关参数写入 `boot_params.screen_info` 字段。这个字段的结构定义可以在 [include/uapi/linux/screen_info.h](https://github.com/0xAX/linux/blob/master/include/uapi/linux/screen_info.h) 中找到。

`store_mode_params` 函数将调用 `store_cursor_position` 函数将当前屏幕上光标的位置保存起来。下面让我们来看 `store_cursor_poistion` 函数是如何实现的。

首先函数初始化一个类型为 `biosregs` 的变量，将其中的 `AH` 寄存器内容设置成 `0x3`，然后调用 `0x10` BIOS 中断。当中断调用返回之后，`DL` 和 `DH` 寄存器分别包含了当前光标的行和列信息。接着，这2个信息将被保存到 `boot_params.screen_info` 字段的 `orig_x` 和 `orig_y`字段。

在 `store_cursor_position` 函数执行完毕之后，`store_mode_params` 函数将调用 `store_video_mode` 函数将当前使用的显示模式保存到 `boot_params.screen_info.orig_video_mode`。

接下來 `store_mode_params` 函数将根据当前显示模式的设定，给 `video_segment` 变量设置正确的值（实际上就是设置显示内存的起始地址）。在 BIOS 将控制权转移到引导扇区的时候，显示内存地址和显示模式的对应关系如下表所示：

```
0xB000:0x0000 	32 Kb 	Monochrome Text Video Memory
0xB800:0x0000 	32 Kb 	Color Text Video Memory
```

根据上表，如果当前显示模式是 MDA, HGC 或者单色 VGA 模式，那么 `video_sgement` 的值将被设置成 `0xB000`；如果当前显示模式是彩色模式，那么 `video_segment` 的值将被设置成 `0xB800`。在这之后，`store_mode_params` 函数将保存字体大小信息到 `boot_params.screen_info.orig_video_points`：

```C
//保存字体大小信息
set_fs(0);
font_size = rdfs16(0x485);
boot_params.screen_info.orig_video_points = font_size;
```

这段代码首先调用 `set_fs` 函数（在 [boot.h](https://github.com/0xAX/linux/blob/master/arch/x86/boot/boot.h) 中定义了许多类似的函数进行寄存器操作）将数字 `0` 放入 `FS` 寄存器。接着从内存地址 `0x485` 处获取字体大小信息并保存到 `boot_params.screen_info.orig_video_points`。

```
 x = rdfs16(0x44a);
 y = (adapter == ADAPTER_CGA) ? 25 : rdfs8(0x484)+1;
```

接下来代码将从地址 `0x44a` 处获得屏幕列信息，从地址 `0x484` 处获得屏幕行信息，并将它们保存到 `boot_params.screen_info.orig_video_cols` 和 `boot_params.screen_info.orig_video_lines`。到这里，`store_mode_params` 的执行就结束了。

接下来，`set_video` 函数将调用 `save_screen` 函数将当前屏幕上的所有信息保存到 HEAP 中。这个函数首先获得当前屏幕的所有信息（包括屏幕大小，当前光标位置，屏幕上的字符信息），并且保存到 `saved_screen` 结构体中。这个结构体的定义如下所示：

```C
static struct saved_screen {
	int x, y;
	int curx, cury;
	u16 *data;
} saved;
```

接下来函数将检查 HEAP 中是否有足够的空间保存这个结构体的数据：

```C
if (!heap_free(saved.x*saved.y*sizeof(u16)+512))
		return;
```

如果 HEAP 有足够的空间，代码将在 HEAP 中分配相应的空间并且将 `saved_screen` 保存到 HEAP。

接下来 `set_video` 函数将调用 `probe_cards(0)`（这个函数定义在  [arch/x86/boot/video-mode.c](https://github.com/0xAX/linux/blob/master/arch/x86/boot/video-mode.c#L33)）。 这个函数简单遍历所有的显卡，并通过调用驱动程序设置显卡所支持的显示模式：

```C
for (card = video_cards; card < video_cards_end; card++) {
		if (card->unsafe == unsafe) {
			if (card->probe)
				card->nmodes = card->probe();
			else
				card->nmodes = 0;
		}
}
```

如果你仔细看上面的代码，你会发现 `video_cards` 这个变量并没有被声明，那么程序怎么能够正常编译执行呢？实际上很简单，它指向了一个在 [arch/x86/boot/setup.ld](https://github.com/0xAX/linux/blob/master/arch/x86/boot/setup.ld) 中定义的叫做 `.videocards` 的内存段：
```
	.videocards	: {
		video_cards = .;
		*(.videocards)
		video_cards_end = .;
	}
```
那么这段内存里面存放的数据是什么呢，下面我们就来详细分析。在内核初始化代码中，对于每个支持的显示模式都是使用下面的代码进行定义的：

```C
static __videocard video_vga = {
	.card_name	= "VGA",
	.probe		= vga_probe,
	.set_mode	= vga_set_mode,
};
```

`__videocard` 是一个宏定义，如下所示：

```C
#define __videocard struct card_info __attribute__((used,section(".videocards")))
```

因此 `__videocard` 是一个 `card_info` 结构，这个结构定义如下：

```C
struct card_info {
	const char *card_name;
	int (*set_mode)(struct mode_info *mode);
	int (*probe)(void);
	struct mode_info *modes;
	int nmodes;
	int unsafe;
	u16 xmode_first;
	u16 xmode_n;
};
```

在 `.videocards` 内存段实际上存放的就是所有被内核初始化代码定义的 `card_info` 结构（可以看成是一个数组），所以 `probe_cards` 函数可以使用 `video_cards`，通过循环遍历所有的 `card_info`。

在 `probe_cards` 执行完成之后，我们终于进入 `set_video` 函数的主循环了。在这个循环中，如果 `vid_mode=ask`，那么将显示一个菜单让用户选择想要的显示模式，然后代码将根据用户的选择或者 `vid_mod` 的值 ，通过调用 `set_mode` 函数来设置正确的显示模式。如果设置成功，循环结束，否则显示菜单让用户选择显示模式，继续进行设置显示模式的尝试。

```c
for (;;) {
      if (mode == ASK_VGA)
          mode = mode_menu();

      if (!set_mode(mode))
          break;

      printf("Undefined video mode number: %x\n", mode);
      mode = ASK_VGA;
  }
```

你可以在 [video-mode.c](https://github.com/0xAX/linux/blob/master/arch/x86/boot/video-mode.c#L147) 中找到 `set_mode` 函数的定义。这个函数只接受一个参数，这个参数是对应的显示模式的数字表示（这个数字来自于显示模式选择菜单，或者从内核命令行参数获得）。

`set_mode` 函数首先检查传入的 `mode` 参数，然后调用 `raw_set_mode` 函数。而后者将遍历内核知道的所有 `card_info` 信息，如果发现某张显卡支持传入的模式，这调用 `card_info` 结构中保存的 `set_mode` 函数地址进行显卡显示模式的设置。以 `video_vga` 这个 `card_info` 结构来说，保存在其中的 `set_mode` 函数就指向了 `vga_set_mode` 函数。下面的代码就是 `vga_set_mode` 函数的实现，这个函数根据输入的 vga 显示模式，调用不同的函数完成显示模式的设置：

```C
static int vga_set_mode(struct mode_info *mode)
{
	vga_set_basic_mode();

	force_x = mode->x;
	force_y = mode->y;

	switch (mode->mode) {
	case VIDEO_80x25:
		break;
	case VIDEO_8POINT:
		vga_set_8font();
		break;
	case VIDEO_80x43:
		vga_set_80x43();
		break;
	case VIDEO_80x28:
		vga_set_14font();
		break;
	case VIDEO_80x30:
		vga_set_80x30();
		break;
	case VIDEO_80x34:
		vga_set_80x34();
		break;
	case VIDEO_80x60:
		vga_set_80x60();
		break;
	}
	return 0;
}
```

在上面的代码中，每个 `vga_set***` 函数只是简单调用 `0x10` BIOS 中断来进行显示模式的设置。

在显卡的显示模式被正确设置之后，这个最终的显示模式被写回  `boot_params.hdr.vid_mode`。

接下来 `set_video` 函数将调用 `vesa_store_edid` 函数， 这个函数只是简单的将  [EDID](https://en.wikipedia.org/wiki/Extended_Display_Identification_Data) (**E**xtended **D**isplay **I**dentification **D**ata) 写入内存，以便于内核访问。最后， `set_video` 将调用 `do_restore` 函数将前面保存的当前屏幕信息还原到屏幕上。

到这里为止，显示模式的设置完成，接下来我们可以切换到保护模式了。

在切换到保护模式之前的最后的准备工作
--------------------------------------------------------------------------------

在进入保护模式之前的最后一个函数调用发生在 [main.c](http://lxr.free-electrons.com/source/arch/x86/boot/main.c?v=3.18#L184) 中的 `go_to_protected_mode` 函数，就像这个函数的注释说的，这个函数将进行最后的准备工作然后进入保护模式，下面就让我们来具体看看最后的准备工作是什么，以及系统是如何切换到保护模式的。

`go_to_protected_mode` 函数本身定义在 [arch/x86/boot/pm.c](http://lxr.free-electrons.com/source/arch/x86/boot/pm.c?v=3.18#L104)。 这个函数调用了一些其他的函数进行最后的准备工作，下面就让我们来具体看看这些函数。

`go_to_protected_mode` 函数首先调用的是 `realmode_switch_hook` 函数，后者如果发现 `realmode_switch` hook， 那么将调用它并禁止 [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt) 中断，反之将直接禁止 NMI 中断。只有当 bootloader 运行在宿主环境下（比如在 DOS 下运行 ）， hook 才会被使用。你可以在 [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) (see **ADVANCED BOOT LOADER HOOKS**) 中详细了解 hook 函数的信息。

```c
/*
 * Invoke the realmode switch hook if present; otherwise
 * disable all interrupts.
 */
static void realmode_switch_hook(void)
{
	if (boot_params.hdr.realmode_swtch) {
		asm volatile("lcallw *%0"
			     : : "m" (boot_params.hdr.realmode_swtch)
			     : "eax", "ebx", "ecx", "edx");
	} else {
		asm volatile("cli");
		outb(0x80, 0x70); /* Disable NMI */
		io_delay();
	}
}
```

`realmode_switch` 指向了一个16 位实模式代码地址（远跳转指针），这个16位代码将禁止 NMI 中断。所以在上述代码中，如果 `realmode_swtch` hook 存在，代码是用了 `lcallw` 指令进行远函数调用。在我的环境中，因为不存在这个 hook ，所以代码是直接进入 `else` 部分进行了 NMI 的禁止：

```assembly
asm volatile("cli");
outb(0x80, 0x70);	/* Disable NMI */
io_delay();
```

上面的代码首先调用 `cli` 汇编指令清除了中断标志 `IF`，这条指令执行之后，外部中断就被禁止了，紧接着的下一行代码就禁止了 NMI 中断。

这里简单介绍一下中断。中断是由硬件或者软件产生的，当中断产生的时候， CPU 将得到通知。这个时候， CPU 将停止当前指令的执行，保存当前代码的环境，然后将控制权移交到中断处理程序。当中断处理程序完成之后，将恢复中断之前的运行环境，从而被中断的代码将继续运行。 NMI 中断是一类特殊的中断，往往预示着系统发生了不可恢复的错误，所以在正常运行的操作系统中，NMI 中断是不会被禁止的，但是在进入保护模式之前，由于特殊需求，代码禁止了这类中断。我们将在后续的章节中对中断做更多的介绍，这里就不展开了。

现在让我们回到上面的代码，在 NMI 中断被禁止之后（通过写 `0x80` 进 CMOS 地址寄存器 `0x70` ），函数接着调用了 `io_delay` 函数进行了短暂的延时以等待 I/O 操作完成。下面就是 `io_delay` 函数的实现：

```C
static inline void io_delay(void)
{
	const u16 DELAY_PORT = 0x80;
	asm volatile("outb %%al,%0" : : "dN" (DELAY_PORT));
}
```

对 I/O 端口 `0x80` 写入任何的字节都将得到 1 ms 的延时。在上面的代码中，代码将 `al` 寄存器中的值写到了这个端口。在这个 `io_delay` 调用完成之后， `realmode_switch_hook` 函数就完成了所有工作，下面让我们进入下一个函数。

下一个函数调用是 `enable_a20`，这个函数使能 [A20 line](http://en.wikipedia.org/wiki/A20_line)，你可以在 [arch/x86/boot/a20.c](http://lxr.free-electrons.com/source/arch/x86/boot/a20.c?v=3.18) 找到这个函数的定义，这个函数会尝试使用不同的方式来使能 A20 地址线。首先这个函数将调用 `a20_test_short`（该函数将调用 `a20_test` 函数） 来检测 A20 地址线是否已经被激活了：

```C
static int a20_test(int loops)
{
	int ok = 0;
	int saved, ctr;

	set_fs(0x0000);
	set_gs(0xffff);

	saved = ctr = rdfs32(A20_TEST_ADDR);

    while (loops--) {
		wrfs32(++ctr, A20_TEST_ADDR);
		io_delay();	/* Serialize and make delay constant */
		ok = rdgs32(A20_TEST_ADDR+0x10) ^ ctr;
		if (ok)
			break;
	}

	wrfs32(saved, A20_TEST_ADDR);
	return ok;
}
```

这个函数首先将 `0x0000` 放入 `FS` 寄存器，将 `0xffff` 放入 `GS` 寄存器。然后通过 `rdfs32` 函数调用，将 `A20_TEST_ADDR` 内存地址的内容放入 `saved` 和 `ctr` 变量。

接下来我们使用 `wrfs32` 函数将更新过的 `ctr` 的值写入 `fs:gs` ，接着延时 1ms， 然后从 `GS:A20_TEST_ADDR+0x10` 读取内容，如果该地址内容不为0，那么 A20 已经被激活。如果 A20 没有被激活，代码将尝试使用多种方法进行 A20 地址激活。其中的一种方法就是调用 BIOS `0X15` 中断激活 A20 地址线。

如果 `enabled_a20` 函数调用失败，显示一个错误消息并且调用 `die` 函数结束操作系统运行。`die` 函数定义在 [arch/x86/boot/header.S](http://lxr.free-electrons.com/source/arch/x86/boot/header.S?v=3.18):

```assembly
die:
	hlt
	jmp	die
	.size	die, .-die
```

A20 地址线被激活之后，`reset_coprocessor` 函数被调用：

 ```C
outb(0, 0xf0);
outb(0, 0xf1);
```

这个函数非常简单，通过将 `0` 写入 I/O 端口 `0xf0` 和 `0xf1` 以复位数字协处理器。

接下来 `mask_all_interrupts` 函数将被调用：

```C
outb(0xff, 0xa1);       /* Mask all interrupts on the secondary PIC */
outb(0xfb, 0x21);       /* Mask all but cascade on the primary PIC */
```

这个函数调用屏蔽了从中断控制器 (注：中断控制器的原文是 Programmable Interrupt Controller) 的所有中断，和主中断控制器上除IRQ2以外的所有中断（IRQ2是主中断控制器上的级联中断，所有从中断控制器的中断将通过这个级联中断报告给 CPU ）。

到这里位置，我们就完成了所有的准备工作，下面我们就将正式开始从实模式转换到保护模式。

设置中断描述符表
--------------------------------------------------------------------------------

现在内核将调用 `setup_idt` 方法来设置中断描述符表（ IDT ）：

```C
static void setup_idt(void)
{
	static const struct gdt_ptr null_idt = {0, 0};
	asm volatile("lidtl %0" : : "m" (null_idt));
}
```

上面的代码使用 `lidtl` 指令将 `null_idt` 所指向的中断描述符表引入寄存器 IDT。由于 `null_idt` 没有设定中断描述符表的长度（长度为 0 ），所以这段指令执行之后，实际上没有任何中断调用被设置成功（所有中断调用都是空的），在后面的章节中我们将看到正确的设置。`null_idt` 是一个 `gdt_ptr` 结构的数据，这个结构的定义如下所示：

```C
struct gdt_ptr {
	u16 len;
	u32 ptr;
} __attribute__((packed));
```

在上面的定义中，我们可以看到上面这个结构包含一个 16 bit 的长度字段，和一个 32 bit 的指针字段。`__attribute__((packed))` 意味着这个结构就只包含 48 bit 信息（没有字节对齐优化）。在下面一节中，我们将看到相同的结构将被导入 `GDTR` 寄存器（如果你还记得上一章的内容，应该记得 GDTR 寄存器是 48 bit 长度的）。

设置全局描述符表
--------------------------------------------------------------------------------

在设置完中断描述符表之后，我们将使用 `setup_gdt` 函数来设置全局描述符表（关于全局描述符表，大家可以参考[上一章](linux-bootstrap-2.md#protected-mode) 的内容）。在 `setup_gdt` 函数中，使用 `boot_gdt` 数组定义了需要引入 GDTR 寄存器的段描述符信息：

```C
   //GDT_ENTRY_BOOT_CS 定义在http://lxr.free-electrons.com/source/arch/x86/include/asm/segment.h#L19 = 2
	static const u64 boot_gdt[] __attribute__((aligned(16))) = {
		[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
		[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
		[GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),
	};
```

在上面的 `boot_gdt` 数组中，我们定义了代码，数据和 TSS 段(Task State Segment, 任务状态段)的段描述符，因为我们并没有设置任何的中断调用（记得上面说的 `null_idt`吗？），所以 TSS 段并不会被使用到。TSS 段存在的唯一目的就是让 Intel 处理器能够正确进入保护模式。下面让我们详细了解一下 `boot_gdt` 这个数组，首先，这个数组被 `__attribute__((aligned(16)))` 修饰，这就意味着这个数组将以 16 字节为单位对齐。让我们通过下面的例子来了解一下什么叫 16 字节对齐：

```C
#include <stdio.h>

struct aligned {
	int a;
}__attribute__((aligned(16)));

struct nonaligned {
	int b;
};

int main(void)
{
	struct aligned    a;
	struct nonaligned na;

	printf("Not aligned - %zu \n", sizeof(na));
	printf("Aligned - %zu \n", sizeof(a));

	return 0;
}
```

上面的代码可以看出，一旦指定了 16 字节对齐，即使结构中只有一个 `int` 类型的字段，整个结构也将占用 16 个字节：

```
$ gcc test.c -o test && test
Not aligned - 4
Aligned - 16
```

因为在 `boot_gdt` 的定义中， `GDT_ENTRY_BOOT_CS = 2`，所以在数组中有2个空项，第一项是一个空的描述符，第二项在代码中没有使用。在没有 `align 16` 之前，整个结构占用了（8*5=40）个字节，加了 `align 16` 之后，结构就占用了 48 字节 。

上面代码中出现的 `GDT_ENTRY` 是一个宏定义，这个宏接受 3 个参数（标志，基地址，段长度）来产生段描述符结构。让我们来具体分析上面数组中的代码段描述符（ `GDT_ENTRY_BOOT_CS` ）来看看这个宏是如何工作的，对于这个段，`GDT_ENTRY` 接受了下面 3 个参数：

* 基地址  - 0
* 段长度 - 0xfffff
* 标志 - 0xc09b

上面这些数字表明，这个段的基地址是 0， 段长度是 `0xfffff` （ 1 MB ），而标志字段展开之后是下面的二进制数据：

```
1100 0000 1001 1011
```

这些二进制数据的具体含义如下:

* 1    - (G) 这里为 1，表示段的实际长度是 `0xfffff * 4kb ` = `4GB`
* 1    - (D) 表示这个段是一个32位段
* 0    - (L) 这个代码段没有运行在 long mode
* 0    - (AVL) Linux 没有使用
* 0000 - 段长度的4个位
* 1    - (P) 段已经位于内存中
* 00   - (DPL) - 段优先级为0
* 1    - (S) 说明这个段是一个代码或者数据段
* 101  - 段类型为可执行/可读
* 1    - 段可访问

关于段描述符的更详细的信息你可以从上一章中获得 [上一章](linux-bootstrap-2.md)，你也可以阅读 [Intel® 64 and IA-32 Architectures Software Developer's Manuals 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)获取全部信息。

在定义了数组之后，代码将获取 GDT 的长度：

```C
gdt.len = sizeof(boot_gdt)-1;
```

接下来是将 GDT 的地址放入 gdt.ptr 中：

```C
gdt.ptr = (u32)&boot_gdt + (ds() << 4);
```

这里的地址计算很简单，因为我们还在实模式，所以就是 （ ds << 4 + 数组起始地址）。

最后通过执行 `lgdtl` 指令将 GDT 信息写入 GDTR 寄存器：

```C
asm volatile("lgdtl %0" : : "m" (gdt));
```

切换进入保护模式
--------------------------------------------------------------------------------

`go_to_protected_mode` 函数在完成 IDT, GDT 初始化，并禁止了 NMI 中断之后，将调用 `protected_mode_jump` 函数完成从实模式到保护模式的跳转：

```C
protected_mode_jump(boot_params.hdr.code32_start, (u32)&boot_params + (ds() << 4));
```

`protected_mode_jump` 函数定义在 [arch/x86/boot/pmjump.S](http://lxr.free-electrons.com/source/arch/x86/boot/pmjump.S?v=3.18#L26)，它接受下面2个参数:

* 保护模式代码的入口
* `boot_params` 结构的地址

第一个参数保存在 `eax` 寄存器，而第二个参数保存在 `edx` 寄存器。

代码首先在 `boot_params` 地址放入 `esi` 寄存器，然后将 `cs` 寄存器内容放入 `bx` 寄存器，接着执行 `bx << 4 + 标号为2的代码的地址`，这样一来 `bx` 寄存器就包含了标号为2的代码的地址。接下来代码将把数据段索引放入 `cx` 寄存器，将  TSS 段索引放入 `di` 寄存器：

```assembly
movw	$__BOOT_DS, %cx
movw	$__BOOT_TSS, %di
```

就像前面我们看到的 `GDT_ENTRY_BOOT_CS` 的值为2，每个段描述符都是 8 字节，所以 `cx` 寄存器的值将是 `2*8 = 16`，`di` 寄存器的值将是 `4*8 =32`。

接下来，我们通过设置 `CR0` 寄存器相应的位使 CPU 进入保护模式：

```assembly
movl	%cr0, %edx
orb	$X86_CR0_PE, %dl
movl	%edx, %cr0
```

在进入保护模式之后，通过一个长跳转进入 32 位代码：

```assembly
	.byte	0x66, 0xea
2:	.long	in_pm32
	.word	__BOOT_CS ;(GDT_ENTRY_BOOT_CS*8) = 16，段描述符表索引
```

这段代码中
* `0x66` 操作符前缀允许我们混合执行 16 位和 32 位代码
* `0xea` - 跳转指令的操作符
* `in_pm32` 跳转地址偏移
* `__BOOT_CS` 代码段描述符索引

在执行了这个跳转命令之后，我们就在保护模式下执行代码了：

```assembly
.code32
.section ".text32","ax"
```

保护模式代码的第一步就是重置所有的段寄存器（除了 `CS` 寄存器）:

```assembly
GLOBAL(in_pm32)
movl	%ecx, %ds
movl	%ecx, %es
movl	%ecx, %fs
movl	%ecx, %gs
movl	%ecx, %ss
```

还记得我们在实模式代码中将 `$__BOOT_DS` （数据段描述符索引）放入了 `cx` 寄存器，所以上面的代码设置所有段寄存器（除了 `CS` 寄存器）指向数据段。接下来代码将所有的通用寄存器清 0 ：

```assembly
xorl	%ecx, %ecx
xorl	%edx, %edx
xorl	%ebx, %ebx
xorl	%ebp, %ebp
xorl	%edi, %edi
```

最后使用长跳转跳入正在的 32 位代码（通过参数传入的地址）

```
jmpl	*%eax ;?jmpl cs:eax?
```

到这里，我们就进入了保护模式开始执行代码了，下一章我们将分析这段 32 位代码到底做了些什么。

结论
--------------------------------------------------------------------------------

这章到这里就结束了，在下一章中我们将具体介绍这章最后跳转到的 32 位代码，并且了解系统是如何进入  [long mode](http://en.wikipedia.org/wiki/Long_mode)的。

如果你有任何的问题或者建议，你可以留言，也可以直接发消息给我[twitter](https://twitter.com/0xAX).

**如果你发现文中描述有任何问题，请提交一个 PR 到 [linux-insides-zh](https://github.com/hust-open-atom-club/linux-insides-zh) 。**

链接
--------------------------------------------------------------------------------

* [VGA](http://en.wikipedia.org/wiki/Video_Graphics_Array)
* [VESA BIOS Extensions](http://en.wikipedia.org/wiki/VESA_BIOS_Extensions)
* [Data structure alignment](http://en.wikipedia.org/wiki/Data_structure_alignment)
* [Non-maskable interrupt](http://en.wikipedia.org/wiki/Non-maskable_interrupt)
* [A20](http://en.wikipedia.org/wiki/A20_line)
* [GCC designated inits](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Designated-Inits.html)
* [GCC type attributes](https://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html)
* [Previous part](linux-bootstrap-2.md)



================================================
FILE: Booting/linux-bootstrap-4.md
================================================
内核引导过程. Part 4.
================================================================================

切换到64位模式
--------------------------------------------------------------------------------

这是 `内核引导过程` 的第四部分，我们将会看到在[保护模式](https://zh.wikipedia.org/wiki/%E4%BF%9D%E8%AD%B7%E6%A8%A1%E5%BC%8F)中的最初几步，比如确认CPU是否支持[长模式](https://zh.wikipedia.org/wiki/%E9%95%BF%E6%A8%A1%E5%BC%8F)，[SSE](https://zh.wikipedia.org/wiki/SSE)和[分页](https://zh.wikipedia.org/wiki/%E5%88%86%E9%A0%81)以及页表的初始化，在这部分的最后我们还将讨论如何切换到[长模式](https://zh.wikipedia.org/wiki/%E9%95%BF%E6%A8%A1%E5%BC%8F)。

**注意：这部分将会有大量的汇编代码，如果你不熟悉汇编，建议你找本书参考一下。**

在[前一章节](linux-bootstrap-3.md)，我们停在了跳转到位于 [arch/x86/boot/pmjump.S](http://lxr.free-electrons.com/source/arch/x86/boot/pmjump.S?v=3.18) 的 32 位入口点这一步：

```assembly
jmpl	*%eax
```

回忆一下， `eax` 寄存器包含了 32 位入口点的地址。我们可以在 [x86 linux 内核引导协议](https://www.kernel.org/doc/Documentation/x86/boot.txt) 中找到相关内容：

```
When using bzImage, the protected-mode kernel was relocated to 0x100000
```

```
当使用 bzImage 时，保护模式下的内核被重定位至 0x100000
```


让我们检查一下 32 位入口点的寄存器值来确保这是对的：

```
eax            0x100000	1048576
ecx            0x0	    0
edx            0x0	    0
ebx            0x0	    0
esp            0x1ff5c	0x1ff5c
ebp            0x0	    0x0
esi            0x14470	83056
edi            0x0	    0
eip            0x100000	0x100000
eflags         0x46	    [ PF ZF ]
cs             0x10	16
ss             0x18	24
ds             0x18	24
es             0x18	24
fs             0x18	24
gs             0x18	24
```

我们在这里可以看到 `cs` 寄存器包含了 - `0x10` （回忆前一章节，这代表了全局描述符表中的第二个索引项）， `eip` 寄存器的值是 `0x100000`，并且包括代码段在内的所有内存段的基地址都为0。所以我们可以得到物理地址： `0:0x100000` 或者 `0x100000`，这和协议规定的一样。现在让我们从 32 位入口点开始。

32 位入口点
--------------------------------------------------------------------------------

我们可以在汇编源码 [arch/x86/boot/compressed/head_64.S](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/head_64.S?v=3.18) 中找到 32 位入口点的定义。

```assembly
	__HEAD
	.code32
ENTRY(startup_32)
....
....
....
ENDPROC(startup_32)
```

首先，为什么目录名叫做 `被压缩的 (compressed)` ？实际上 `bzimage` 是由 `vmlinux + 头文件 + 内核启动代码` 被 gzip 压缩之后获得的。我们在前几个章节已经看到了启动内核的代码。所以， `head_64.S` 的主要目的就是做好进入长模式的准备之后进入长模式，进入以后再解压内核。在这一章节，我们将会看到直到内核解压缩之前的所有步骤。

在 `arch/x86/boot/compressed` 目录下有两个文件：

* [head_32.S](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/head_32.S?v=3.18)
* [head_64.S](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/head_64.S?v=3.18)

但是，你可能还记得我们这本书只和 `x86_64` 有关，所以我们只会关注 `head_64.S` ；在我们这里 `head_32.S` 没有被用到。让我们看一下 [arch/x86/boot/compressed/Makefile](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/Makefile?v=3.18)。在那里我们可以看到以下目标：

```Makefile
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/head_$(BITS).o $(obj)/misc.o \
	$(obj)/string.o $(obj)/cmdline.o \
	$(obj)/piggy.o $(obj)/cpuflags.o
```

注意 `$(obj)/head_$(BITS).o` 。这意味着我们将会选择基于 `$(BITS)` 所设置的文件执行链接操作，即 head_32.o 或者 head_64.o。`$(BITS)` 在 [arch/x86/Makefile](http://lxr.free-electrons.com/source/arch/x86/Makefile?v=3.18) 之中根据 .config 文件另外定义：

```Makefile
ifeq ($(CONFIG_X86_32),y)
        BITS := 32
        ...
		...
else
        BITS := 64
		...
		...
endif
```

现在我们知道从哪里开始了，那就来吧。

必要时重新加载内存段寄存器
--------------------------------------------------------------------------------

正如上面阐述的，我们先从 [arch/x86/boot/compressed/head_64.S](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/head_64.S?v=3.18) 这个汇编文件开始。首先我们看到了在 `startup_32` 之前的特殊段属性定义：

```assembly
    __HEAD
	.code32
ENTRY(startup_32)
```

这个 `__HEAD` 是一个定义在头文件 [include/linux/init.h](http://lxr.free-electrons.com/source/include/linux/init.h?v=3.18) 中的宏，展开后就是下面这个段的定义：

```C
#define __HEAD		.section	".head.text","ax"
```

其拥有 `.head.text` 的命名和 `ax` 标记。在这里，这些标记告诉我们这个段是[可执行的](https://en.wikipedia.org/wiki/Executable)或者换种说法，包含了代码。我们可以在 [arch/x86/boot/compressed/vmlinux.lds.S](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/vmlinux.lds.S?v=3.18) 这个链接脚本里找到这个段的定义： 

```
SECTIONS
{
	. = 0;
	.head.text : {
		_head = . ;
		HEAD_TEXT
		_ehead = . ;
	}
```

如果你不熟悉 `GNU LD` 这个链接脚本语言的语法，你可以在[这个文档](https://sourceware.org/binutils/docs/ld/Scripts.html#Scripts)中找到更多信息。简单来说，这个 `.` 符号是一个链接器的特殊变量 - 位置计数器。其被赋值为相对于该段的偏移。在这里，我们将位置计数器赋值为0，这意味着我们的代码被链接到内存的 `0` 偏移处。此外，我们可以从注释里找到更多信息：

```
Be careful parts of head_64.S assume startup_32 is at address 0.
```

```
要小心， head_64.S 中一些部分假设 startup_32 位于地址 0。
```

好了，现在我们知道我们在哪里了，接下来就是深入 `startup_32` 函数的最佳时机。

在 `startup_32` 函数的开始，我们可以看到 `cld` 指令将[标志寄存器](http://baike.baidu.com/view/1845107.htm)的 `DF` （方向标志）位清空。当方向标志被清空，所有的串操作指令像[stos](https://x86.hust.openatom.club/html/file_module_x86_id_306.html)， [scas](https://x86.hust.openatom.club/html/file_module_x86_id_287.html)等等将会增加索引寄存器 `esi` 或者 `edi` 的值。我们需要清空方向标志是因为接下来我们会使用汇编的串操作指令来做为页表腾出空间等工作。

在我们清空 `DF` 标志后，下一步就是从内核加载头中的 `loadflags` 字段来检查 `KEEP_SEGMENTS` 标志。你是否还记得在本书的[最初一节](linux-bootstrap-1.md)，我们已经看到过 `loadflags` 。在那里我们检查了 `CAN_USE_HEAP` 标记以使用堆。现在我们需要检查 `KEEP_SEGMENTS` 标记。这些标记在 linux 的[引导协议](https://www.kernel.org/doc/Documentation/x86/boot.txt)文档中有描述：

```
Bit 6 (write): KEEP_SEGMENTS
  Protocol: 2.07+
  - If 0, reload the segment registers in the 32bit entry point.
  - If 1, do not reload the segment registers in the 32bit entry point.
    Assume that %cs %ds %ss %es are all set to flat segments with
	a base of 0 (or the equivalent for their environment).
```

```
第 6 位 (写): KEEP_SEGMENTS
  协议版本: 2.07+
  - 为0，在32位入口点重载段寄存器
  - 为1，不在32位入口点重载段寄存器。假设 %cs %ds %ss %es 都被设到基地址为0的普通段中（或者在他们的环境中等价的位置）。
```

所以，如果 `KEEP_SEGMENTS` 位在 `loadflags` 中没有被设置，我们需要重置 `ds` , `ss` 和 `es` 段寄存器到一个基地址为 `0` 的普通段中。如下：

```C
	testb $(1 << 6), BP_loadflags(%esi)
	jnz 1f

	cli
	movl	$(__BOOT_DS), %eax
	movl	%eax, %ds
	movl	%eax, %es
	movl	%eax, %ss
```

记住 `__BOOT_DS` 是 `0x18` （位于[全局描述符表](https://en.wikipedia.org/wiki/Global_Descriptor_Table)中数据段的索引）。如果设置了 `KEEP_SEGMENTS` ，我们就跳转到最近的 `1f` 标签，或者当没有 `1f` 标签，则用 `__BOOT_DS` 更新段寄存器。这非常简单，但是这是一个有趣的操作。如果你已经读了[前一章节](linux-bootstrap-3.md)，你或许还记得我们在 [arch/x86/boot/pmjump.S](http://lxr.free-electrons.com/source/arch/x86/boot/pmjump.S?v=3.18) 中切换到[保护模式](https://zh.wikipedia.org/wiki/%E4%BF%9D%E8%AD%B7%E6%A8%A1%E5%BC%8F)的时候已经更新了这些段寄存器。那么为什么我们还要去关心这些段寄存器的值呢？答案很简单，Linux 内核也有32位的引导协议，如果一个引导程序之前使用32位协议引导内核，那么在 `startup_32` 之前的代码就会被忽略。在这种情况下 `startup_32` 将会变成引导程序之后的第一个入口点，不保证段寄存器会不会处于未知状态。

在我们检查了 `KEEP_SEGMENTS` 标记并且给段寄存器设置了正确的值之后，下一步就是计算我们代码的加载和编译运行之间的位置偏差了。记住 `setup.ld.S` 包含了以下定义：在 `.head.text` 段的开始 `. = 0` 。这意味着这一段代码被编译成从 `0` 地址运行。我们可以在 `objdump` 工具的输出中看到：

```
arch/x86/boot/compressed/vmlinux:     file format elf64-x86-64


Disassembly of section .head.text:

0000000000000000 <startup_32>:
   0:   fc                      cld
   1:   f6 86 11 02 00 00 40    testb  $0x40,0x211(%rsi)
```

 `objdump` 工具告诉我们 `startup_32` 的地址是 `0` 。但实际上并不是。我们当前的目标是获知我们实际上在哪里。在[长模式](https://zh.wikipedia.org/wiki/%E9%95%BF%E6%A8%A1%E5%BC%8F)下，这非常简单，因为其支持 `rip` 相对寻址，但是我们当前处于[保护模式](https://zh.wikipedia.org/wiki/%E4%BF%9D%E8%AD%B7%E6%A8%A1%E5%BC%8F)下。我们将会使用一个常用的方法来确定 `startup_32` 的地址。我们需要定义一个标签并且跳转到它，然后把栈顶抛出到一个寄存器中：

```assembly
call label
label: pop %reg
```

在这之后，那个寄存器将会包含标签的地址，让我们看看在 Linux 内核中类似的寻找 `startup_32` 地址的代码：

```assembly
	leal	(BP_scratch+4)(%esi), %esp
	call	1f
1:  popl	%ebp
	subl	$1b, %ebp
```

回忆前一节， `esi` 寄存器包含了 [boot_params](http://lxr.free-electrons.com/source/arch/x86/include/uapi/asm/bootparam.h?v=3.18#L113) 结构的地址，这个结构在我们切换到保护模式之前已经被填充了。`bootparams` 这个结构体包含了一个特殊的字段 `scratch` ，其偏移量为 `0x1e4` 。这个 4 字节的区域将会成为 `call` 指令的临时栈。我们把 `scratch` 的地址加 4 存入 `esp` 寄存器。我们之所以在 `BP_scratch` 基础上加 `4` 是因为，如之前所说的，这将成为一个临时的栈，而在 `x86_64` 架构下，栈是自顶向下生长的。所以我们的栈指针就会指向栈顶。接下来我们就可以看到我上面描述的过程。我们跳转到 `1f` 标签并且把该标签的地址放入 `ebp` 寄存器，因为在执行 `call` 指令之后我们把返回地址放到了栈顶。那么，目前我们拥有 `1f` 标签的地址，也能够很容易得到 `startup_32` 的地址。我们只需要把我们从栈里得到的地址减去标签的地址：

```
startup_32 (0x0)     +-----------------------+
                     |                       |
                     |                       |
                     |                       |
                     |                       |
                     |                       |
                     |                       |
                     |                       |
                     |                       |
1f (0x0 + 1f offset) +-----------------------+ %ebp - 实际物理地址
                     |                       |
                     |                       |
                     +-----------------------+
```

 `startup_32` 被链接为在 `0x0` 地址运行，这意味着 `1f` 的地址为 `0x0 + 1f 的偏移量` 。实际上偏移量大概是 `0x22` 字节。 `ebp` 寄存器包含了 `1f` 标签的实际物理地址。所以如果我们从 `ebp` 中减去 `1f` ，我们就会得到 `startup_32` 的实际物理地址。Linux 内核的[引导协议](https://www.kernel.org/doc/Documentation/x86/boot.txt)描述了保护模式下的内核基地址是 `0x100000` 。我们可以用 [gdb](https://zh.wikipedia.org/wiki/GNU%E4%BE%A6%E9%94%99%E5%99%A8) 来验证。让我们启动调试器并且在 `1f` 的地址 `0x100022` 添加断点。如果这是正确的，我们将会看到在 `ebp` 寄存器中值为 `0x100022` ：

```
$ gdb
(gdb)$ target remote :1234
Remote debugging using :1234
0x0000fff0 in ?? ()
(gdb)$ br *0x100022
Breakpoint 1 at 0x100022
(gdb)$ c
Continuing.

Breakpoint 1, 0x00100022 in ?? ()
(gdb)$ i r
eax            0x18	0x18
ecx            0x0	0x0
edx            0x0	0x0
ebx            0x0	0x0
esp            0x144a8	0x144a8
ebp            0x100021	0x100021
esi            0x142c0	0x142c0
edi            0x0	0x0
eip            0x100022	0x100022
eflags         0x46	[ PF ZF ]
cs             0x10	0x10
ss             0x18	0x18
ds             0x18	0x18
es             0x18	0x18
fs             0x18	0x18
gs             0x18	0x18
```

如果我们执行下一条指令 `subl	$1b, %ebp` ，我们将会看到：

```
nexti
...
ebp            0x100000	0x100000
...
```

好了，那是对的。`startup_32` 的地址是 `0x100000` 。在我们知道了 `startup_32` 的地址之后，我们可以开始准备切换到[长模式](https://zh.wikipedia.org/wiki/%E9%95%BF%E6%A8%A1%E5%BC%8F)了。我们的下一个目标是建立栈并且确认 CPU 对长模式和 [SSE](https://zh.wikipedia.org/wiki/SSE) 的支持。

栈的建立和 CPU 的确认
--------------------------------------------------------------------------------

如果不知道 `startup_32` 标签的地址，我们就无法建立栈。我们可以把栈看作是一个数组，并且栈指针寄存器 `esp` 必须指向数组的底部。当然我们可以在自己的代码里定义一个数组，但是我们需要知道其真实地址来正确配置栈指针。让我们看一下代码：

```assembly
	movl	$boot_stack_end, %eax
	addl	%ebp, %eax
	movl	%eax, %esp
```

 `boots_stack_end` 标签被定义在同一个汇编文件 [arch/x86/boot/compressed/head_64.S](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/head_64.S?v=3.18) 中，位于 [.bss](https://en.wikipedia.org/wiki/.bss) 段：

```assembly
	.bss
	.balign 4
boot_heap:
	.fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
	.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:
```

首先，我们把 `boot_stack_end` 放到 `eax` 寄存器中。那么 `eax` 寄存器将包含 `boot_stack_end` 链接后的地址或者说 `0x0 + boot_stack_end` 。为了得到 `boot_stack_end` 的实际地址，我们需要加上 `startup_32` 的实际地址。回忆一下，前面我们找到了这个地址并且把它存到了 `ebp` 寄存器中。最后，`eax` 寄存器将会包含 `boot_stack_end` 的实际地址，我们只需要将其加到栈指针上。

在外面建立了栈之后，下一步是 CPU 的确认。既然我们将要切换到 `长模式` ，我们需要检查 CPU 是否支持 `长模式` 和 `SSE`。我们将会在跳转到 `verify_cpu` 函数之后执行：

```assembly
	call	verify_cpu
	testl	%eax, %eax
	jnz	no_longmode
```

这个函数定义在 [arch/x86/kernel/verify_cpu.S](http://lxr.free-electrons.com/source/arch/x86/kernel/verify_cpu.S?v=3.18) 中，只是包含了几个对 [cpuid](https://en.wikipedia.org/wiki/CPUID) 指令的调用。该指令用于获取处理器的信息。在我们的情况下，它检查了对 `长模式` 和 `SSE` 的支持，通过 `eax` 寄存器返回0表示成功，1表示失败。

如果 `eax` 的值不是 0 ，我们就跳转到 `no_longmode` 标签，用 `hlt` 指令停止 CPU ，期间不会发生硬件中断：

```assembly
no_longmode:
1:
	hlt
	jmp     1b
```

如果 `eax` 的值为0，万事大吉，我们可以继续。

计算重定位地址
--------------------------------------------------------------------------------

下一步是在必要的时候计算解压缩之后的地址。首先，我们需要知道内核重定位的意义。我们已经知道 Linux 内核的32位入口点地址位于 `0x100000` 。但是那是一个32位的入口。默认的内核基地址由内核配置项 `CONFIG_PHYSICAL_START` 的值所确定，其默认值为 `0x1000000` 或 `16 MB` 。这里的主要问题是如果内核崩溃了，内核开发者需要一个配置于不同地址加载的 `救援内核` 来进行 [kdump](https://www.kernel.org/doc/Documentation/kdump/kdump.txt)。Linux 内核提供了特殊的配置选项以解决此问题 - `CONFIG_RELOCATABLE` 。我们可以在内核文档中找到：

```
This builds a kernel image that retains relocation information
so it can be loaded someplace besides the default 1MB.

Note: If CONFIG_RELOCATABLE=y, then the kernel runs from the address
it has been loaded at and the compile time physical address
(CONFIG_PHYSICAL_START) is used as the minimum location.
```

```
这建立了一个保留了重定向信息的内核镜像，这样就可以在默认的 1MB 位置之外加载了。

注意：如果 CONFIG_RELOCATABLE=y， 那么 内核将会从其被加载的位置运行，编译时的物理地址 (CONFIG_PHYSICAL_START) 将会被作为最低地址位置的限制。
```

简单来说，这意味着相同配置下的 Linux 内核可以从不同地址被启动。这是通过将程序以 [位置无关代码](https://zh.wikipedia.org/wiki/%E5%9C%B0%E5%9D%80%E6%97%A0%E5%85%B3%E4%BB%A3%E7%A0%81) 的形式编译来达到的。如果我们参考 [/arch/x86/boot/compressed/Makefile](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/Makefile?v=3.18)，我们将会看到解压器的确是用 `-fPIC` 标记编译的：

```Makefile
KBUILD_CFLAGS += -fno-strict-aliasing -fPIC
```

当我们使用位置无关代码时，一段代码的地址是由一个控制地址加上程序计数器计算得到的。我们可以从任意一个地址加载使用这种方式寻址的代码。这就是为什么我们需要获得 `startup_32` 的实际地址。现在让我们回到 Linux 内核代码。我们目前的目标是计算出内核解压的地址。这个地址的计算取决于内核配置项 `CONFIG_RELOCATABLE` 。让我们看代码：

```assembly
#ifdef CONFIG_RELOCATABLE
	movl	%ebp, %ebx
	movl	BP_kernel_alignment(%esi), %eax
	decl	%eax
	addl	%eax, %ebx
	notl	%eax
	andl	%eax, %ebx
	cmpl	$LOAD_PHYSICAL_ADDR, %ebx
	jge	1f
#endif
	movl	$LOAD_PHYSICAL_ADDR, %ebx
1:
	addl	$z_extract_offset, %ebx
```

记住 `ebp` 寄存器的值就是 `startup_32` 标签的物理地址。如果在内核配置中 `CONFIG_RELOCATABLE` 内核配置项开启，我们就把这个地址放到 `ebx` 寄存器中，对齐到 `2M` 的整数倍 ，然后和 `LOAD_PHYSICAL_ADDR` 的值比较。 `LOAD_PHYSICAL_ADDR` 宏定义在头文件 [arch/x86/include/asm/boot.h](http://lxr.free-electrons.com/source/arch/x86/include/asm/boot.h?v=3.18) 中，如下：

```C
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
				+ (CONFIG_PHYSICAL_ALIGN - 1)) \
				& ~(CONFIG_PHYSICAL_ALIGN - 1))
```

我们可以看到该宏只是展开成对齐的 `CONFIG_PHYSICAL_ALIGN` 值，其表示了内核加载位置的物理地址。在比较了 `LOAD_PHYSICAL_ADDR` 和 `ebx` 的值之后，我们给 `startup_32` 加上偏移来获得解压内核镜像的地址。如果 `CONFIG_RELOCATABLE` 选项在内核配置时没有开启，我们就直接将默认的地址加上 `z_extract_offset` 。

在前面的操作之后，`ebp` 包含了我们加载时的地址，`ebx` 被设为内核解压缩的目标地址。

进入长模式前的准备工作
--------------------------------------------------------------------------------

在我们得到了重定位内核镜像的基地址之后，我们需要做切换到64位模式之前的最后准备。首先，我们需要更新[全局描述符表](https://en.wikipedia.org/wiki/Global_Descriptor_Table)：

```assembly
	leal	gdt(%ebp), %eax
	movl	%eax, gdt+2(%ebp)
	lgdt	gdt(%ebp)
```

在这里我们把 `ebp` 寄存器加上 `gdt` 的偏移存到 `eax` 寄存器。接下来我们把这个地址放到 `ebp` 加上 `gdt+2` 偏移的位置上，并且用 `lgdt` 指令载入 `全局描述符表` 。为了理解这个神奇的 `gdt` 偏移量，我们需要关注 `全局描述符表` 的定义。我们可以在同一个[源文件](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/head_64.S?v=3.18)中找到其定义：

```assembly
	.data
gdt:
	.word	gdt_end - gdt
	.long	gdt
	.word	0
	.quad	0x0000000000000000	/* NULL descriptor */
	.quad	0x00af9a000000ffff	/* __KERNEL_CS */
	.quad	0x00cf92000000ffff	/* __KERNEL_DS */
	.quad	0x0080890000000000	/* TS descriptor */
	.quad   0x0000000000000000	/* TS continued */
gdt_end:
```

我们可以看到其位于 `.data` 段，并且包含了5个描述符： `null` 、内核代码段、内核数据段和其他两个任务描述符。我们已经在[上一章节](linux-bootstrap-3.md)载入了 `全局描述符表` ，和我们现在做的差不多，但是将描述符改为 `CS.L = 1` `CS.D = 0` 从而在 `64` 位模式下执行。我们可以看到， `gdt` 的定义从两个字节开始： `gdt_end - gdt` ，代表了 `gdt` 表的最后一个字节，或者说表的范围。接下来的4个字节包含了 `gdt` 的基地址。记住 `全局描述符表` 保存在 `48位 GDTR-全局描述符表寄存器` 中，由两个部分组成：

* 全局描述符表的大小 (16位）
* 全局描述符表的基址 (32位)

所以，我们把 `gdt` 的地址放到 `eax` 寄存器，然后存到 `.long	gdt` 或者 `gdt+2`。现在我们已经建立了 `GDTR` 寄存器的结构，并且可以用 `lgdt` 指令载入 `全局描述符表` 了。

在我们载入 `全局描述符表` 之后，我们必须启用 [PAE](http://en.wikipedia.org/wiki/Physical_Address_Extension) 模式。方法是将 `cr4` 寄存器的值传入 `eax` ，将第5位置1，然后再写回 `cr4` 。

```assembly
	movl	%cr4, %eax
	orl	$X86_CR4_PAE, %eax
	movl	%eax, %cr4
```

现在我们已经接近完成进入64位模式前的所有准备工作了。最后一步是建立页表，但是在此之前，这里有一些关于长模式的知识。

长模式
--------------------------------------------------------------------------------

[长模式](https://zh.wikipedia.org/wiki/%E9%95%BF%E6%A8%A1%E5%BC%8F)是 [x86_64](https://en.wikipedia.org/wiki/X86-64) 系列处理器的原生模式。首先让我们看一看 `x86_64` 和 `x86` 的一些区别。

 `64位` 模式提供了一些新特性，比如：

* 从 `r8` 到 `r15` 8个新的通用寄存器，并且所有通用寄存器都是64位的了。
* 64位指令指针 - `RIP` ;
* 新的操作模式 - 长模式;
* 64位地址和操作数;
* RIP 相对寻址 (我们将会在接下来的章节看到一个例子).

长模式是一个传统保护模式的扩展，其由两个子模式构成：

* 64位模式
* 兼容模式

为了切换到 `64位` 模式，我们需要完成以下操作：

* 启用 [PAE](https://en.wikipedia.org/wiki/Physical_Address_Extension);
* 建立页表并且将顶级页表的地址放入 `cr3` 寄存器;
* 启用 `EFER.LME` ;
* 启用分页;


我们已经通过设置 `cr4` 控制寄存器中的 `PAE` 位启动 `PAE` 了。在下一个段落，我们就要建立[页表](https://zh.wikipedia.org/wiki/%E5%88%86%E9%A0%81)的结构了。

初期页表初始化
--------------------------------------------------------------------------------

现在，我们已经知道了在进入 `64位` 模式之前，我们需要先建立页表，那么就让我们看看如何建立初期的 `4G` 启动页表。

**注意：我不会在这里解释虚拟内存的理论，如果你想知道更多，查看本节最后的链接**

Linux 内核使用 `4级` 页表，通常我们会建立6个页表：

* 1 个 `PML4` 或称为 `4级页映射` 表，包含 1 个项；
* 1 个 `PDP` 或称为 `页目录指针` 表，包含 4 个项；
* 4 个 页目录表，一共包含 `2048` 个项；

让我们看看其实现方式。首先我们在内存中为页表清理一块缓存。每个表都是 `4096` 字节，所以我们需要 `24` KB 的空间：

```assembly
	leal	pgtable(%ebx), %edi
	xorl	%eax, %eax
	movl	$((4096*6)/4), %ecx
	rep	stosl
```

我们把和 `ebx` 相关的 `pgtable` 的地址放到 `edi` 寄存器中，清空 `eax` 寄存器，并将 `ecx` 赋值为 `6144` 。 `rep stosl` 指令将会把 `eax` 的值写到 `edi` 指向的地址，然后给 `edi` 加 4 ， `ecx` 减 4 ，重复直到 `ecx` 小于等于 0 。所以我们才把 `6144` 赋值给 `ecx` 。

 `pgtable` 定义在 [arch/x86/boot/compressed/head_64.S](http://lxr.free-electrons.com/source/arch/x86/boot/compressed/head_64.S?v=3.18) 的最后：

```assembly
	.section ".pgtable","a",@nobits
	.balign 4096
pgtable:
	.fill 6*4096, 1, 0
```

我们可以看到，其位于 `.pgtable` 段，大小为 `24KB` 。

在我们为 `pgtable` 分配了空间之后，我们可以开始构建顶级页表 - `PML4` ：

```assembly
	leal	pgtable + 0(%ebx), %edi
	leal	0x1007 (%edi), %eax
	movl	%eax, 0(%edi)
```

还是在这里，我们把和 `ebx` 相关的，或者说和 `startup_32` 相关的 `pgtable` 的地址放到 `edi` 寄存器。接下来我们把相对此地址偏移 `0x1007` 的地址放到 `eax` 寄存器中。 `0x1007` 是 `PML4` 的大小 `4096` 加上 `7` 。这里的 `7` 代表了 `PML4` 的项标记。在我们这里，这些标记是 `PRESENT+RW+USER` 。在最后我们把第一个 `PDP（页目录指针）` 项的地址写到 `PML4` 中。

在接下来的一步，我们将会在 `页目录指针（PDP）` 表（3级页表）建立 4 个带有 `PRESENT+RW+USE` 标记的 `Page Directory （2级页表）` 项：

```assembly
	leal	pgtable + 0x1000(%ebx), %edi
	leal	0x1007(%edi), %eax
	movl	$4, %ecx
1:  movl	%eax, 0x00(%edi)
	addl	$0x00001000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b
```

我们把 3 级页目录指针表的基地址（从 `pgtable` 表偏移 `4096` 或者 `0x1000` ）放到 `edi` ，把第一个 2 级页目录指针表的首项的地址放到 `eax` 寄存器。把 `4` 赋值给 `ecx` 寄存器，其将会作为接下来循环的计数器，然后将第一个页目录指针项写到 `edi` 指向的地址。之后， `edi` 将会包含带有标记 `0x7` 的第一个页目录指针项的地址。接下来我们就计算后面的几个页目录指针项的地址，每个占 8 字节，把地址赋值给 `eax` ，然后回到循环开头将其写入 `edi` 所在地址。建立页表结构的最后一步就是建立 `2048` 个 `2MB` 页的页表项。

```assembly
	leal	pgtable + 0x2000(%ebx), %edi
	movl	$0x00000183, %eax
	movl	$2048, %ecx
1:  movl	%eax, 0(%edi)
	addl	$0x00200000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b
```

在这里我们做的几乎和上面一样，所有的表项都带着标记 - `$0x00000183` - `PRESENT + WRITE + MBZ` 。最后我们将会拥有 `2048` 个 `2MB` 大的页，或者说：

```python
>>> 2048 * 0x00200000
4294967296
```

一个 `4G` 页表。我们刚刚完成我们的初期页表结构，其映射了 `4G` 大小的内存，现在我们可以把高级页表 `PML4` 的地址放到 `cr3` 寄存器中了：

```assembly
	leal	pgtable(%ebx), %eax
	movl	%eax, %cr3
```

这样就全部结束了。所有的准备工作都已经完成，我们可以开始看如何切换到长模式了。

切换到长模式
--------------------------------------------------------------------------------

首先我们需要设置 [MSR](http://en.wikipedia.org/wiki/Model-specific_register) 中的 `EFER.LME` 标记为 `0xC0000080` ：

```assembly
	movl	$MSR_EFER, %ecx
	rdmsr
	btsl	$_EFER_LME, %eax
	wrmsr
```

在这里我们把 `MSR_EFER` 标记（在 [arch/x86/include/uapi/asm/msr-index.h](http://lxr.free-electrons.com/source/arch/x86/include/uapi/asm/msr-index.h?v=3.18#L7) 中定义）放到 `ecx` 寄存器中，然后调用 `rdmsr` 指令读取 [MSR](http://en.wikipedia.org/wiki/Model-specific_register) 寄存器。在 `rdmsr` 执行之后，我们将会获得 `edx:eax` 中的结果值，其取决于 `ecx` 的值。我们通过 `btsl` 指令检查 `EFER_LME` 位，并且通过 `wrmsr` 指令将 `eax` 的数据写入 `MSR` 寄存器。

下一步我们将内核段代码地址入栈（我们在 GDT 中定义了），然后将 `startup_64` 的地址导入 `eax` 。

```assembly
	pushl	$__KERNEL_CS
	leal	startup_64(%ebp), %eax
```

在这之后我们把这个地址入栈然后通过设置 `cr0` 寄存器中的 `PG` 和 `PE` 启用分页：

```assembly
	movl	$(X86_CR0_PG | X86_CR0_PE), %eax
	movl	%eax, %cr0
```

然后执行：

```assembly
lret
```

指令。记住前一步我们已经将 `startup_64` 函数的地址入栈，在 `lret` 指令之后，CPU 取出了其地址跳转到那里。

这些步骤之后我们最后来到了64位模式：

```assembly
	.code64
	.org 0x200
ENTRY(startup_64)
....
....
....
```


就是这样！

总结
--------------------------------------------------------------------------------

这是 linux 内核启动流程的第4部分。如果你有任何的问题或者建议，你可以留言，也可以直接发消息给我 [twitter](https://twitter.com/0xAX) 或者创建一个 [issue](https://github.com/0xAX/linux-insides/issues/new)。

下一节我们将会看到内核解压缩流程和其他更多。

**如果你发现文中描述有任何问题，请提交一个 PR 到 [linux-insides-zh](https://github.com/hust-open-atom-club/linux-insides-zh) 。**

相关链接
--------------------------------------------------------------------------------

* [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
* [Intel® 64 and IA-32 Architectures Software Developer’s Manual 3A](http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html)
* [GNU linker](http://www.eecs.umich.edu/courses/eecs373/readings/Linker.pdf)
* [SSE](http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions)
* [Paging](http://en.wikipedia.org/wiki/Paging)
* [Model specific register](http://en.wikipedia.org/wiki/Model-specific_register)
* [.fill instruction](http://www.chemie.fu-berlin.de/chemnet/use/info/gas/gas_7.html)
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-3.md)
* [Paging on osdev.org](http://wiki.osdev.org/Paging)
* [Paging Systems](https://www.cs.rutgers.edu/~pxk/416/notes/09a-paging.html)
* [x86 Paging Tutorial](http://www.cirosantilli.com/x86-paging/)


================================================
FILE: Booting/linux-bootstrap-5.md
================================================
内核引导过程. Part 5.
================================================================================

内核解压
--------------------------------------------------------------------------------

这是`内核引导过程`系列文章的第五部分。在[前一部分](linux-bootstrap-4.md#transition-to-the-long-mode)我们看到了切换到64位模式的过程，在这一部分我们会从这里继续。我们会看到跳进内核代码的最后步骤：内核解压前的准备、重定位和直接内核解压。所以...让我们再次深入内核源码。

内核解压前的准备
--------------------------------------------------------------------------------

我们停在了跳转到`64位`入口点——`startup_64`的跳转之前，它在源文件 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) 里面。在之前的部分，我们已经在`startup_32`里面看到了到`startup_64`的跳转：

```assembly
	pushl	$__KERNEL_CS
	leal	startup_64(%ebp), %eax
	...
	...
	...
	pushl	%eax
	...
	...
	...
	lret
```

由于我们加载了新的`全局描述符表`并且在其他模式有CPU的模式转换（在我们这里是`64位`模式），我们可以在`startup_64`的开头看到数据段的建立：

```assembly
	.code64
	.org 0x200
ENTRY(startup_64)
	xorl	%eax, %eax
	movl	%eax, %ds
	movl	%eax, %es
	movl	%eax, %ss
	movl	%eax, %fs
	movl	%eax, %gs
```

除`cs`之外的段寄存器在我们进入`长模式`时已经重置。

下一步是计算内核编译时的位置和它被加载的位置的差：

```assembly
#ifdef CONFIG_RELOCATABLE
	leaq	startup_32(%rip), %rbp
	movl	BP_kernel_alignment(%rsi), %eax
	decl	%eax
	addq	%rax, %rbp
	notq	%rax
	andq	%rax, %rbp
	cmpq	$LOAD_PHYSICAL_ADDR, %rbp
	jge	1f
#endif
	movq	$LOAD_PHYSICAL_ADDR, %rbp
1:
	movl	BP_init_size(%rsi), %ebx
	subl	$_end, %ebx
	addq	%rbp, %rbx
```

`rbp`包含了解压后内核的起始地址，在这段代码执行之后`rbx`会包含用于解压的重定位内核代码的地址。我们已经在`startup_32`看到类似的代码（你可以看之前的部分[计算重定位地址](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md#calculate-relocation-address)），但是我们需要再做这个计算，因为引导加载器可以用64位引导协议，而`startup_32`在这种情况下不会执行。

下一步，我们可以看到栈指针的设置和标志寄存器的重置：

```assembly
	leaq	boot_stack_end(%rbx), %rsp

	pushq	$0
	popfq
```

如上所述，`rbx`寄存器包含了内核解压代码的起始地址，我们把这个地址的`boot_stack_entry`偏移地址相加放到表示栈顶指针的`rsp`寄存器。在这一步之后，栈就是正确的。你可以在汇编源码文件 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) 的末尾找到`boot_stack_end`的定义：

```assembly
	.bss
	.balign 4
boot_heap:
	.fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
	.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:
```

它在`.bss`节的末尾，就在`.pgtable`前面。如果你查看 [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/vmlinux.lds.S) 链接脚本，你会找到`.bss`和`.pgtable`的定义。

由于我们设置了栈，在我们计算了解压了的内核的重定位地址后，我们可以复制压缩了的内核到以上地址。在查看细节之前，我们先看这段汇编代码：

```assembly
	pushq	%rsi
	leaq	(_bss-8)(%rip), %rsi
	leaq	(_bss-8)(%rbx), %rdi
	movq	$_bss, %rcx
	shrq	$3, %rcx
	std
	rep	movsq
	cld
	popq	%rsi
```

首先我们把`rsi`压进栈。我们需要保存`rsi`的值，因为这个寄存器现在存放指向`boot_params`的指针，这是包含引导相关数据的实模式结构体（你一定记得这个结构体，我们在开始设置内核的时候就填充了它）。在代码的结尾，我们会重新恢复指向`boot_params`的指针到`rsi`.

接下来两个`leaq`指令用`_bss - 8`偏移和`rip`和`rbx`计算有效地址并存放到`rsi`和`rdi`. 我们为什么要计算这些地址？实际上，压缩了的代码镜像存放在这份复制了的代码（从`startup_32`到当前的代码）和解压了的代码之间。你可以通过查看链接脚本 [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/vmlinux.lds.S) 验证：

```
	. = 0;
	.head.text : {
		_head = . ;
		HEAD_TEXT
		_ehead = . ;
	}
	.rodata..compressed : {
		*(.rodata..compressed)
	}
	.text :	{
		_text = .; 	/* Text */
		*(.text)
		*(.text.*)
		_etext = . ;
	}
```

注意`.head.text`节包含了`startup_32`. 你可以从之前的部分回忆起它：

```assembly
	__HEAD
	.code32
ENTRY(startup_32)
...
...
...
```

`.text`节包含解压代码：

```assembly
	.text
relocated:
...
...
...
/*
 * Do the decompression, and jump to the new kernel..
 */
...
```

`.rodata..compressed`包含了压缩了的内核镜像。所以`rsi`包含`_bss - 8`的绝对地址，`rdi`包含`_bss - 8`的重定位的相对地址。在我们把这些地址放入寄存器时，我们把`_bss`的地址放到了`rcx`寄存器。正如你在`vmlinux.lds.S`链接脚本中看到了一样，它和设置/内核代码一起在所有节的末尾。现在我们可以开始用`movsq`指令每次8字节地从`rsi`到`rdi`复制代码。

注意在数据复制前有`std`指令：它设置`DF`标志，意味着`rsi`和`rdi`会递减。换句话说，我们会从后往前复制这些字节。最后，我们用`cld`指令清除`DF`标志，并恢复`boot_params`到`rsi`.

现在我们有`.text`节的重定位后的地址，我们可以跳到那里：

```assembly
	leaq	relocated(%rbx), %rax
	jmp	*%rax
```

在内核解压前的最后准备
--------------------------------------------------------------------------------

在上一段我们看到了`.text`节从`relocated`标签开始。它做的第一件事是清空`.bss`节：

```assembly
	xorl	%eax, %eax
	leaq    _bss(%rip), %rdi
	leaq    _ebss(%rip), %rcx
	subq	%rdi, %rcx
	shrq	$3, %rcx
	rep	stosq
```

我们要初始化`.bss`节，因为我们很快要跳转到[C](https://en.wikipedia.org/wiki/C_%28programming_language%29)代码。这里我们就清空`eax`，把`_bss`的地址放到`rdi`，把`_ebss`放到`rcx`，然后用`rep stosq`填零。

最后，我们可以调用`extract_kernel`函数：

```assembly
	pushq	%rsi
	movq	%rsi, %rdi
	leaq	boot_heap(%rip), %rsi
	leaq	input_data(%rip), %rdx
	movl	$z_input_len, %ecx
	movq	%rbp, %r8
	movq	$z_output_len, %r9
	call	extract_kernel
	popq	%rsi
```

我们再一次设置`rdi`为指向`boot_params`结构体的指针并把它保存到栈中。同时我们设置`rsi`指向用于内核解压的区域。最后一步是准备`extract_kernel`的参数并调用这个解压内核的函数。`extract_kernel`函数在 [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) 源文件定义并有六个参数：

* `rmode` - 指向 [boot_params](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973//arch/x86/include/uapi/asm/bootparam.h#L114) 结构体的指针，`boot_params`被引导加载器填充或在早期内核初始化时填充
* `heap` - 指向早期启动堆的起始地址 `boot_heap` 的指针
* `input_data` - 指向压缩的内核，即 `arch/x86/boot/compressed/vmlinux.bin.bz2` 的指针
* `input_len` - 压缩的内核的大小
* `output` - 解压后内核的起始地址
* `output_len` - 解压后内核的大小

所有参数根据 [System V Application Binary Interface](http://www.x86-64.org/documentation/abi.pdf) 通过寄存器传递。我们已经完成了所有的准备工作，现在我们可以看内核解压的过程。

内核解压
--------------------------------------------------------------------------------

就像我们在之前的段落中看到了那样，`extract_kernel`函数在源文件 [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c) 定义并有六个参数。正如我们在之前的部分看到的，这个函数从图形/控制台初始化开始。我们要再次做这件事，因为我们不知道我们是不是从[实模式](https://en.wikipedia.org/wiki/Real_mode)开始，或者是使用了引导加载器，或者引导加载器用了32位还是64位启动协议。

在最早的初始化步骤后，我们保存空闲内存的起始和末尾地址。

```C
free_mem_ptr     = heap;
free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
```

在这里 `heap` 是我们在 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S) 得到的 `extract_kernel` 函数的第二个参数：

```assembly
leaq	boot_heap(%rip), %rsi
```

如上所述，`boot_heap`定义为：

```assembly
boot_heap:
	.fill BOOT_HEAP_SIZE, 1, 0
```

在这里`BOOT_HEAP_SIZE`是一个展开为`0x10000`(对`bzip2`内核是`0x400000`)的宏，代表堆的大小。

在堆指针初始化后，下一步是从 [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/kaslr.c#L425) 调用`choose_random_location`函数。我们可以从函数名猜到，它选择内核镜像解压到的内存地址。看起来很奇怪，我们要寻找甚至是`选择`内核解压的地址，但是Linux内核支持[kASLR](https://en.wikipedia.org/wiki/Address_space_layout_randomization)，为了安全，它允许解压内核到随机的地址。

在这一部分，我们不会考虑Linux内核的加载地址的随机化，我们会在下一部分讨论。

现在我们回头看 [misc.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/misc.c#L404). 在获得内核镜像的地址后，需要有一些检查以确保获得的随机地址是正确对齐的，并且地址没有错误：

```C
if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
	error("Destination physical address inappropriately aligned");

if (virt_addr & (MIN_KERNEL_ALIGN - 1))
	error("Destination virtual address inappropriately aligned");

if (heap > 0x3fffffffffffUL)
	error("Destination address too large");

if (virt_addr + max(output_len, kernel_total_size) > KERNEL_IMAGE_SIZE)
	error("Destination virtual address is beyond the kernel mapping area");

if ((unsigned long)output != LOAD_PHYSICAL_ADDR)
    error("Destination address does not match LOAD_PHYSICAL_ADDR");

if (virt_addr != LOAD_PHYSICAL_ADDR)
	error("Destination virtual address changed when not relocatable");
```

在所有这些检查后，我们可以看到熟悉的消息：

```
Decompressing Linux... 
```

然后调用解压内核的`__decompress`函数：

```C
__decompress(input_data, input_len, NULL, NULL, output, output_len, NULL, error);
```

`__decompress`函数的实现取决于在内核编译期间选择什么压缩算法：

```C
#ifdef CONFIG_KERNEL_GZIP
#include "../../../../lib/decompress_inflate.c"
#endif

#ifdef CONFIG_KERNEL_BZIP2
#include "../../../../lib/decompress_bunzip2.c"
#endif

#ifdef CONFIG_KERNEL_LZMA
#include "../../../../lib/decompress_unlzma.c"
#endif

#ifdef CONFIG_KERNEL_XZ
#include "../../../../lib/decompress_unxz.c"
#endif

#ifdef CONFIG_KERNEL_LZO
#include "../../../../lib/decompress_unlzo.c"
#endif

#ifdef CONFIG_KERNEL_LZ4
#include "../../../../lib/decompress_unlz4.c"
#endif
```

在内核解压之后，最后两个函数是`parse_elf`和`handle_relocations`.这些函数的主要用途是把解压后的内核移动到正确的位置。事实上，解压过程会[原地](https://en.wikipedia.org/wiki/In-place_algorithm)解压，我们还是要把内核移动到正确的地址。我们已经知道，内核镜像是一个[ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)可执行文件，所以`parse_elf`的主要目标是移动可加载的段到正确的地址。我们可以在`readelf`的输出看到可加载的段：

```
readelf -l vmlinux

Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000200000 0xffffffff81000000 0x0000000001000000
                 0x0000000000893000 0x0000000000893000  R E    200000
  LOAD           0x0000000000a93000 0xffffffff81893000 0x0000000001893000
                 0x000000000016d000 0x000000000016d000  RW     200000
  LOAD           0x0000000000c00000 0x0000000000000000 0x0000000001a00000
                 0x00000000000152d8 0x00000000000152d8  RW     200000
  LOAD           0x0000000000c16000 0xffffffff81a16000 0x0000000001a16000
                 0x0000000000138000 0x000000000029b000  RWE    200000
```

`parse_elf`函数的目标是加载这些段到从`choose_random_location`函数得到的`output`地址。这个函数从检查ELF签名标志开始：

```C
Elf64_Ehdr ehdr;
Elf64_Phdr *phdrs, *phdr;

memcpy(&ehdr, output, sizeof(ehdr));

if (ehdr.e_ident[EI_MAG0] != ELFMAG0 ||
    ehdr.e_ident[EI_MAG1] != ELFMAG1 ||
    ehdr.e_ident[EI_MAG2] != ELFMAG2 ||
    ehdr.e_ident[EI_MAG3] != ELFMAG3) {
        error("Kernel is not a valid ELF file");
        return;
}
```

如果是无效的，它会打印一条错误消息并停机。如果我们得到一个有效的`ELF`文件，我们从给定的`ELF`文件遍历所有程序头，并用正确的地址复制所有可加载的段到输出缓冲区：

```C
	for (i = 0; i < ehdr.e_phnum; i++) {
		phdr = &phdrs[i];

		switch (phdr->p_type) {
		case PT_LOAD:
#ifdef CONFIG_RELOCATABLE
			dest = output;
			dest += (phdr->p_paddr - LOAD_PHYSICAL_ADDR);
#else
			dest = (void *)(phdr->p_paddr);
#endif
			memmove(dest, output + phdr->p_offset, phdr->p_filesz);
			break;
		default:
			break;
		}
	}
```

这就是全部的工作。

从现在开始，所有可加载的段都在正确的位置。

在`parse_elf`函数之后是调用`handle_relocations`函数。这个函数的实现依赖于`CONFIG_X86_NEED_RELOCS`内核配置选项，如果它被启用，这个函数调整内核镜像的地址，只有在内核配置时启用了`CONFIG_RANDOMIZE_BASE`配置选项才会调用。`handle_relocations`函数的实现足够简单。这个函数从基准内核加载地址的值减掉`LOAD_PHYSICAL_ADDR`的值，从而我们获得内核链接后要加载的地址和实际加载地址的差值。在这之后我们可以进行内核重定位，因为我们知道内核加载的实际地址、它被链接的运行的地址和内核镜像末尾的重定位表。

在内核重定位后，我们从`extract_kernel`回来，到 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/arch/x86/boot/compressed/head_64.S).

内核的地址在`rax`寄存器，我们跳到那里：

```assembly
jmp	*%rax
```

就是这样。现在我们就在内核里！

结论
--------------------------------------------------------------------------------

这是关于内核引导过程的第五部分的结尾。我们不会再看到关于内核引导的文章（可能有这篇和前面的文章的更新），但是会有关于其他内核内部细节的很多文章。

下一章会描述更高级的关于内核引导过程的细节，如加载地址随机化等等。

如果你有什么问题或建议，写个评论或在 [twitter](https://twitter.com/0xAX) 找我。

**如果你发现文中描述有任何问题，请提交一个 PR 到 [linux-insides-zh](https://github.com/hust-open-atom-club/linux-insides-zh) 。**

链接
--------------------------------------------------------------------------------

* [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
* [initrd](https://en.wikipedia.org/wiki/Initrd)
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
* [bzip2](http://www.bzip.org/)
* [RDRand instruction](https://en.wikipedia.org/wiki/RdRand)
* [Time Stamp Counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
* [Programmable Interval Timers](https://en.wikipedia.org/wiki/Intel_8253)
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-4.md)


================================================
FILE: Booting/linux-bootstrap-6.md
================================================
内核引导过程. Part 6.
================================================================================

简介
--------------------------------------------------------------------------------

这是`内核引导过程`系列文章的第六部分。在[前一部分](linux-bootstrap-5.md)，我们已经看到了内核引导过程的结尾，但是我们跳过了一些高级部分。

你可能还记得，Linux内核的入口点是 [main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 的`start_kernel`函数，它在`LOAD_PHYSICAL_ADDR`地址开始执行。这个地址依赖于`CONFIG_PHYSICAL_START`内核配置选项，默认为`0x1000000`:

```
config PHYSICAL_START
	hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)
	default "0x1000000"
	---help---
	  This gives the physical address where the kernel is loaded.
      ...
      ...
      ...
```

这个选项在内核配置时可以修改，但是加载地址可以选择为一个随机值。为此，`CONFIG_RANDOMIZE_BASE`内核配置选项在内核配置时应该启用。

在这种情况下，Linux内核镜像解压和加载的物理地址会被随机化。我们在这一部分考虑这个选项被启用，并且为了[安全原因](https://en.wikipedia.org/wiki/Address_space_layout_randomization)，内核镜像的加载地址被随机化的情况。

页表的初始化
--------------------------------------------------------------------------------

在内核解压器要开始找随机的内核解压和加载地址之前，应该初始化恒等映射（identity mapped,虚拟地址和物理地址相同）页表。如果[引导加载器](https://en.wikipedia.org/wiki/Booting)使用[16位或32位引导协议](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt)，那么我们已经有了页表。但在任何情况下，如果内核解压器选择它们之外的内存区域，我们需要新的页。这就是为什么我们需要建立新的恒等映射页表。

是的，建立恒等映射页表是随机化加载地址的最早的步骤之一。但是在此之前，让我们回忆一下我们是怎么来到这里的。

在[前一部分](linux-bootstrap-5.md)，我们看到了到[长模式](https://en.wikipedia.org/wiki/Long_mode)的转换，并跳转到了内核解压器的入口点——`extract_kernel`函数。随机化从调用这个函数开始：

```C
void choose_random_location(unsigned long input,
                            unsigned long input_size,
			                unsigned long *output,
                            unsigned long output_size,
			                unsigned long *virt_addr)
{}
```

你可以看到，这个函数有五个参数：

  * `input`;
  * `input_size`;
  * `output`;
  * `output_isze`;
  * `virt_addr`.

让我们试着理解一下这些参数是什么。第一个`input`参数来自源文件 [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c) 里的`extract_kernel`函数：

```C
asmlinkage __visible void *extract_kernel(void *rmode, memptr heap,
				                          unsigned char *input_data,
				                          unsigned long input_len,
				                          unsigned char *output,
				                          unsigned long output_len)
{
  ...
  ...
  ...
  choose_random_location((unsigned long)input_data, input_len,
                         (unsigned long *)&output,
				         max(output_len, kernel_total_size),
				         &virt_addr);
  ...
  ...
  ...
}
```

这个参数由 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) 的汇编代码传递：

```C
leaq	input_data(%rip), %rdx
```

`input_data`由 [mkpiggy](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/mkpiggy.c) 程序生成。如果你亲手编译过Linux内核源码，你会找到这个程序生成的文件，它应该位于 `linux/arch/x86/boot/compressed/piggy.S`. 在我这里，这个文件是这样的：

```assembly
.section ".rodata..compressed","a",@progbits
.globl z_input_len
z_input_len = 6988196
.globl z_output_len
z_output_len = 29207032
.globl input_data, input_data_end
input_data:
.incbin "arch/x86/boot/compressed/vmlinux.bin.gz"
input_data_end:
```

你能看到它有四个全局符号。前两个`z_input_len`和`z_output_len`是压缩的和解压后的`vmlinux.bin.gz`的大小。第三个是我们的`input_data`，你可以看到，它指向二进制格式（去掉所有调试符号、注释和重定位信息）的Linux内核镜像。最后的`input_data_end`指向压缩的Linux镜像的末尾。

所以我们`choose_random_location`函数的第一个参数是指向嵌入在`piggy.o`目标文件的压缩的内核镜像的指针。

`choose_random_location`函数的第二个参数是我们刚刚看到的`z_input_len`.

`choose_random_location`函数的第三和第四个参数分别是解压后的内核镜像的位置和长度。放置解压后内核的地址来自 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S)，并且它是`startup_32`对齐到 2MB 边界的地址。解压后的内核的大小来自同样的`piggy.S`，并且它是`z_output_len`.

`choose_random_location`函数的最后一个参数是内核加载地址的虚拟地址。我们可以看到，它和默认的物理加载地址相同：

```C
unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
```

它依赖于内核配置：

```C
#define LOAD_PHYSICAL_ADDR ((CONFIG_PHYSICAL_START \
				+ (CONFIG_PHYSICAL_ALIGN - 1)) \
				& ~(CONFIG_PHYSICAL_ALIGN - 1))
```

现在，由于我们考虑`choose_random_location`函数的参数，让我们看看它的实现。这个函数从检查内核命令行的`nokaslr`选项开始：

```C
if (cmdline_find_option_bool("nokaslr")) {
	warn("KASLR disabled: 'nokaslr' on cmdline.");
	return;
}
```

如果有这个选项，那么我们就退出`choose_random_location`函数，并且内核的加载地址不会随机化。相关的命令行选项可以在[内核文档](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/Documentation/kernel-parameters.txt)找到：

```
kaslr/nokaslr [X86]

Enable/disable kernel and module base offset ASLR
(Address Space Layout Randomization) if built into
the kernel. When CONFIG_HIBERNATION is selected,
kASLR is disabled by default. When kASLR is enabled,
hibernation will be disabled.
```

假设我们没有把`nokaslr`传到内核命令行，并且`CONFIG_RANDOMIZE_BASE`启用了内核配置选项。

下一步是以下函数的调用：

```C
initialize_identity_maps();
```

它在 [arch/x86/boot/compressed/pagetable.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/pagetable.c) 源码文件定义。这个函数从初始化`mapping_info`,`x86_mapping_info`结构体的一个实例开始。

```C
mapping_info.alloc_pgt_page = alloc_pgt_page;
mapping_info.context = &pgt_data;
mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sev_me_mask;
mapping_info.kernpg_flag = _KERNPG_TABLE | sev_me_mask;
```

`x86_mapping_info`结构体在 [arch/x86/include/asm/init.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/init.h) 头文件定义：

```C
struct x86_mapping_info {
	void *(*alloc_pgt_page)(void *);
	void *context;
	unsigned long page_flag;
	unsigned long offset;
	bool direct_gbpages;
	unsigned long kernpg_flag;
};
```

这个结构体提供了关于内存映射的信息。你可能还记得，在前面的部分，我们已经建立了初始的从0到`4G`的页表。现在我们可能需要访问`4G`以上的内存来在随机的位置加载内核。所以，`initialize_identity_maps`函数初始化一个内存区域，它用于可能需要的新页表。首先，让我们尝试查看`x86_mapping_info`结构体的定义。

`alloc_pgt_page`是一个会在为一个页表项分配空间时调用的回调函数。`context`域是一个用于跟踪已分配页表的`alloc_pgt_data`结构体的实例。`page_flag`和`kernpg_flag`是页标志。第一个代表`PMD`或`PUD`表项的标志。第二个`kernpg_flag`域代表会在之后被覆盖的内核页的标志。`direct_gbpages`域代表对大页的支持。最后的`offset`域代表内核虚拟地址到`PMD`级物理地址的偏移。

`alloc_pgt_page`回调函数检查有一个新页的空间，从缓冲区分配新页并返回新页的地址：


```C
entry = pages->pgt_buf + pages->pgt_buf_offset;
pages->pgt_buf_offset += PAGE_SIZE;
```

缓冲区在此结构体中：

```C
struct alloc_pgt_data {
	unsigned char *pgt_buf;
	unsigned long pgt_buf_size;
	unsigned long pgt_buf_offset;
};
```

`initialize_identity_maps`函数最后的目标是初始化`pgdt_buf_size`和`pgt_buf_offset`. 由于我们只是在初始化阶段，`initialize_identity_maps`函数设置`pgt_buf_offset`为0:

```C
pgt_data.pgt_buf_offset = 0;
```

而`pgt_data.pgt_buf_size`会根据引导加载器所用的引导协议（64位或32位）被设置为`77824`或`69632`. `pgt_data.pgt_buf`也是一样。如果引导加载器在`startup_32`引导内核，`pgdt_data.pgdt_buf`会指向已经在 [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) 初始化的页表的末尾：

```C
pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
```

其中`_pgtable`指向这个页表 [_pgtable](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) 的开头。另一方面，如果引导加载器用64位引导协议并在`startup_64`加载内核，早期页表应该由引导加载器建立，并且`_pgtable`会被重写：

```C
pgt_data.pgt_buf = _pgtable
```

在新页表的缓冲区被初始化之下，我们回到`choose_random_location`函数。

避开保留的内存范围
--------------------------------------------------------------------------------

在恒等映射页表相关的数据被初始化之后，我们可以开始选择放置解压后内核的随机位置。但是正如你猜的那样，我们不能选择任意地址。在内存的范围中，有一些保留的地址。这些地址被重要的东西占用，如[initrd](https://en.wikipedia.org/wiki/Initial_ramdisk), 内核命令行等等。这个函数：

```C
mem_avoid_init(input, input_size, *output);
```

会帮我们做这件事。所有不安全的内存区域会收集到：

```C
struct mem_vector {
	unsigned long long start;
	unsigned long long size;
};

static struct mem_vector mem_avoid[MEM_AVOID_MAX];
```

数组。其中`MEM_AVOID_MAX`来自[枚举类型](https://en.wikipedia.org/wiki/Enumerated_type#C)`mem_avoid_index`, 它代表不同类型的保留内存区域：

```C
enum mem_avoid_index {
	MEM_AVOID_ZO_RANGE = 0,
	MEM_AVOID_INITRD,
	MEM_AVOID_CMDLINE,
	MEM_AVOID_BOOTPARAMS,
	MEM_AVOID_MEMMAP_BEGIN,
	MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
	MEM_AVOID_MAX,
};
```

它们都定义在源文件 [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) 中。

让我们看看`mem_avoid_init`函数的实现。这个函数的主要目标是在`mem_avoid`数组存放关于被`mem_avoid_index`枚举类型描述的保留内存区域的信息，并且在我们新的恒等映射缓冲区为这样的区域创建新页。`mem_avoid_index`函数的几个部分很相似，但是先看看其中一个：

```C
mem_avoid[MEM_AVOID_ZO_RANGE].start = input;
mem_avoid[MEM_AVOID_ZO_RANGE].size = (output + init_size) - input;
add_identity_map(mem_avoid[MEM_AVOID_ZO_RANGE].start,
		 mem_avoid[MEM_AVOID_ZO_RANGE].size);
```

`mem_avoid_init`函数的开头尝试避免用于当前内核解压的内存区域。我们用这个区域的起始地址和大小填写`mem_avoid`数组的一项，并调用`add_identity_map`函数，它会为这个区域建立恒等映射页。`add_identity_map`函数在源文件 [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) 定义：

```C
void add_identity_map(unsigned long start, unsigned long size)
{
	unsigned long end = start + size;

	start = round_down(start, PMD_SIZE);
	end = round_up(end, PMD_SIZE);
	if (start >= end)
		return;

	kernel_ident_mapping_init(&mapping_info, (pgd_t *)top_level_pgt,
				  start, end);
}
```

你可以看到，它对齐内存到 2MB 边界并检查给定的起始地址和终止地址。

最后它调用`kernel_ident_mapping_init`函数，它在源文件 [arch/x86/mm/ident_map.c](https://github.com/torvalds/linux/blob/master/arch/x86/mm/ident_map.c) 中，并传入以上初始化好的`mapping_info`实例、顶层页表的地址和建立新的恒等映射的内存区域的地址。

`kernel_ident_mapping_init`函数为新页设置默认的标志，如果它们没有被给出：

```C
if (!info->kernpg_flag)
	info->kernpg_flag = _KERNPG_TABLE;
```

并且开始建立新的2MB (因为`mapping_info.page_flag`中的`PSE`位) 给定地址相关的页表项（[五级页表](https://lwn.net/Articles/717293/)中的`PGD -> P4D -> PUD -> PMD`或者[四级页表](https://lwn.net/Articles/117749/)中的`PGD -> PUD -> PMD`）。

```C
for (; addr < end; addr = next) {
	p4d_t *p4d;

	next = (addr & PGDIR_MASK) + PGDIR_SIZE;
	if (next > end)
		next = end;

    p4d = (p4d_t *)info->alloc_pgt_page(info->context);
	result = ident_p4d_init(info, p4d, addr, next);

    return result;
}
```

首先我们找给定地址在 `页全局目录` 的下一项，如果它大于给定的内存区域的末地址`end`，我们把它设为`end`.之后，我们用之前看过的`x86_mapping_info`回调函数分配一个新页，然后调用`ident_p4d_init`函数。`ident_p4d_init`函数做同样的事情，但是用于低层的页目录 (`p4d` -> `pud` -> `pmd`).

就是这样。

和保留地址相关的新页表项已经在我们的页表中。这不是`mem_avoid_init`函数的末尾，但是其他部分类似。它建立用于 [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk)、内核命令行等数据的页。

现在我们可以回到`choose_random_location`函数。

物理地址随机化
--------------------------------------------------------------------------------

在保留内存区域存储在`mem_avoid`数组并且为它们建立了恒等映射页之后，我们选择最小可用的地址作为解压内核的随机内存区域：

```C
min_addr = min(*output, 512UL << 20);
```

你可以看到，它应该小于512MB. 选择这个512MB的值只是避免低内存区域中未知的东西。

下一步是选择随机的物理和虚拟地址来加载内核。首先是物理地址：

```C
random_addr = find_random_phys_addr(min_addr, output_size);
```

`find_random_phys_addr`函数在[同一个](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c)源文件中定义：

```
static unsigned long find_random_phys_addr(unsigned long minimum,
                                           unsigned long image_size)
{
	minimum = ALIGN(minimum, CONFIG_PHYSICAL_ALIGN);

	if (process_efi_entries(minimum, image_size))
		return slots_fetch_random();

	process_e820_entries(minimum, image_size);
	return slots_fetch_random();
}
```

`process_efi_entries`函数的主要目标是在整个可用的内存找到所有的合适的内存区域来加载内核。如果内核没有在支持[EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)的系统中编译和运行，我们继续在[e820](https://en.wikipedia.org/wiki/E820)区域中找这样的内存区域。所有找到的内存区域会存储在

```C
struct slot_area {
	unsigned long addr;
	int num;
};

#define MAX_SLOT_AREA 100

static struct slot_area slot_areas[MAX_SLOT_AREA];
```

数组中。内核解压器应该选择这个数组随机的索引，并且它会是内核解压的随机位置。这个选择会被`slots_fetch_random`函数执行。`slots_fetch_random`函数的主要目标是通过`kaslr_get_random_long`函数从`slot_areas`数组选择随机的内存范围：

```C
slot = kaslr_get_random_long("Physical") % slot_max;
```

`kaslr_get_random_long`函数在源文件 [arch/x86/lib/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/lib/kaslr.c) 中定义，它返回一个随机数。注意这个随机数会通过不同的方式得到，取决于内核配置、系统机会（基于[时间戳计数器](https://en.wikipedia.org/wiki/Time_Stamp_Counter)的随机数、[rdrand](https://en.wikipedia.org/wiki/RdRand)等等）。

这就是随机内存范围的选择方法。

虚拟地址随机化
--------------------------------------------------------------------------------

在内核解压器选择了随机内存区域后，新的恒等映射页会为这个区域按需建立：

```C
random_addr = find_random_phys_addr(min_addr, output_size);

if (*output != random_addr) {
		add_identity_map(random_addr, output_size);
		*output = random_addr;
}
```

这时，`output`会存放内核将会解压的一个内存区域的基地址。但是现在，正如你还记得的那样，我们只是随机化了物理地址。在[x86_64](https://en.wikipedia.org/wiki/X86-64)架构，虚拟地址也应该被随机化：

```C
if (IS_ENABLED(CONFIG_X86_64))
	random_addr = find_random_virt_addr(LOAD_PHYSICAL_ADDR, output_size);

*virt_addr = random_addr;
```

正如你所看到的，对于非`x86_64`架构，随机化的虚拟地址和随机化的物理地址相同。`find_random_virt_addr`函数计算可以保存内存镜像的虚拟内存范围的数量并且调用我们在尝试找到随机的`物理`地址的时候，之前已经看到的`kaslr_get_random_long`函数。

这时，我们同时有了用于解压内核的随机化的物理(`*output`)和虚拟(`*virt_addr`)基地址。

就是这样。

结论
--------------------------------------------------------------------------------

这是关于Linux内核引导过程的第六，并且是最后一部分的结尾。我们不再会看到关于内核引导的帖子（可能有对这篇和之前文章的更新），但是会有很多关于其他内核内部细节的文章。

下一章是关于内核初始化的，我们会看到Linux内核初始化代码的早期步骤。

如果你有什么问题或建议，写个评论或在 [twitter](https://twitter.com/0xAX) 找我。

**如果你发现文中描述有任何问题，请提交一个 PR 到 [linux-insides-zh](https://github.com/hust-open-atom-club/linux-insides-zh) 。**

Links
--------------------------------------------------------------------------------

* [Address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
* [Linux kernel boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.txt)
* [long mode](https://en.wikipedia.org/wiki/Long_mode)
* [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk)
* [Enumerated type](https://en.wikipedia.org/wiki/Enumerated_type#C)
* [four-level page tables](https://lwn.net/Articles/117749/)
* [five-level page tables](https://lwn.net/Articles/717293/)
* [EFI](https://en.wikipedia.org/wiki/Unified_Extensible_Firmware_Interface)
* [e820](https://en.wikipedia.org/wiki/E820)
* [time stamp counter](https://en.wikipedia.org/wiki/Time_Stamp_Counter)
* [rdrand](https://en.wikipedia.org/wiki/RdRand)
* [x86_64](https://en.wikipedia.org/wiki/X86-64)
* [Previous part](https://github.com/0xAX/linux-insides/blob/master/Booting/linux-bootstrap-5.md)


================================================
FILE: CONTRIBUTING.md
================================================
贡献
================================================================================

如果你想要给 [linux-insides-zh](https://github.com/hust-open-atom-club/linux-insides-zh) 做贡献，请遵照如下规则：

1. 点击 `fork` 按钮：

    ![fork](./Assets/fork_button.png)

2. 通过如下命令 `clone` 你 Github 帐号中的 `linux-insides-zh` 仓库：

    ```
    git clone git@github.com:your_github_username/linux-insides-zh.git
    ```

3. 通过如下命令创建分支 (`branch`)：

    ```
    git checkout -b "linux-insides-zh-fix"
    ```
   其中，`linux-insides-zh-fix` 仅是一个样例。

4. 对本地仓库进行修改。

5. 提交自己的修改，然后推送 (`push`) 到远端。

    ```
    git add your_changed_files
	git commit -m "your comment"
    git push --set-upstream origin linux-insides-zh-fix
    ```

6. 点击 `New Pull Request` 按钮，将你 Github 帐号中的 `linux-insides-zh` 仓库的 linux-insides-zh-fix 分支的修改提交到当前 `linux-insides-zh` 库中。

十分感谢！


================================================
FILE: CONTRIBUTORS.md
================================================
## 翻译人员 (排名不分先后)

[@xinqiu](https://github.com/xinqiu)

[@lijiangsheng1](https://github.com/lijiangsheng1)

[@littleneko](https://github.com/littleneko)

[@qianmoke](https://github.com/qianmoke)

[@icecoobe](https://github.com/icecoobe)

[@choleraehyq](http://github.com/choleraehyq)

[@mudongliang](https://github.com/mudongliang)

[@oska874](https://github.com/oska874)

[@cloudusers](https://github.com/cloudusers)

[@hailincai](https://github.com/hailincai)

[@zmj1316](https://github.com/zmj1316)

[@zhangyangjing](https://github.com/zhangyangjing)

[@huxq](https://github.com/huxq)

[@worldwar](https://github.com/worldwar)

[@keltoy](https://github.com/keltoy)

[@a1ickgu0](https://github.com/a1ickgu0)

[@hao-lee](https://github.com/hao-lee)

[@woodpenker](http://github.com/woodpenker)

@tjm-1990

[@up2wing](https://github.com/up2wing)

[@NeoCui](https://github.com/NeoCui)

[@narcijie](https://github.com/narcijie)

[@biopuppet](https://github.com/biopuppet)

[@Albertchamberlain](https://github.com/Albertchamberlain)

@nannxnann

[@chenhr56](https://github.com/chenhr56)


================================================
FILE: Cgroups/README.md
================================================
# 控制组

这个章节描述了 Linux 内核中的控制组机制。

* [简介](linux-cgroups-1.md)


================================================
FILE: Cgroups/linux-cgroups-1.md
================================================
控制组
================================================================================

简介
--------------------------------------------------------------------------------

这是 [linux 内核揭秘](/) 的新一章的第一部分。你可以根据这部分的标题猜测 - 这一部分将涉及 Linux 内核中的 [`控制组`](https://en.wikipedia.org/wiki/Cgroups) 或 `cgroups` 机制。

`Cgroups` 是由 Linux 内核提供的一种机制，它允许我们分配诸如处理器时间、每组进程的数量、每个 `cgroup` 的内存大小，或者针对一个或一组进程的上述资源的组合。`Cgroups` 是按照层级结构组织的，这种机制类似于通常的进程，他们也是层级结构，并且子 `cgroups` 会继承其上级的一些属性。但实际上他们还是有区别的。`cgroups` 和进程之间的主要区别在于，多个不同层级的 `cgroup` 可以同时存在，而进程树则是单一的。同时存在的多个不同层级的 `cgroup` 并不是任意的，因为每个 `cgroup` 层级都要附加到一组 `cgroup` "子系统"中。

每个 `cgroup` 子系统代表一种资源，如针对某个 `cgroup` 的处理器时间或者 [pid](https://en.wikipedia.org/wiki/Process_identifier)  的数量，也叫进程数。Linux 内核提供对以下 12 种 `cgroup` 子系统的支持：

* `cpuset` - 为 `cgroup` 内的任务分配独立的处理器和内存节点；
* `cpu` - 使用调度程序对 `cgroup` 内的任务提供 CPU 资源的访问；
* `cpuacct` - 生成 `cgroup` 中所有任务的处理器使用情况报告；
* `io` - 限制对[块设备](https://en.wikipedia.org/wiki/Device_file)的读写操作;
* `memory` - 限制 `cgroup` 中的一组任务的内存使用;
* `devices` - 限制 `cgroup` 中的一组任务访问设备；
* `freezer` - 允许 `cgroup` 中的一组任务挂起/恢复；
* `net_cls` - 允许对 `cgroup` 中的任务产生的网络数据包进行标记；
* `net_prio` - 针对 `cgroup` 中的每个网络接口提供一种动态修改网络流量优先级的方法；
* `perf_event` - 支持访问 `cgroup` 中的[性能事件](https://en.wikipedia.org/wiki/Perf_\(Linux\));
* `hugetlb` - 为 `cgroup` 开启对[大页内存](https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt)的支持;
* `pid` - 限制 `cgroup` 中的进程数量。

每个 `cgroup` 子系统是否被支持均与相关配置选项有关。例如，`cpuset` 子系统应该通过 `CONFIG_CPUSETS` 内核配置选项启用，`io` 子系统通过 `CONFIG_BLK_CGROUP` 内核配置选项等。所有这些内核配置选项都可以在 `General setup → Control Group support` 菜单里找到：

![menuconfig](images/menuconfig.png)

你可以通过 [proc](https://en.wikipedia.org/wiki/Procfs) 虚拟文件系统在计算机上查看已经启用的 `cgroup`：

```
$ cat /proc/cgroups 
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	8	1	1
cpu	7	66	1
cpuacct	7	66	1
blkio	11	66	1
memory	9	94	1
devices	6	66	1
freezer	2	1	1
net_cls	4	1	1
perf_event	3	1	1
net_prio	4	1	1
hugetlb	10	1	1
pids	5	69	1
```

或者通过 [sysfs](https://en.wikipedia.org/wiki/Sysfs) 虚拟文件系统查看:

```
$ ls -l /sys/fs/cgroup/
total 0
dr-xr-xr-x 5 root root  0 Dec  2 22:37 blkio
lrwxrwxrwx 1 root root 11 Dec  2 22:37 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Dec  2 22:37 cpuacct -> cpu,cpuacct
dr-xr-xr-x 5 root root  0 Dec  2 22:37 cpu,cpuacct
dr-xr-xr-x 2 root root  0 Dec  2 22:37 cpuset
dr-xr-xr-x 5 root root  0 Dec  2 22:37 devices
dr-xr-xr-x 2 root root  0 Dec  2 22:37 freezer
dr-xr-xr-x 2 root root  0 Dec  2 22:37 hugetlb
dr-xr-xr-x 5 root root  0 Dec  2 22:37 memory
lrwxrwxrwx 1 root root 16 Dec  2 22:37 net_cls -> net_cls,net_prio
dr-xr-xr-x 2 root root  0 Dec  2 22:37 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Dec  2 22:37 net_prio -> net_cls,net_prio
dr-xr-xr-x 2 root root  0 Dec  2 22:37 perf_event
dr-xr-xr-x 5 root root  0 Dec  2 22:37 pids
dr-xr-xr-x 5 root root  0 Dec  2 22:37 systemd
```

正如你所猜测的那样，`cgroup` 机制不只是针对 Linux 内核的需求而创建的，更多的是用户空间层面的需求。要使用 `cgroup` ，需要先创建它。我们可以通过两种方式来创建。

第一种方法是在 `/sys/fs/cgroup` 目录下的任意子系统中创建子目录，并将任务的 pid 添加到 `tasks` 文件中，这个文件在我们创建子目录后会自动创建。

第二种方法是使用 `libcgroup` 库提供的工具集来创建/销毁/管理 `cgroups`(在 Fedora 中是 `libcgroup-tools`)。

我们来看一个简单的例子。下面的 [bash](https://www.gnu.org/software/bash/) 脚本会持续把一行信息输出到代表当前进程的控制终端的设备：

```shell
#!/bin/bash

while :
do
    echo "print line" > /dev/tty
    sleep 5
done
```

因此，如果我们运行这个脚本，将看到下面的结果：

```
$ sudo chmod +x cgroup_test_script.sh
~$ ./cgroup_test_script.sh 
print line
print line
print line
...
...
...
```

现在让我们进入系统中 `cgroupfs` 的挂载点。前面说到，它位于 `/sys/fs/cgroup` 目录，但你可以将它挂载到任何你希望的地方。

```
$ cd /sys/fs/cgroup
```

接着我们进入 `devices` 子目录，这个子目录表示允许或拒绝 `cgroup` 中的任务访问的设备：

```
# cd devices
```

然后在这里创建 `cgroup_test_group` 目录：

```
# mkdir cgroup_test_group
```

创建 `cgroup_test_group` 目录之后，会在目录下生成以下文件：

```
/sys/fs/cgroup/devices/cgroup_test_group$ ls -l
total 0
-rw-r--r-- 1 root root 0 Dec  3 22:55 cgroup.clone_children
-rw-r--r-- 1 root root 0 Dec  3 22:55 cgroup.procs
--w------- 1 root root 0 Dec  3 22:55 devices.allow
--w------- 1 root root 0 Dec  3 22:55 devices.deny
-r--r--r-- 1 root root 0 Dec  3 22:55 devices.list
-rw-r--r-- 1 root root 0 Dec  3 22:55 notify_on_release
-rw-r--r-- 1 root root 0 Dec  3 22:55 tasks
```

现在我们重点关注 `tasks` 和 `devices.deny` 这两个文件。第一个文件 `tasks` 包含的是要附加到 `cgroup_test_group` `cgroup` 的 pid，第二个文件 `devices.deny` 包含的是拒绝访问的设备列表。新创建的 `cgroup` 默认对设备没有任何访问限制。为了禁止访问某个设备(在我们的示例中是 `/dev/tty`)，我们应该向 `devices.deny` 写入下面这行：

```
# echo "c 5:0 w" > devices.deny
```

我们来对这行进行详细解读。第一个字符 `c` 表示一种设备类型，我们示例中的 `/dev/tty` 是“字符设备”，我们可以通过 `ls` 命令的输出对此进行验证：

```
~$ ls -l /dev/tty
crw-rw-rw- 1 root tty 5, 0 Dec  3 22:48 /dev/tty
```

可以看到权限列表中的第一个字符是 `c`。第二部分的 `5:0` 是设备的主次设备号，你也可以在 `ls` 命令的输出中看到。最后的字符 `w` 表示禁止 `cgroups` 中的任务对指定的设备执行写入操作。现在让我们再次运行 `cgroup_test_script.sh` 脚本：

```
~$ ./cgroup_test_script.sh 
print line
print line
print line
...
...
```

没有任何效果。再把这个进程的 pid 加到我们 `cgroup` 的 `devices/tasks` 文件：

```
# echo $(pidof -x cgroup_test_script.sh) > /sys/fs/cgroup/devices/cgroup_test_group/tasks
```

现在，脚本的运行结果和预期的一样：

```
~$ ./cgroup_test_script.sh 
print line
print line
print line
print line
print line
print line
./cgroup_test_script.sh: line 5: /dev/tty: Operation not permitted
```

在你运行 [docker](https://en.wikipedia.org/wiki/Docker_\(software\)) 容器的时候也会出现类似的情况：

```
~$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                    NAMES
fa2d2085cd1c        mariadb:10          "docker-entrypoint..."   12 days ago         Up 4 minutes        0.0.0.0:3306->3306/tcp   mysql-work

~$ cat /sys/fs/cgroup/devices/docker/fa2d2085cd1c8d797002c77387d2061f56fefb470892f140d0dc511bd4d9bb61/tasks | head -3
5501
5584
5585
...
...
...
```

因此，在 `docker` 容器的启动过程中，`docker` 会为这个容器中的进程创建一个 `cgroup`：

```
$ docker exec -it mysql-work /bin/bash
$ top
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                   1 mysql     20   0  963996 101268  15744 S   0.0  0.6   0:00.46 mysqld                                                                                  71 root      20   0   20248   3028   2732 S   0.0  0.0   0:00.01 bash                                                                                    77 root      20   0   21948   2424   2056 R   0.0  0.0   0:00.00 top                                                                                  
```

我们可以在宿主机上看到这个 `cgroup`：

```C
$ systemd-cgls

Control group /:
-.slice
├─docker
│ └─fa2d2085cd1c8d797002c77387d2061f56fefb470892f140d0dc511bd4d9bb61
│   ├─5501 mysqld
│   └─6404 /bin/bash
```

现在我们了解了一些关于 `cgroup` 的机制，如何手动使用它，以及这个机制的用途。是时候深入 Linux 内核源码来了解这个机制的实现了。

`cgroup` 的早期初始化
--------------------------------------------------------------------------------

现在，在我们刚刚看到关于 Linux 内核的 `cgroup` 机制的一些理论之后，我们可以开始深入到 Linux 的内核源码，以便更深入的了解这种机制。
与往常一样，我们将从 `cgroup` 的初始化开始。在 Linux 内核中，`cgroups` 的初始化分为两个部分：早期和晚期。在这部分我们只考虑“早期”的部分，“晚期”的部分会在下一部分考虑。

`Cgroups` 的早期初始化是在 Linux 内核的早期初始化期间从 [init/main.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/init/main.c) 中调用：

```C
cgroup_init_early();
```
函数开始的。这个函数定义在源文件 [kernel/cgroup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cgroup.c) 中，从下面两个局部变量的定义开始：

```C
int __init cgroup_init_early(void)
{
	static struct cgroup_sb_opts __initdata opts;
	struct cgroup_subsys *ss;
    ...
    ...
    ...
}
```

`cgroup_sb_opts` 结构体的定义也可以在这个源文件中找到：

```C
struct cgroup_sb_opts {
	u16 subsys_mask;
	unsigned int flags;
	char *release_agent;
	bool cpuset_clone_children;
	char *name;
	bool none;
};
```

用来表示 `cgroupfs` 的挂载选项。例如，我们可以使用 `name=` 选项创建指定名称的 cgroup 层级(本示例中以 `my_cgrp` 命名)，不附加到任何子系统：

```
$ mount -t cgroup -oname=my_cgrp,none /mnt/cgroups
```

第二个变量 - `ss` 是 `cgroup_subsys` 结构体，这个结构体定义在 [include/linux/cgroup-defs.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cgroup-defs.h) 头文件中。你可以从这个结构体的名称中猜到，这个变量表示一个 `cgroup` 子系统。这个结构体包含多个字段和回调函数，如：

```C
struct cgroup_subsys {
    int (*css_online)(struct cgroup_subsys_state *css);
    void (*css_offline)(struct cgroup_subsys_state *css);
    ...
    ...
    ...
    bool early_init:1;
    int id;
    const char *name;
    struct cgroup_root *root;
    ...
    ...
    ...
}
```

例如，`css_online` 和 `css_offline` 回调分别在 cgroup 成功完成所有分配之后和 cgroup 释放之前调用，`early_init` 标志位用来标记子系统是否要提前初始化，`id` 和 `name` 字段分别表示在 cgroup 中已注册的子系统的唯一标识和子系统的”名称“。最后的 `root` 字段指向 cgroup 层级结构的根。

当然，`cgroup_subsys` 结构体还有一些其他字段，比上面展示的要多，不过目前了解这么多已经够了。现在我们了解了与 `cgroups` 机制有关的重要结构体，让我们再回到 `cgroup_init_early` 函数。这个函数的主要目的是对一些子系统进行早期初始化。你可能已经猜到了，这些需要”早期“初始化的子系统的 `cgroup_subsys->early_init` 字段应该为 `1`。来看看哪些子系统可以提前初始化吧。

在两个局部变量定义之后，我们可以看到下面几行代码：

```C
init_cgroup_root(&cgrp_dfl_root, &opts);
cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;
```

这里我们可以看到 `init_cgroup_root` 函数的调用，它会使用缺省的层级结构进行初始化。接着我们在缺省的 `cgroup` 中设置 `CSS_NO_REF` 标志来禁止这个 css 的引用计数。`cgrp_dfl_root` 的定义也在这个文件中：

```C
struct cgroup_root cgrp_dfl_root;
```

这里的 `cgrp` 字段是 `cgroup` 结构体，你也许已经猜到了，它表示一个 `cgroup`，`cgroup` 定义在 [include/linux/cgroup-defs.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cgroup-defs.h) 头文件中。我们知道一个进程在 Linux 内核中是用 `task_struct` 结构体表示的， `task_struct` 并不包含直接访问这个任务所属的 `cgroup` 的链接，但是可以通过 `task_struct` 的 `css_set` 字段访问。这个 `css_set` 结构体拥有指向子系统状态数组的指针：

```C
struct css_set {
    ...
    ...
    ....
    struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
    ...
    ...
    ...
}
```

通过 `cgroup_subsys_state` 结构体，一个进程可以找到其所属的 `cgroup`：
```C
struct cgroup_subsys_state {
    ...
    ...
    ...
    struct cgroup *cgroup;
    ...
    ...
    ...
}
```

所以，`cgroups` 相关数据结构的整体情况如下：

```                                                 
+-------------+         +---------------------+    +------------->+---------------------+          +----------------+
| task_struct |         |       css_set       |    |              | cgroup_subsys_state |          |     cgroup     |
+-------------+         |                     |    |              +---------------------+          +----------------+
|             |         |                     |    |              |                     |          |     flags      |
|             |         |                     |    |              +---------------------+          |  cgroup.procs  |
|             |         |                     |    |              |        cgroup       |--------->|       id       |
|             |         |                     |    |              +---------------------+          |      ....      | 
|-------------+         |---------------------+----+                                               +----------------+
|   cgroups   | ------> | cgroup_subsys_state | array of cgroup_subsys_state
|-------------+         +---------------------+------------------>+---------------------+          +----------------+
|             |         |                     |                   | cgroup_subsys_state |          |      cgroup    |
+-------------+         +---------------------+                   +---------------------+          +----------------+
                                                                  |                     |          |      flags     |
                                                                  +---------------------+          |   cgroup.procs |
                                                                  |        cgroup       |--------->|        id      |
                                                                  +---------------------+          |       ....     |
                                                                  |    cgroup_subsys    |          +----------------+
                                                                  +---------------------+
                                                                             |
                                                                             |
                                                                             ↓
                                                                  +---------------------+
                                                                  |    cgroup_subsys    |
                                                                  +---------------------+
                                                                  |         id          |
                                                                  |        name         |
                                                                  |      css_online     |
                                                                  |      css_ofline     |
                                                                  |        attach       |
                                                                  |         ....        |
                                                                  +---------------------+
```



因此，`init_cgroup_root` 函数使用默认值设置 `cgrp_dfl_root`。接下来的工作是把初始化的 `css_set` 分配给 `init_task`，它表示系统中的第一个进程：

```C
RCU_INIT_POINTER(init_task.cgroups, &init_css_set);
```

`cgroup_init_early` 函数里最后一件重要的任务是 `early cgroups` 的初始化。在这里，我们遍历所有已注册的子系统，给子系统分配一个唯一的标识号和名称，并且对标记为早期的子系统调用 `cgroup_init_subsys` 函数：

```C
for_each_subsys(ss, i) {
		ss->id = i;
		ss->name = cgroup_subsys_name[i];

        if (ss->early_init)
			cgroup_init_subsys(ss, true);
}
```

这里的 `for_each_subsys` 是 [kernel/cgroup.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cgroup.c) 源文件中的一个宏定义，正好扩展成基于 `cgroup_subsys` 数组的 for 循环。这个数组的定义可以在该源文件中找到，它看起来有点不寻常：

```C
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
    static struct cgroup_subsys *cgroup_subsys[] = {
        #include <linux/cgroup_subsys.h>
};
#undef SUBSYS
```

它被定义为 `SUBSYS` 宏，它接受一个参数(子系统名称)，并定义了 cgroup 子系统的 `cgroup_subsys`数组。另外，我们可以看到这个数组是使用 [linux/cgroup_subsys.h](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/include/linux/cgroup_subsys.h) 头文件的内容进行初始化。如果我们看一下这个头文件，就会发现一组具有给定子系统名称的 `SUBSYS` 宏：

```C
#if IS_ENABLED(CONFIG_CPUSETS)
SUBSYS(cpuset)
#endif

#if IS_ENABLED(CONFIG_CGROUP_SCHED)
SUBSYS(cpu)
#endif
...
...
...
```

可以这样定义是因为第一个 `SUBSYS` 的宏定义后面的 `#undef` 语句。来看看 `&_x ## _cgrp_subsys` 表达式，在 `C` 语言的宏定义中，`##` 操作符连接左右两边的表达式，所以当我们把 `cpuset`、`cpu` 等参数传给 `SUBSYS` 宏时，其实是在定义 `cpuset_cgrp_subsys`、`cp_cgrp_subsys`。确实如此，在 [kernel/cpuset.c](https://github.com/torvalds/linux/blob/16f73eb02d7e1765ccab3d2018e0bd98eb93d973/kernel/cpuset.c) 源文件中你可以看到这些结构体的定义：

```C
struct cgroup_subsys cpuset_cgrp_subsys = {
    ...
    ...
    ...
	.early_init	= true,
};
```

因此，`cgroup_init_early` 函数中的最后一步是调用 `cgroup_init_subsys` 函数完成早期子系统的初始化，下面的早期子系统将被初始化：

* `cpuset`;
* `cpu`;
* `cpuacct`.

`cgroup_init_subsys` 函数使用缺省值对指定的子系统进行初始化。比如，设置层级结构的根，使用 `css_alloc` 回调函数为指定的子系统分配空间，将一个子系统链接到一个已经存在的子系统，为初始进程分配子系统等。

至此，早期子系统就初始化结束了。

结束语
--------------------------------------------------------------------------------

这是第一部分的结尾，它描述了 Linux 内核中 `cgroup` 机制的引入，我们讨论了与 `cgroup` 机制相关的一些理论和初始化步骤，在接下来的部分中，我们将继续深入讨论 `cgroup` 更实用的方面。

如果你有任何问题或建议，可以写评论给我，也可以在 [twitter](https://twitter.com/0xAX) 上联系我。

**请注意，英语不是我的第一语言，对于任何不便，我深表歉意。如果你发现任何错误，请给我发送一个 PR 到 [linux-insides](https://github.com/0xAX/linux-insides).**

链接
--------------------------------------------------------------------------------

* [control groups](https://en.wikipedia.org/wiki/Cgroups)
* [PID](https://en.wikipedia.org/wiki/Process_identifier)
* [cpuset](https://man7.org/linux/man-pages/man7/cpuset.7.html)
* [block devices](https://en.wikipedia.org/wiki/Device_file)
* [huge pages](https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt)
* [sysfs](https://en.wikipedia.org/wiki/Sysfs)
* [proc](https://en.wikipedia.org/wiki/Procfs)
* [cgroups kernel documentation](https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt)
* [cgroups v2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt)
* [bash](https://www.gnu.org/software/bash/)
* [docker](https://en.wikipedia.org/wiki/Docker_\(software\))
* [perf events](https://en.wikipedia.org/wiki/Perf_\(Linux\))
* [Previous chapter](/MM/linux-mm-1.md)


================================================
FILE: Concepts/README.md
================================================
# Linux 内核概念

本章描述内核中使用到的各种各样的概念。

* [每个 CPU 的变量](linux-cpu-1.md)
* [CPU 掩码](linux-cpu-2.md)
* [initcall 机制](linux-cpu-3.md)
* [Linux 内核的通知链](linux-cpu-4.md)


================================================
FILE: Concepts/linux-cpu-1.md
================================================
Per-cpu 变量
================================================================================

Per-cpu 变量是一项内核特性。从它的名字你就可以理解这项特性的意义了。我们可以创建一个变量，然后每个 CPU 上都会有一个此变量的拷贝。本节我们来看下这个特性，并试着去理解它是如何实现以及工作的。

内核提供了一个创建 per-cpu 变量的 API - `DEFINE_PER_CPU` 宏：

```C
#define DEFINE_PER_CPU(type, name) \
        DEFINE_PER_CPU_SECTION(type, name, "")
```

正如其它许多处理 per-cpu 变量的宏一样，这个宏定义在 [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) 中。现在我们来看下这个特性是如何实现的。

看下 `DECLARE_PER_CPU` 的定义，可以看到它使用了 2 个参数：`type` 和 `name`，因此我们可以这样创建 per-cpu 变量：

```C
DEFINE_PER_CPU(int, per_cpu_n)
```

我们传入要创建变量的类型和名字，`DEFINE_PER_CPU` 调用 `DEFINE_PER_CPU_SECTION`，将两个参数和空字符串传递给后者。让我们来看下 `DEFINE_PER_CPU_SECTION` 的定义：

```C
#define DEFINE_PER_CPU_SECTION(type, name, sec)    \
         __PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES  \
         __typeof__(type) name
```

```C
#define __PCPU_ATTRS(sec)                                                \
         __percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))     \
         PER_CPU_ATTRIBUTES
```

其中 `section` 是:

```C
#define PER_CPU_BASE_SECTION ".data..percpu"
```

当所有的宏展开之后，我们得到一个全局的 per-cpu 变量：

```C
__attribute__((section(".data..percpu"))) int per_cpu_n
```

这意味着我们在 `.data..percpu` 段有了一个 `per_cpu_n` 变量，可以在 `vmlinux` 中找到它：

```
.data..percpu 00013a58  0000000000000000  0000000001a5c000  00e00000  2**12
              CONTENTS, ALLOC, LOAD, DATA
```

好，现在我们知道了，当我们使用 `DEFINE_PER_CPU` 宏时，一个在 `.data..percpu` 段中的 per-cpu 变量就被创建了。内核初始化时，调用 `setup_per_cpu_areas` 函数多次加载 `.data..percpu` 段，每个 CPU 一次。

让我们来看下 per-cpu 区域初始化流程。它从 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 中调用 `setup_per_cpu_areas` 函数开始，这个函数定义在 [arch/x86/kernel/setup_percpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/setup_percpu.c) 中。

```C
pr_info("NR_CPUS:%d nr_cpumask_bits:%d nr_cpu_ids:%d nr_node_ids:%d\n",
        NR_CPUS, nr_cpumask_bits, nr_cpu_ids, nr_node_ids);
```

 `setup_per_cpu_areas` 开始输出在内核配置中以 `CONFIG_NR_CPUS` 配置项设置的最大 CPUs 数，实际的 CPU 个数，`nr_cpumask_bits`（对于新的 `cpumask` 操作来说和 `NR_CPUS` 是一样的），还有 `NUMA` 节点个数。

我们可以在 `dmesg` 中看到这些输出：

```
$ dmesg | grep percpu
[    0.000000] setup_percpu: NR_CPUS:8 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
```

然后我们检查 `per-cpu` 第一个块分配器。所有的 per-cpu 区域都是以块进行分配的。第一个块用于静态 per-cpu 变量。Linux 内核提供了决定第一个块分配器类型的命令行：`percpu_alloc` 。我们可以在内核文档中读到它的说明。

```
percpu_alloc=	选择要使用哪个 per-cpu 第一个块分配器。
		当前支持的类型是 "embed" 和 "page"。
        不同架构支持这些类型的子集或不支持。
        更多分配器的细节参考 mm/percpu.c 中的注释。
        这个参数主要是为了调试和性能比较的。
```

[mm/percpu.c](https://github.com/torvalds/linux/blob/master/mm/percpu.c) 包含了这个命令行选项的处理函数：

```C
early_param("percpu_alloc", percpu_alloc_setup);
```

其中 `percpu_alloc_setup` 函数根据 `percpu_alloc` 参数值设置 `pcpu_chosen_fc` 变量。默认第一个块分配器是 `auto`：

```C
enum pcpu_fc pcpu_chosen_fc __initdata = PCPU_FC_AUTO;
```

如果内核命令行中没有设置 `percpu_alloc` 参数，就会使用 `embed` 分配器，将第一个 per-cpu 块嵌入进带 [memblock](/MM/linux-mm-1.md) 的 bootmem。最后一个分配器和第一个块 `page` 分配器一样，只是将第一个块使用 `PAGE_SIZE` 页进行了映射。

如我上面所写，首先我们在 `setup_per_cpu_areas` 中对第一个块分配器检查，检查到第一个块分配器不是 page 分配器：

```C
if (pcpu_chosen_fc != PCPU_FC_PAGE) {
    ...
    ...
    ...
}
```

如果不是 `PCPU_FC_PAGE`，我们就使用 `embed` 分配器并使用 `pcpu_embed_first_chunk` 函数分配第一块空间。

```C
rc = pcpu_embed_first_chunk(PERCPU_FIRST_CHUNK_RESERVE,
					    dyn_size, atom_size,
					    pcpu_cpu_distance,
					    pcpu_fc_alloc, pcpu_fc_free);
```

如前所述，函数 `pcpu_embed_first_chunk` 将第一个 per-cpu 块嵌入 bootmen，因此我们传递一些参数给 `pcpu_embed_first_chunk`。参数如下：

* `PERCPU_FIRST_CHUNK_RESERVE` - 为静态变量 `per-cpu` 保留空间的大小；
* `dyn_size` - 动态分配的最少空闲字节；
* `atom_size` - 所有的分配都是这个的整数倍，并以此对齐；
* `pcpu_cpu_distance` - 决定 cpus 距离的回调函数；
* `pcpu_fc_alloc` - 分配 `percpu` 页的函数；
* `pcpu_fc_free` - 释放 `percpu` 页的函数。

在调用 `pcpu_embed_first_chunk` 前我们计算好所有的参数：

```C
const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE - PERCPU_FIRST_CHUNK_RESERVE;
size_t atom_size;
#ifdef CONFIG_X86_64
		atom_size = PMD_SIZE;
#else
		atom_size = PAGE_SIZE;
#endif
```

如果第一个块分配器是 `PCPU_FC_PAGE`，我们用 `pcpu_page_first_chunk` 而不是 `pcpu_embed_first_chunk`。 `per-cpu` 区域准备好以后，我们用 `setup_percpu_segment` 函数设置 `per-cpu` 的偏移和段（只针对 `x86` 系统），并将前面的数据从数组移到 `per-cpu` 变量（`x86_cpu_to_apicid`, `irq_stack_ptr` 等等）。当内核完成初始化进程后，我们就有了N个 `.data..percpu` 段，其中 N 是 CPU 个数，bootstrap 进程使用的段将会包含用 `DEFINE_PER_CPU` 宏创建的未初始化的变量。

内核提供了操作 per-cpu 变量的API：

* get_cpu_var(var)
* put_cpu_var(var)

让我们来看看 `get_cpu_var` 的实现：

```C
#define get_cpu_var(var)     \
(*({                         \
         preempt_disable();  \
         this_cpu_ptr(&var); \
}))
```

Linux 内核是抢占式的，获取 per-cpu 变量需要我们知道内核运行在哪个处理器上。因此访问 per-cpu 变量时，当前代码不能被抢占，不能移到其它的 CPU。如我们所见，这就是为什么首先调用 `preempt_disable` 函数然后调用 `this_cpu_ptr` 宏，像这样：

```C
#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
```

以及

```C
#define raw_cpu_ptr(ptr)        per_cpu_ptr(ptr, 0)
```

`per_cpu_ptr` 返回一个指向给定 CPU（第 2 个参数） per-cpu 变量的指针。当我们创建了一个 per-cpu 变量并对其进行了修改时，我们必须调用 `put_cpu_var` 宏通过函数 `preempt_enable` 使能抢占。因此典型的 per-cpu 变量的使用如下：

```C
get_cpu_var(var);
...
//用这个 'var' 做些啥
...
put_cpu_var(var);
```

让我们来看下这个 `per_cpu_ptr` 宏：

```C
#define per_cpu_ptr(ptr, cpu)                             \
({                                                        \
        __verify_pcpu_ptr(ptr);                           \
         SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu)));  \
})
```

就像我们上面写的，这个宏返回了一个给定 cpu 的 per-cpu 变量。首先它调用了 `__verify_pcpu_ptr`：

```C
#define __verify_pcpu_ptr(ptr)
do {
	const void __percpu *__vpp_verify = (typeof((ptr) + 0))NULL;
	(void)__vpp_verify;
} while (0)
```

该宏声明了 `ptr` 类型的 `const void __percpu *`。

之后，我们可以看到带两个参数的 `SHIFT_PERCPU_PTR` 宏的调用。第一个参数是我们的指针，第二个参数是传给 `per_cpu_offset` 宏的CPU数：

```C
#define per_cpu_offset(x) (__per_cpu_offset[x])
```

该宏将 `x` 扩展为 `__per_cpu_offset` 数组：

```C
extern unsigned long __per_cpu_offset[NR_CPUS];
```

其中 `NR_CPUS` 是 CPU 的数目。`__per_cpu_offset` 数组以 CPU 变量拷贝之间的距离填充。例如，所有 per-cpu 变量是 `X` 字节大小，所以我们通过 `__per_cpu_offset[Y]` 就可以访问 `X*Y`。让我们来看下 `SHIFT_PERCPU_PTR` 的实现：

```C
#define SHIFT_PERCPU_PTR(__p, __offset)                                 \
         RELOC_HIDE((typeof(*(__p)) __kernel __force *)(__p), (__offset))
```

`RELOC_HIDE` 只是取得偏移量 `(typeof(ptr)) (__ptr + (off))`，并返回一个指向该变量的指针。

就这些了！当然这不是全部的 API，只是一个大概。开头是比较艰难，但是理解 per-cpu 变量你只需理解 [include/linux/percpu-defs.h](https://github.com/torvalds/linux/blob/master/include/linux/percpu-defs.h) 的奥秘。

让我们再看下获得 per-cpu 变量指针的算法：

* 内核在初始化流程中创建多个 `.data..percpu` 段（一个 per-cpu 变量一个）；
* 所有 `DEFINE_PER_CPU` 宏创建的变量都将重新分配到首个扇区或者 CPU0；
* `__per_cpu_offset` 数组以 (`BOOT_PERCPU_OFFSET`) 和 `.data..percpu` 扇区之间的距离填充；
* 当 `per_cpu_ptr` 被调用时，例如取一个 per-cpu 变量的第三个 CPU 的指针，将访问 `__per_cpu_offset` 数组，该数组的索引指向了所需 CPU。

就这么多了。


================================================
FILE: Concepts/linux-cpu-2.md
================================================
CPU masks
================================================================================

介绍
--------------------------------------------------------------------------------

`Cpumasks` 是Linux内核提供的保存系统CPU信息的特殊方法。包含 `Cpumasks` 操作 API 相关的源码和头文件：

* [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h)
* [lib/cpumask.c](https://github.com/torvalds/linux/blob/master/lib/cpumask.c)
* [kernel/cpu.c](https://github.com/torvalds/linux/blob/master/kernel/cpu.c)

正如 [include/linux/cpumask.h](https://github.com/torvalds/linux/blob/master/include/linux/cpumask.h) 注释：Cpumasks 提供了代表系统中 CPU 集合的位图，一位放置一个 CPU 序号。我们已经在 [Kernel entry point](/Initialization/linux-initialization-4.md) 部分，函数 `boot_cpu_init` 中看到了一点 cpumask。这个函数将第一个启动的 cpu 上线、激活等等……

```C
set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);
```

`set_cpu_possible` 是一个在系统启动时任意时刻都可插入的 cpu ID 集合。`cpu_present` 代表了当前插入的 CPUs。`cpu_online` 是 `cpu_present` 的子集，表示可调度的 CPUs。这些掩码依赖于 `CONFIG_HOTPLUG_CPU` 配置选项，以及 `possible == present` 和 `active == online` 选项是否被禁用。这些函数的实现很相似，检测第二个参数，如果为 `true`，就调用 `cpumask_set_cpu` ，否则调用 `cpumask_clear_cpu`。

有两种方法创建 `cpumask`。第一种是用 `cpumask_t`。定义如下：

```C
typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
```

它封装了 `cpumask` 结构，其包含了一个位掩码 `bits` 字段。`DECLARE_BITMAP` 宏有两个参数：

* bitmap name;
* number of bits.

并以给定名称创建了一个 `unsigned long` 数组。它的实现非常简单：

```C
#define DECLARE_BITMAP(name,bits) \
        unsigned long name[BITS_TO_LONGS(bits)]
```

其中 `BITS_TO_LONGS`：

```C
#define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
```

因为我们专注于 `x86_64` 架构，`unsigned long` 是8字节大小，因此我们的数组仅包含一个元素：

```
(((8) + (8) - 1) / (8)) = 1
```

`NR_CPUS` 宏表示的是系统中 CPU 的数目，且依赖于在 [include/linux/threads.h](https://github.com/torvalds/linux/blob/master/include/linux/threads.h) 中定义的 `CONFIG_NR_CPUS` 宏，看起来像这样：

```C
#ifndef CONFIG_NR_CPUS
        #define CONFIG_NR_CPUS  1
#endif

#define NR_CPUS         CONFIG_NR_CPUS
```

第二种定义 cpumask 的方法是直接使用宏 `DECLARE_BITMAP` 和 `to_cpumask` 宏，后者将给定的位图转化为 `struct cpumask *`：

```C
#define to_cpumask(bitmap)                                              \
        ((struct cpumask *)(1 ? (bitmap)                                \
                            : (void *)sizeof(__check_is_bitmap(bitmap))))
```

可以看到这里的三目运算符每次总是 `true`。`__check_is_bitmap` 内联函数定义为：

```C
static inline int __check_is_bitmap(const unsigned long *bitmap)
{
        return 1;
}
```

每次都是返回 `1`。我们需要它只是因为：编译时检测一个给定的 `bitmap` 是一个位图，换句话说，它检测一个 `bitmap` 是否有 `unsigned long *` 类型。因此我们传递 `cpu_possible_bits` 给宏 `to_cpumask` ，将 `unsigned long` 数组转换为 `struct cpumask *`。

cpumask API
--------------------------------------------------------------------------------

因为我们可以用其中一个方法来定义 cpumask，Linux 内核提供了 API 来处理 cpumask。我们来研究下其中一个函数，例如 `set_cpu_online`，这个函数有两个参数：

* CPU 数目;
* CPU 状态;

这个函数的实现如下所示：

```C
void set_cpu_online(unsigned int cpu, bool online)
{
	if (online) {
		cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
		cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
	} else {
		cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
	}
}
```

该函数首先检测第二个 `state` 参数并调用依赖它的 `cpumask_set_cpu` 或 `cpumask_clear_cpu`。这里我们可以看到在中 `cpumask_set_cpu` 的第二个参数转换为 `struct cpumask *`。在我们的例子中是位图 `cpu_online_bits`，定义如下：

```C
static DECLARE_BITMAP(cpu_online_bits, CONFIG_NR_CPUS) __read_mostly;
```

函数 `cpumask_set_cpu` 仅调用了一次 `set_bit` 函数：

```C
static inline void cpumask_set_cpu(unsigned int cpu, struct cpumask *dstp)
{
        set_bit(cpumask_check(cpu), cpumask_bits(dstp));
}
```

`set_bit` 函数也有两个参数，设置了一个给定位（第一个参数）的内存（第二个参数或 `cpu_online_bits` 位图）。这儿我们可以看到在调用 `set_bit` 之前，它的两个参数会传递给

* cpumask_check;
* cpumask_bits.

让我们细看下这两个宏。第一个 `cpumask_check` 在我们的例子里没做任何事，只是返回了给的参数。第二个 `cpumask_bits` 只是返回了传入 `struct cpumask *` 结构的 `bits` 域。

```C
#define cpumask_bits(maskp) ((maskp)->bits)
```

现在让我们看下 `set_bit` 的实现：

```C
 static __always_inline void
 set_bit(long nr, volatile unsigned long *addr)
 {
         if (IS_IMMEDIATE(nr)) {
                asm volatile(LOCK_PREFIX "orb %1,%0"
                        : CONST_MASK_ADDR(nr, addr)
                        : "iq" ((u8)CONST_MASK(nr))
                        : "memory");
        } else {
                asm volatile(LOCK_PREFIX "bts %1,%0"
                        : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
        }
 }
```

这个函数看着吓人，但它没有看起来那么难。首先传参 `nr` 或者说位数给 `IS_IMMEDIATE` 宏，该宏调用了 GCC 内联函数 `__builtin_constant_p`：

```C
#define IS_IMMEDIATE(nr)    (__builtin_constant_p(nr))
```

`__builtin_constant_p` 检查给定参数是否编译时恒定变量。因为我们的 `cpu` 不是编译时恒定变量，将会执行 `else` 分支：

```C
asm volatile(LOCK_PREFIX "bts %1,%0" : BITOP_ADDR(addr) : "Ir" (nr) : "memory");
```

让我们试着一步一步来理解它如何工作的：

`LOCK_PREFIX` 是个 x86 `lock` 指令。这个指令告诉 CPU 当指令执行时占据系统总线。这允许 CPU 同步内存访问，防止多核（或多设备 - 比如 DMA 控制器）并发访问同一个内存cell。

`BITOP_ADDR` 转换给定参数至 `(*(volatile long *)` 并且加了 `+m` 约束。`+` 意味着这个操作数对于指令是可读写的。`m` 显示这是一个内存操作数。`BITOP_ADDR` 定义如下：

```C
#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
```

接下来是 `memory`。它告诉编译器汇编代码执行内存读或写到某些项，而不是那些输入或输出操作数（例如，访问指向输出参数的内存）。

`Ir` - 寄存器操作数。

`bts` 指令设置一个位字符串的给定位，存储给定位的值到 `CF` 标志位。所以我们传递 cpu 号，我们的例子中为 0，给 `set_bit` 并且执行后，其设置了在 `cpu_online_bits` cpumask 中的 0 位。这意味着第一个 cpu 此时上线了。

当然，除了 `set_cpu_*` API 外，cpumask 提供了其它 cpumasks 操作的 API。让我们简短看下。

附加的 cpumask API
--------------------------------------------------------------------------------

cpumaks 提供了一系列宏来得到不同状态 CPUs 序号。例如：

```C
#define num_online_cpus()	cpumask_weight(cpu_online_mask)
```

这个宏返回了 `online` CPUs 数量。它读取 `cpu_online_mask` 位图并调用了 `cpumask_weight` 函数。`cpumask_weight` 函数使用两个参数调用了一次 `bitmap_weight` 函数：

* cpumask bitmap;
* `nr_cpumask_bits` - 在我们的例子中就是 `NR_CPUS`。

```C
static inline unsigned int cpumask_weight(const struct cpumask *srcp)
{
	return bitmap_weight(cpumask_bits(srcp), nr_cpumask_bits);
}
```

并计算给定位图的位数。除了 `num_online_cpus`，cpumask还提供了所有 CPU 状态的宏：

* num_possible_cpus;
* num_active_cpus;
* cpu_online;
* cpu_possible.

等等。

除了 Linux 内核提供的下述操作 `cpumask` 的 API：

* `for_each_cpu` - 遍历一个mask的所有 cpu;
* `for_each_cpu_not` - 遍历所有补集的 cpu;
* `cpumask_clear_cpu` - 清除一个 cpumask 的 cpu;
* `cpumask_test_cpu` - 测试一个 mask 中的 cpu;
* `cpumask_setall` - 设置 mask 的所有 cpu;
* `cpumask_size` - 返回分配 'struct cpumask' 字节数大小;

还有很多。

链接
--------------------------------------------------------------------------------

* [cpumask documentation](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)


================================================
FILE: Concepts/linux-cpu-3.md
================================================
initcall 机制
================================================================================

介绍
--------------------------------------------------------------------------------


就像你从标题所理解的，这部分将涉及 Linux 内核中有趣且重要的概念，称之为 `initcall`。在 Linux 内核中，我们可以看到类似这样的定义：

```C
early_param("debug", debug_kernel);
```

或者

```C
arch_initcall(init_pit_clocksource);
```

在我们分析这个机制在内核中是如何实现的之前，我们必须了解这个机制是什么，以及在 Linux 内核中是如何使用它的。像这样的定义表示一个 [回调函数](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) ，它们会在 Linux 内核启动中或启动后调用。实际上 `initcall` 机制的要点是确定内置模块和子系统初始化的正确顺序。举个例子，我们来看看下面的函数：

```C
static int __init nmi_warning_debugfs(void)
{
    debugfs_create_u64("nmi_longest_ns", 0644,
                       arch_debugfs_dir, &nmi_longest_ns);
    return 0;
}
```

这个函数出自源码文件 [arch/x86/kernel/nmi.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/nmi.c)。我们可以看到，这个函数只是在 `arch_debugfs_dir` 目录中创建 `nmi_longest_ns` [debugfs](https://en.wikipedia.org/wiki/Debugfs) 文件。实际上，只有在 `arch_debugfs_dir` 创建后，才会创建这个 `debugfs` 文件。这个目录是在 Linux 内核特定架构的初始化期间创建的。实际上，该目录将在源码文件 [arch/x86/kernel/kdebugfs.c](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kdebugfs.c) 的 `arch_kdebugfs_init` 函数中创建。注意 `arch_kdebugfs_init` 函数也被标记为 `initcall`。

```C
arch_initcall(arch_kdebugfs_init);
```

Linux 内核在调用 `fs` 相关的 `initcalls` 之前调用所有特定架构的 `initcalls`。因此，只有在 `arch_kdebugfs_dir` 目录创建以后才会创建我们的 `nmi_longest_ns`。实际上，Linux 内核提供了八个级别的主 `initcalls`：

* `early`;
* `core`;
* `postcore`;
* `arch`;
* `susys`;
* `fs`;
* `device`;
* `late`.

它们的所有名称是由数组 `initcall_level_names` 来描述的，该数组定义在源码文件 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 中：

```C
static char *initcall_level_names[] __initdata = {
	"early",
	"core",
	"postcore",
	"arch",
	"subsys",
	"fs",
	"device",
	"late",
};
```

所有用这些标识符标记为 `initcall` 的函数将会以相同的顺序被调用，或者说，`early initcalls` 会首先被调用，其次是 `core initcalls`，以此类推。现在，我们对 `initcall` 机制了解点了，所以我们可以开始潜入 Linux 内核源码，来看看这个机制是如何实现的。

initcall 机制在 Linux 内核中的实现
--------------------------------------------------------------------------------

Linux 内核提供了一组来自头文件 [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) 的宏，来标记给定的函数为 `initcall`。所有这些宏都相当简单：

```C
#define early_initcall(fn)		__define_initcall(fn, early)
#define core_initcall(fn)		__define_initcall(fn, 1)
#define postcore_initcall(fn)		__define_initcall(fn, 2)
#define arch_initcall(fn)		__define_initcall(fn, 3)
#define subsys_initcall(fn)		__define_initcall(fn, 4)
#define fs_initcall(fn)			__define_initcall(fn, 5)
#define device_initcall(fn)		__define_initcall(fn, 6)
#define late_initcall(fn)		__define_initcall(fn, 7)
```

我们可以看到，这些宏只是从同一个头文件的 `__define_initcall` 宏的调用扩展而来。此外，`__define_initcall` 宏有两个参数：

* `fn` - 在调用某个级别 `initcalls` 时调用的回调函数；
* `id` - 识别 `initcall` 的标识符，用来防止两个相同的 `initcalls` 指向同一个处理函数时出现错误。

`__define_initcall` 宏的实现如下所示：

```C
#define __define_initcall(fn, id) \
	static initcall_t __initcall_##fn##id __used \
	__attribute__((__section__(".initcall" #id ".init"))) = fn; \
	LTO_REFERENCE_INITCALL(__initcall_##fn##id)
```

要了解 `__define_initcall` 宏，首先让我们来看下 `initcall_t` 类型。这个类型定义在同一个 [头文件]() 中，它表示一个返回 [整形](https://en.wikipedia.org/wiki/Integer)指针的函数指针，这将是 `initcall` 的结果：

```C
typedef int (*initcall_t)(void);
```

现在让我们回到 `_-define_initcall` 宏。[##](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html) 提供了连接两个符号的能力。在我们的例子中，`__define_initcall` 宏的第一行产生了 `.initcall id .init` [ELF 部分](http://www.skyfree.org/linux/references/ELF_Format.pdf) 给定函数的定义，并标记以下 [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) 属性： `__initcall_function_name_id` 和 `__used`。如果我们查看表示内核链接脚本数据的 [include/asm-generic/vmlinux.lds.h](https://github.com/torvalds/linux/blob/master/include/asm-generic/vmlinux.lds.h) 头文件，我们会看到所有的 `initcalls` 部分都将放在 `.data` 段：

```C
#define INIT_CALLS					\
		VMLINUX_SYMBOL(__initcall_start) = .;	\
		*(.initcallearly.init)					\
		INIT_CALLS_LEVEL(0)					    \
		INIT_CALLS_LEVEL(1)					    \
		INIT_CALLS_LEVEL(2)					    \
		INIT_CALLS_LEVEL(3)					    \
		INIT_CALLS_LEVEL(4)					    \
		INIT_CALLS_LEVEL(5)					    \
		INIT_CALLS_LEVEL(rootfs)				\
		INIT_CALLS_LEVEL(6)					    \
		INIT_CALLS_LEVEL(7)					    \
		VMLINUX_SYMBOL(__initcall_end) = .;

#define INIT_DATA_SECTION(initsetup_align)	\
	.init.data : AT(ADDR(.init.data) - LOAD_OFFSET) {	   \
        ...                                                \
        INIT_CALLS						                   \
        ...                                                \
	}

```

第二个属性 - `__used`，定义在 [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) 头文件中，它扩展了以下 `gcc` 定义：

```C
#define __used   __attribute__((__used__))
```

它防止 `定义了变量但未使用` 的告警。宏 `__define_initcall` 最后一行是：

```C
LTO_REFERENCE_INITCALL(__initcall_##fn##id)
```

这取决于 `CONFIG_LTO` 内核配置选项，只为编译器提供[链接时间优化](https://gcc.gnu.org/wiki/LinkTimeOptimization)存根：

```
#ifdef CONFIG_LTO
#define LTO_REFERENCE_INITCALL(x) \
        static __used __exit void *reference_##x(void)  \
        {                                               \
                return &x;                              \
        }
#else
#define LTO_REFERENCE_INITCALL(x)
#endif
```

为了防止当模块中的变量没有引用时而产生的任何问题，它被移到了程序末尾。这就是关于 `__define_initcall` 宏的全部了。所以，所有的 `*_initcall` 宏将会在Linux内核编译时扩展，所有的 `initcalls` 会放置在它们的段内，并可以通过 `.data` 段来获取，Linux 内核在初始化过程中就知道在哪儿去找到 `initcall` 并调用它。

既然 Linux 内核可以调用 `initcalls`，我们就来看下 Linux 内核是如何做的。这个过程从 [init/main.c](https://github.com/torvalds/linux/blob/master/init/main.c) 头文件的 `do_basic_setup` 函数开始：

```C
static void __init do_basic_setup(void)
{
    ...
    ...
    ...
   	do_initcalls();
    ...
    ...
    ...
}
```

该函数在 Linux 内核初始化过程中调用，调用时机是主要的初始化步骤，比如内存管理器相关的初始化、`CPU` 子系统等完成之后。`do_initcalls` 函数只是遍历 `initcall` 级别数组，并调用每个级别的 `do_initcall_level` 函数：

```C
static void __init do_initcalls(void)
{
	int level;

	for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
		do_initcall_level(level);
}
```

`initcall_levels` 数组在同一个源码[文件](https://github.com/torvalds/linux/blob/master/init/main.c)中定义，包含了定义在 `__define_initcall` 宏中的那些段的指针：

```C
static initcall_t *initcall_levels[] __initdata = {
	__initcall0_start,
	__initcall1_start,
	__initcall2_start,
	__initcall3_start,
	__initcall4_start,
	__initcall5_start,
	__initcall6_start,
	__initcall7_start,
	__initcall_end,
};
```

如果你有兴趣，你可以在 Linux 内核编译后生成的链接器脚本 `arch/x86/kernel/vmlinux.lds` 中找到这些段：

```
.init.data : AT(ADDR(.init.data) - 0xffffffff80000000) {
    ...
    ...
    ...
    ...
    __initcall_start = .;
    *(.initcallearly.init)
    __initcall0_start = .;
    *(.initcall0.init)
    *(.initcall0s.init)
    __initcall1_start = .;
    ...
    ...
}
```

如果你对这些不熟，可以在本书的某些[部分](/Misc/linux-misc-3.md)了解更多关于[链接器](https://en.wikipedia.org/wiki/Linker_%28computing%29)的信息。

正如我们刚看到的，`do_initcall_level` 函数有一个参数 - `initcall` 的级别，做了以下两件事：首先这个函数拷贝了 `initcall_command_line`，这是通常内核包含了各个模块参数的[命令行](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)的副本，并用 [kernel/params.c](https://github.com/torvalds/linux/blob/master/kernel/params.c)源码文件的 `parse_args` 函数解析它，然后调用各个级别的 `do_on_initcall` 函数：

```C
for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
		do_one_initcall(*fn);
```

`do_on_initcall` 为我们做了主要的工作。我们可以看到，这个函数有一个参数表示 `initcall` 回调函数，并调用给定的回调函数：

```C
int __init_or_module do_one_initcall(initcall_t fn)
{
	int count = preempt_count();
	int ret;
	char msgbuf[64];

	if (initcall_blacklisted(fn))
		return -EPERM;

	if (initcall_debug)
		ret = do_one_initcall_debug(fn);
	else
		ret = fn();

	msgbuf[0] = 0;

	if (preempt_count() != count) {
		sprintf(msgbuf, "preemption imbalance ");
		preempt_count_set(count);
	}
	if (irqs_disabled()) {
		strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
		local_irq_enable();
	}
	WARN(msgbuf[0], "initcall %pF returned with %s\n", fn, msgbuf);

	return ret;
}
```

让我们来试着理解 `do_on_initcall` 函数做了什么。首先我们增加 [preemption](https://en.wikipedia.org/wiki/Preemption_%28computing%29) 计数，以便我们稍后进行检查，确保它不是不平衡的。这步以后，我们可以看到 `initcall_backlist` 函数的调用，这个函数遍历包含了 `initcalls` 黑名单的 `blacklisted_initcalls` 链表，如果 `initcall` 在黑名单里就释放它：

```C
list_for_each_entry(entry, &blacklisted_initcalls, next) {
	if (!strcmp(fn_name, entry->buf)) {
		pr_debug("initcall %s blacklisted\n", fn_name);
		kfree(fn_name);
		return true;
	}
}
```

黑名单的 `initcalls` 保存在 `blacklisted_initcalls` 链表中，这个链表是在早期 Linux 内核初始化时由 Linux 内核命令行来填充的。

处理完进入黑名单的 `initcalls`，接下来的代码直接调用 `initcall`：

```C
if (initcall_debug)
	ret = do_one_initcall_debug(fn);
else
	ret = fn();
```

取决于 `initcall_debug` 变量的值，`do_one_initcall_debug` 函数将调用 `initcall`，或直接调用 `fn()`。`initcall_debug` 变量定义在[同一个源码文件](https://github.com/torvalds/linux/blob/master/init/main.c)：

```C
bool initcall_debug;
```

该变量提供了向内核[日志缓冲区](https://en.wikipedia.org/wiki/Dmesg)打印一些信息的能力。可以通过 `initcall_debug` 参数从内核命令行中设置这个变量的值。从Linux内核命令行[文档](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)可以看到：

```
initcall_debug	[KNL] Trace initcalls as they are executed.  Useful
                      for working out where the kernel is dying during
                      startup.
```

确实如此。如果我们看下 `do_one_initcall_debug` 函数的实现，我们会看到它与 `do_one_initcall` 函数做了一样的事，也就是说，`do_one_initcall_debug` 函数调用了给定的 `initcall`，并打印了一些和 `initcall` 相关的信息（比如当前任务的 [pid](https://en.wikipedia.org/wiki/Process_identifier)、`initcall` 的持续时间等）：

```C
static int __init_or_module do_one_initcall_debug(initcall_t fn)
{
	ktime_t calltime, delta, rettime;
	unsigned long long duration;
	int ret;

	printk(KERN_DEBUG "calling  %pF @ %i\n", fn, task_pid_nr(current));
	calltime = ktime_get();
	ret = fn();
	rettime = ktime_get();
	delta = ktime_sub(rettime, calltime);
	duration = (unsigned long long) ktime_to_ns(delta) >> 10;
	printk(KERN_DEBUG "initcall %pF returned %d after %lld usecs\n",
		 fn, ret, duration);

	return ret;
}
```

由于 `initcall` 被 `do_one_initcall` 或 `do_one_initcall_debug` 调用，我们可以看到在 `do_one_initcall` 函数末尾做了两次检查。第一个检查在initcall执行内部 `__preempt_count_add` 和 `__preempt_count_sub` 可能的执行次数，如果这个值和之前的可抢占计数不相等，我们就把 `preemption imbalance` 字符串添加到消息缓冲区，并设置正确的可抢占计数：

```C
if (preempt_count() != count) {
	sprintf(msgbuf, "preemption imbalance ");
	preempt_count_set(count);
}
```

稍后这个错误字符串就会被打印出来。最后检查本地 [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29) 的状态，如果它们被禁用了，我们就将 `disabled interrupts` 字符串添加到我们的消息缓冲区，并为当前处理器使能 `IRQs`，以防出现 `IRQs` 被 `initcall` 禁用了但不再使能的情况出现：

```C
if (irqs_disabled()) {
	strlcat(msgbuf, "disabled interrupts ", sizeof(msgbuf));
	local_irq_enable();
}
```

这就是全部了。通过这种方式，Linux 内核以正确的顺序完成了很多子系统的初始化。现在我们知道 Linux 内核的 `initcall` 机制是怎么回事了。在这部分中，我们介绍了 `initcall` 机制的主要部分，但遗留了一些重要的概念。让我们来简单看下这些概念。

首先，我们错过了一个级别的 `initcalls`，就是 `rootfs initcalls`。和我们在本部分看到的很多宏类似，你可以在 [include/linux/init.h](https://github.com/torvalds/linux/blob/master/include/linux/init.h) 头文件中找到 `rootfs_initcall` 的定义：

```C
#define rootfs_initcall(fn)		__define_initcall(fn, rootfs)
```

从这个宏的名字我们可以理解到，它的主要目的是保存和 [rootfs](https://en.wikipedia.org/wiki/Initramfs) 相关的回调。除此之外，只有在与设备相关的东西没被初始化时，在文件系统级别初始化以后再初始化一些其它东西时才有用。例如，发生在源码文件 [init/initramfs.c](https://github.com/torvalds/linux/blob/master/init/initramfs.c) 中 `populate_rootfs` 函数里的解压  [initramfs](https://en.wikipedia.org/wiki/Initramfs)：

```C
rootfs_initcall(populate_rootfs);
```

在这里，我们可以看到熟悉的输出：

```
[    0.199960] Unpacking initramfs...
```

除了 `rootfs_initcall` 级别，还有其它的 `console_initcall`、 `security_initcall` 和其他辅助的  `initcall` 级别。我们遗漏的最后一件事，是 `*_initcall_sync` 级别的集合。在这部分我们看到的几乎每个 `*_initcall` 宏，都有 `_sync` 前缀的宏伴随：

```C
#define core_initcall_sync(fn)		__define_initcall(fn, 1s)
#define postcore_initcall_sync(fn)	__define_initcall(fn, 2s)
#define arch_initcall_sync(fn)		__define_initcall(fn, 3s)
#define subsys_initcall_sync(fn)	__define_initcall(fn, 4s)
#define fs_initcall_sync(fn)		__define_initcall(fn, 5s)
#define device_initcall_sync(fn)	__define_initcall(fn, 6s)
#define late_initcall_sync(fn)		__define_initcall(fn, 7s)
```

这些附加级别的主要目的是，等待所有某个级别的与模块相关的初始化例程完成。

这就是全部了。

结论
--------------------------------------------------------------------------------

在这部分中，我们看到了 Linux 内核的一项重要机制，即在初始化期间允许调用依赖于 Linux 内核当前状态的函数。

如果你有问题或建议，可随时在 twitter [0xAX](https://twitter.com/0xAX) 上联系我，给我发 [email](mailto:anotherworldofworld@gmail.com)，或者创建 [issue](https://github.com/0xAX/linux-insides/issues/new)。

**请注意英语不是我的母语，对此带来的不便，我很抱歉。如果你发现了任何错误，都可以给我发 PR 到[linux-insides](https://github.com/0xAX/linux-insides)。**.

链接
--------------------------------------------------------------------------------

* [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29)
* [debugfs](https://en.wikipedia.org/wiki/Debugfs)
* [integer type](https://en.wikipedia.org/wiki/Integer)
* [symbols concatenation](https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html)
* [GCC](https://en.wikipedia.org/wiki/GNU_Compiler_Collection)
* [Link time optimization](https://gcc.gnu.org/wiki/LinkTimeOptimization)
* [Introduction to linkers](/Misc/linux-misc-3.md)
* [Linux kernel command line](https://www.kernel.org/doc/Documentation/kernel-parameters.txt)
* [Process identifier](https://en.wikipedia.org/wiki/Process_identifier)
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [rootfs](https://en.wikipedia.org/wiki/Initramfs)
* [previous part](/Misc/linux-misc-2.md)


================================================
FILE: Concepts/linux-cpu-4.md
================================================
Notification Chains in Linux Kernel
================================================================================

Introduction
--------------------------------------------------------------------------------

The Linux kernel is huge piece of [C](https://en.wikipedia.org/wiki/C_%28programming_language%29) code which consists from many different subsystems. Each subsystem has its own purpose which is independent of other subsystems. But often one subsystem wants to know something from other subsystem(s). There is special mechanism in the Linux kernel which allows to solve this problem partly. The name of this mechanism is - `notification chains` and its main purpose to provide a way for different subsystems to subscribe on asynchronous events from other subsystems. Note that this mechanism is only for communication inside kernel, but there are other mechanisms for communication between kernel and userspace.

Before we consider `notification chains` [API](https://en.wikipedia.org/wiki/Application_programming_interface) and implementation of this API, let's look at `Notification chains` mechanism from theoretical side as we did it in other parts of this book. Everything which is related to `notification chains` mechanism is located in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file and [kernel/notifier.c](https://github.com/torvalds/linux/blob/master/kernel/notifier.c) source code file. So let's open them and start to dive.

Notification Chains related data structures
--------------------------------------------------------------------------------

Let's start to consider `notification chains` mechanism from related data structures. As I wrote above, main data structures should be located in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, so the Linux kernel provides generic API which does not depend on certain architecture. In general, the `notification chains` mechanism represents a list (that's why it's named `chains`) of [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29) functions which are will be executed when an event will be occurred.

All of these callback functions are represented as `notifier_fn_t` type in the Linux kernel:

```C
typedef	int (*notifier_fn_t)(struct notifier_block *nb, unsigned long action, void *data);
```

So we may see that it takes three following arguments:

* `nb` - is linked list of function pointers (will see it now);
* `action` - is type of an event. A notification chain may support multiple events, so we need this parameter to distinguish an event from other events;
* `data` - is storage for private information. Actually it allows to provide additional data information about an event.

Additionally we may see that `notifier_fn_t` returns an integer value. This integer value maybe one of:

* `NOTIFY_DONE` - subscriber does not interested in notification;
* `NOTIFY_OK` - notification was processed correctly;
* `NOTIFY_BAD` - something went wrong;
* `NOTIFY_STOP` - notification is done, but no further callbacks should be called for this event.

All of these results defined as macros in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file:

```C
#define NOTIFY_DONE		0x0000
#define NOTIFY_OK		0x0001
#define NOTIFY_BAD		(NOTIFY_STOP_MASK|0x0002)
#define NOTIFY_STOP		(NOTIFY_OK|NOTIFY_STOP_MASK)
```

Where `NOTIFY_STOP_MASK` represented by the:

```C
#define NOTIFY_STOP_MASK	0x8000
```

macro and means that callbacks will not be called during next notifications.

Each part of the Linux kernel which wants to be notified on a certain event will should provide own `notifier_fn_t` callback function. Main role of the `notification chains` mechanism is to call certain callbacks when an asynchronous event occurred.

The main building block of the `notification chains` mechanism is the `notifier_block` structure:

```C
struct notifier_block {
	notifier_fn_t notifier_call;
	struct notifier_block __rcu *next;
	int priority;
};
```

which is defined in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) file. This struct contains pointer to callback function - `notifier_call`, link to the next notification callback and `priority` of a callback function as functions with higher priority are executed first.

The Linux kernel provides notification chains of four following types:

* Blocking notifier chains;
* SRCU notifier chains;
* Atomic notifier chains;
* Raw notifier chains.

Let's consider all of these types of notification chains by order:

In the first case for the `blocking notifier chains`, callbacks will be called/executed in process context. This means that the calls in a notification chain may be blocked.

The second `SRCU notifier chains` represent alternative form of `blocking notifier chains`. In the first case, blocking notifier chains uses `rw_semaphore` synchronization primitive to protect chain links. `SRCU` notifier chains run in process context too, but uses special form of [RCU](https://en.wikipedia.org/wiki/Read-copy-update) mechanism which is permissible to block in an read-side critical section.

In the third case for the `atomic notifier chains` runs in interrupt or atomic context and protected by [spinlock](/SyncPrim/linux-sync-1.md) synchronization primitive. The last `raw notifier chains` provides special type of notifier chains without any locking restrictions on callbacks. This means that protection rests on the shoulders of caller side. It is very useful when we want to protect our chain with very specific locking mechanism.

If we will look at the implementation of the `notifier_block` structure, we will see that it contains pointer to the `next` element from a notification chain list, but we have no head. Actually a head of such list is in separate structure depends on type of a notification chain. For example for the `blocking notifier chains`:

```C
struct blocking_notifier_head {
	struct rw_semaphore rwsem;
	struct notifier_block __rcu *head;
};
```

or for `atomic notification chains`:

```C
struct atomic_notifier_head {
	spinlock_t lock;
	struct notifier_block __rcu *head;
};
```

Now as we know a little about `notification chains` mechanism let's consider implementation of its API.

Notification Chains
--------------------------------------------------------------------------------

Usually there are two sides in a publish/subscriber mechanisms. One side who wants to get notifications and other side(s) who generates these notifications. We will consider notification chains mechanism from both sides. We will consider `blocking notification chains` in this part, because of other types of notification chains are similar to it and differ mostly in protection mechanisms.

Before a notification producer is able to produce notification, first of all it should initialize head of a notification chain. For example let's consider notification chains related to kernel [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module). If we will look in the [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c) source code file, we will see following definition:

```C
static BLOCKING_NOTIFIER_HEAD(module_notify_list);
```

which defines head for loadable modules blocking notifier chain. The `BLOCKING_NOTIFIER_HEAD` macro is defined in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file and expands to the following code:

```C
#define BLOCKING_INIT_NOTIFIER_HEAD(name) do {	\
		init_rwsem(&(name)->rwsem);	                            \
		(name)->head = NULL;		                            \
	} while (0)
```

So we may see that it takes name of a name of a head of a blocking notifier chain and initializes read/write [semaphore](/SyncPrim/linux-sync-3.md) and set head to `NULL`. Besides the `BLOCKING_INIT_NOTIFIER_HEAD` macro, the Linux kernel additionally provides `ATOMIC_INIT_NOTIFIER_HEAD`, `RAW_INIT_NOTIFIER_HEAD` macros and `srcu_init_notifier` function for initialization atomic and other types of notification chains.

After initialization of a head of a notification chain, a subsystem which wants to receive notification from the given notification chain should register with certain function which depends on the type of notification. If you will look in the [include/linux/notifier.h](https://github.com/torvalds/linux/blob/master/include/linux/notifier.h) header file, you will see following four function for this:

```C
extern int atomic_notifier_chain_register(struct atomic_notifier_head *nh,
		struct notifier_block *nb);

extern int blocking_notifier_chain_register(struct blocking_notifier_head *nh,
		struct notifier_block *nb);

extern int raw_notifier_chain_register(struct raw_notifier_head *nh,
		struct notifier_block *nb);

extern int srcu_notifier_chain_register(struct srcu_notifier_head *nh,
		struct notifier_block *nb);
```

As I already wrote above, we will cover only blocking notification chains in the part, so let's consider implementation of the `blocking_notifier_chain_register` function. Implementation of this function is located in the [kernel/notifier.c](https://github.com/torvalds/linux/blob/master/kernel/notifier.c) source code file and as we may see the `blocking_notifier_chain_register` takes two parameters:

* `nh` - head of a notification chain;
* `nb` - notification descriptor.

Now let's look at the implementation of the `blocking_notifier_chain_register` function:

```C
int raw_notifier_chain_register(struct raw_notifier_head *nh,
		struct notifier_block *n)
{
	return notifier_chain_register(&nh->head, n);
}
```

As we may see it just returns result of the `notifier_chain_register` function from the same source code file and as we may understand this function does all job for us. Definition of the `notifier_chain_register` function looks:

```C
int blocking_notifier_chain_register(struct blocking_notifier_head *nh,
		struct notifier_block *n)
{
	int ret;

	if (unlikely(system_state == SYSTEM_BOOTING))
		return notifier_chain_register(&nh->head, n);

	down_write(&nh->rwsem);
	ret = notifier_chain_register(&nh->head, n);
	up_write(&nh->rwsem);
	return ret;
}
```

As we may see implementation of the `blocking_notifier_chain_register` is pretty simple. First of all there is check which check current system state and if a system in rebooting state we just call the `notifier_chain_register`. In other way we do the same call of the `notifier_chain_register` but as you may see this call is protected with read/write semaphores. Now let's look at the implementation of the `notifier_chain_register` function:

```C
static int notifier_chain_register(struct notifier_block **nl,
		struct notifier_block *n)
{
	while ((*nl) != NULL) {
		if (n->priority > (*nl)->priority)
			break;
		nl = &((*nl)->next);
	}
	n->next = *nl;
	rcu_assign_pointer(*nl, n);
	return 0;
}
```

This function just inserts new `notifier_block` (given by a subsystem which wants to get notifications) to the notification chain list. Besides subscribing on an event, subscriber may unsubscribe from a certain events with the set of `unsubscribe` functions:

```C
extern int atomic_notifier_chain_unregister(struct atomic_notifier_head *nh,
		struct notifier_block *nb);

extern int blocking_notifier_chain_unregister(struct blocking_notifier_head *nh,
		struct notifier_block *nb);

extern int raw_notifier_chain_unregister(struct raw_notifier_head *nh,
		struct notifier_block *nb);

extern int srcu_notifier_chain_unregister(struct srcu_notifier_head *nh,
		struct notifier_block *nb);
```

When a producer of notifications wants to notify subscribers about an event, the `*.notifier_call_chain` function will be called. As you already may guess each type of notification chains provides own function to produce notification:

```C
extern int atomic_notifier_call_chain(struct atomic_notifier_head *nh,
		unsigned long val, void *v);

extern int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
		unsigned long val, void *v);

extern int raw_notifier_call_chain(struct raw_notifier_head *nh,
		unsigned long val, void *v);

extern int srcu_notifier_call_chain(struct srcu_notifier_head *nh,
		unsigned long val, void *v);
```

Let's consider implementation of the `blocking_notifier_call_chain` function. This function is defined in the [kernel/notifier.c](https://github.com/torvalds/linux/blob/master/kernel/notifier.c) source code file:

```C
int blocking_notifier_call_chain(struct blocking_notifier_head *nh,
		unsigned long val, void *v)
{
	return __blocking_notifier_call_chain(nh, val, v, -1, NULL);
}
```

and as we may see it just returns result of the `__blocking_notifier_call_chain` function. As we may see, the `blocking_notifer_call_chain` takes three parameters:

* `nh` - head of notification chain list;
* `val` - type of a notification;
* `v` -  input parameter which may be used by handlers.

But the `__blocking_notifier_call_chain` function takes five parameters:

```C
int __blocking_notifier_call_chain(struct blocking_notifier_head *nh,
				   unsigned long val, void *v,
				   int nr_to_call, int *nr_calls)
{
    ...
    ...
    ...
}
```

Where `nr_to_call` and `nr_calls` are number of notifier functions to be called and number of sent notifications. As you may guess the main goal of the `__blocking_notifer_call_chain` function and other functions for other notification types is to call callback function when an event occurs. Implementation of the `__blocking_notifier_call_chain` is pretty simple, it just calls the `notifier_call_chain` function from the same source code file protected with read/write semaphore:

```C
int __blocking_notifier_call_chain(struct blocking_notifier_head *nh,
				   unsigned long val, void *v,
				   int nr_to_call, int *nr_calls)
{
	int ret = NOTIFY_DONE;

	if (rcu_access_pointer(nh->head)) {
		down_read(&nh->rwsem);
		ret = notifier_call_chain(&nh->head, val, v, nr_to_call,
					nr_calls);
		up_read(&nh->rwsem);
	}
	return ret;
}
```

and returns its result. In this case all job is done by the `notifier_call_chain` function. Main purpose of this function is to inform registered notifiers about an asynchronous event:

```C
static int notifier_call_chain(struct notifier_block **nl,
			       unsigned long val, void *v,
			       int nr_to_call, int *nr_calls)
{
    ...
    ...
    ...
    ret = nb->notifier_call(nb, val, v);
    ...
    ...
    ...
    return ret;
}
```

That's all. In general all looks pretty simple.

Now let's consider on a simple example related to [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module). If we will look in the [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c). As we already saw in this part, there is:

```C
static BLOCKING_NOTIFIER_HEAD(module_notify_list);
```

definition of the `module_notify_list` in the [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c) source code file. This definition determines head of list of blocking notifier chains related to kernel modules. There are at least three following events:

* MODULE_STATE_LIVE
* MODULE_STATE_COMING
* MODULE_STATE_GOING

in which maybe interested some subsystems of the Linux kernel. For example tracing of kernel modules states. Instead of direct call of the `atomic_notifier_chain_register`, `blocking_notifier_chain_register` and etc., most notification chains come with a set of wrappers used to register to them. Registration on these modules events is going with the help of such wrapper:

```C
int register_module_notifier(struct notifier_block *nb)
{
	return blocking_notifier_chain_register(&module_notify_list, nb);
}
```

If we will look in the [kernel/tracepoint.c](https://github.com/torvalds/linux/blob/master/kernel/tracepoint.c) source code file, we will see such registration during initialization of [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt):

```C
static __init int init_tracepoints(void)
{
	int ret;

	ret = register_module_notifier(&tracepoint_module_nb);
	if (ret)
		pr_warn("Failed to register tracepoint module enter notifier\n");

	return ret;
}
```

Where `tracepoint_module_nb` provides callback function:

```C
static struct notifier_block tracepoint_module_nb = {
	.notifier_call = tracepoint_module_notify,
	.priority = 0,
};
```

When one of the `MODULE_STATE_LIVE`, `MODULE_STATE_COMING` or `MODULE_STATE_GOING` events occurred. For example the `MODULE_STATE_LIVE` the `MODULE_STATE_COMING` notifications will be sent during execution of the [init_module](https://man7.org/linux/man-pages/man2/init_module.2.html) [system call](/SysCall/linux-syscall-1.md). Or for example `MODULE_STATE_GOING` will be sent during execution of the [delete_module](https://man7.org/linux/man-pages/man2/delete_module.2.html) `system call`:

```C
SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
		unsigned int, flags)
{
    ...
    ...
    ...
    blocking_notifier_call_chain(&module_notify_list,
				     MODULE_STATE_GOING, mod);
    ...
    ...
    ...
}
```

Thus when one of these system call will be called from userspace, the Linux kernel will send certain notification depending on a system call and the `tracepoint_module_notify` callback function will be called.

That's all.

Links
--------------------------------------------------------------------------------

* [C programming language](https://en.wikipedia.org/wiki/C_%28programming_language%29)
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
* [callback](https://en.wikipedia.org/wiki/Callback_%28computer_programming%29)
* [RCU](https://en.wikipedia.org/wiki/Read-copy-update)
* [spinlock](/SyncPrim/linux-sync-1.md)
* [loadable modules](https://en.wikipedia.org/wiki/Loadable_kernel_module)
* [semaphore](/SyncPrim/linux-sync-3.md)
* [tracepoints](https://www.kernel.org/doc/Documentation/trace/tracepoints.txt)
* [system call](/SysCall/linux-syscall-1.md)
* [init_module system call](https://man7.org/linux/man-pages/man2/init_module.2.html)
* [delete_module](https://man7.org/linux/man-pages/man2/delete_module.2.html)
* [previous part](/Concepts/linux-cpu-3.md)


================================================
FILE: DataStructures/README.md
================================================
Linux内核中的数据结构
========================================================================

Linux内核对很多数据结构提供不同的实现方法，比如，双向链表，B+树，具有优先级的堆等等。

这部分考虑这些数据结构和算法。

  * [双向链表](linux-datastructures-1.md)
  * [基数树](linux-datastructures-2.md)
  * [位数组](linux-datastructures-3.md)


================================================
FILE: DataStructures/linux-datastructures-1.md
================================================
Linux 内核里的数据结构——双向链表
================================================================================

双向链表
--------------------------------------------------------------------------------

Linux 内核自己实现了双向链表，可以在 [include/linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h) 找到定义。我们将会从双向链表数据结构开始`内核的数据结构`。为什么？因为它在内核里使用的很广泛，你只需要在 [free-electrons.com](https://elixir.bootlin.com/linux/latest/A/ident/list_head) 检索一下就知道了。

首先让我们看一下在 [include/linux/types.h](https://github.com/torvalds/linux/blob/master/include/linux/types.h) 里的主结构体：

```C
struct list_head {
	struct list_head *next, *prev;
};
```

你可能注意到这和你以前见过的双向链表的实现方法是不同的。举个例子来说，在 [glib](https://www.gnu.org/software/libc/)  库里是这样实现的：

```C
struct GList {
  gpointer data;
  GList *next;
  GList *prev;
};
```

通常来说一个链表会包含一个指向某个项目的指针。但是内核的实现并没有这样做。所以问题来了：`链表在哪里保存数据呢？`。实际上内核里实现的链表实际上是`侵入式链表`。侵入式链表并不在节点内保存数据-节点仅仅包含指向前后节点的指针，然后把数据是附加到链表的。这就使得这个数据结构是通用的，使用起来就不需要考虑节点数据的类型了。

比如：

```C
struct nmi_desc {
    spinlock_t lock;
    struct list_head head;
};
```

让我们看几个例子来理解一下在内核里是如何使用 `list_head` 的。如上所述，在内核里有实在很多不同的地方用到了链表。我们以杂项字符驱动为例来说明双向链表的使用。在 [drivers/char/misc.c](https://github.com/torvalds/linux/blob/master/drivers/char/misc.c) 的杂项字符驱动API 被用来编写处理小型硬件和虚拟设备的小驱动。这些驱动共享相同的主设备号：

```C
#define MISC_MAJOR              10
```

但是都有各自不同的次设备号。比如：

```
ls -l /dev |  grep 10
crw-------   1 root root     10, 235 Mar 21 12:01 autofs
drwxr-xr-x  10 root root         200 Mar 21 12:01 cpu
crw-------   1 root root     10,  62 Mar 21 12:01 cpu_dma_latency
crw-------   1 root root     10, 203 Mar 21 12:01 cuse
drwxr-xr-x   2 root root         100 Mar 21 12:01 dri
crw-rw-rw-   1 root root     10, 229 Mar 21 12:01 fuse
crw-------   1 root root     10, 228 Mar 21 12:01 hpet
crw-------   1 root root     10, 183 Mar 21 12:01 hwrng
crw-rw----+  1 root kvm      10, 232 Mar 21 12:01 kvm
crw-rw----   1 root disk     10, 237 Mar 21 12:01 loop-control
crw-------   1 root root     10, 227 Mar 21 12:01 mcelog
crw-------   1 root root     10,  59 Mar 21 12:01 memory_bandwidth
crw-------   1 root root     10,  61 Mar 21 12:01 network_latency
crw-------   1 root root     10,  60 Mar 21 12:01 network_throughput
crw-r-----   1 root kmem     10, 144 Mar 21 12:01 nvram
brw-rw----   1 root disk      1,  10 Mar 21 12:01 ram10
crw--w----   1 root tty       4,  10 Mar 21 12:01 tty10
crw-rw----   1 root dialout   4,  74 Mar 21 12:01 ttyS10
crw-------   1 root root     10,  63 Mar 21 12:01 vga_arbiter
crw-------   1 root root     10, 137 Mar 21 12:01 vhci
```

现在让我们看看它是如何使用链表的。首先看一下结构体 `miscdevice` ：

```C
struct miscdevice
{
      int minor;
      const char *name;
      const struct file_operations *fops;
      struct list_head list;
      struct device *parent;
      struct device *this_device;
      const char *nodename;
      mode_t mode;
};
```

我们可以看到结构体的第四个变量 `list` 是所有注册过的设备的链表。在源代码文件的开始可以看到这个链表的定义：

```C
static LIST_HEAD(misc_list);
```

它扩展开来实际上就是定义了一个 `list_head` 类型的变量：

```C
#define LIST_HEAD(name) \
	struct list_head name = LIST_HEAD_INIT(name)
```

然后使用宏 `LIST_HEAD_INIT` 进行初始化，这会使用变量 `name` 的地址来填充结构体的 `prev` 和 `next` 两个变量。

```C
#define LIST_HEAD_INIT(name) { &(name), &(name) }
```

现在来看看注册杂项设备的函数 `misc_register` 。它在开始就用 `INIT_LIST_HEAD` 初始化了`miscdevice->list`。

```C
INIT_LIST_HEAD(&misc->list);
```

作用和宏 `LIST_HEAD_INIT`一样。

```C
static inline void INIT_LIST_HEAD(struct list_head *list)
{
	list->next = list;
	list->prev = list;
}
```

下一步在函数 `device_create` 创建了设备后我们就用下面的语句将设备添加到设备链表：

```
list_add(&misc->list, &misc_list);
```

内核文件 `list.h` 提供了向链表添加新项的接口函数。我们来看看它的实现：


```C
static inline void list_add(struct list_head *new, struct list_head *head)
{
	__list_add(new, head, head->next);
}
```

实际上就是使用3个指定的参数来调用了内部函数 `__list_add`：

* new  - 新项。
* head - 新项将会被添加到`head` 之后.
* head->next - `head` 之后的项。

`__list_add`的实现非常简单：

```C
static inline void __list_add(struct list_head *new,
			      struct list_head *prev,
			      struct list_head *next)
{
	next->prev = new;
	new->next = next;
	new->prev = prev;
	prev->next = new;
}
```

我们会在 `prev` 和 `next` 之间添加一个新项。所以我们用宏 `LIST_HEAD_INIT` 定义的 `misc` 链表会包含指向 `miscdevice->list` 的向前指针和向后指针。

这里仍有一个问题：如何得到列表的内容呢？这里有一个特殊的宏：

```C
#define list_entry(ptr, type, member) \
	container_of(ptr, type, member)
```

使用了三个参数：

* ptr - 指向链表头的指针；
* type - 结构体类型;
* member - 在结构体内类型为 `list_head` 的变量的名字；

比如说：

```C
const struct miscdevice *p = list_entry(v, struct miscdevice, list)
```

然后我们就可以使用 `p->minor` 或者 `p->name`来访问 `miscdevice`。让我们来看看 `list_entry` 的实现：
 
```C
#define list_entry(ptr, type, member) \
	container_of(ptr, type, member)
```

如我们所见，它仅仅使用相同的参数调用了宏 `container_of`。初看这个宏挺奇怪的：

```C
#define container_of(ptr, type, member) ({                      \
    const typeof( ((type *)0)->member ) *__mptr = (ptr);    \
    (type *)( (char *)__mptr - offsetof(type,member) );})
```

首先你可以注意到花括号内包含两个表达式。编译器会执行花括号内的全部语句，然后返回最后的表达式的值。

举个例子来说：

```
#include <stdio.h>

int main() {
	int i = 0;
	printf("i = %d\n", ({++i; ++i;}));
	return 0;
}
```

最终会打印 `2`

下一点就是 `typeof`,它也很简单。就如你从名字所理解的，它仅仅返回了给定变量的类型。当我第一次看到宏 `container_of` 的实现时，让我觉得最奇怪的就是 `container_of` 中的 0 。实际上这个指针巧妙的计算了从结构体特定变量的偏移，这里的 `0` 刚好就是位宽里的零偏移。让我们看一个简单的例子：

```C
#include <stdio.h>

struct s {
        int field1;
        char field2;
	char field3;
};

int main() {
	printf("%p\n", &((struct s*)0)->field3);
	return 0;
}
```

结果显示 `0x5`。

下一个宏 `offsetof` 会计算从结构体的某个变量的相对于结构体起始地址的偏移。它的实现和上面类似：

```C
#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
```

现在我们来总结一下宏`container_of`。只需要知道结构体的类型（`type`），及里面类型为 `list_head` 的变量的名字（`member`）和地址（`ptr`)，就可以获得该结构体的起始地址。在宏定义的第一行，声明了一个`__mptr`指针，并将参数`ptr`赋值给了它，现在，它们共同指向了结构体的`list_head`的成员变量。确切来说我们其实并不需要这一行，但是它可以辅助进行类型检查。第一行保证了特定的结构体（参数 `type`）包含成员变量 `member`。

> 译注：若传入的`ptr`参数并不是`struct list_head *`类型，编译器会报`imcompatible pointer types`的warning；同时，`((type *)0)->member`还能让编译器检查`type`是否的确有`member`这个成员，因此加上这一行可以大大提高代码的鲁棒性。

第二行代码会用宏 `offsetof` 计算成员变量相对于结构体起始地址的偏移，然后从结构体的地址减去这个偏移，最后就得到了结构体的起始地址。

 `list_add` 和 `list_entry` 当然不是 `<linux/list.h>` 提供的唯一函数。双向链表的实现还提供了如下API：

* list_add
* list_add_tail
* list_del
* list_replace
* list_move
* list_is_last
* list_empty
* list_cut_position
* list_splice
* list_for_each
* list_for_each_entry

等等很多其它 API。


================================================
FILE: DataStructures/linux-datastructures-2.md
================================================
Linux内核中的数据结构
================================================================================

基数树
--------------------------------------------------------------------------------
正如你所知道的 Linux 内核通过许多不同库以及函数提供各种数据结构以及算法实现。
这个部分我们将介绍其中一个数据结构 [Radix tree](http://en.wikipedia.org/wiki/Radix_tree)。Linux 内核中有两个文件与 `radix tree` 的实现和API相关：

* [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/master/include/linux/radix-tree.h)
* [lib/radix-tree.c](https://github.com/torvalds/linux/blob/master/lib/radix-tree.c)

首先说明一下什么是 `radix tree` 。Radix tree 是一种 `压缩 trie`，其中 [trie](http://en.wikipedia.org/wiki/Trie) 是一种通过保存关联数组（associative array）来提供 `关键字-值（key-value）` 存储与查找的数据结构。通常关键字是字符串，不过也可以是其他数据类型。 

trie 结构的节点与 `n-tree` 不同，其节点中并不存储关键字，取而代之的是存储单个字符标签。关键字查找时，通过从树的根开始遍历关键字相关的所有字符标签节点，直至到达最终的叶子节点。下面是个例子：


```
               +-----------+
               |           |
               |    " "    |
               |           |
        +------+-----------+------+
        |                         |
        |                         |
   +----v------+            +-----v-----+
   |           |            |           |
   |    g      |            |     c     |
   |           |            |           |
   +-----------+            +-----------+
        |                         |
        |                         |
   +----v------+            +-----v-----+
   |           |            |           |
   |    o      |            |     a     |
   |           |            |           |
   +-----------+            +-----------+
                                  |
                                  |
                            +-----v-----+
                            |           |
                            |     t     |
                            |           |
                            +-----------+
```

这个例子中，我们可以看到 `trie` 所存储的关键字信息 `go` 与 `cat`，压缩 trie 或 `radix tree` 与 `trie` 所不同的是，所有只存在单个孩子的中间节点将被压缩。

Linux 内核中的 Radix 树将值映射为整型关键字，Radix 的数据结构定义在 [include/linux/radix-tree.h](https://github.com/torvalds/linux/blob/master/include/linux/radix-tree.h) 文件中 :

```C
struct radix_tree_root {
         unsigned int            height;
         gfp_t                   gfp_mask;
         struct radix_tree_node  __rcu *rnode;
};
```

上面这个是 radix 树的 root 节点的结构体，它包括三个成员：

* `height`   - 从叶节点向上计算出的树高度。
* `gfp_mask` - 内存分配标识。
* `rnode`    - 子节点指针。

这里我们先讨论的结构体成员是 `gfp_mask` : 

Linux 底层的内存申请接口需要提供一类标识（flag） - `gfp_mask` ，用于描述内存申请的行为。这个以 `GFP_` 前缀开头的内存申请控制标识主要包括，`GFP_NOIO` 禁止所有IO操作但允许睡眠等待内存，`__GFP_HIGHMEM` 允许申请内核的高端内存，`GFP_ATOMIC` 高优先级申请内存且操作不允许被睡眠。


接下来说的结构体成员是`rnode`：

```C
struct radix_tree_node {
        unsigned int    path;
        unsigned int    count;
        union {
                struct {
                        struct radix_tree_node *parent;
                        void *private_data;
                };
                struct rcu_head rcu_head;
        };
        /* For tree user */
        struct list_head private_list;
        void __rcu      *slots[RADIX_TREE_MAP_SIZE];
        unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};
```

这个结构体中包括这几个内容，节点与父节点的偏移以及到树底端的高度，子节点的个数，节点的存储数据域，具体描述如下：

* `path` - 本节点与父节点的偏移以及到树底端的高度。
* `count` - 子节点的个数。
* `parent` - 父节点的指针。
* `private_data` - 存储数据内容缓冲区。
* `rcu_head` - 用于节点释放的RCU链表。
* `private_list` - 存储数据。

结构体 `radix_tree_node` 的最后两个成员 `tags` 与 `slots` 是非常重要且需要特别注意的。每个 Radix 树节点都可以包括一个指向存储数据指针的 slots 集合，空闲 slots 的指针指向 NULL。 Linux 内核的 Radix 树结构体中还包含用于记录节点存储状态的标签 `tags` 成员，标签通过位设置指示 Radix 树的数据存储状态。

至此，我们了解到 radix 树的结构，接下来看一下 radix 树所提供的 API。


Linux 内核基数树 API
---------------------------------------------------------------------------------

我们从数据结构的初始化开始看，radix 树支持两种方式初始化。

第一个是使用宏 `RADIX_TREE` ：

```C
RADIX_TREE(name, gfp_mask);
````

正如你看到，只需要提供 `name` 参数，就能够使用 `RADIX_TREE` 宏完成 radix 的定义以及初始化，`RADIX_TREE` 宏的实现非常简单：

```C
#define RADIX_TREE(name, mask) \
         struct radix_tree_root name = RADIX_TREE_INIT(mask)

#define RADIX_TREE_INIT(mask)   { \
        .height = 0,              \
        .gfp_mask = (mask),       \
        .rnode = NULL,            \
}
```

`RADIX_TREE` 宏首先使用 `name` 定义了一个 `radix_tree_root` 实例并用 `RADIX_TREE_INIT` 宏带参数 `mask` 进行初始化。宏 `RADIX_TREE_INIT` 将 `radix_tree_root` 初始化为默认属性并将 gfp_mask 初始化为入参 `mask` 。
第二种方式是手工定义 `radix_tree_root` 变量，之后再使用 `mask` 调用 `INIT_RADIX_TREE` 宏对变量进行初始化。
```C
struct radix_tree_root my_radix_tree;
INIT_RADIX_TREE(my_tree, gfp_mask_for_my_radix_tree);
```

`INIT_RADIX_TREE` 宏定义：

```C
#define INIT_RADIX_TREE(root, mask)  \
do {                                 \
        (root)->height = 0;          \
        (root)->gfp_mask = (mask);   \
        (root)->rnode = NULL;        \
} while (0)
```
宏 `INIT_RADIX_TREE` 所初始化的属性与 `RADIX_TREE_INIT` 一致


接下来是 radix 树的节点插入以及删除，这两个函数：

* `radix_tree_insert`;
* `radix_tree_delete`.

第一个函数 `radix_tree_insert` 需要三个入参：

* radix 树 root 节点结构
* 索引关键字
* 需要插入存储的数据

第二个函数 `radix_tree_delete` 除了不需要存储数据参数外，其他与 `radix_tree_insert` 一致。

radix 树的查找实现有以下几个函数：The search in a radix tree implemented in two ways:

* `radix_tree_lookup`;
* `radix_tree_gang_lookup`;
* `radix_tree_lookup_slot`.

第一个函数 `radix_tree_lookup` 需要两个参数：

* radix  树 root 节点结构
* 索引关键字

这个函数通过给定的关键字查找 radix 树，并返关键字所对应的结点。

第二个函数 `radix_tree_gang_lookup` 具有以下特征：

```C
unsigned int radix_tree_gang_lookup(struct radix_tree_root *root,
                                    void **results,
                                    unsigned long first_index,
                                    unsigned int max_items);
```

函数返回查找到记录的条目数，并根据关键字进行排序，返回的总结点数不超过入参 `max_items` 的大小。

最后一个函数 `radix_tree_lookup_slot` 返回结点 slot 中所存储的数据。


链接
---------------------------------------------------------------------------------

* [Radix tree](http://en.wikipedia.org/wiki/Radix_tree)
* [Trie](http://en.wikipedia.org/wiki/Trie)




================================================
FILE: DataStructures/linux-datastructures-3.md
================================================
Linux 内核里的数据结构——位数组
================================================================================

Linux 内核中的位数组和位操作
--------------------------------------------------------------------------------

除了不同的基于[链式](https://en.wikipedia.org/wiki/Linked_data_structure)和[树](https://en.wikipedia.org/wiki/Tree_%28data_structure%29)的数据结构以外，Linux 内核也为[位数组](https://en.wikipedia.org/wiki/Bit_array)（或称为位图（bitmap））提供了 [API](https://en.wikipedia.org/wiki/Application_programming_interface)。位数组在 Linux 内核里被广泛使用，并且在以下的源代码文件中包含了与这样的结构搭配使用的通用 `API`：

* [lib/bitmap.c](https://github.com/torvalds/linux/blob/master/lib/bitmap.c)
* [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h)

除了这两个文件之外，还有体系结构特定的头文件，它们为特定的体系结构提供优化的位操作。我们将探讨 [x86_64](https://en.wikipedia.org/wiki/X86-64) 体系结构，因此在我们的例子里，它会是

* [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h)

头文件。正如我上面所写的，`位图`在 Linux 内核中被广泛地使用。例如，`位数组`常常用于保存一组在线/离线处理器，以便系统支持[热插拔](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)的 CPU（你可以在 [cpumasks](/Concepts/linux-cpu-2.md) 部分阅读更多相关知识 ），一个位数组（bit array）可以在 Linux 内核初始化等期间保存一组已分配的[中断处理](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)。

因此，本部分的主要目的是了解位数组（bit array）是如何在 Linux 内核中实现的。让我们现在开始吧。

位数组声明
================================================================================

在我们开始查看`位图`操作的 `API` 之前，我们必须知道如何在 Linux 内核中声明它。有两种声明位数组的通用方法。第一种简单的声明一个位数组的方法是，定义一个 `unsigned long` 的数组，例如：

```C
unsigned long my_bitmap[8]
```

第二种方法，是使用 `DECLARE_BITMAP` 宏，它定义于 [include/linux/types.h](https://github.com/torvalds/linux/blob/master/include/linux/types.h) 头文件：

```C
#define DECLARE_BITMAP(name,bits) \
    unsigned long name[BITS_TO_LONGS(bits)]
```

我们可以看到 `DECLARE_BITMAP` 宏使用两个参数：

* `name` - 位图名称;
* `bits` - 位图中位数;

并且只是使用 `BITS_TO_LONGS(bits)` 元素展开 `unsigned long` 数组的定义。 `BITS_TO_LONGS` 宏将一个给定的位数转换为 `long` 的个数，换言之，就是计算 `bits` 中有多少个 `8` 字节元素：

```C
#define BITS_PER_BYTE           8
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
#define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))
```

因此，例如 `DECLARE_BITMAP(my_bitmap, 64)` 将产生：

```python
>>> (((64) + (64) - 1) / (64))
1
```

与：

```C
unsigned long my_bitmap[1];
```

在能够声明一个位数组之后，我们便可以使用它了。

体系结构特定的位操作
================================================================================

我们已经看了上面提及的一对源文件和头文件，它们提供了位数组操作的 [API](https://en.wikipedia.org/wiki/Application_programming_interface)。其中重要且广泛使用的位数组 API 是体系结构特定的且位于已提及的头文件中 [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h)。

首先让我们查看两个最重要的函数：

* `set_bit`;
* `clear_bit`.

我认为没有必要解释这些函数的作用。从它们的名字来看，这已经很清楚了。让我们直接查看它们的实现。如果你浏览 [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) 头文件，你将会注意到这些函数中的每一个都有[原子性](https://en.wikipedia.org/wiki/Linearizability)和非原子性两种变体。在我们开始深入这些函数的实现之前，首先，我们必须了解一些有关原子（atomic）操作的知识。

简而言之，原子操作保证两个或以上的操作不会并发地执行同一数据。`x86` 体系结构提供了一系列原子指令，例如， [xchg](https://x86.hust.openatom.club/html/file_module_x86_id_328.html)、[cmpxchg](https://x86.hust.openatom.club/html/file_module_x86_id_41.html) 等指令。除了原子指令，一些非原子指令可以在 [lock](https://x86.hust.openatom.club/html/file_module_x86_id_159.html) 指令的帮助下具有原子性。现在你已经对原子操作有了足够的了解，我们可以接着探讨 `set_bit` 和 `clear_bit` 函数的实现。

我们先考虑函数的非原子性（non-atomic）变体。非原子性的 `set_bit` 和 `clear_bit` 的名字以双下划线开始。正如我们所知道的，所有这些函数都定义于 [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) 头文件，并且第一个函数就是 `__set_bit`:

```C
static inline void __set_bit(long nr, volatile unsigned long *addr)
{
	asm volatile("bts %1,%0" : ADDR : "Ir" (nr) : "memory");
}
```

正如我们所看到的，它使用了两个参数：

* `nr` - 位数组中的位号（LCTT 译注：从 0开始）
* `addr` - 我们需要置位的位数组地址

注意，`addr` 参数使用 `volatile` 关键字定义，以告诉编译器给定地址指向的变量可能会被修改。 `__set_bit` 的实现相当简单。正如我们所看到的，它仅包含一行[内联汇编代码](https://en.wikipedia.org/wiki/Inline_assembler)。在我们的例子中，我们使用 [bts](https://x86.hust.openatom.club/html/file_module_x86_id_25.html) 指令，从位数组中选出一个第一操作数（我们的例子中的 `nr`）所指定的位，存储选出的位的值到 [CF](https://en.wikipedia.org/wiki/FLAGS_register) 标志寄存器并设置该位（LCTT 译注：即 `nr` 指定的位置为 1）。

注意，我们了解了 `nr` 的用法，但这里还有一个参数 `addr` 呢！你或许已经猜到秘密就在 `ADDR`。 `ADDR` 是一个定义在同一个头文件中的宏，它展开为一个包含给定地址和 `+m` 约束的字符串：

```C
#define ADDR				BITOP_ADDR(addr)
#define BITOP_ADDR(x) "+m" (*(volatile long *) (x))
```

除了 `+m` 之外，在 `__set_bit` 函数中我们可以看到其他约束。让我们查看并试着理解它们所表示的意义：

* `+m` - 表示内存操作数，这里的 `+` 表明给定的操作数为输入输出操作数；
* `I` - 表示整型常量；
* `r` - 表示寄存器操作数

除了这些约束之外，我们也能看到 `memory` 关键字，其告诉编译器这段代码会修改内存中的变量。到此为止，现在我们看看相同的原子性（atomic）变体函数。它看起来比非原子性（non-atomic）变体更加复杂：

```C
static __always_inline void
set_bit(long nr, volatile unsigned long *addr)
{
	if (IS_IMMEDIATE(nr)) {
		asm volatile(LOCK_PREFIX "orb %1,%0"
			: CONST_MASK_ADDR(nr, addr)
			: "iq" ((u8)CONST_MASK(nr))
			: "memory");
	} else {
		asm volatile(LOCK_PREFIX "bts %1,%0"
			: BITOP_ADDR(addr) : "Ir" (nr) : "memory");
	}
}
```

（LCTT 译注：BITOP_ADDR 的定义为：`#define BITOP_ADDR(x) "=m" (*(volatile long *) (x))`，ORB 为字节按位或。）

首先注意，这个函数使用了与 `__set_bit` 相同的参数集合，但额外地使用了 `__always_inline` 属性标记。 `__always_inline` 是一个定义于 [include/linux/compiler-gcc.h](https://github.com/torvalds/linux/blob/master/include/linux/compiler-gcc.h) 的宏，并且只是展开为 `always_inline` 属性：

```C
#define __always_inline inline __attribute__((always_inline))
```

其意味着这个函数总是内联的，以减少 Linux 内核映像的大小。现在让我们试着了解下 `set_bit` 函数的实现。首先我们在 `set_bit` 函数的开头检查给定的位的数量。`IS_IMMEDIATE` 宏定义于相同的[头文件](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h)，并展开为 [gcc](https://en.wikipedia.org/wiki/GNU_Compiler_Collection) 内置函数的调用：

```C
#define IS_IMMEDIATE(nr)		(__builtin_constant_p(nr))
```

如果给定的参数是编译期已知的常量，`__builtin_constant_p` 内置函数则返回 `1`，其他情况返回 `0`。假若给定的位数是编译期已知的常量，我们便无须使用效率低下的 `bts` 指令去设置位。我们可以只需在给定地址指向的字节上执行 [按位或](https://en.wikipedia.org/wiki/Bitwise_operation#OR) 操作，其字节包含给定的位，掩码位数表示高位为 `1`，其他位为 0 的掩码。在其他情况下，如果给定的位号不是编译期已知常量，我们便做和 `__set_bit` 函数一样的事。`CONST_MASK_ADDR` 宏：

```C
#define CONST_MASK_ADDR(nr, addr)	BITOP_ADDR((void *)(addr) + ((nr)>>3))
```

展开为带有到包含给定位的字节偏移的给定地址，例如，我们拥有地址 `0x1000` 和位号 `0x9`。因为 `0x9` 代表 `一个字节 + 一位`，所以我们的地址是 `addr + 1`:

```python
>>> hex(0x1000 + (0x9 >> 3))
'0x1001'
```

`CONST_MASK` 宏将我们给定的位号表示为字节，位号对应位为高位 `1`，其他位为 `0`：

```C
#define CONST_MASK(nr)			(1 << ((nr) & 7))
```

```python
>>> bin(1 << (0x9 & 7))
'0b10'
```

最后，我们应用 `按位或` 运算到这些变量上面，因此，假如我们的地址是 `0x4097` ，并且我们需要置位号为 `9` 的位为 1：

```python
>>> bin(0x4097)
'0b100000010010111'
>>> bin((0x4097 >> 0x9) | (1 << (0x9 & 7)))
'0b100010'
```

`第 9 位` 将会被置位。（LCTT 译注：这里的 9 是从 0 开始计数的，比如0010，按照作者的意思，其中的 1 是第 1 位）

注意，所有这些操作使用 `LOCK_PREFIX` 标记，其展开为 [lock](https://x86.hust.openatom.club/html/file_module_x86_id_159.html) 指令，保证该操作的原子性。

正如我们所知，除了 `set_bit` 和 `__set_bit` 操作之外，Linux 内核还提供了两个功能相反的函数，在原子性和非原子性的上下文中清位。它们是 `clear_bit` 和 `__clear_bit`。这两个函数都定义于同一个[头文件](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) 并且使用相同的参数集合。不仅参数相似，一般而言，这些函数与  `set_bit` 和 `__set_bit` 也非常相似。让我们查看非原子性 `__clear_bit` 的实现吧： 

```C
static inline void __clear_bit(long nr, volatile unsigned long *addr)
{
	asm volatile("btr %1,%0" : ADDR : "Ir" (nr));
}
```

没错，正如我们所见，`__clear_bit` 使用相同的参数集合，并包含极其相似的内联汇编代码块。它只是使用 [btr](https://x86.hust.openatom.club/html/file_module_x86_id_24.html) 指令替换了 `bts`。正如我们从函数名所理解的一样，通过给定地址，它清除了给定的位。`btr` 指令表现得像 `bts`（LCTT 译注：原文这里为 btr，可能为笔误，修正为 bts）。该指令选出第一操作数所指定的位，存储它的值到 `CF` 标志寄存器，并且清除第二操作数指定的位数组中的对应位。

`__clear_bit` 的原子性变体为 `clear_bit`：

```C
static __always_inline void
clear_bit(long nr, volatile unsigned long *addr)
{
	if (IS_IMMEDIATE(nr)) {
		asm volatile(LOCK_PREFIX "andb %1,%0"
			: CONST_MASK_ADDR(nr, addr)
			: "iq" ((u8)~CONST_MASK(nr)));
	} else {
		asm volatile(LOCK_PREFIX "btr %1,%0"
			: BITOP_ADDR(addr)
			: "Ir" (nr));
	}
}
```

并且正如我们所看到的，它与 `set_bit` 非常相似，只有两处不同。第一处差异为 `clear_bit` 使用 `btr` 指令来清位，而 `set_bit` 使用 `bts` 指令来置位。第二处差异为 `clear_bit` 使用否定的位掩码和 `按位与` 在给定的字节上置位，而 `set_bit` 使用 `按位或` 指令。

到此为止，我们可以在任意位数组置位和清位了，我们将看看位掩码上的其他操作。

在 Linux 内核中对位数组最广泛使用的操作是设置和清除位，但是除了这两个操作外，位数组上其他操作也是非常有用的。Linux 内核里另一种广泛使用的操作是知晓位数组中一个给定的位是否被置位。我们能够通过 `test_bit` 宏的帮助实现这一功能。这个宏定义于 [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) 头文件，并根据位号分别展开为 `constant_test_bit` 或 `variable_test_bit` 调用。

```C
#define test_bit(nr, addr)			\
	(__builtin_constant_p((nr))                 \
	 ? constant_test_bit((nr), (addr))	        \
	 : variable_test_bit((nr), (addr)))
```

因此，如果 `nr` 是编译期已知常量，`test_bit` 将展开为 `constant_test_bit` 函数的调用，而其他情况则为 `variable_test_bit`。现在让我们看看这些函数的实现，让我们从 `variable_test_bit` 开始看起：  

```C
static inline int variable_test_bit(long nr, volatile const unsigned long *addr)
{
	int oldbit;

	asm volatile("bt %2,%1\n\t"
		     "sbb %0,%0"
		     : "=r" (oldbit)
		     : "m" (*(unsigned long *)addr), "Ir" (nr));

	return oldbit;
}
```

`variable_test_bit` 函数使用了与 `set_bit` 及其他函数使用的相似的参数集合。我们也可以看到执行 [bt](https://x86.hust.openatom.club/html/file_module_x86_id_22.html) 和 [sbb](https://x86.hust.openatom.club/html/file_module_x86_id_286.html) 指令的内联汇编代码。`bt` （或称 `bit test`）指令从第二操作数指定的位数组选出第一操作数指定的一个指定位，并且将该位的值存进标志寄存器的 [CF](https://en.wikipedia.org/wiki/FLAGS_register) 位。第二个指令 `sbb` 从第二操作数中减去第一操作数，再减去 `CF` 的值。因此，这里将一个从给定位数组中的给定位号的值写进标志寄存器的 `CF` 位，并且执行 `sbb` 指令计算： `00000000 - CF`，并将结果写进 `oldbit` 变量。

`constant_test_bit` 函数做了和我们在 `set_bit` 所看到的一样的事：

```C
static __always_inline int constant_test_bit(long nr, const volatile unsigned long *addr)
{
	return ((1UL << (nr & (BITS_PER_LONG-1))) &
		(addr[nr >> _BITOPS_LONG_SHIFT])) != 0;
}
```

它生成了一个位号对应位为高位 `1`，而其他位为 `0` 的字节（正如我们在 `CONST_MASK` 所看到的），并将 [按位与](https://en.wikipedia.org/wiki/Bitwise_operation#AND) 应用于包含给定位号的字节。

下一个被广泛使用的位数组相关操作是改变一个位数组中的位。为此，Linux 内核提供了两个辅助函数：

* `__change_bit`;
* `change_bit`.

你可能已经猜测到，就拿 `set_bit` 和 `__set_bit` 例子说，这两个变体分别是原子和非原子版本。首先，让我们看看 `__change_bit` 函数的实现：

```C
static inline void __change_bit(long nr, volatile unsigned long *addr)
{
    asm volatile("btc %1,%0" : ADDR : "Ir" (nr));
}
```

相当简单，不是吗？ `__change_bit` 的实现和 `__set_bit` 一样，只是我们使用 [btc](https://x86.hust.openatom.club/html/file_module_x86_id_23.html) 替换 `bts` 指令而已。 该指令从一个给定位数组中选出一个给定位，将该为位的值存进 `CF` 并使用求反操作改变它的值，因此值为 `1` 的位将变为 `0`，反之亦然：

```python
>>> int(not 1)
0
>>> int(not 0)
1
```

`__change_bit` 的原子版本为 `change_bit` 函数：

```C
static inline void change_bit(long nr, volatile unsigned long *addr)
{
	if (IS_IMMEDIATE(nr)) {
		asm volatile(LOCK_PREFIX "xorb %1,%0"
			: CONST_MASK_ADDR(nr, addr)
			: "iq" ((u8)CONST_MASK(nr)));
	} else {
		asm volatile(LOCK_PREFIX "btc %1,%0"
			: BITOP_ADDR(addr)
			: "Ir" (nr));
	}
}
```

它和 `set_bit` 函数很相似，但也存在两点不同。第一处差异为 `xor` 操作而不是 `or`。第二处差异为 `btc`（ LCTT 译注：原文为 `bts`，为作者笔误） 而不是 `bts`。

目前，我们了解了最重要的体系特定的位数组操作，是时候看看一般的位图 API 了。

通用位操作
================================================================================

除了 [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) 中体系特定的 API 外，Linux 内核提供了操作位数组的通用 API。正如我们本部分开头所了解的一样，我们可以在 [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) 头文件和 [lib/bitmap.c](https://github.com/torvalds/linux/blob/master/lib/bitmap.c)  源文件中找到它。但在查看这些源文件之前，我们先看看 [include/linux/bitops.h](https://github.com/torvalds/linux/blob/master/include/linux/bitops.h) 头文件，其提供了一系列有用的宏，让我们看看它们当中一部分。

首先我们看看以下 4 个 宏：

* `for_each_set_bit`
* `for_each_set_bit_from`
* `for_each_clear_bit`
* `for_each_clear_bit_from`

所有这些宏都提供了遍历位数组中某些位集合的迭代器。第一个宏迭代那些被置位的位。第二个宏也是一样，但它是从某一个确定的位开始。最后两个宏做的一样，但是迭代那些被清位的位。让我们看看 `for_each_set_bit` 宏：

```C
#define for_each_set_bit(bit, addr, size) \
	for ((bit) = find_first_bit((addr), (size));		\
	     (bit) < (size);					\
	     (bit) = find_next_bit((addr), (size), (bit) + 1))
```

正如我们所看到的，它使用了三个参数，并展开为一个循环，该循环从作为 `find_first_bit` 函数返回结果的第一个置位开始，到小于给定大小的最后一个置位为止。

除了这四个宏， [arch/x86/include/asm/bitops.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/bitops.h) 也提供了 `64-bit` 或 `32-bit` 变量循环的 API 等等。

下一个 [头文件](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) 提供了操作位数组的 API。例如，它提供了以下两个函数： 

* `bitmap_zero`;
* `bitmap_fill`.

它们分别可以清除一个位数组和用 `1` 填充位数组。让我们看看 `bitmap_zero` 函数的实现：

```C
static inline void bitmap_zero(unsigned long *dst, unsigned int nbits)
{
	if (small_const_nbits(nbits))
		*dst = 0UL;
	else {
		unsigned int len = BITS_TO_LONGS(nbits) * sizeof(unsigned long);
		memset(dst, 0, len);
	}
}
```

首先我们可以看到对 `nbits` 的检查。 `small_const_nbits` 是一个定义在同一个[头文件](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) 的宏：

```C
#define small_const_nbits(nbits) \
	(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
```

正如我们可以看到的，它检查 `nbits` 是否为编译期已知常量，并且其值不超过 `BITS_PER_LONG` 或 `64`。如果位数目没有超过一个 `long` 变量的位数，我们可以仅仅设置为 0。在其他情况，我们需要计算有多少个需要填充位数组的 `long` 变量并且使用 [memset](https://man7.org/linux/man-pages/man3/memset.3.html) 进行填充。

`bitmap_fill` 函数的实现和 `biramp_zero` 函数很相似，除了我们需要在给定的位数组中填写 `0xff` 或 `0b11111111`：

```C
static inline void bitmap_fill(unsigned long *dst, unsigned int nbits)
{
	unsigned int nlongs = BITS_TO_LONGS(nbits);
	if (!small_const_nbits(nbits)) {
		unsigned int len = (nlongs - 1) * sizeof(unsigned long);
		memset(dst, 0xff,  len);
	}
	dst[nlongs - 1] = BITMAP_LAST_WORD_MASK(nbits);
}
```

除了 `bitmap_fill` 和 `bitmap_zero`，[include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) 头文件也提供了和 `bitmap_zero` 很相似的 `bitmap_copy`，只是仅仅使用 [memcpy](https://man7.org/linux/man-pages/man3/memcpy.3.html) 而不是  [memset](https://man7.org/linux/man-pages/man3/memset.3.html) 这点差异而已。它也提供了位数组的按位操作，像 `bitmap_and`, `bitmap_or`, `bitamp_xor`等等。我们不会探讨这些函数的实现了，因为如果你理解了本部分的所有内容，这些函数的实现是很容易理解的。无论如何，如果你对这些函数是如何实现的感兴趣，你可以打开并研究 [include/linux/bitmap.h](https://github.com/torvalds/linux/blob/master/include/linux/bitmap.h) 头文件。

本部分到此为止。

注： 本文由 [LCTT](https://github.com/LCTT/TranslateProject) 原创翻译，[Linux中国](https://linux.cn/) 荣誉推出

链接
================================================================================

* [bitmap](https://en.wikipedia.org/wiki/Bit_array)
* [linked data structures](https://en.wikipedia.org/wiki/Linked_data_structure)
* [tree data structures](https://en.wikipedia.org/wiki/Tree_%28data_structure%29) 
* [hot-plug](https://www.kernel.org/doc/Documentation/cpu-hotplug.txt)
* [cpumasks](/Concepts/linux-cpu-2.md)
* [IRQs](https://en.wikipedia.org/wiki/Interrupt_request_%28PC_architecture%29)
* [API](https://en.wikipedia.org/wiki/Application_programming_interface)
* [atomic operations](https://en.wikipedia.org/wiki/Linearizability)
* [xchg instruction](https://x86.hust.openatom.club/html/file_module_x86_id_328.html)
* [cmpxchg instruction](https://x86.hust.openatom.club/html/file_module_x86_id_41.html)
* [lock instruction](https://x86.hust.openatom.club/html/file_module_x86_id_159.html)
* [bts instruction](https://x86.hust.openatom.club/html/file_module_x86_id_25.html)
* [btr instruction](https://x86.hust.openatom.club/html/file_module_x86_id_24.html)
* [bt instruction](https://x86.hust.openatom.club/html/file_module_x86_id_22.html)
* [sbb instruction](https://x86.hust.openatom.club/html/file_module_x86_id_286.html)
* [btc instruction](https://x86.hust.openatom.club/html/file_module_x86_id_23.html)
* [man memcpy](https://man7.org/linux/man-pages/man3/memcpy.3.html) 
* [man memset](https://man7.org/linux/man-pages/man3/memset.3.html)
* [CF](https://en.wikipedia.org/

Download .txt

gitextract_e1uf8jnz/

├── .github/
│   └── ISSUE_TEMPLATE/
│       └── 1_translation_request.yaml
├── Booting/
│   ├── README.md
│   ├── linux-bootstrap-1.md
│   ├── linux-bootstrap-2.md
│   ├── linux-bootstrap-3.md
│   ├── linux-bootstrap-4.md
│   ├── linux-bootstrap-5.md
│   └── linux-bootstrap-6.md
├── CONTRIBUTING.md
├── CONTRIBUTORS.md
├── Cgroups/
│   ├── README.md
│   └── linux-cgroups-1.md
├── Concepts/
│   ├── README.md
│   ├── linux-cpu-1.md
│   ├── linux-cpu-2.md
│   ├── linux-cpu-3.md
│   └── linux-cpu-4.md
├── DataStructures/
│   ├── README.md
│   ├── linux-datastructures-1.md
│   ├── linux-datastructures-2.md
│   └── linux-datastructures-3.md
├── Initialization/
│   ├── README.md
│   ├── linux-initialization-1.md
│   ├── linux-initialization-10.md
│   ├── linux-initialization-2.md
│   ├── linux-initialization-3.md
│   ├── linux-initialization-4.md
│   ├── linux-initialization-5.md
│   ├── linux-initialization-6.md
│   ├── linux-initialization-7.md
│   ├── linux-initialization-8.md
│   └── linux-initialization-9.md
├── Interrupts/
│   ├── README.md
│   ├── linux-interrupts-1.md
│   ├── linux-interrupts-10.md
│   ├── linux-interrupts-2.md
│   ├── linux-interrupts-3.md
│   ├── linux-interrupts-4.md
│   ├── linux-interrupts-5.md
│   ├── linux-interrupts-6.md
│   ├── linux-interrupts-7.md
│   ├── linux-interrupts-8.md
│   └── linux-interrupts-9.md
├── KernelStructures/
│   ├── README.md
│   └── linux-kernelstructure-1.md
├── LICENSE
├── LINKS.md
├── MM/
│   ├── README.md
│   ├── linux-mm-1.md
│   ├── linux-mm-2.md
│   └── linux-mm-3.md
├── Misc/
│   ├── README.md
│   ├── linux-misc-1.md
│   ├── linux-misc-2.md
│   ├── linux-misc-3.md
│   └── linux-misc-4.md
├── README.md
├── SUMMARY.md
├── Scripts/
│   └── validate_markdown_links.py
├── SyncPrim/
│   ├── README.md
│   ├── linux-sync-1.md
│   ├── linux-sync-2.md
│   ├── linux-sync-3.md
│   ├── linux-sync-4.md
│   ├── linux-sync-5.md
│   └── linux-sync-6.md
├── SysCall/
│   ├── README.md
│   ├── linux-syscall-1.md
│   ├── linux-syscall-2.md
│   ├── linux-syscall-3.md
│   ├── linux-syscall-4.md
│   ├── linux-syscall-5.md
│   └── linux-syscall-6.md
├── TRANSLATION_NOTES.md
├── TRANSLATION_STATUS.md
├── Theory/
│   ├── README.md
│   ├── linux-theory-1.md
│   ├── linux-theory-2.md
│   └── linux-theory-3.md
└── Timers/
    ├── README.md
    ├── linux-timers-1.md
    ├── linux-timers-2.md
    ├── linux-timers-3.md
    ├── linux-timers-4.md
    ├── linux-timers-5.md
    ├── linux-timers-6.md
    └── linux-timers-7.md

Download .txt

SYMBOL INDEX (2 symbols across 1 files)

FILE: Scripts/validate_markdown_links.py
  function check_live_url (line 16) | def check_live_url(url):
  function main (line 34) | def main(path):

Download .json

Condensed preview — 87 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,292K chars).

[
  {
    "path": ".github/ISSUE_TEMPLATE/1_translation_request.yaml",
    "chars": 829,
    "preview": "name: 请求翻译/Request Translation\ndescription: 请求翻译一篇文章/Request \ntitle: 请求翻译\nlabels: [request, translation]\nbody:\n  - type:"
  },
  {
    "path": "Booting/README.md",
    "chars": 468,
    "preview": "# 内核引导过程\n\n本章介绍了Linux内核引导过程。此处你将在这看到一些描述内核加载过程的整个周期的文章：\n\n* [从引导程序到内核](linux-bootstrap-1.md) - 介绍了从启动计算机到内核执行第一条指令之前的所有阶段;"
  },
  {
    "path": "Booting/linux-bootstrap-1.md",
    "chars": 16810,
    "preview": "内核引导过程. 第一部分.\n================================================================================\n\n从引导加载程序内核\n--------------"
  },
  {
    "path": "Booting/linux-bootstrap-2.md",
    "chars": 23057,
    "preview": "# 在内核安装代码的第一步\r\n\r\n内核启动的第一步  \r\n--------------------------------------------------------------------------------\r\n\r\n在[上一节中]"
  },
  {
    "path": "Booting/linux-bootstrap-3.md",
    "chars": 18952,
    "preview": "内核启动过程，第三部分\n================================================================================\n\n显示模式初始化和进入保护模式\n-----------"
  },
  {
    "path": "Booting/linux-bootstrap-4.md",
    "chars": 20639,
    "preview": "内核引导过程. Part 4.\n================================================================================\n\n切换到64位模式\n-------------"
  },
  {
    "path": "Booting/linux-bootstrap-5.md",
    "chars": 11883,
    "preview": "内核引导过程. Part 5.\n================================================================================\n\n内核解压\n-----------------"
  },
  {
    "path": "Booting/linux-bootstrap-6.md",
    "chars": 13843,
    "preview": "内核引导过程. Part 6.\n================================================================================\n\n简介\n-------------------"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 815,
    "preview": "贡献\n================================================================================\n\n如果你想要给 [linux-insides-zh](https://g"
  },
  {
    "path": "CONTRIBUTORS.md",
    "chars": 1084,
    "preview": "## 翻译人员 (排名不分先后)\n\n[@xinqiu](https://github.com/xinqiu)\n\n[@lijiangsheng1](https://github.com/lijiangsheng1)\n\n[@littleneko"
  },
  {
    "path": "Cgroups/README.md",
    "chars": 60,
    "preview": "# 控制组\n\n这个章节描述了 Linux 内核中的控制组机制。\n\n* [简介](linux-cgroups-1.md)\n"
  },
  {
    "path": "Cgroups/linux-cgroups-1.md",
    "chars": 16094,
    "preview": "控制组\n================================================================================\n\n简介\n-------------------------------"
  },
  {
    "path": "Concepts/README.md",
    "chars": 158,
    "preview": "# Linux 内核概念\n\n本章描述内核中使用到的各种各样的概念。\n\n* [每个 CPU 的变量](linux-cpu-1.md)\n* [CPU 掩码](linux-cpu-2.md)\n* [initcall 机制](linux-cpu-3"
  },
  {
    "path": "Concepts/linux-cpu-1.md",
    "chars": 6697,
    "preview": "Per-cpu 变量\n================================================================================\n\nPer-cpu 变量是一项内核特性。从它的名字你就可以"
  },
  {
    "path": "Concepts/linux-cpu-2.md",
    "chars": 6494,
    "preview": "CPU masks\n================================================================================\n\n介绍\n-------------------------"
  },
  {
    "path": "Concepts/linux-cpu-3.md",
    "chars": 13271,
    "preview": "initcall 机制\n================================================================================\n\n介绍\n-----------------------"
  },
  {
    "path": "Concepts/linux-cpu-4.md",
    "chars": 18401,
    "preview": "Notification Chains in Linux Kernel\n================================================================================\n\nIn"
  },
  {
    "path": "DataStructures/README.md",
    "chars": 265,
    "preview": "Linux内核中的数据结构\n========================================================================\n\nLinux内核对很多数据结构提供不同的实现方法，比如，双向链表，"
  },
  {
    "path": "DataStructures/linux-datastructures-1.md",
    "chars": 6214,
    "preview": "Linux 内核里的数据结构——双向链表\n================================================================================\n\n双向链表\n------------"
  },
  {
    "path": "DataStructures/linux-datastructures-2.md",
    "chars": 5785,
    "preview": "Linux内核中的数据结构\n================================================================================\n\n基数树\n--------------------"
  },
  {
    "path": "DataStructures/linux-datastructures-3.md",
    "chars": 15527,
    "preview": "Linux 内核里的数据结构——位数组\n================================================================================\n\nLinux 内核中的位数组和位操作\n"
  },
  {
    "path": "Initialization/README.md",
    "chars": 773,
    "preview": "#内核初始化流程\n\n读者在这章可以了解到整个内核初始化的完整周期，从内核解压之后的第一步到内核自身运行的第一个进程。\n\n*注意* 这里不是所有内核初始化步骤的介绍。这里只有通用的内核内容，不会涉及到中断控制、 ACPI 、以及其它部分。此处"
  },
  {
    "path": "Initialization/linux-initialization-1.md",
    "chars": 21150,
    "preview": "内核初始化 第一部分\n================================================================================\n\n踏入内核代码的第一步（TODO: Need proof"
  },
  {
    "path": "Initialization/linux-initialization-10.md",
    "chars": 30744,
    "preview": "Kernel initialization. Part 10.\n================================================================================\n\nEnd of"
  },
  {
    "path": "Initialization/linux-initialization-2.md",
    "chars": 19191,
    "preview": "内核初始化 第二部分\n================================================================================\n\n初期中断和异常处理\n-----------------"
  },
  {
    "path": "Initialization/linux-initialization-3.md",
    "chars": 12419,
    "preview": "内核初始化 第三部分\n================================================================================\n\n进入内核入口点之前最后的准备工作\n----------"
  },
  {
    "path": "Initialization/linux-initialization-4.md",
    "chars": 17024,
    "preview": "内核初始化. Part 4.\n================================================================================\n\nKernel entry point\n===="
  },
  {
    "path": "Initialization/linux-initialization-5.md",
    "chars": 21663,
    "preview": "内核初始化 第五部分\r\n===========================================================\r\n\r\n与系统架构有关的初始化后续分析\r\n============================"
  },
  {
    "path": "Initialization/linux-initialization-6.md",
    "chars": 21609,
    "preview": "内核初始化 第六部分\n===========================================================\n\n仍旧是与系统架构有关的初始化\n================================="
  },
  {
    "path": "Initialization/linux-initialization-7.md",
    "chars": 21067,
    "preview": "内核初始化 第七部分\n================================================================================\n\n架构相关初始化尾声：最后一程\n============"
  },
  {
    "path": "Initialization/linux-initialization-8.md",
    "chars": 22173,
    "preview": "Kernel initialization. Part 8.\n================================================================================\n\n调度器初始化\n"
  },
  {
    "path": "Initialization/linux-initialization-9.md",
    "chars": 29760,
    "preview": "Kernel initialization. Part 9.\n================================================================================\n\nRCU ini"
  },
  {
    "path": "Interrupts/README.md",
    "chars": 739,
    "preview": "# 中断和中断处理\n\n在 linux 内核中你会发现很多关于中断和异常处理的话题\n\n* [中断和中断处理第一部分](linux-interrupts-1.md) - 描述中断处理主题\n* [深入 Linux 内核中的中断](linux-in"
  },
  {
    "path": "Interrupts/linux-interrupts-1.md",
    "chars": 19683,
    "preview": "中断和中断处理 Part 1.\n================================================================================\n\nIntroduction\n---------"
  },
  {
    "path": "Interrupts/linux-interrupts-10.md",
    "chars": 16428,
    "preview": "中断和中断处理(十)\n=====================\n终结篇\n-------------------------\n本文是 Linux 内核[中断和中断处理](/Interrupts/)的第十节。在[上一节](/Interrupt"
  },
  {
    "path": "Interrupts/linux-interrupts-2.md",
    "chars": 23670,
    "preview": "中断和中断处理 Part 2.\n================================================================================\n\n深入Linux内核中的中断和异常处理\n---"
  },
  {
    "path": "Interrupts/linux-interrupts-3.md",
    "chars": 15471,
    "preview": "中断和中断处理 Part 3.\n================================================================================\n\n异常处理\n-----------------"
  },
  {
    "path": "Interrupts/linux-interrupts-4.md",
    "chars": 27068,
    "preview": "Interrupts and Interrupt Handling. Part 4.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-5.md",
    "chars": 25700,
    "preview": "Interrupts and Interrupt Handling. Part 5.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-6.md",
    "chars": 25175,
    "preview": "Interrupts and Interrupt Handling. Part 6.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-7.md",
    "chars": 26739,
    "preview": "Interrupts and Interrupt Handling. Part 7.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-8.md",
    "chars": 27849,
    "preview": "Interrupts and Interrupt Handling. Part 8.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-9.md",
    "chars": 18044,
    "preview": "中断和中断处理(九)\n================================================================================\n\n延后中断(软中断，Tasklets 和工作队列)介绍\n"
  },
  {
    "path": "KernelStructures/README.md",
    "chars": 362,
    "preview": "# Linux 内核内部`系统`数据结构\n\n这不是 `linux-insides-zh` 中的一般章节。正如你从题目中理解到的，它主要描述 Linux 内核中的内部`系统`数据结构。比如说，中断描述符表 (`Interrupt Descri"
  },
  {
    "path": "KernelStructures/linux-kernelstructure-1.md",
    "chars": 6903,
    "preview": " 中断描述符 （IDT）\n================================================================================\n\n三个常见的中断和异常来源：\n\n* 异常 - syn"
  },
  {
    "path": "LICENSE",
    "chars": 20845,
    "preview": "Attribution-NonCommercial-ShareAlike 4.0 International\n\n================================================================"
  },
  {
    "path": "LINKS.md",
    "chars": 1573,
    "preview": "有帮助的链接\n========================\n\nLinux 启动\n------------------------\n\n* [Linux/x86 boot protocol](https://www.kernel.org/d"
  },
  {
    "path": "MM/README.md",
    "chars": 240,
    "preview": "# Linux 内核内存管理\n\n本章描述 Linux 内核中的内存管理。在本章中你会看到一系列描述 Linux 内核内存管理框架的不同部分的帖子。\n\n* [内存块](linux-mm-1.md) - 描述早期的 `memblock` 分配器"
  },
  {
    "path": "MM/linux-mm-1.md",
    "chars": 13650,
    "preview": "内核内存管理. 第一部分.\n================================================================================\n\n简介\n---------------------"
  },
  {
    "path": "MM/linux-mm-2.md",
    "chars": 18299,
    "preview": "内核内存管理. 第二部分.\n================================================================================\n\n固定映射地址和输入输出重映射\n---------"
  },
  {
    "path": "MM/linux-mm-3.md",
    "chars": 14142,
    "preview": "Linux内核内存管理 第三节\n================================================================================\n\n内核中 kmemcheck 介绍\n-----"
  },
  {
    "path": "Misc/README.md",
    "chars": 37,
    "preview": "# 杂项\n\n这个章节包含不直接涉及到内核源码的部分以及各个子系统的实现。\n"
  },
  {
    "path": "Misc/linux-misc-1.md",
    "chars": 18341,
    "preview": "Linux 内核开发\n================================================================================\n\n简介\n------------------------"
  },
  {
    "path": "Misc/linux-misc-2.md",
    "chars": 23857,
    "preview": "你知道 Linux 内核是如何构建的吗？\r\n================================================================================\r\n\r\n### 介绍\r\n\r\n我不会告"
  },
  {
    "path": "Misc/linux-misc-3.md",
    "chars": 18677,
    "preview": "介绍\n---------------\n\n在写 [linux-insides](/) 一书的过程中，我收到了很多邮件询问关于[链接器](https://zh.wikipedia.org/wiki/%E9%93%BE%E6%8E%A5%E5%9"
  },
  {
    "path": "Misc/linux-misc-4.md",
    "chars": 18752,
    "preview": "用户空间的程序启动过程\n================================================================================\n\n简介\n-----------------------"
  },
  {
    "path": "README.md",
    "chars": 1371,
    "preview": "# Linux 内核揭秘\n\n一系列关于 Linux 内核和其内在机理的帖子。\n\n**目的很简单** - 分享我对 Linux 内核机理的一些浅见，帮助读者理解 Linux 内核机理和其他底层内容。从 [这里](https://github."
  },
  {
    "path": "SUMMARY.md",
    "chars": 3694,
    "preview": "# Summary\n\n* [简介](README.md)\n* [引导](Booting/README.md)\n    * [从引导加载程序内核](Booting/linux-bootstrap-1.md)\n    * [在内核安装代码的第一"
  },
  {
    "path": "Scripts/validate_markdown_links.py",
    "chars": 1564,
    "preview": "#!/usr/bin/env python3\n\nfrom socket import timeout\n\nimport os\nimport sys\nimport codecs\nimport re\n\nimport markdown\n\nfrom "
  },
  {
    "path": "SyncPrim/README.md",
    "chars": 486,
    "preview": "# Linux 内核中的同步原语\n\n这个章节描述内核中所有的同步原语。\n\n* [自旋锁简介](linux-sync-1.md) - 这个章节的第一部分描述 Linux 内核中自旋锁机制的实现；\n* [队列自旋锁](linux-sync-2."
  },
  {
    "path": "SyncPrim/linux-sync-1.md",
    "chars": 16723,
    "preview": "Linux 内核中的同步原语. 第一部分.\n================================================================================\n\nIntroduction\n---"
  },
  {
    "path": "SyncPrim/linux-sync-2.md",
    "chars": 18407,
    "preview": "Linux 内核的同步原语. 第二部分.\n================================================================================\n\n队列自旋锁\n-----------"
  },
  {
    "path": "SyncPrim/linux-sync-3.md",
    "chars": 13619,
    "preview": "\n内核同步原语. 第三部分.\n================================================================================\n\n信号量\n-------------------"
  },
  {
    "path": "SyncPrim/linux-sync-4.md",
    "chars": 18892,
    "preview": "内核同步原语. 第四部分.\n================================================================================\n\n引言\n---------------------"
  },
  {
    "path": "SyncPrim/linux-sync-5.md",
    "chars": 32275,
    "preview": "Synchronization primitives in the Linux kernel. Part 5.\n================================================================"
  },
  {
    "path": "SyncPrim/linux-sync-6.md",
    "chars": 19756,
    "preview": "Synchronization primitives in the Linux kernel. Part 6.\n================================================================"
  },
  {
    "path": "SysCall/README.md",
    "chars": 409,
    "preview": "# 系统调用\n\n本章描述 Linux 内核中的系统调用概念。\n\n* [系统调用概念简介](linux-syscall-1.md) - 介绍 Linux 内核中的系统调用概念\n* [Linux 内核如何处理系统调用](linux-syscal"
  },
  {
    "path": "SysCall/linux-syscall-1.md",
    "chars": 16709,
    "preview": "Linux 内核系统调用 第一节 \n================================================================================\n\n简介\n-----------------"
  },
  {
    "path": "SysCall/linux-syscall-2.md",
    "chars": 15728,
    "preview": "Linux 系统内核调用 第二节\n================================================================================\n\nLinux 内核如何处理系统调用\n----"
  },
  {
    "path": "SysCall/linux-syscall-3.md",
    "chars": 17405,
    "preview": "Linux 内核系统调用 第三节\n================================================================================\n\nvsyscalls 和 vDSO\n----"
  },
  {
    "path": "SysCall/linux-syscall-4.md",
    "chars": 25168,
    "preview": "System calls in the Linux kernel. Part 4.\n=============================================================================="
  },
  {
    "path": "SysCall/linux-syscall-5.md",
    "chars": 16140,
    "preview": "`open` 系统调用实现\n--------------------------------------------------------------------------------\n\n导论\n---------------------"
  },
  {
    "path": "SysCall/linux-syscall-6.md",
    "chars": 9993,
    "preview": "Limits on resources in Linux\n================================================================================\n\nEach proc"
  },
  {
    "path": "TRANSLATION_NOTES.md",
    "chars": 1416,
    "preview": "# 翻译约定\r\n\r\n## 基本要求\r\n\r\n* 准确表述原文的意思；\r\n* 中文应该意思清晰且符合中文表达习惯；\r\n* 原文如果表达不清晰，中文应该意译，并且应根据上下文和注释进行推断并填补相应的信息；\r\n* 情况 `3` 不能太多；\r\n* "
  },
  {
    "path": "TRANSLATION_STATUS.md",
    "chars": 10330,
    "preview": "## 翻译\n\n### 翻译进度\n\n|章节|译者|翻译进度|\n| ------------- |:-------------:| -----:|\n| 1. [Booting](Booting)||已完成|\n|├ [1.0](Booting/R"
  },
  {
    "path": "Theory/README.md",
    "chars": 130,
    "preview": "# 理论\n\n这一章描述各种理论性概念和那些不直接涉及实践，但是知道了会很有用的概念。\n\n* [分页](linux-theory-1.md)\n* [Elf64 格式](linux-theory-2.md)\n* [內联汇编](linux-the"
  },
  {
    "path": "Theory/linux-theory-1.md",
    "chars": 10524,
    "preview": "分页\n================================================================================\n\n简介\n--------------------------------"
  },
  {
    "path": "Theory/linux-theory-2.md",
    "chars": 6447,
    "preview": "ELF文件格式\n================================================================================\n\nELF (Executable and Linkable F"
  },
  {
    "path": "Theory/linux-theory-3.md",
    "chars": 22823,
    "preview": "Inline assembly\n================================================================================\n\nIntroduction\n---------"
  },
  {
    "path": "Timers/README.md",
    "chars": 536,
    "preview": "# 定时器和时钟管理\n\n本章介绍 Linux 内核中定时器和时钟管理相关的观念。\n\n* [简介](linux-timers-1.md) - 简单介绍 Linux 内核中的定时器。\n* [时钟源框架简介](linux-timers-2.md)"
  },
  {
    "path": "Timers/linux-timers-1.md",
    "chars": 25906,
    "preview": "Timers and time management in the Linux kernel. Part 1.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-2.md",
    "chars": 30685,
    "preview": "Timers and time management in the Linux kernel. Part 2.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-3.md",
    "chars": 32588,
    "preview": "Timers and time management in the Linux kernel. Part 3.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-4.md",
    "chars": 20354,
    "preview": "Timers and time management in the Linux kernel. Part 4.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-5.md",
    "chars": 25859,
    "preview": "Timers and time management in the Linux kernel. Part 5.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-6.md",
    "chars": 14797,
    "preview": "Linux内核中的定时器和时间管理 第6部分\n================================================================================\n\nx86_64相关的时钟资源\n-"
  },
  {
    "path": "Timers/linux-timers-7.md",
    "chars": 21138,
    "preview": "Timers and time management in the Linux kernel. Part 7.\n================================================================"
  }
]

About this extraction

This page contains the full source code of the hust-open-atom-club/linux-insides-zh GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 87 files (1.2 MB), approximately 421.2k tokens, and a symbol index with 2 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo