Full Code of 0xAX/linux-insides for AI

master 50d9f02c4694 cached

105 files

1.6 MB

428.3k tokens

6 symbols

1 requests

Download .txt

Showing preview only (1,705K chars total). Download the full file or copy to clipboard to get everything.

Repository: 0xAX/linux-insides
Branch: master
Commit: 50d9f02c4694
Files: 105
Total size: 1.6 MB

Directory structure:
gitextract_i4p4qj27/

├── .github/
│   ├── FUNDING.yml
│   ├── ISSUE_TEMPLATE/
│   │   ├── content-issue.yml
│   │   └── question.yml
│   ├── dependabot.yaml
│   ├── pull-request-template.md
│   └── workflows/
│       ├── check-code-snippets.yaml
│       ├── check-links.yaml
│       ├── generate-e-books.yaml
│       └── release-e-books.yaml
├── .gitignore
├── Booting/
│   ├── README.md
│   ├── linux-bootstrap-1.md
│   ├── linux-bootstrap-2.md
│   ├── linux-bootstrap-3.md
│   ├── linux-bootstrap-4.md
│   ├── linux-bootstrap-5.md
│   └── linux-bootstrap-6.md
├── CODEOWNERS
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Cgroups/
│   ├── README.md
│   └── linux-cgroups-1.md
├── Concepts/
│   ├── README.md
│   ├── linux-cpu-1.md
│   ├── linux-cpu-2.md
│   ├── linux-cpu-3.md
│   └── linux-cpu-4.md
├── DataStructures/
│   ├── README.md
│   ├── linux-datastructures-1.md
│   ├── linux-datastructures-2.md
│   └── linux-datastructures-3.md
├── Dockerfile
├── Initialization/
│   ├── README.md
│   ├── linux-initialization-1.md
│   ├── linux-initialization-10.md
│   ├── linux-initialization-2.md
│   ├── linux-initialization-3.md
│   ├── linux-initialization-4.md
│   ├── linux-initialization-5.md
│   ├── linux-initialization-6.md
│   ├── linux-initialization-7.md
│   ├── linux-initialization-8.md
│   └── linux-initialization-9.md
├── Interrupts/
│   ├── README.md
│   ├── linux-interrupts-1.md
│   ├── linux-interrupts-10.md
│   ├── linux-interrupts-2.md
│   ├── linux-interrupts-3.md
│   ├── linux-interrupts-4.md
│   ├── linux-interrupts-5.md
│   ├── linux-interrupts-6.md
│   ├── linux-interrupts-7.md
│   ├── linux-interrupts-8.md
│   └── linux-interrupts-9.md
├── KernelStructures/
│   ├── .gitkeep
│   ├── README.md
│   └── linux-kernelstructure-1.md
├── LICENSE
├── LINKS.md
├── MM/
│   ├── README.md
│   ├── linux-mm-1.md
│   ├── linux-mm-2.md
│   └── linux-mm-3.md
├── Makefile
├── Misc/
│   ├── README.md
│   ├── linux-misc-1.md
│   ├── linux-misc-2.md
│   ├── linux-misc-3.md
│   └── linux-misc-4.md
├── README.md
├── SUMMARY.md
├── Scripts/
│   ├── README.md
│   ├── get_all_links.py
│   └── latex.sh
├── SyncPrim/
│   ├── README.md
│   ├── linux-sync-1.md
│   ├── linux-sync-2.md
│   ├── linux-sync-3.md
│   ├── linux-sync-4.md
│   ├── linux-sync-5.md
│   └── linux-sync-6.md
├── SysCall/
│   ├── README.md
│   ├── linux-syscall-1.md
│   ├── linux-syscall-2.md
│   ├── linux-syscall-3.md
│   ├── linux-syscall-4.md
│   ├── linux-syscall-5.md
│   └── linux-syscall-6.md
├── Theory/
│   ├── README.md
│   ├── linux-theory-1.md
│   ├── linux-theory-2.md
│   └── linux-theory-3.md
├── Timers/
│   ├── README.md
│   ├── linux-timers-1.md
│   ├── linux-timers-2.md
│   ├── linux-timers-3.md
│   ├── linux-timers-4.md
│   ├── linux-timers-5.md
│   ├── linux-timers-6.md
│   └── linux-timers-7.md
├── book-A5.json
├── book.json
├── contributors.md
├── lychee.toml
└── scripts/
    └── check_code_snippets.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/FUNDING.yml
================================================
# These are supported funding model platforms

patreon: 0xAX


================================================
FILE: .github/ISSUE_TEMPLATE/content-issue.yml
================================================
name: 📖 Content issue
description: Report an issue with the content
body:
  - type: markdown
    attributes:
      value: |
        Use this form to report an issue with the content.

        When contributing, make sure to follow Contributing guidelines and Code of Conduct.
        Thank you for your contribution!

  - type: checkboxes
    attributes:
      label: Existing issues
      description: Is there an existing issue for this? Search open and closed issues to avoid duplicates.
      options:
        - label: I have searched the existing issues.
          required: true

  - type: input
    attributes:
      label: Affected document
      description: Name or paste a link to the document that contains an issue.
    validations:
      required: true

  - type: textarea
    attributes:
      label: Issue description
      description: Explain what is unclear or confusing in the given document.
    validations:
      required: true

  - type: textarea
    attributes:
      label: Attachments
      description: Include screenshots or links if applicable.
    validations:
      required: false


================================================
FILE: .github/ISSUE_TEMPLATE/question.yml
================================================
name: ❓ Questions and discussions
description: Ask a question or start a discussion with other community members.
body:
  - type: markdown
    attributes:
      value: |
        Use this form to ask a question or start a discussion with other community members.

        When contributing, make sure to follow Contributing guidelines and Code of Conduct.
        Thank you for your contribution!

  - type: checkboxes
    attributes:
      label: Existing issues
      description: Is there an existing issue for this? Search open and closed issues to avoid duplicates.
      options:
        - label: I have searched the existing issues.
          required: true

  - type: textarea
    attributes:
      label: Question
      description: Ask a question you would like to discuss with the community.
    validations:
      required: false

  - type: textarea
    attributes:
      label: Discussion
      description: Start a discussion topic.
    validations:
      required: false

  - type: textarea
    attributes:
      label: Attachments
      description: Include screenshots, links, or example's output if applicable.
    validations:
      required: false


================================================
FILE: .github/dependabot.yaml
================================================
version: 2
updates:
  - package-ecosystem: "github-actions"
    directory: "/"
    schedule:
      interval: "daily"


================================================
FILE: .github/pull-request-template.md
================================================
<!-- Thank you for your contribution. When contributing to the project, remember to:
- Read the Contribution guide.
- Follow the Code of Conduct.
-->

**Description**

<!-- In this section, provide a description of your changes. The context and justification let others understand your motivation and the purpose of the pull request. Follow the description with a list that summarises the most relevant changes included in the pull request. -->

Changes proposed in this pull request:

- ...
- ...
- ...

**Related issues**

<!-- Link the related issue here, if applicable. -->


================================================
FILE: .github/workflows/check-code-snippets.yaml
================================================
name: check code snippets

on:
  workflow_dispatch:
  push:
    branches:
      - main
  pull_request:

concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

jobs:
  check-code-snippets:
    name: check-code-snippets
    runs-on:
      - ubuntu-22.04
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
      - name: Setup python
        uses: actions/setup-python@v6
        with:
          python-version: '3.13'
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install requests  
      - name: Validate code snippets
        run: |
          python ./scripts/check_code_snippets.py .


================================================
FILE: .github/workflows/check-links.yaml
================================================
name: check links

on:
  workflow_dispatch:
  push:
    branches:
      - main
      - master
  pull_request:

concurrency:
  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
  cancel-in-progress: true

jobs:
  check-links:
    name: check-links
    runs-on:
      - ubuntu-22.04
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6

      - name: Check links with lychee
        uses: lycheeverse/lychee-action@v2
        with:
          # Check README.md and all files in Booting directory
          args: |
            --verbose
            --no-progress
            --max-retries 3
            --timeout 20
            README.md
            'Booting/*.md'
          fail: true
        env:
          GITHUB_TOKEN: ${{secrets.GITHUB_TOKEN}}


================================================
FILE: .github/workflows/generate-e-books.yaml
================================================
name: Generate e-books

on:
  workflow_dispatch: {}

jobs:
  build-for-pr:
    # For every PR, build the same artifacts and make them accessible from the PR.
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest

    permissions:
      contents: read
      pull-requests: write

    steps:
      - name: Checkout repository
        uses: actions/checkout@v6

      - name: Export all supported book formats from the Docker container
        run: |
          make run
          make export

      - name: Copy generated files to host system
        run: |
          make cp
          mkdir -p artifacts/
          mv "Linux Inside - 0xAX.epub" \
             "Linux Inside - 0xAX.mobi" \
             "Linux Inside - 0xAX.pdf" \
             "Linux Inside - 0xAX (A5).pdf" \
             artifacts/

      - name: Upload PR artifacts
        uses: actions/upload-artifact@v7
        with:
          name: ebooks-${{ github.sha }}
          path: artifacts/*
          if-no-files-found: error
          # Change the retention period here if necessary.
          retention-days: 7

      - name: Add a comment with a link to the generated artifacts.
        # For forked PRs the token is read-only; skip commenting to avoid failures.
        if: ${{ github.event.pull_request.head.repo.full_name == github.event.pull_request.base.repo.full_name }}
        uses: actions/github-script@v8
        env:
          RUN_URL: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
        with:
          script: |
            const body = [
              `E-books generated for this pull request available at: ${process.env.RUN_URL}`
            ].join('\n');
            await github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body
            });


================================================
FILE: .github/workflows/release-e-books.yaml
================================================
name: Release e-books

on:
  push:
    tags:
      - 'v*.*' # Create a release only when a new tag matching v*.* is pushed.
    # To also create a release for each push to the main branch, uncomment the following 2 lines:
    # branches:
    #   - master
  workflow_dispatch: {}  # For manual runs.

jobs:
  release-ebooks:
    runs-on: ubuntu-latest

    permissions:
      contents: write

    steps:
      - name: Checkout repository
        uses: actions/checkout@v6

      - name: Export all supported book formats from the Docker container
        run: |
          make run
          make export

      - name: Copy generated files to host system
        run: |
          make cp
          mkdir -p artifacts/
          mv "Linux Inside - 0xAX.epub" \
             "Linux Inside - 0xAX.mobi" \
             "Linux Inside - 0xAX.pdf" \
             "Linux Inside - 0xAX (A5).pdf" \
             artifacts/
          cp LICENSE artifacts/

      - name: Prepare release metadata
        # Use tag name when running on a tag, otherwise fall back to the short commit hash.
        id: meta
        env:
          GITHUB_REF_TYPE: ${{ github.ref_type }}
          GITHUB_REF_NAME: ${{ github.ref_name }}
        run: |
          DATE_UTC="$(date -u '+%m/%d/%Y %H:%M')"
          if [ "${GITHUB_REF_TYPE}" = "tag" ] && [ -n "${GITHUB_REF_NAME}" ]; then
            LABEL="${GITHUB_REF_NAME}"
          else
            LABEL="$(git rev-parse --short HEAD)"
          fi
          echo "release_name=${DATE_UTC} (${LABEL})" >> "$GITHUB_OUTPUT"
          echo "tag_name=${LABEL}" >> "$GITHUB_OUTPUT"

      - name: Create GitHub release
        uses: softprops/action-gh-release@v2
        with:
          files: artifacts/*
          name: ${{ steps.meta.outputs.release_name }}
          tag_name: ${{ steps.meta.outputs.tag_name }}
          target_commitish: ${{ github.sha }}
          generate_release_notes: true
          fail_on_unmatched_files: true


================================================
FILE: .gitignore
================================================
*.tex
build


================================================
FILE: Booting/README.md
================================================
# Kernel Boot Process

Welcome to the boot journey of the Linux kernel, from power-on to the first instruction of the decompressed kernel. This chapter walks the complete boot path step by step from the moment you power on your computer to the moment the Linux kernel loaded in the memory of your machine.

## How to read

This chapter assumes you are comfortable with basic computer architecture and have a light familiarity with `C` programming language and x86_64 assembly syntax. You do not need to be a kernel expert, but being able to read short code snippets and recognize hardware terms will help.

Each part of this chapter focuses on one boot phase. Read in order the first time, then revisit individual steps as references when you want to map a specific symbol or register setup to its place in the sequence. It is quite useful to have the source code of Linux kernel on your local computer to follow the details. You can obtain the source code using the following command:

```bash
git clone git@github.com:torvalds/linux.git
```

## Notation used

During reading this and other chapters, you may encounter special notation:

- `CS`, `DS`, `SS`, `CR0`, `CR3`, `CR4`, `EFER` - refer to x86 segment and control registers
- `0x...` - denotes hexadecimal values
- `entry_*` and `startup_*` - are common prefixes for early boot symbols
- `setup code` refers to the early part of the Linux kernel which executes preparation to load the kernel code itself into memory
- `decompressor` refers to the part of the `setup code` that inflates the compressed kernel image into memory

## What you will learn

- The way a processor reaches the kernel entry point from firmware and the bootloader
- Different modes of x86_64 processors
- What the early setup code does before the kernel itself will be loaded into memory and start its work

## Reading order

1. [From the bootloader to kernel](linux-bootstrap-1.md) - from power-on to the first instruction in the kernel
2. [First steps in the kernel setup code](linux-bootstrap-2.md) - early setup, heap init, parameter discovery (EDD, IST, and more)
3. [Video mode initialization and transition to protected mode](linux-bootstrap-3.md) - video mode setup and the move to protected mode
4. [Transition to 64-bit mode](linux-bootstrap-4.md) - preparation and the jump into long mode
5. [Kernel Decompression](linux-bootstrap-5.md) - pre-decompression setup and the decompressor itself
6. [Kernel load address randomization](linux-bootstrap-6.md) - how KASLR picks a load address

## Kernel version

This chapter corresponds to `Linux kernel v6.19`.


================================================
FILE: Booting/linux-bootstrap-1.md
================================================
# Kernel Booting Process — Part 1

If you’ve read my earlier [posts](https://github.com/0xAX/asm) about [assembly language](https://en.wikipedia.org/wiki/Assembly_language) for Linux x86_64, you might see that I started to get interested in low-level programming. I’ve written a set of articles on assembly programming for [x86_64](https://en.wikipedia.org/wiki/X86-64) Linux and, in parallel, began exploring the Linux kernel source code. I’ve always been fascinated by what happens under the hood — how programs execute on a CPU, how they’re laid out in memory, how the kernel schedules processes and manages resources, how the network stack operates at a low level, and many other details. This series is a way of sharing my journey.

> [!NOTE]
> This is not official Linux kernel documentation, it is a learning project. I’m not a professional Linux kernel developer, and I don’t write kernel code as part of my daily job. Learning how the Linux kernel works is just my hobby. If you find anything unclear, spot an error, or have questions or suggestions, feel free to reach out - you always can ping me on X [0xAX](https://twitter.com/0xAX), send me an [email](mailto:anotherworldofworld@gmail.com) or open a new [issue](https://github.com/0xAX/linux-insides/issues/new). Your feedback is always welcome and appreciated.

The main goal of this series is to provide a guide to the Linux kernel for readers who want to begin learning how it works. We will explore not only what the kernel does, but will try to understand how and why it does it. Despite being considered to be understandable for anyone who is interested in Linux kernel, it is highly recommended to have some prior knowledge before starting to read these notes. If you want to experiment with the kernel code, first of all it is best to have a [Linux distribution](https://en.wikipedia.org/wiki/Linux_distribution) installed. Besides that, on these pages we will see much of [C](https://en.wikipedia.org/wiki/C_(programming_language)) and [assembly](https://en.wikipedia.org/wiki/Assembly_language) code, so the good understanding of these programming languages is highly required.

> [!IMPORTANT]
> I started writing this series when the latest version of the kernel was `3.18`. A lot has changed since then, and I am in the process of updating the content to reflect modern kernels where possible — now focusing on v6.16+. I’ll continue revising the posts as the kernel evolves.

That’s enough introduction — let’s dive into the Linux kernel!

## The Magic Power Button - What happens next?

Although this is a series of posts about Linux kernel, we will not jump straight into kernel code. First, let’s step back and look at what happens before the kernel even comes into play. Everything starts from the turning on a computer. And we will start from this point as well.

When you press the "magic" power button on your laptop or desktop computer, the [motherboard](https://en.wikipedia.org/wiki/Motherboard) sends a signal to the [power supply](https://en.wikipedia.org/wiki/Power_supply). In response, the power supply delivers the proper amount of electricity to other components of the computer. Once the motherboard receives the [power good signal](https://en.wikipedia.org/wiki/Power_good_signal), it triggers the CPU to start. The CPU then performs a reset: it clears any leftover data in its registers and loads predefined values into each of them, preparing for the very first instructions of the boot process.

Each **x86_64** processor begins execution in a special mode called [real mode](https://en.wikipedia.org/wiki/Real_mode). This mode exists for historical reasons - to be compatible with the earliest processors. Real mode is supported on all x86-compatible processors — from the original [8086](https://en.wikipedia.org/wiki/Intel_8086) to today’s modern 64-bit CPUs.

The **8086** was a 16-bit microprocessor. Basically it means that its general-purpose registers and instruction pointer were `16` bits wide. However, the chip was designed with a `20-bit` physical memory address bus — the set of electrical lines used to select memory locations. With `20` address lines, the CPU can form addresses from `0x00000` to `0xFFFFF`, giving access to exactly `1 MB` of physical memory or `2^20` bytes.

Because the registers on **8086** processors were only `16` bits wide, the largest value they could hold was `0xFFFF` which equals 64 KB. This means that, using just a single 16-bit value, the CPU could only directly address 64 KB of memory at a time. This leads us to the question - how can a processor with 16-bit registers access 20-bit addresses? The answer is [memory segmentation](https://en.wikipedia.org/wiki/Memory_segmentation).

To make use of the entire 1 MB space provided by the 20-bit address bus, the **8086** used a scheme called [memory segmentation](https://en.wikipedia.org/wiki/Memory_segmentation). All memory is divided into small, fixed-size segments of `65_536` bytes each. Instead of using just one value to identify a memory location, a CPU uses the two:

1. Segment selector — identifies the starting point (base address) of a 64 KB segment. Represented by the value of the `cs` (code-segment) register.
2. Offset — specifies how far into that segment the target address is. Represented by the value of the `ip` register.

In real mode, the base address for a given segment selector is calculated as:

```
Base Address = Segment Selector << 4
```

To compute the final physical memory address, the CPU adds the base address to the offset:

```
Physical Address = Base Address + Offset
```

For example, if the value of the `cs:ip` is `0x2000:0x0010`, then the corresponding physical address will be:

```python
>>> hex((0x2000 << 4) + 0x0010)
'0x20010'
```

If we take the largest possible values for the segment selector and the offset - `0xFFFF:0xFFFF`, the resulting address will be:

```python
>>> hex((0xffff << 4) + 0xffff)
'0x10ffef'
```

This gives us the address `0x10FFEF`, which is `65_520` bytes past the 1 MB boundary. Since, in real mode on the original **8086** CPU, the CPU could only access the first 1 MB of memory, any address above `0xFFFFF` would wrap around back to the beginning of the address space. On modern **386+** CPUs the physical bus is wider even in real mode, but the address computation still based on the `segment:offset`.

Now that we understand the basics of real mode and its memory addressing limitations, let’s return to the state after a hardware reset.

## First code executed after reset

The system has just been powered on, the reset signal has been released, and the processor is waking up to execute first instructions. The [80386](https://en.wikipedia.org/wiki/Intel_80386) and later CPUs set the following [register](https://en.wikipedia.org/wiki/X86#x86_registers) values after a hardware reset:

| Register           | Value        | Meaning                                                                        |
| ------------------ | ------------ | ------------------------------------------------------------------------------ |
| `ip`               | `0xFFF0`     | Instruction pointer; execution starts here within the current code segment     |
| `cs` (selector)    | `0xF000`     | Visible code segment selector value after reset                                |
| `cs` (base)        | `0xFFFF0000` | Hidden descriptor base address loaded into `cs` during reset                   |

In real mode, the base address is normally formed by shifting the 16-bit segment selector value 4 bits left to produce a 20-bit physical address. However, after the hardware reset the first instruction will be located at the special address. We may see that the segment selector in the `cs` register is loaded with `0xF000` but the hidden base address is loaded with `0xFFFF0000`. Instead of using the usual formula to get the address, the processor uses this value as the base address of the first instruction. Having the value of the base address and the offset (from the `ip` register), the starting address will be:

```python
>>> hex(0xffff0000 + 0xfff0)
'0xfffffff0'
```

We got `0xFFFFFFF0`, which is 16 bytes below 4GB. This is the very first address where the CPU starts the execution after reset. This address has special name - [reset vector](https://en.wikipedia.org/wiki/Reset_vector). It is the memory location at which the CPU expects to find the first instruction to execute after reset. Usually it contains a [jump](https://en.wikipedia.org/wiki/JMP_%28x86_instruction%29) (`jmp`) instruction which points to the [BIOS](https://en.wikipedia.org/wiki/BIOS) or [UEFI](https://en.wikipedia.org/wiki/UEFI) entry point. For example, if we take a look at the [source code](https://github.com/coreboot/coreboot/blob/main/src/cpu/x86/entry16.S) of the [coreboot](https://www.coreboot.org/), we will see it there:

<!-- https://raw.githubusercontent.com/coreboot/coreboot/refs/heads/main/src/cpu/x86/entry16.S#L155-L159 -->
```assembly
  /* This is the first instruction the CPU runs when coming out of reset. */
.section ".reset", "ax", %progbits
.globl _start
_start:
	jmp		_start16bit
```

To prove that this code is located at the `0xFFFFFFF0` address, we may take a look at the [linker script](https://github.com/coreboot/coreboot/blob/master/src/arch/x86/bootblock.ld):

<!-- https://raw.githubusercontent.com/coreboot/coreboot/refs/heads/master/src/arch/x86/bootblock.ld#L72-L78 -->
```linker-script
	. = 0xfffffff0;
	_X86_RESET_VECTOR = .;
	.reset . : {
		*(.reset);
		. = _X86_RESET_VECTOR_FILLING;
		BYTE(0);
	}
```

The address `0xFFFFFFF0` is much larger than `0xFFFFF` (1MB). How can the CPU access this address in real mode? The answer is simple. Most likely you have something more modern than **8086** CPU with 20-bit address bus. More modern processors starts in real mode but with 32-bit or 64-bit bus.

When the CPU wakes up, it reads the jump at the `0xFFFFFFF0` address, jump into the firmware, and the long chain of the boot process begins. This is the very first step on the way to boot the Linux kernel.

## From Power-On to Bootloader

We stopped at the point when a CPU jumps from the reset vector to the firmware. On a legacy PC, that means the BIOS. On modern computers it is UEFI. In the next chapters we will see the booting processes on a legacy PC using the BIOS, and later UEFI.

The first job of BIOS is to bring the system into a working state. It runs a series of hardware checks and initializations — memory tests, peripheral setup, chipset configuration — all part of the [POST](https://en.wikipedia.org/wiki/Power-on_self-test) routine. Once everything is checked, the next step is to find an operating system to boot. The BIOS doesn’t pick just a random disk. It follows a boot order, a list stored in its configuration.

When the BIOS tries to boot from a hard drive, it looks for a [boot sector](https://en.wikipedia.org/wiki/Boot_sector). On hard drives partitioned with an [MBR partition layout](https://en.wikipedia.org/wiki/Master_boot_record), the boot sector is stored in the first `446` bytes of the first sector, where each sector is `512` bytes. The final two bytes of the first sector must be `0x55` and `0xAA`. These two last bytes says to BIOS somewhat like "yes - this device is bootable". Once the BIOS finds the valid boot sector, it copies it into the fixed memory location at `0x7C00`, jumps to there and start executing it.

In general, real mode's memory map is as follows:

| Address Range         | Description                          |
|-----------------------|--------------------------------------|
| 0x00000000–0x000003FF | Real Mode Interrupt Vector Table     |
| 0x00000400–0x000004FF | BIOS Data Area                       |
| 0x00000500–0x00007BFF | Unused                               |
| 0x00007C00–0x00007DFF | Bootloader                           |
| 0x00007E00–0x0009FFFF | Unused                               |
| 0x000A0000–0x000BFFFF | Video RAM (VRAM) Memory              |
| 0x000B0000–0x000B7777 | Monochrome Video Memory              |
| 0x000B8000–0x000BFFFF | Color Video Memory                   |
| 0x000C0000–0x000C7FFF | Video ROM BIOS                       |
| 0x000C8000–0x000EFFFF | BIOS Shadow Area                     |
| 0x000F0000–0x000FFFFF | System BIOS                          |

We can do a simple experiment and create a very primitive boot code:

```assembly
;;
;; Note: this example is written using NASM assembler
;;
[BITS 16]

boot:
    ;; Symbol to print
    mov al, '!'
    ;; TTY-style text output
    mov ah, 0x0e
    ;; Position where to print the character
    mov bh, 0x00
    ;; Color
    mov bl, 0x07
    ;; Interrupt call
    int 0x10
    jmp $

times 510-($-$$) db 0

db 0x55
db 0xaa
```

You can build and run this code using the following commands:

```bash
nasm -f bin boot.S && qemu-system-x86_64 boot -nographic
```

This will instruct [QEMU](https://www.qemu.org/) virtual machine to use the `boot` binary that we just built as a disk image. Since the binary generated by the assembly code above fulfills the requirements of the boot sector (we end it with the magic sequence), QEMU will treat the binary as the master boot record (MBR) of a disk image.

If you did everything correctly, you will see something like this after run of the command above:

```
SeaBIOS (version 1.17.0-5.fc42)

iPXE (https://ipxe.org) 00:03.0 CA00 PCI2.10 PnP PMM+06FCAEC0+06F0AEC0 CA00

Booting from Hard Disk...
!
```

Of course, a real-world boot sector has "slightly" speaking more code for loading of an operating system instead of printing an exclamation mark, but it may interesting to experiment. In this example, we can see that the code will be executed in `16-bit` real mode which is specified by the `[BITS 16]` directive. After starting, it calls the [0x10](https://en.wikipedia.org/wiki/INT_10H) interrupt, which just prints the `!` symbol. The `times` directive will pad that number of bytes up to `510th` byte with zeros. In the end we "hard-code" the last two magic bytes `0xAA` and `0x55`. To exit from the virtual machine, you can press - `Ctrl+a x`.

From this point onwards, the BIOS hands control over to the bootloader.

## The Bootloader Stage

There are a number of different bootloaders that can boot Linux kernel, such as [GRUB 2](https://www.gnu.org/software/grub/), [syslinux](http://www.syslinux.org/wiki/index.php/The_Syslinux_Project), [systemd-boot](https://github.com/ivandavidov/systemd-boot), and others. The Linux kernel has a [Boot protocol](https://github.com/torvalds/linux/blob/master/Documentation/arch/x86/boot.rst) which specifies the requirements for a bootloader to implement Linux support. In this chapter, we will take a short look how GRUB 2 does loading.

Continuing from where we left off - the BIOS has now selected a boot device, found its boot sector, loaded it into memory and passed control to the code located there. GRUB 2 bootloader consists of multiple [stages](https://www.gnu.org/software/grub/manual/grub/grub.html#Images). The first stage of the boot code is in the [boot.S](https://github.com/rhboot/grub2/blob/master/grub-core/boot/i386/pc/boot.S) source code file. Due to limited amount of space for the first boot sector, this code has only single goal - to load [core image](https://www.gnu.org/software/grub/manual/grub/html_node/Images.html) into memory and jump to it.

The core image starts with [diskboot.S](https://github.com/rhboot/grub2/blob/master/grub-core/boot/i386/pc/diskboot.S), which is usually stored right after the first sector of the disk. The code from the `diskboot.S` file loads the rest of the core image into memory. The core image contains the code of the loader itself and drivers for reading different filesystems. After the whole core image is loaded into memory, the execution continues from the [grub_main](https://github.com/rhboot/grub2/blob/master/grub-core/kern/main.c) function. This is where GRUB sets up the environment it needs to operate:

- Initializes the console so messages and menus can be displayed.
- Sets the root device — the disk from which GRUB will read files modules and configuration files.
- Loads and parses the GRUB configuration file.
- Loads required modules.

Once these tasks are complete, we may see the familiar GRUB menu where we can choose the operating system we want to load. When we select one of the menu entries, GRUB executes the [boot](https://www.gnu.org/software/grub/manual/grub/grub.html#boot) command which boots the selected operating system. So how the loader loads the Linux kernel? To answer on this question, we need to get back to the Linux kernel boot protocol.

As we can read in the [documentation](https://github.com/torvalds/linux/blob/master/Documentation/arch/x86/boot.rst), the bootloader must load the kernel into memory, fill some fields in the kernel setup header and pass control to the kernel code. The very first part of the kernel code is so-called kernel setup header and setup code. The kernel setup header is a special structure embedded in the early Linux boot code and provides fields that describes how kernel should be loaded and started. The setup header is started at the `0x01F1` offset from the beginning of the kernel image. We may look at the boot [linker script](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld) to confirm the value of this offset:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/setup.ld#L70-70 -->
```linker-script
	. = ASSERT(hdr == 0x1f1, "The setup header has the wrong offset!");
```

The kernel [setup header](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) is split on two parts and the first part starts from the following fields:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L233-L241 -->
```assembly
	.globl	hdr
hdr:
		.byte setup_sects - 1
root_flags:	.word ROOT_RDONLY
syssize:	.long ZO__edata / 16
ram_size:	.word 0			/* Obsolete */
vid_mode:	.word SVGA_MODE
root_dev:	.word 0			/* Default to major/minor 0/0 */
boot_flag:	.word 0xAA55
```

The bootloader may fill some of these fields in the setup header which marked as being type `write` or `modify` in the Linux boot protocol. The values set by the bootloader will be taken from its configuration or will be calculated during boot. Of course we will not go over full descriptions and explanations of all the fields of the kernel setup header. Instead, we will take a look closer at this or that field if we will meet it during our research of the kernel code.

According to the Linux kernel boot protocol, memory will be mapped as follows after loading the kernel:

```
              ~                        ~
              |  Protected-mode kernel |
100000        +------------------------+
              |  I/O memory hole       |
0A0000        +------------------------+
              |  Reserved for BIOS     |      Leave as much as possible unused
              ~                        ~
              |  Command line          |      (Can also be below the X+10000 mark)
X+10000       +------------------------+
              |  Stack/heap            |      For use by the kernel real-mode code.
X+08000       +------------------------+
              |  Kernel setup          |      The kernel real-mode code.
              |  Kernel boot sector    |      The kernel legacy boot sector.
X             +------------------------+
              |  Boot loader           |      <- Boot sector entry point 0000:7C00
001000        +------------------------+
              |  Reserved for MBR/BIOS |
000800        +------------------------+
              |  Typically used by MBR |
000600        +------------------------+
              |  BIOS use only         |
000000        +------------------------+

... where the address X is as low as the design of the boot loader permits.
```

We can see that when the bootloader transfers control to the kernel, execution starts right after the kernel’s boot sector — that is, at the address `X` plus the length of the boot sector. The value of this `X` depends on how the kernel loaded. For example if I try to load kernel just with [qemu](https://www.qemu.org/), the starting address of the kernel image is at `0x10000`:

```bash
hexdump -C /tmp/dump | grep MZ
00010000  4d 5a 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |MZ..............|
```

Linux kernel image starts from `4D 5A` bytes as you may see in the beginning of the kernel setup code:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L42-L46 -->
```assembly
	.code16
	.section ".bstext", "ax"
#ifdef CONFIG_EFI_STUB
	# "MZ", MS-DOS header
	.word	IMAGE_DOS_SIGNATURE
```

If you want to get a similar memory dump, follow these steps. First of all, you need to build kernel. If you do not know how to do it, you can find detailed instruction [here](https://github.com/0xAX/linux-insides/blob/master/Misc/linux-misc-1.md). On the diagram above, we can see that the `Protected-mode` kernel starts from `0x100000`. Knowing this address we can start the kernel in the qemu virtual machine with the following command:

```bash
sudo qemu-system-x86_64 -kernel ./linux/arch/x86/boot/bzImage \
                        -nographic                            \
                        -append "console=ttyS0 nokaslr"       \
                        -initrd /boot/initramfs-6.17.0-rc1-g8f5ae30d69d7.img -s -S
```

After the virtual machine is started, we can attach the debugger to it, set up a breakpoint on the entry point and get the dump:

```bash
gdb vmlinux
(gdb) target remote :1234
(gdb) hbreak *0x100000
(gdb) c
Continuing.

Breakpoint 1, 0x0000000000100000 in ?? ()
(gdb) dump binary memory /tmp/dump 0x0000 0x20000
```

After this you should be able to find your dump in the `/tmp/dump`.

If we try to load Linux kernel using GRUB 2 bootloader, this `X` address will be `0x90000`. Let's take a look how to do it and check. First of all you need to prepare image with kernel and GRUB 2. To do so execute the following commands:

```bash
qemu-img create hdd.img 64M
parted hdd.img --script mklabel msdos
parted hdd.img --script mkpart primary ext2 1MiB 100%
parted hdd.img --script set 1 boot on
LO_DEVICE=$(losetup -f)
sudo losetup -P "${LO_DEVICE}" hdd.img
sudo mkfs.ext2 "${LO_DEVICE}"p1
sudo mount "${LO_DEVICE}"p1 /mnt/tmp
sudo mkdir -p /mnt/tmp/boot/grub
sudo grub2-install \
  --target=i386-pc \
  --boot-directory=/mnt/tmp/boot \
  "${LO_DEVICE}"
sudo cp ./arch/x86/boot/bzImage /mnt/tmp/boot/
sudo tee /mnt/tmp/boot/grub/grub.cfg > /dev/null <<EOF
terminal_input serial
terminal_output serial
set timeout=0
set default=0
set debug=linux

menuentry "Linux" {
    linux /boot/bzImage earlyprintk=serial,0x3f8,115200
}
EOF
sudo umount /mnt/tmp
sudo losetup -d "${LO_DEVICE}"
```

Now we can run qemu virtual machine with our image:

```bash
qemu-system-x86_64 -drive format=raw,file=hdd.img -m 256M -s -S -no-reboot -no-shutdown -vga virtio
```

Connect with [gdb](https://sourceware.org/gdb/) debugger and setup breakpoint:

```
$ gdb
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
(gdb) break *0x90200
Breakpoint 1 at 0x90200
(gdb) c
Continuing.
```

If you did everything correctly, you will see the GRUB 2 prompt in the qemu window. Execute the following commands:

```
set pager=1
set debug=all
linux /boot/bzImage
boot
```

During the execution of the `linux` command, you will see the debug line:

```
relocator: min_addr = 0x0, max_addr = 0xffffffff, target = 0x90000
```

That confirms that the kernel image will be loaded at the `0x90000` address. During execution of the `boot` command, the breakpoint should be caught. In debugger you can execute `i r` command and see that we are at the `0x9020:0x0000`

```
rip            0x0                 0x0
cs             0x9020              36896
```

If you continue to execute `s i` commands in the debugger CLI, you will go step by step through the early kernel setup code. If you exit from the debugger, you will see the continuation of the kernel loading procedure.

In addition, we can confirm this address using the same approach as in the example with QEMU above. We know that according to the Linux kernel boot protocol, the protected mode kernel is loaded at the `100000` address. We can set a breakpoint at this address and create a memory dump. To do this, run the QEMU virtual machine using the same command:

```bash
qemu-system-x86_64 -drive format=raw,file=hdd.img -m 256M -s -S -no-reboot -no-shutdown -vga virtio
```

At the next step, attach with gdb to the virtual machine:

```
(gdb) target remote localhost:1234
Remote debugging using localhost:1234
0x000000000000fff0 in ?? ()
(gdb) break *0x100000
Breakpoint 1 at 0x100000
(gdb) c
Continuing.
```

At the beginning, the breakpoint stops us at the GRUB code itself. Because of this, we need to continue in the debugger with the `c` command. Return to the QEMU window now, and execute these commands:

```
set pager=1
set debug=all
linux /boot/bzImage
boot
```

During the boot process, the debugger stops us the second time at the breakpoint which we set at the `100000` address:

```
Breakpoint 1, 0x0000000000100000 in ?? ()
(gdb) c
Continuing.
```

This time, we are at the entry point of the Linux kernel in protected mode. Execute the next command in the debugger shell to get a memory dump:

```
dump binary memory /tmp/dump 0x0000 0x200000
```

Now we can inspect the memory dump at the `0x90000` address:

```bash
~$ hexdump -C /tmp/dump | grep 00090000
00090000  4d 5a 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |MZ..............|
```

We can see the same `MZ` header from which the Linux kernel setup head starts. In addition, we can inspect the memory at the `0x90200` offset to see that there is a kernel setup header:

```bash
~$ hexdump -C /tmp/dump | grep 00090200
00090200  eb 6a 48 64 72 53 0f 02  00 00 00 00 00 10 00 43  |.jHdrS.........C|
```

## The Beginning of the Kernel Setup Stage

The bootloader has now loaded the Linux kernel and the kernel setup code into memory, filled the header fields, and then jumped to the corresponding memory address. Finally, we are in the kernel 🎉

Technically, the kernel itself hasn't run yet but only early kernel setup code. First, the kernel setup part must switch from the real mode to [protected mode](https://en.wikipedia.org/wiki/Protected_mode), and after this switch to the [long mode](https://en.wikipedia.org/wiki/Long_mode), to configure the kernel decompressor, and finally decompress the kernel and jump to it. Execution of the kernel setup code starts from [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) at the `_start` symbol:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L246-L256 -->
```assembly
_start:
		# Explicitly enter this as bytes, or the assembler
		# tries to generate a 3-byte jump here, which causes
		# everything else to push off to the wrong offset.
		.byte	0xeb		# short (2-byte) jump
		.byte	start_of_setup-1f
1:

	# Part 2 of the header, from the old setup.S

		.ascii	"HdrS"		# header signature
```

The very first instruction we encounter here is [jmp](https://en.wikipedia.org/wiki/JMP_(x86_instruction)) specified by the `0xEB` opcode. The second byte defines the offset to jump to. As described in the [Intel® 64 and IA-32 Architectures Software Developer Manuals](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html):

> The target operand specifies either an absolute offset (that is an offset from the base of the code segment) or a relative offset (a signed displacement relative to the current value of the instruction pointer in the EIP register).

If you’ve never met the `Nf` syntax before, `1f` means the next label `1` that will appear in the code. Immediately after those two bytes, we can see the label `1` located right before the beginning of the second part of the kernel setup header.

After the second part of the kernel setup header, we can see the `.entrytext` section, which starts with the `start_of_setup` label. This is exactly the place where the execution will be continued:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L544-L547 -->
```assembly
# End of setup header #####################################################

	.section ".entrytext", "ax"
start_of_setup:
```

But from which point are we jumping? After the kernel setup code receives control from the bootloader, the first `jmp` instruction is located at the `0x200` bytes offset from the start of the loaded kernel image. This is mentioned in the Linux kernel boot protocol:

> The kernel is started by jumping to the kernel entry point, which is located at *segment* offset 0x20 from the start of the real mode kernel.

This applies also to the GRUB 2 bootloader. We can see in its [source code](https://github.com/rhboot/grub2/blob/master/grub-core/loader/i386/pc/linux.c):

```C
segment = grub_linux_real_target >> 4;
state.gs = state.fs = state.es = state.ds = state.ss = segment;
state.sp = GRUB_LINUX_SETUP_STACK;
state.cs = segment + 0x20;
state.ip = 0;
```

Here, `grub_linux_real_target` is the physical address where the kernel setup code will be loaded. As we saw in the [previous section](#the-magic-power-button---what-happens-next), this address was `0x90000`. Shifting it right by four divides it by `16`, converting a physical address into a segment value - that’s how real mode memory segmentation works.

Then, GRUB sets the code segment specified by the `CS` register to `segment + 0x20` before starting execution. Why `0x20`? Let's remember that in real mode, physical addresses are computed as:

```
Physical = (cs << 4) + ip
```

With `segment = 0x9000`, setting `cs = 0x9000 + 0x20 = 0x9020` and `ip = 0` gives us:

```
Physical = (0x9020 << 4) + 0 = 0x90200
```

This means execution starts at physical address `0x90200` which is exactly `512` bytes offset from where the setup code was loaded. In other words - the offset to the address where the `jump` instruction resides in the image.

After the jump to the `start_of_setup` label, the kernel setup code enters the very first phase of its real work:

- Unifying the segment registers
- Establishing a valid stack
- Clearing the `.bss` section
- Transitioning into C code

In the next sections, we’ll walk through each of these steps in detail.

### Aligning the segment registers

Reading the Linux kernel boot protocol for `x86_64`, we can see:

> At entry, ds = es = ss should point to the start of the real-mode kernel code...

This is the first operation we can see after the `start_of_setup` label. First, the kernel setup code ensures that the `ds` and `es` segment registers point to the same address. Next, it clears the [direction flag](https://en.wikipedia.org/wiki/Direction_flag) using the `cld` instruction:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L546-L551 -->
```assembly
	.section ".entrytext", "ax"
start_of_setup:
# Force %es = %ds
	movw	%ds, %ax
	movw	%ax, %es
	cld
```

We need to do both of these two things to clear the [bss](https://en.wikipedia.org/wiki/.bss) section properly a bit later. From this point we are sure that both `ds` and `es` segment registers point to the same address - `0x9000`.

### Stack Setup

We need to prepare for C language environment. The next step is to setup the stack. Let's take a look at the next lines of the code:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L553-L561 -->
```assembly
# Apparently some ancient versions of LILO invoked the kernel with %ss != %ds,
# which happened to work by accident for the old code.  Recalculate the stack
# pointer if %ss is invalid.  Otherwise leave it alone, LOADLIN sets up the
# stack behind its own code, so we can't blindly put it directly past the heap.

	movw	%ss, %dx
	cmpw	%ax, %dx	# %ds == %ss?
	movw	%sp, %dx
	je	2f		# -> assume %sp is reasonably set
```

Here we compare the value of the `ss` and `ds` registers to be sure that they are equal or to fix the `ss` otherwise. 

According to the comment to this code, only old versions of the [LILO](https://en.wikipedia.org/wiki/LILO_(bootloader)) bootloader can set these registers to different values. So we will skip all the "edge cases" and consider only a single case when the value of the `ss` register is equal to `ds`. Since the values of these registers are equal, we jump to the `2` label:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L572-L578 -->
```assembly
2:	# Now %dx should point to the end of our stack space
	andw	$~3, %dx	# dword align (might as well...)
	jnz	3f
	movw	$0xfffc, %dx	# Make sure we're not zero
3:	movw	%ax, %ss
	movzwl	%dx, %esp	# Clear upper half of %esp
	sti			# Now we should have a working stack
```

At this point, the `dx` register stores the stack pointer value, which should point to the top of the stack. The value of the stack pointer is `0x9000`. GRUB 2 bootloader sets it during the loading of the Linux kernel image. The address is defined by:

<!-- https://raw.githubusercontent.com/rhboot/grub2/refs/heads/master/include/grub/i386/linux.h#L34-L34 -->
```C
#define GRUB_LINUX_SETUP_STACK		0x9000
```

At the next step we check that the address is aligned by four bytes and if yes jump to the label `3`. If the stack pointer is not aligned, we set it to `0xFFFC` value. The reason for this that we can not have stack pointer equal to zero as it grows down during pushing something on the stack. The `0xFFFC` value is the highest 4‑byte aligned address below `0x10000`. If the value of the stack pointer is aligned, we continue to use the aligned value.

From this point we have a correct stack and starts from `0x9000:0x9000` and grows down:

![early-stack](./images/early-stack.svg)

### BSS Setup

Before the kernel's setup code can switch to C code, two final tasks must be done:

- Verify the "magic" signature
- Clear the `.bss` section

The first is the signature checking:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L588-L589 -->
```assembly
	cmpl	$0x5a5aaa55, setup_sig
	jne	setup_bad
```

This simply compares the [setup_sig](https://github.com/torvalds/linux/blob/master/arch/x86/boot/setup.ld) constant value placed by the linker with the magic number `0x5A5AAA55`. If they are not equal, the setup code reports a fatal error and stops execution. The main goal of this check is to ensure we are actually running a valid Linux kernel setup binary, loaded into the proper place by the bootloader.

With the magic number confirmed, and knowing our segment registers and stack are already in the proper state, the only initialization left is to clear the `.bss` section. The section of memory is used to store statically allocated, uninitialized data. Let's take a look at the initialization of this memory area:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L592-L597 -->
```assembly
	movw	$__bss_start, %di
	movw	$_end+3, %cx
	xorl	%eax, %eax
	subw	%di, %cx
	shrw	$2, %cx
	rep stosl
```

The main goal of this code is to clear, or in other words, to fill with zeros the memory area between `__bss_start` and `_end`. To fill this memory area with zeros, the `rep stos` instruction is used. This instruction puts the value of the `eax` register into the destination pointed to by `es:di`. That is why we unified the values of the `ds` and `es` registers at the beginning of the kernel setup code. The `rep` prefix specifies the repetition of the `stos` instruction based on the value of the `cx` register.

To clear this memory area, at first we set the borders of this area - from the [__bss_start](https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/setup.ld) to `_end + 3`. We add `3` bytes to the `_end` address because we are going to write zeros in double words, meaning four bytes at a time. Adding three bytes ensures that when we later divide by four, any reminder at the end of the memory area is still covered. After we set up the borders of the memory area and fill the `eax` with zero using the `xor` instruction, the `rep stosl` does its job.

The effect of this code is that zeros are written through the entire memory from `__bss_start` to `_end`. To know their exact addresses, we can inspect the `setup.elf` file with the [readelf](https://en.wikipedia.org/wiki/Readelf) utility:

```bash
$ readelf -a arch/x86/boot/setup.elf  | grep bss
  [12] .bss              NOBITS          00003f00 004efc 001380 00  WA  0   0 32
   00     .bstext .header .entrytext .inittext .initdata .text .text32 .rodata .videocards .data .signature .bss
   145: 00005280     0 NOTYPE  GLOBAL DEFAULT   12 __bss_end
   169: 00003f00     0 NOTYPE  GLOBAL DEFAULT   12 __bss_start
```

These offsets inside the setup segment. Since in our case the kernel image is loaded at physical address `0x90000`, the symbols translate to:

- __bss_start - `0x90000 + 0x3f00 = 0x93F00`
- __bss_end - `0x90000 + 0x5280 = 0x95280`

The following diagram illustrates how the setup image, `.bss`, and the stack region are laid out in memory:

![bss](./images/early-bss.svg)

> [!IMPORTANT]
> The addresses of the `__bss_start` and `__bss_end` may differ on your machine and depend on the Linux kernel version.

We can confirm it by running an experiment. Add a simple change to the [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c) source code file, build the kernel with our change, and run the Linux kernel in the qemu virtual machine as we did before in this part. The change is:

```diff
modified   arch/x86/boot/main.c
@@ -11,6 +11,7 @@
  * Main module for the real-mode kernel code
  */
 #include <linux/build_bug.h>
+#include <asm/sections.h>

 #include "boot.h"
 #include "string.h"
@@ -173,6 +174,8 @@ void main(void)
	query_edd();
 #endif

+        printf("BSS start: %p. BSS end: %p\n", __bss_start, _end);
+
	/* Set the video mode */
	set_video();
```

If you did everything correctly, you will see an output similar to:

```
BSS start: 00003F00. BSS end: 00005280
```

### Jump to C code

At this point, we initialized the [stack](#stack-setup) and [.bss](#bss-setup) sections. The last assembly instruction is a jump to C code:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/header.S#L600-L600 -->
```assembly
	calll	main
```

The `main()` function is located in [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c) source code file.

What's happening there, we will see in the next chapter.

## Conclusion

This is the end of the first part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - [0xAX](https://twitter.com/0xAX), drop me an [email](mailto:anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part, we will see the first C code that executes in the Linux kernel setup, the implementation of memory routines such as `memset`, `memcpy`, `earlyprintk`, early console implementation and initialization, and much more.

## Links

Here is the list of the links that you may find useful during reading of this chapter:

- [Intel 80386 programmer's reference manual 1986](http://css.csail.mit.edu/6.858/2014/readings/i386.pdf)
- [Minimal Boot Loader for Intel® Architecture](https://www.cs.cmu.edu/~410/doc/minimal_boot.pdf)
- [Minimal Boot Loader in Assembler with comments](https://github.com/Stefan20162016/linux-insides-code/blob/master/bootloader.asm)
- [8086](https://en.wikipedia.org/wiki/Intel_8086)
- [80386](https://en.wikipedia.org/wiki/Intel_80386)
- [Reset vector](https://en.wikipedia.org/wiki/Reset_vector)
- [Real mode](https://en.wikipedia.org/wiki/Real_mode)
- [Linux kernel boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.rst)
- [Ralf Brown's Interrupt List](http://www.ctyme.com/intr/int.htm)
- [Power supply](https://en.wikipedia.org/wiki/Power_supply)
- [Power good signal](https://en.wikipedia.org/wiki/Power_good_signal)


================================================
FILE: Booting/linux-bootstrap-2.md
================================================
# Kernel booting process - Part 2

We have already started our journey into the Linux kernel in the previous [part](./linux-bootstrap-1.md), where we walked through the very early stages of the booting process and first assembly instructions of the Linux kernel code. Aside from different mechanisms, this code was responsible for preparing the environment for the [C](https://en.wikipedia.org/wiki/C_(programming_language)) programming language. At the end of the chapter, we reached a symbolic milestone - the very first call of a C function. This function has a classical name - `main` - and is defined in the [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c) source code file.

From here on, we will still see some assembly code on our way, but it will be more and more rare 🤓 Now it is time for more "high-level" logic!

From the previous part, we know that the kernel setup code is still running in [real mode](https://en.wikipedia.org/wiki/Real_mode). Its primary task is to move the processor first into [protected mode](https://en.wikipedia.org/wiki/Protected_mode), and then into [long mode](https://en.wikipedia.org/wiki/Long_mode). Almost all of the C code we will see in the next chapters exists for this purpose - to prepare and complete these transitions.

In this part, we’ll keep digging through the kernel’s setup code and cover:

- What protected mode is on x86 processors
- Setup of early [heap](https://en.wikipedia.org/wiki/Memory_management#HEAP) and console
- Detection of available memory
- Validation of a CPU 
- Initialization of a keyboard 

Time to explore these steps in detail!

## Protected mode

The Linux kernel for x86_64 operates in a special mode called - [long mode](http://en.wikipedia.org/wiki/Long_mode). One of the main goal of all the setup kernel code is to switch to this mode. But before we can move to this mode, the kernel must switch the CPU into [protected mode](https://en.wikipedia.org/wiki/Protected_mode).

What is [protected mode](https://en.wikipedia.org/wiki/Protected_mode)? From the previous chapter we already know that currently CPU operates in [real mode](https://en.wikipedia.org/wiki/Real_mode). For us it is mostly means - memory segmentation. As a short reminder - to access a memory location, the combination of two CPU [registers](https://en.wikipedia.org/wiki/Processor_register) is used:

- A segment register - `cs`, `ds`, `ss` and `es` which defines segment selector.
- A general purpose register which specifies offset within the segment.

The main motivation for switching from real mode is its memory addressing limitation. As we saw in the previous part, real mode can address only 2<sup>20</sup> bytes. This is just 1 MB of RAM. Obviously, modern software, including an operating system kernel, needs more. To break these constraints, the new processor mode was introduced - `protected mode`.

Protected mode was introduced to the x86 architecture in 1982 and became the primary operating mode of Intel processors, starting with the [80286](http://en.wikipedia.org/wiki/Intel_80286) until the introduction of x86_64 and long mode. This mode brought many changes and improvements, but one of the most crucial was the memory management. The 20-bit address bus was replaced with a 32-bit address bus. It allowed access to 4 Gigabytes of memory in comparison to the 1 Megabyte in real mode.

Memory management in protected mode is divided into two, mostly independent mechanisms:

- `Segmentation`
- `Paging`

For now, our attention stays on segmentation. We’ll return to paging later, once we enter 64-bit long mode.

### Memory segmentation in protected mode

In protected mode, memory segmentation is completely redesigned. Fixed 64 KB real mode segments are gone. Instead, each segment is now defined by a special data structure called a `Segment Descriptor` which specifies the properties of a memory segment. The segment descriptors are stored in a special structure called the `Global Descriptor Table` or `GDT`. Whenever a CPU needs to find an actual physical memory address, it consults this table. The GDT itself is just a block of memory. Its address is stored in the special CPU register called `gdtr`.  This is a 48-bit register and consists of two parts:

- The size of the Global Descriptor Table
- The address of the Global Descriptor Table

Later, we will see exactly how the Linux kernel builds and loads its GDT. For now, it’s enough to know that the CPU provides a dedicated instruction to load the table’s address into the GDTR register:

```assembly
lgdt gdt
```

As mentioned above, the GDT contains `segment descriptors` which describe memory segments. Now let's see how segment descriptors look like. Each descriptor is 64-bits in size. The general scheme of a descriptor is:

![segment-descriptor](./images/segment-descriptor.svg)

Do not worry! I know it may look a little bit intimidating at the first glance, especially in comparison to the relatively simple addressing in real mode, but we will go through it in details. We will start from the bottom, from right to left. 

The first field is `LIMIT 15:0`. It represents the first 16 bits of the segment limit. The second part is located at the bits `51:48`. This field provides information about the size of a segment. Having 20-bit size of the limit field, it may seem that the max size of a memory segment can be 1 MB, but it is not like that. In addition, the max size of a segment depends on the 55th `G` bit:

- If `G=0` - the value of the `LIMIT` field is interpreted in bytes.
- if `G=1` - the value of the `LIMIT` field is interpreted in 4 KB units called pages.

Based on this, we can easily calculate that the max size of a segment is 4 GB.

The next field is `BASE`. We can see that it is split into three parts. The first part occupies bits from `16` to `31`, the second part occupies bits from `32` to `39`, and the last third part occupies bits from `56` to `63`. The main goal of this field is to store the base address of a segment.

The remaining fields in a segment descriptor represent flags that control different aspects of a segment, such as the type of memory. Let's take a look at the description of these flags:

- `Type` - describes the type of a memory segment.
- `S` - distinguishes system segments from code and data segments.
- `DPL` - provides information about the privilege level of a segment. It can be a value from `0` to `3`, where `0` is the level with the highest privileges.
- `P` - tells the CPU whether a segment presented in memory.
- `AVL` - available and reserved bits. It is ignored by the Linux kernel.
- `L` - indicates whether a code segment contains 64-bit code.
- `D / B` - provides different meaning depends on the type of a segment.
  - For a code segment: Controls the default operand and address size. If the bit is clear, it is a 16-bit code segment. Otherwise it is a 32-bit code segment.
  - For a stack segment or in other words a data segment pointed by the `ss` register: Controls the default stack pointer size. If the bit is clear, it is a 16-bit stack segment and stack operations use `sp` register. Otherwise it is a 32-bit stack segment and stack operations use `esp` register.
  - For a expand-down data segment: Specifies the upper bound of the segment. If the bit is clear, the upper bound is `0xFFFF` or 64 KB. Otherwise, it is `0xFFFFFFFF` or 4 GB.

If the `S` flag of a segment descriptor is set, the descriptor describes either a code or a data segment, otherwise it is a system segment. If the highest order bit of the `Type` flags is clear - this descriptor describes a data segment, otherwise a code segment. Rest of the three bits of a data segment descriptor interpreted as:

- `Accessed` - indicates whether a segment has been accessed since the last time the kernel cleared this bit.
- `Write-Enable` - determines whether a segment is writable or read-only.
- `Expansion-Direction` - determines whether addresses decreasing from the base address or not.

For a code segment, these three bits interpreted as:

- `Accessed` - indicates whether a segment has been accessed since the last time the kernel cleared this bit.
- `Read-Enable` - determines whether a segment is execute-only or execute-read.
- `Confirming` - determines how privilege level changes are handled when transferring execution to that segment.

In the tables below you can find full information about possible states of the flags for a code and a data segments.

A data segment `Type` field:

| E (Expand-Down) | W (Writable) | A (Accessed) | Description                       |
| --------------- | ------------ | ------------ | --------------------------------- |
| 0               | 0            | 0            | Read-Only                         |
| 0               | 0            | 1            | Read-Only, accessed               |
| 0               | 1            | 0            | Read/Write                        |
| 0               | 1            | 1            | Read/Write, accessed              |
| 1               | 0            | 0            | Read-Only, expand-down            |
| 1               | 0            | 1            | Read-Only, expand-down, accessed  |
| 1               | 1            | 0            | Read/Write, expand-down           |
| 1               | 1            | 1            | Read/Write, expand-down, accessed |

A code segment `Type` field:

| C (Conforming) | R (Readable) | A (Accessed) | Description                        |
| -------------- | ------------ | ------------ | ---------------------------------- |
| 0              | 0            | 0            | Execute-Only                       |
| 0              | 0            | 1            | Execute-Only, accessed             |
| 0              | 1            | 0            | Execute/Read                       |
| 0              | 1            | 1            | Execute/Read, accessed             |
| 1              | 0            | 0            | Execute-Only, conforming           |
| 1              | 1            | 0            | Execute/Read, conforming           |
| 1              | 0            | 1            | Execute-Only, conforming, accessed |
| 1              | 1            | 1            | Execute/Read, conforming, accessed |

So far, we’ve looked at how a segment descriptor defines the properties of a memory segment — its base, limit, type, and different flags. But how does the CPU actually refer to one of these descriptors during execution? Just like in real mode - using segment registers. In protected mode they contain segment selectors. However, in protected mode, a segment selector is handled differently. Each segment descriptor has an associated segment selector which is a 16-bit structure:

![segment-selector](./images/segment-selector.svg)

The meaning of the fields is:

- `Index` - the entry number of the descriptor in the descriptor table.
- `TI` - indicates where to search for the descriptor
  - If the value of the bit is `0`, a descriptor will be searched in the Global Descriptor Table.
  - If the value of this bit is `1`, a descriptor will be searched in the Local Descriptor Table.
- `RPL` - the privilege level requested by the selector.

When a program running in protected mode references a memory, the CPU need to calculate a proper physical address. The following steps are needed to get a physical address in protected mode:

1. A segment selector is loaded into one of the segment registers.
2. The CPU tries to find a associated segment descriptor in the Global Descriptor Table based on the `Index` value from the segment selector. If the descriptor was found, it is loaded into a special hidden part of this segment register.
3. The physical address will be the base address from the segment descriptor plus offset from the instruction pointer or memory location referenced within an executed instruction.

In the next part, we will see the transition into protected mode. But before the kernel can be switched to protected mode, we need to do some more preparations.

Let's continue from the point where we have stopped in the previous chapter.

## Back to the Kernel: Entering main.c

As we already have mentioned in the beginning of this chapter, one of the kernel's first main goals is to switch the processor into protected mode. But before this can happen, the kernel need to do some preparations.

If we look at the very beginning of the `main` function from the [arch/x86/boot/main.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c), the very first thing we will see is a call of the `init_default_io_ops` function.

This function defined in the [arch/x86/boot/io.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/io.h) and looks like:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/io.h#L26-L31 -->
```C
static inline void init_default_io_ops(void)
{
	pio_ops.f_inb  = __inb;
	pio_ops.f_outb = __outb;
	pio_ops.f_outw = __outw;
}
```

This function initializes function pointers for:

- reading a byte from an I/O port
- writing a byte to an I/O port
- writing a word (16-bit) to an I/O port

These callbacks will be used to write data to the serial console which will be initialized at the one of the next steps. All the operations will be executed with the help of the `inb`, `outb`, and `outw` macros which defined in the same file:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/io.h#L37-L39 -->
```C
#define inb  pio_ops.f_inb
#define outb pio_ops.f_outb
#define outw pio_ops.f_outw
```

The `__inb`, `__outb`, and `__outw` themselves are inline functions from the [arch/x86/include/asm/shared/io.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/shared/io.h):

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/include/asm/shared/io.h#L7-L24 -->
```C
#define BUILDIO(bwl, bw, type)						\
static __always_inline void __out##bwl(type value, u16 port)		\
{									\
	asm volatile("out" #bwl " %" #bw "0, %w1"			\
		     : : "a"(value), "Nd"(port));			\
}									\
									\
static __always_inline type __in##bwl(u16 port)				\
{									\
	type value;							\
	asm volatile("in" #bwl " %w1, %" #bw "0"			\
		     : "=a"(value) : "Nd"(port));			\
	return value;							\
}

BUILDIO(b, b, u8)
BUILDIO(w, w, u16)
BUILDIO(l,  , u32)
```

All of these functions use `in` and `out` assembly instructions which send the given value to the given port or read the value from the given port. If the syntax is not familiar to you, you can read the chapter about [inline assembly](https://github.com/0xAX/linux-insides/blob/master/Theory/linux-theory-3.md).

After initialization of callbacks for writing to a serial port, the next step is copying of the kernel setup header filled by a bootloader into the corresponding field of the C `boot_params` structure. This will make the fields from the kernel setup header more easily accessible. All the job by copying handled by the `copy_boot_params` function with the help of `memcpy`:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/main.c#L39-L39 -->
```C
	memcpy(&boot_params.hdr, &hdr, sizeof(hdr));
```

Do not mix this `memcpy` with the function from the C standard library - [memcpy](https://man7.org/linux/man-pages/man3/memcpy.3.html). During the time when the kernel is in the early initialization phase, there is no way to load any library. For this reason, an operating system kernel provides own implementation of such functions. The kernel's `memcpy` defined in the [copy.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/copy.S). If you already started to miss an assembly code, this is the high time to bring some back:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/copy.S#L18-L32 -->
```assembly
SYM_FUNC_START_NOALIGN(memcpy)
	pushw	%si
	pushw	%di
	movw	%ax, %di
	movw	%dx, %si
	pushw	%cx
	shrw	$2, %cx
	rep movsl
	popw	%cx
	andw	$3, %cx
	rep movsb
	popw	%di
	popw	%si
	retl
SYM_FUNC_END(memcpy)
```

First of all, we can see that `memcpy` and other routines which are defined there, start and end with the two macros - `SYM_FUNC_START_NOALIGN` and `SYM_FUNC_END`. The `SYM_FUNC_START_NOALIGN` just specifies the given symbol name as [.globl](https://sourceware.org/binutils/docs/as.html#Global) to make it visible for other functions. The `SYM_FUNC_END` just expands to an empty string in our case.

Despite the implementation of this function is written in assembly language, the implementation of `memcpy` is relatively simple. At first, it pushes values from the `si` and `di` registers to the stack to preserve their values because they will change during the `memcpy` execution. At the next step we may see handling of the function's parameters. The parameters of this function are passed through the `ax`, `dx`, and `cx` registers. This is because the kernel setup code is built with `-mregparm=3` option. So:

- `ax` will contain the address of `boot_params.hdr`
- `dx` will contain the address of `hdr`
- `cx` will contain the size of `hdr` in bytes

The `rep movsl` instruction copies bytes from the memory pointed by the `si` register to the memory location pointed by the `di` register. At each iteration 4 bytes copied. For this reason we divided the size of the setup header by 4 using `shrw` instruction. After this step we just copy rest of bytes that is not divided by 4.

From this point, the setup header is copied into a proper place and we can move forward.

### Console initialization

As soon as the kernel setup header is copied into the `boot_params.hdr`, the next step is to initialize the serial console by calling the `console_init` function. Very soon we will be able to print something from within the kernel code!

The `console_init` defined in [arch/x86/boot/early_serial_console.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/early_serial_console.c). At the very first step it tries to find the `earlyprintk` option in the kernel's command line. If the search was successful, it parses the port address and [baud rate](https://en.wikipedia.org/wiki/Baud) and executes the initialization of the serial port.

> [!NOTE]
> If you want to know what else options you can pass in the kernel command line, you can find more information in the [The kernel's command-line parameters](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst) document.

Let's take a look at these two steps in details.

The possible values of the `earlyprintk` command line option are:

- `serial,0x3f8,115200`
- `serial,ttyS0,115200`
- `ttyS0,115200`

These parameters define the name of a serial port, the port number, and the [baud](https://en.wikipedia.org/wiki/Baud) rate.

The pointer to the kernel command line is stored in the kernel setup header that was copied in the previous section. The kernel setup code accesses it using `boot_params.hdr.cmd_line_ptr`. The `parse_earlyprintk` function tries to find the `earlyprintk` option in the kernel command line, parse it, and initialize the serial console with the given parameters. If the `earlyprintk` option is given and contains valid values, the initialization of the serial console takes place in the `early_serial_init` function. There is nothing specific to the Linux kernel in the initialization of a serial console, so we will skip this part. If you want to dive deeper, you can find more information [here](https://wiki.osdev.org/Serial_Ports#Port_Addresses) and learn [arch/x86/boot/early_serial_console.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/early_serial_console.c) step by step.

After the serial port initialization we can see the first output:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/main.c#L142-L143 -->
```C
	if (cmdline_find_option_bool("debug"))
		puts("early console in setup code\n");
```

The `puts` function uses the `inb` function that we have seen above during initialization of I/O callbacks.

From this point we can print messages from the kernel setup code 🎉. Time to move to the next step.

### Heap initialization

We have seen the initialization of the `stack` and `bss` memory areas in the previous chapter. The next step is to initialize the [heap](https://en.wikipedia.org/wiki/Memory_management#HEAP) memory area. The heap initialization takes place in the `init_heap` function:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/main.c#L118-131 -->
```C
static void init_heap(void)
{
	char *stack_end;

	if (boot_params.hdr.loadflags & CAN_USE_HEAP) {
		stack_end = (char *) (current_stack_pointer - STACK_SIZE);
		heap_end = (char *) ((size_t)boot_params.hdr.heap_end_ptr + 0x200);
		if (heap_end > stack_end)
			heap_end = stack_end;
	} else {
		/* Boot protocol 2.00 only, no heap available */
		puts("WARNING: Ancient bootloader, some functionality may be limited!\n");
	}
}
```

First of all, `init_heap` checks the `CAN_USE_HEAP` flag from the kernel setup header. We can find information about this flag in the kernel boot protocol:

>   Bit 7 (write): CAN_USE_HEAP
>
>	Set this bit to 1 to indicate that the value entered in the
>	heap_end_ptr is valid.  If this field is clear, some setup code
>	functionality will be disabled.

If this bit is not set, we'll see the warning message. Otherwise, the heap memory area is initialized. The beginning of the heap is defined by the `HEAP` pointer, which points to the end of the kernel setup image:

```C
char *HEAP = _end;
```

Now we need to initialize the size of the heap. There is another small hint in the Linux kernel boot protocol:

> ============	==================
> Field name:	heap_end_ptr
> Type:		write (obligatory)
> Offset/size:	0x224/2
> Protocol:	2.01+
> ============	==================
>
>  Set this field to the offset (from the beginning of the real-mode
>  code) of the end of the setup stack/heap, minus 0x0200.

The GRUB bootloader sets this value to:

```C
#define GRUB_LINUX_HEAP_END_OFFSET	(0x9000 - 0x200)
```

Based on these values, the end of the heap pointed by the `heap_end` will be at the `0x9000` offset from the end of the kernel setup image. To avoid the case when the heap and stack overlap, there is an additional check. It sets the end of the heap equal to the end of the stack if the first one is greater than the second. Having this, the heap memory area will be located above the `bss` area till the stack. So, the memory map will look like:

![early-heap](./images/early-heap.svg)

Now the heap is initialized, although we will see the usage of it in the next chapters.

### CPU validation

The next step is the validation of CPU on which the kernel is running. The kernel has to do it to make sure that the all required functionalities will work correctly on the given CPU.

The `validate_cpu` function from [arch/x86/boot/cpu.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cpu.c) validates the CPU. This function calls the [`check_cpu`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/cpucheck.c) which check the CPU model and its flags using the [cpuid](https://en.wikipedia.org/wiki/CPUID) instruction. The CPU's flags are checked like the presence of [long mode](http://en.wikipedia.org/wiki/Long_mode), checks the processor's vendor and makes preparations for certain vendors like turning on extensions like [SSE+SSE2](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data):

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/cpu.c#L60-L73 -->
```C
int validate_cpu(void)
{
	u32 *err_flags;
	int cpu_level, req_level;

	check_cpu(&cpu_level, &req_level, &err_flags);

	if (cpu_level < req_level) {
		printf("This kernel requires an %s CPU, ",
		       cpu_name(req_level));
		printf("but only detected an %s CPU.\n",
		       cpu_name(cpu_level));
		return -1;
	}
```

If the level of CPU is less than the required level specified by the `CONFIG_X86_MINIMUM_CPU_FAMILY` kernel configuration option, the function returns the error and the kernel setup process is aborted.

### Memory detection

After the kernel became sure that the CPU which it is running on is suitable, the next stage is to detect available memory in the system. This task is handled by the `detect_memory` function, which queries the system firmware to obtain a map of physical memory regions. To do this, the kernel uses the special BIOS service - `0xE820`, but kernel can fallback to legacy BIOS services like `0xE801` or `0x88`. In this chapter, we will see only the implementation of the `0xE820` interface.

The `detect_memory` function defined in the [arch/x86/boot/memory.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/memory.c) and as just mentioned, tries to get the information about available memory:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/memory.c#L116-L123 -->
```C
void detect_memory(void)
{
	detect_memory_e820();

	detect_memory_e801();

	detect_memory_88();
}
```

Let's look at the crucial part of the implementation of the `detect_memory_e820` function. First of all, the `detect_memory_e820` function initializes the `biosregs` structure with the special values related to the `0xE820` BIOS interface:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/memory.c#L25-L29 -->
```C
	initregs(&ireg);
	ireg.ax  = 0xe820;
	ireg.cx  = sizeof(buf);
	ireg.edx = SMAP;
	ireg.di  = (size_t)&buf;
```

- `ax` register contains the number of the BIOS service
- `cx` register contains the size of the buffer which will contain the data about available memory
- `di` register contain the address of the buffer which will contain memory data
- `edx` register contains the `SMAP` magic number

After registers are filled with the needed values, the kernel can ask the `0xE820` BIOS interface about the available memory. To do so, the kernel invokes `0x15` [BIOS interrupt](https://en.wikipedia.org/wiki/BIOS_interrupt_call), which returns information about one memory region. The kernel repeats this operation in a loop until it collects information about all available memory regions into the array of `boot_e820_entry` structures. This structure contains information about:

- beginning address of the memory region
- size of the memory region
- type of the memory region

The structure is defined in [arch/x86/include/uapi/asm/setup_data.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/uapi/asm/setup_data.h):

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/include/uapi/asm/setup_data.h#L45-L49 -->
```C
struct boot_e820_entry {
	__u64 addr;
	__u64 size;
	__u32 type;
} __attribute__((packed));
```

After the information is called, the kernel prints a message about the available memory regions. You can find it in the [dmesg](https://en.wikipedia.org/wiki/Dmesg) output:

```
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003ffdffff] usable
[    0.000000] BIOS-e820: [mem 0x000000003ffe0000-0x000000003fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
```

### Keyboard initialization

Once memory detection is complete, the kernel proceeds with initializing the keyboard using the `keyboard_init`:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/main.c#L64-L76 -->
```C
static void keyboard_init(void)
{
	struct biosregs ireg, oreg;

	initregs(&ireg);

	ireg.ah = 0x02;		/* Get keyboard status */
	intcall(0x16, &ireg, &oreg);
	boot_params.kbd_status = oreg.al;

	ireg.ax = 0x0305;	/* Set keyboard repeat rate */
	intcall(0x16, &ireg, NULL);
}
```

This function performs two tasks using [BIOS interrupt](https://en.wikipedia.org/wiki/BIOS_interrupt_call) `0x16`:

1. Gets the state of a keyboard which contains information about state of certain modifier keys, like for example Caps Lock active or not.
2. Sets the keyboard repeat rate which determines how long a key must hold down before it begins repeating

After the BIOS interrupt was executed, the keyboard should be initialized. If you are wondering why we need a working keyboard at such an early stage, the answer is - it can be used during the selection of the video mode. We will see more details in the [next chapter](linux-bootstrap-3.md).

### Gathering system information

After we went though the most essential hardware interfaces like CPU, I/O, memory map, keyboard, the next a couple of steps are to query the BIOS for additional information about the system. The information which kernel is going to gather is not strictly required for entering protected mode, but it provides useful details that later parts of the kernel may rely on. 

The following information is going to be collected:

- Information about [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep)
- Information about [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management)
- Information about [Enhanced Disk Drive](https://en.wikipedia.org/wiki/INT_13H)

At this moment we will not dive into details about each of this query, but will get back to them in the next parts when we will use this information. For now, just let's take a short look at these functions:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/main.c#L163-L174 -->
```C
	/* Query Intel SpeedStep (IST) information */
	query_ist();

	/* Query APM information */
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
	query_apm_bios();
#endif

	/* Query EDD information */
#if defined(CONFIG_EDD) || defined(CONFIG_EDD_MODULE)
	query_edd();
#endif
```

The first one is getting information about the [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep). This information is obtained by the calling the `0x15` BIOS interrupt and store the result in the `boot_params` structure. The returned information describes the support of the Intel SpeedStep and settings around it. If it is supported, this information will be passed later by the kernel to the power management subsystems.

The next one is getting information about the [Advanced Power Management](http://en.wikipedia.org/wiki/Advanced_Power_Management). The logic of this function is pretty similar to the one described above. It uses the same `0x15` BIOS interrupt to obtain information and store it in the `boot_params` structure. The returned information describes the support of the `APM` which was power management sub-system before [ACPI](https://en.wikipedia.org/wiki/ACPI) started to be a standard.

The last one function gets information about the `Enhanced Disk Drive` from the BIOS. The same `0x13` BIOS interrupt is used to obtain this information. The returned information describes the disks and their characteristics like geometry and mapping information.

## Conclusion

This is the end of the second part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - [0xAX](https://twitter.com/0xAX), drop me an [email](mailto:anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-insides/issues/new). In the next part, we will continue to deal with the preparations before transitioning into protected mode and the transitioning itself.

## Links

Here is the list of the links that you may find useful during reading of this chapter:

- [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
- [Long mode](http://en.wikipedia.org/wiki/Long_mode)
- [The kernel's command-line parameters](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/kernel-parameters.rst)
- [Linux serial console](https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/serial-console.rst)
- [BIOS interrupt](https://en.wikipedia.org/wiki/BIOS_interrupt_call)
- [Intel SpeedStep](http://en.wikipedia.org/wiki/SpeedStep)
- [APM](https://en.wikipedia.org/wiki/Advanced_Power_Management)
- [EDD](https://en.wikipedia.org/wiki/Enhanced_Disk_Drive)
- [Previous part](linux-bootstrap-1.md)


================================================
FILE: Booting/linux-bootstrap-3.md
================================================
# Kernel booting process. Part 3

In the previous [part](./linux-bootstrap-2.md), we have seen first pieces of C code that run in the Linux kernel. One of the main goal of this stage is to switch into the [protected mode](https://en.wikipedia.org/wiki/Protected_mode), but before this, we have seen some early setup code which executes early initialization procedures, such as:

- Setup of console to be able to print messages from the kernel's setup code
- Validation of CPU
- Detection of available memory
- Initialization of keyboard
- Platform information

In this part, we continue to explore the next steps before transitioning to protected mode.

## Video mode setup

Previously, we stopped right at the point where the kernel setup code was about to initialize the video mode. 

The setup code is located in [arch/x86/boot/video.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/video.c) and implemented by the `set_video` function. Now let's take a look at the implementation of the `set_video` function:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/video.c#L317-L343 -->
```C
void set_video(void)
{
	u16 mode = boot_params.hdr.vid_mode;

	RESET_HEAP();

	store_mode_params();
	save_screen();
	probe_cards(0);

	for (;;) {
		if (mode == ASK_VGA)
			mode = mode_menu();

		if (!set_mode(mode))
			break;

		printf("Undefined video mode number: %x\n", mode);
		mode = ASK_VGA;
	}
	boot_params.hdr.vid_mode = mode;
	vesa_store_edid();
	store_mode_params();

	if (do_restore)
		restore_screen();
}
```

In the next section, let's try to understand what a video mode is and how this function initializes it.

### Video modes

A video mode is a predefined configuration of a screen that tells the video hardware information about:

- resolution
- color depth
- text or graphic mode

The next goal of the kernel is to collect this information and initialize a suitable video mode. This allows the kernel to use a special API to print messages on the screen.

The implementation of the `set_video` function starts by getting the video mode from the `boot_params.hdr` structure:

```C
u16 mode = boot_params.hdr.vid_mode;
```

> [!NOTE] 
> Instead of old good standard C data types like `int`, `short`, `unsigned short`, Linux kernel provides own data types for numeric values. Here is the table that will help you to remember them:
>
> | Type | char | short | int | long | u8 | u16 | u32 | u64 |
> |------|------|-------|-----|------|----|-----|-----|-----|
> | Size |  1   |   2   |  4  |   8  |  1 |  2  |  4  |  8  |

The initial value of the video mode can be filled by the bootloader. This header field defined in the Linux kernel boot protocol:

```
Offset	Proto	Name		Meaning
/Size
01FA/2	ALL	    vid_mode	Video mode control
```

Information about potential values for this field can be also found in the Linux kernel boot protocol document:

```
vga=<mode>
	<mode> here is either an integer (in C notation, either
	decimal, octal, or hexadecimal) or one of the strings
	"normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask"
	(meaning 0xFFFD). This value should be entered into the
	vid_mode field, as it is used by the kernel before the command
	line is parsed.
```

This tells us that we can add the `vga` option to the kernel's command line. As mentioned in the description above, this option can have different values. For example, it can be an integer number `0xFFFD` or `ask`. If you pass `ask` to `vga`, you see a menu with the possible video modes. We can test it using [QEMU](https://www.qemu.org/) virtual machine as we did in the previous chapters:

```bash
sudo qemu-system-x86_64 -kernel ./linux/arch/x86/boot/bzImage                \
                        -nographic                                           \
                        -append "console=ttyS0 nokaslr vga=ask"              \
                        -initrd /boot/initramfs-6.17.0-rc3-g1b237f190eb3.img 
```

If you did everything correctly, after the kernel is loaded it will ask you to press the `ENTER`. By pressing on it you should see something like this:

```
Booting from ROM...
Probing EDD (edd=off to disable)... ok
Press <ENTER> to see video modes available, <SPACE> to continue, or wait 30 sec
Mode: Resolution:  Type: Mode: Resolution:  Type: Mode: Resolution:  Type: 
0 F00   80x25      VGA   1 F01   80x50      VGA   2 F02   80x43      VGA   
3 F03   80x28      VGA   4 F05   80x30      VGA   5 F06   80x34      VGA   
6 F07   80x60      VGA   7 340  320x200x32  VESA  8 341  640x400x32  VESA  
9 342  640x480x32  VESA  a 343  800x600x32  VESA  b 344 1024x768x32  VESA  
c 345 1280x1024x32 VESA  d 347 1600x1200x32 VESA  e 34C 1152x864x32  VESA  
f 377 1280x768x32  VESA  g 37A 1280x800x32  VESA  h 37D 1280x960x32  VESA  
i 380 1440x900x32  VESA  j 383 1400x1050x32 VESA  k 386 1680x1050x32 VESA  
l 389 1920x1200x32 VESA  m 38C 2560x1600x32 VESA  n 38F 1280x720x32  VESA  
o 392 1920x1080x32 VESA  p 300  640x400x8   VESA  q 301  640x480x8   VESA  
r 303  800x600x8   VESA  s 305 1024x768x8   VESA  t 307 1280x1024x8  VESA  
u 30D  320x200x15  VESA  v 30E  320x200x16  VESA  w 30F  320x200x24  VESA  
x 310  640x480x15  VESA  y 311  640x480x16  VESA  z 312  640x480x24  VESA  
  313  800x600x15  VESA    314  800x600x16  VESA    315  800x600x24  VESA  
  316 1024x768x15  VESA    317 1024x768x16  VESA    318 1024x768x24  VESA  
  319 1280x1024x15 VESA    31A 1280x1024x16 VESA    31B 1280x1024x24 VESA  
  31C 1600x1200x8  VESA    31D 1600x1200x15 VESA    31E 1600x1200x16 VESA  
  31F 1600x1200x24 VESA    346  320x200x8   VESA    348 1152x864x8   VESA  
  349 1152x864x15  VESA    34A 1152x864x16  VESA    34B 1152x864x24  VESA  
  375 1280x768x16  VESA    376 1280x768x24  VESA    378 1280x800x16  VESA  
  379 1280x800x24  VESA    37B 1280x960x16  VESA    37C 1280x960x24  VESA  
  37E 1440x900x16  VESA    37F 1440x900x24  VESA    381 1400x1050x16 VESA  
  382 1400x1050x24 VESA    384 1680x1050x16 VESA    385 1680x1050x24 VESA  
  387 1920x1200x16 VESA    388 1920x1200x24 VESA    38A 2560x1600x16 VESA  
  38B 2560x1600x24 VESA    38D 1280x720x16  VESA    38E 1280x720x24  VESA  
  390 1920x1080x16 VESA    391 1920x1080x24 VESA    393 1600x900x16  VESA  
  394 1600x900x24  VESA    395 1600x900x32  VESA    396 2560x1440x16 VESA  
  397 2560x1440x24 VESA    398 2560x1440x32 VESA    399 3840x2160x16 VESA  
  200   40x25      VESA    201   40x25      VESA    202   80x25      VESA  
  203   80x25      VESA    207   80x25      VESA    213  320x200x8   VESA  
Enter a video mode or "scan" to scan for additional modes: 
```

### Early heap API

Before proceeding further to investigate what the `set_video` function does, it will be useful to take a look at the API for the management of the kernel's early heap. 

After getting the video mode set by the bootloader, we can see resetting the heap value by the `RESET_HEAP` macro. The definition of this macro is in the [arch/x86/boot/boot.h](https://github.com/torvalds/linux/blob/master/arch/x86/boot/boot.h):

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/boot.h#L174-L174 -->
```C
#define RESET_HEAP() ((void *)( HEAP = _end ))
```

If you have read [part 2](./linux-bootstrap-2.md#kernel-booting-process-part-2), you should remember the initialization of the heap memory area. This memory area starts right after the end of [BSS](https://en.wikipedia.org/wiki/.bss) and lasts till the stack.

The kernel setup code provides a couple of utility macros and functions for managing the early heap. Let's take a look at some of them, especially at those relevant for this chapter.

The `RESET_HEAP` macro resets the heap by setting the `HEAP` variable to `_end`, which represents the end of the early setup kernel's image, including the early code, data, and BSS memory areas. By doing this, we set the heap pointer to the very beginning of the heap.

The next useful macro is:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/boot.h#L184-L185 -->
```C
#define GET_HEAP(type, n) \
	((type *)__get_heap(sizeof(type),__alignof__(type),(n)))
```

The goal of this macro is to allocate memory on the early heap. This macro calls the `__get_heap` function from the same header file with the following parameters:

- Size of the data type to allocate on the heap
- Alignment of the allocated memory area
- Number of items to allocate, specified by the size of the first parameter

The implementation of `__get_heap` is:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/boot.h#L175-L183 -->
```C
static inline char *__get_heap(size_t s, size_t a, size_t n)
{
	char *tmp;

	HEAP = (char *)(((size_t)HEAP+(a-1)) & ~(a-1));
	tmp = HEAP;
	HEAP += s*n;
	return tmp;
}
```

Let's try to understand how the `__get_heap` function works. First of all we can see here that `HEAP` pointer is assigned to the [aligned](https://en.wikipedia.org/wiki/Data_structure_alignment) address of the memory. The address is aligned based on the size of data type for which we want to allocate memory. After we have got the initial aligned address, we just move the `HEAP` pointer by the requested size.

The last but not least API of the early heap that we will see is the `heap_free` function which checks the availability of the given size of memory on the heap:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/boot.h#L187-L190 -->
```C
static inline bool heap_free(size_t n)
{
	return (int)(heap_end-HEAP) >= (int)n;
}
```

As you may see, the implementation of this function is pretty trivial. It just subtracts the current value of the heap pointer from the address which represents the end of heap memory area. The function returns `true` if there is enough memory for `n` or `false` otherwise.

### Return to the setup of the video mode

Since the kernel initialized the heap and the heap pointer is in the right place, we can move directly to video mode initialization.

The first step during the process of a video mode initialization is the `store_mode_params` function, which stores currently available video mode parameters in `boot_params.screen_info`. This structure is defined in [include/uapi/linux/screen_info.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/screen_info.h) header file and provides basic information about the screen and video mode:

- The current position of the cursor
- The BIOS video mode
- The number of text rows and columns

The `store_mode_params` function asks the BIOS services about this information and stores it in this structure for later usage.

The next step is saving the current contents of the screen to the heap by calling the `save_screen` function. This function collects all the data that we got in the previous functions (like rows and columns) and stores it in the `saved_screen` structure, which is defined as:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/video.c#L233-L237 -->
```C
static struct saved_screen {
	int x, y;
	int curx, cury;
	u16 *data;
} saved;
```

After the contents of the screen is saved, the next step is to collect currently available video modes in the system. This job is done by the `probe_cards` function defined in the [arch/x86/boot/video-mode.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/video-mode.c). It goes over all `video_cards` and collects the information about them:

```C
for (card = video_cards; card < video_cards_end; card++) {
  /* collecting the number of video modes */
}
```

The `video_cards` is an array defined as:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/video.h#L81-L82 -->
```C
#define __videocard struct card_info __section(".videocards") __attribute__((used))
extern struct card_info video_cards[], video_cards_end[];
```

The `__videocard` macro allows to define structures which describe video cards and the linker will put them into the `video_cards` array. Example of such structure can be found in the [arch/x86/boot/video-vga.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/video-vga.c):

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/video-vga.c#L282-L286 -->
```C
static __videocard video_vga = {
	.card_name	= "VGA",
	.probe		= vga_probe,
	.set_mode	= vga_set_mode,
};
```

After the `probe_cards` function is executed, we have a set of structures in our `video_cards` array, along with the known number of video modes they support. At the next step, the kernel setup code prints a menu with available video modes if the `vid_mode=ask` option was passed to the kernel command line, and sets up the video mode with all the parameters that we collected in the previous steps.

The video mode is set by the `set_mode` function which is defined in [video-mode.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/video-mode.c). This function expects one parameter - the video mode identifier. This identifier is set by the bootloader or based on the choice of the video modes menu. The `set_mode` function goes over all available video cards defined in the `video_cards` array, and if the given mode belongs to the given card, the `card->set_mode()` callback is called to set up the video mode.

Let's take a look at the example of setting up the [VGA](https://en.wikipedia.org/wiki/Video_Graphics_Array) video mode:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/video-vga.c#L191-L224 -->
```C
static int vga_set_mode(struct mode_info *mode)
{
	/* Set the basic mode */
	vga_set_basic_mode();

	/* Override a possibly broken BIOS */
	force_x = mode->x;
	force_y = mode->y;

	switch (mode->mode) {
	case VIDEO_80x25:
		break;
	case VIDEO_8POINT:
		vga_set_8font();
		break;
	case VIDEO_80x43:
		vga_set_80x43();
		break;
	case VIDEO_80x28:
		vga_set_14font();
		break;
	case VIDEO_80x30:
		vga_set_80x30();
		break;
	case VIDEO_80x34:
		vga_set_80x34();
		break;
	case VIDEO_80x60:
		vga_set_80x60();
		break;
	}

	return 0;
}
```

The `vga_set_mode` function is responsible for configuring the VGA display to a specific text mode, based on the settings which we collected in the previous steps. The `vga_set_basic_mode` function resets the VGA hardware into a standard text mode. The next statement sets up the video mode based on the video mode that was selected. Most of these functions have very similar implementation based on the `0x10` BIOS interrupt.

After this step, the video mode is configured and we save all the information about it again for later use. Having done this, the video mode setup is complete and now we can take a look at the last preparation before we will see the switch into the protected mode.

## Last preparation before transition into protected mode

Returning to the [`main`](https://github.com/torvalds/linux/blob/master/arch/x86/boot/main.c) function of the early kernel setup code, we finally can see:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/main.c#L179-L180 -->
```C
	/* Do the last things and invoke protected mode */
	go_to_protected_mode();
```

As the comment says: `Do the last things and invoke protected mode`, so let's see what these last things are and switch into protected mode.

The `go_to_protected_mode` function is defined in [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c). It contains routines that make the final preparations before we jump into protected mode, so let's look at it and try to understand what it does and how it works.

The very first function that we can see in `go_to_protected_mode` is the `realmode_switch_hook` function. This function invokes the real mode switch hook if it is present, or disables [NMI](http://en.wikipedia.org/wiki/Non-maskable_interrupt) otherwise. The hooks are used if the bootloader runs in a hostile environment. You can read more about hooks in the [boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) (see **ADVANCED BOOT LOADER HOOKS**). Interrupts must be disabled before switching to protected mode because otherwise the CPU could receive an interrupt when there is no valid interrupt table or handlers. Once the kernel sets up the protected-mode interrupt infrastructure, interrupts are enabled again.

We will consider only a standard use case, when the bootloader does not provide any hooks. In this case, we just disable non-maskable interrupts:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pm.c#L28-L30 -->
```assembly
		asm volatile("cli");
		outb(0x80, 0x70); /* Disable NMI */
		io_delay();
```

An interrupt is a signal to the CPU that is emitted by hardware or software. After getting such a signal, the CPU suspends the current instruction sequence, saves its state, and transfers control to the interrupt handler. After the interrupt handler has finished its work, it transfers control back to the interrupted instruction. Non-maskable interrupts (NMI) are interrupts that are always processed, independently of permission. They cannot be ignored and are typically used to signal non-recoverable hardware errors. We will not dive into the details of interrupts now, but we will discuss them in the next parts.

At the first line, there is an [inline assembly](../Theory/linux-theory-3.md) statement with the `cli` instruction, which clears the [interrupt flag](https://en.wikipedia.org/wiki/Interrupt_flag). After this, external interrupts are disabled. The next line disables NMI (non-maskable interrupt).

Let's get back to the code. In the second line, we set the byte `0x0` to the port `0x80`. After that, a call to the `io_delay` function occurs. `io_delay` causes a little delay and looks like this:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/boot.h#L39-L43 -->
```C
static inline void io_delay(void)
{
	const u16 DELAY_PORT = 0x80;
	outb(0, DELAY_PORT);
}
```

Writing any byte to port `0x80` introduces a delay of 1 microsecond. This delay ensures that the change to the NMI mask has fully taken effect. After this delay, all interrupts are disabled.

The next step is the `enable_a20` function, which enables the [A20 line](http://en.wikipedia.org/wiki/A20_line). Enabling this line allows the kernel to have access to more than 1 megabyte of memory.

The `enable_a20` function is defined in [arch/x86/boot/a20.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/a20.c). It enables the `A20` gate using the different approaches. The first is the `a20_test_short` function, which checks if `A20` is already enabled using the `a20_test` function:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/a20.c#L54-L74 -->
```C
static int a20_test(int loops)
{
	int ok = 0;
	int saved, ctr;

	set_fs(0x0000);
	set_gs(0xffff);

	saved = ctr = rdfs32(A20_TEST_ADDR);

	while (loops--) {
		wrfs32(++ctr, A20_TEST_ADDR);
		io_delay();	/* Serialize and make delay constant */
		ok = rdgs32(A20_TEST_ADDR+0x10) ^ ctr;
		if (ok)
			break;
	}

	wrfs32(saved, A20_TEST_ADDR);
	return ok;
}
```

To verify whether the `A20` line is already enabled or not, the kernel performs a simple memory test. It begins by setting the `FS` register to `0x0000` and the `GS` register to `0xffff` values. By doing this, an access to `FS:0x200` (`A20_TEST_ADDR`) points into the very beginning of memory, while an access to `GS:0x2010` refers to a location just past the one-megabyte boundary. If the `A20` line is disabled, the latter will wrap around and point to the same physical address.

If the `A20` gate is disabled, the kernel will try to enable it using different methods which you can find in `enable_a20` function. For example, it can be done with a call to the `0x15` BIOS interrupt with `AH` register set to `0x2041`. If this function finished with a failure, print an error message and call the function `die` which will stop the process of the kernel setup.

After the `A20` gate is successfully enabled, the `reset_coprocessor` function is called:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pm.c#L48-L54 -->
```C
static void reset_coprocessor(void)
{
	outb(0, 0xf0);
	io_delay();
	outb(0, 0xf1);
	io_delay();
}
```

This function resets the [math coprocessor](https://en.wikipedia.org/wiki/Floating-point_unit) to ensure it is in a clean state before switching to protected mode. The reset is performed by writing `0` to port `0xF0`, followed by writing `0` to port `0xF1`.

The next step is the `mask_all_interrupts` function:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pm.c#L37-L43 -->
```C
static void mask_all_interrupts(void)
{
	outb(0xff, 0xa1);	/* Mask all interrupts on the secondary PIC */
	io_delay();
	outb(0xfb, 0x21);	/* Mask all but cascade on the primary PIC */
	io_delay();
}
```

This function masks or in other words forbids all interrupts on the primary and secondary [PICs](https://en.wikipedia.org/wiki/Programmable_interrupt_controller). This is needed for safeness, we forbid all the interrupts from the `PIC` so nothing can interrupt the CPU while the kernel is doing transition into protected mode.

All the operations before this point, were executed for safe transition to the protected mode. The next operations will prepare the transition to the protected mode. Let's take a look at them.

## Entering Protected Mode

At this point, we are very close to see the switching into protected mode of the Linux kernel. 

Only two last steps remain:

- Setting up the Interrupt Descriptor Table
- Setting up the Global Descriptor Table

And that’s all! Once these two structures will be configured, the Linux kernel can make the jump into protected mode.

### Set up the Interrupt Descriptor Table

Before the CPU can safely enter protected mode, it needs to know where to find the handlers that are triggered in the case of [interrupts and exceptions](https://en.wikipedia.org/wiki/Interrupt). In real mode, the CPU relies on the [Interrupt Vector Table](https://en.wikipedia.org/wiki/Interrupt_vector_table). In protected mode, this mechanism changes to the Interrupt Descriptor Table.

The Interrupt Descriptor Table is a special structure located in memory that contains descriptors. This structure describes where the CPU can find handlers for interrupts and exceptions. We will see the full description of the Interrupt Description Table and its entries later, because for now, we have disabled all interrupts in the previous steps. Let's take a look at the function that sets up a zero-filled Interrupt Descriptor Table:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pm.c#L94-L98 -->
```C
static void setup_idt(void)
{
	static const struct gdt_ptr null_idt = {0, 0};
	asm volatile("lidtl %0" : : "m" (null_idt));
}
```

As we can see, it just loads the IDT (which is filled with zeros) using the `lidtl` instruction. The `null_idt` has type `gdt_ptr`, which is a structure defined in the same [arch/x86/boot/pm.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pm.c) file:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pm.c#L60-L63 -->
```C
struct gdt_ptr {
	u16 len;
	u32 ptr;
} __attribute__((packed));
```

This structure provides information about the pointer to the Interrupt Descriptor Table.

### Set up Global Descriptor Table

Next, we set up the Global Descriptor Table. As you may remember, the memory access is based on the `segment:offset` addressing in real mode. The protected mode introduces a different model based on the `Global Descriptor Table`. If you forgot the details about the Global Description Table structure, you can find more information in the [previous chapter](./linux-bootstrap-2.md#protected-mode).

Instead of fixed segment bases and limits, the CPU now looks for memory regions defined by descriptors located in the Global Descriptor Table. The goal of the kernel is to set up these descriptors.

All the job will be done by the `setup_gdt` function, which is defined in the same source code file. Let's take a look at the definition of this function:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pm.c#L65-L89 -->
```C
static void setup_gdt(void)
{
	/* There are machines which are known to not boot with the GDT
	   being 8-byte unaligned.  Intel recommends 16 byte alignment. */
	static const u64 boot_gdt[] __attribute__((aligned(16))) = {
		/* CS: code, read/execute, 4 GB, base 0 */
		[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(DESC_CODE32, 0, 0xfffff),
		/* DS: data, read/write, 4 GB, base 0 */
		[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(DESC_DATA32, 0, 0xfffff),
		/* TSS: 32-bit tss, 104 bytes, base 4096 */
		/* We only have a TSS here to keep Intel VT happy;
		   we don't actually use it for anything. */
		[GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(DESC_TSS32, 4096, 103),
	};
	/* Xen HVM incorrectly stores a pointer to the gdt_ptr, instead
	   of the gdt_ptr contents.  Thus, make it static so it will
	   stay in memory, at least long enough that we switch to the
	   proper kernel GDT. */
	static struct gdt_ptr gdt;

	gdt.len = sizeof(boot_gdt)-1;
	gdt.ptr = (u32)&boot_gdt + (ds() << 4);

	asm volatile("lgdtl %0" : : "m" (gdt));
}
```

The initial memory descriptors specified by the items of the `boot_gdt` array. The `setup_gdt` function just loads the pointer to the Global Descriptor Table filled with these items using the `lgdtl` instruction. Let's take a closer look at the memory descriptors definition.

Initially, the 3 memory descriptors specified:

- Code segment
- Memory segment
- Task state segment

We will skip the description of the task state segment for now, as it was added there (according to the comment) to make [Intel VT](https://en.wikipedia.org/wiki/X86_virtualization#Intel_virtualization_(VT-x)) happy.

The other two segments correspond to the memory regions used by the kernel code and data sections. Both memory descriptors are defined using the `GDT_ENTRY` macro. This macro itself is defined in [arch/x86/include/asm/segment.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/segment.h) and expects three arguments:

- `flags`
- `base`
- `limit`

Let's take a look at the definition of the code memory segment:

```C
[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(DESC_CODE32, 0, 0xfffff),
```

The base address of this memory segment is defined as `0` and the limit as `0xFFFFF`. The `DESC_CODE32` value describes the flags of this segment. If we take a look at the flags, we can see that the granularity (bit `G`) of this segment is set to 4 KB units. This means that the segment covers addresses `0x00000000–0xFFFFFFFF`, which is the entire 4 GB linear address space. The same base address and limit are defined for the data segment. This is because the Linux kernel uses the so-called [flat memory model](https://en.wikipedia.org/wiki/Flat_memory_model).

Besides the granularity bit, the `DESC_CODE32` specifies other flags. Among them, you can find a 32-bit segment that is readable, executable, and present in memory. The privilege level is set to the highest value as the kernel needs.

Looking at the documentation of the Global Descriptor Table and its entries, you can check all the initial segments by yourself. It is not so hard.

## Transition into protected mode

Finally, we are standing right before it – Interrupts are disabled, and the Interrupt Descriptor Table and Global Descriptor Table are initialized. Now the kernel can execute a jump into protected mode! But despite the good news, we need to return to the assembly again 😅

The transition to protected mode is defined in [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S). Let's take a look at it:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pmjump.S#L24-L45 -->
```assembly
SYM_FUNC_START_NOALIGN(protected_mode_jump)
	movl	%edx, %esi		# Pointer to boot_params table

	xorl	%ebx, %ebx
	movw	%cs, %bx
	shll	$4, %ebx
	addl	%ebx, 2f
	jmp	1f			# Short jump to serialize on 386/486
1:

	movw	$__BOOT_DS, %cx
	movw	$__BOOT_TSS, %di

	movl	%cr0, %edx
	orb	$X86_CR0_PE, %dl	# Protected mode
	movl	%edx, %cr0

	# Transition to 32-bit mode
	.byte	0x66, 0xea		# ljmpl opcode
2:	.long	.Lin_pm32		# offset
	.word	__BOOT_CS		# segment
SYM_FUNC_END(protected_mode_jump)
```

First of all, we preserve the address of the `boot_params` structure in the `esi` register since we continue to use parameters that the kernel got during boot in later stages.

After this, we compute the physical base address of the current code segment and store it in the `ebx` register. Having it, we add it to the value stored at memory location `2f` so that the jump instruction to the first protected mode code will contain the proper offset.

The next jump to the label `1` may look quite unexpected. Why does the kernel even need this jump? Right now, the CPU works in real mode. While it is executing the current instruction, it may have already fetched several subsequent instruction bytes into its internal prefetch queue. At this moment, all prefetched instructions were fetched under the assumption that the processor is still operating in real mode. If we were to continue executing instructions that were prefetched before the jump to the protected mode, the processor could continue decoding and executing them without fully synchronizing its internal state with the new mode. The jump instruction prevents this.

At the next steps, we save the segment addresses of the data and task state in general-purpose registers `cx` and `di` and set the `PE` bit in the [control register](https://en.wikipedia.org/wiki/Control_register) `cr0`. From this point, the protected mode is turned on, and we just need to jump into it to set the proper value of the code segment:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pmjump.S#L41-L44 -->
```assembly
	# Transition to 32-bit mode
	.byte	0x66, 0xea		# ljmpl opcode
2:	.long	.Lin_pm32		# offset
	.word	__BOOT_CS		# segment
```

The kernel is in protected mode now 🥳🥳🥳

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pmjump.S#L47-L49 -->
```assembly
	.code32
	.section ".text32","ax"
SYM_FUNC_START_LOCAL_NOALIGN(.Lin_pm32)
```

Let's look at the first steps taken in the protected mode. First of all we set up the data segment with the data segment address that we preserved in the `cx` register at the previous step:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pmjump.S#L50-L55 -->
```assembly
	# Set up data segments for flat 32-bit mode
	movl	%ecx, %ds
	movl	%ecx, %es
	movl	%ecx, %fs
	movl	%ecx, %gs
	movl	%ecx, %ss
```

Since we are in protected mode, our segment bases point to zero. Because of this, the stack pointer will point somewhere below the kernel code, so we need to adjust it to at least its previous state. Before the jump, we stored the base address of the code segment in the `ebx` register, so now we can use this value to adjust the stack pointer:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pmjump.S#L58-L58 -->
```assembly
	addl	%ebx, %esp
```

The last step before the jump into actual 32-bit entry point is to clear the general purpose registers:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pmjump.S#L65-L69 -->
```assembly
	xorl	%ecx, %ecx
	xorl	%edx, %edx
	xorl	%ebx, %ebx
	xorl	%ebp, %ebp
	xorl	%edi, %edi
```

Now everything is ready. The kernel is in the protected mode and we can jump to the next code, address of which was passed in the `eax` register:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pmjump.S#L74-L74 -->
```assembly
	jmpl	*%eax			# Jump to the 32-bit entrypoint
```

## Conclusion

This is the end of the third part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - [0xAX](https://twitter.com/0xAX), drop me an [email](mailto:anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-insides/issues/new).

## Links

Here is the list of the links that you may find useful during reading of this chapter:

- [QEMU](https://www.qemu.org/)
- [VGA](http://en.wikipedia.org/wiki/Video_Graphics_Array)
- [VESA BIOS Extensions](http://en.wikipedia.org/wiki/VESA_BIOS_Extensions)
- [Data structure alignment](http://en.wikipedia.org/wiki/Data_structure_alignment)
- [Non-maskable interrupt](http://en.wikipedia.org/wiki/Non-maskable_interrupt)
- [A20](http://en.wikipedia.org/wiki/A20_line)
- [Math coprocessor](https://en.wikipedia.org/wiki/Floating-point_unit)
- [PIC](https://en.wikipedia.org/wiki/Programmable_interrupt_controller)
- [Interrupts and exceptions](https://en.wikipedia.org/wiki/Interrupt)
- [Interrupt Vector Table](https://en.wikipedia.org/wiki/Interrupt_vector_table)
- [Protected mode](https://en.wikipedia.org/wiki/Protected_mode)
- [Intel VT](https://en.wikipedia.org/wiki/X86_virtualization#Intel_virtualization_(VT-x))
- [Flat memory model](https://en.wikipedia.org/wiki/Flat_memory_model)
- [Previous part](linux-bootstrap-2.md)


================================================
FILE: Booting/linux-bootstrap-4.md
================================================
# Kernel booting process. Part 4

In the previous [part](./linux-bootstrap-3.md), we saw the transition from the [real mode](https://en.wikipedia.org/wiki/Real_mode) into [protected mode](http://en.wikipedia.org/wiki/Protected_mode). At this point, the two crucial things were changed: 

- The processor now can address up to four gigabytes of memory
- The privilege levels were set for the memory access 

Despite this, the kernel is still in its early setup mode. There are many different things that the early setup code should prepare before we reach the main kernel's entry point. Right now, the processor operates in protected mode. However, protected mode is not the main mode in which `x86_64` processors should operate – it exists only for backward compatibility. The next crucial step is to switch to the native mode for `x86_64` - [long mode](https://en.wikipedia.org/wiki/Long_mode).

The main characteristic of this new mode (as with all the earlier modes) is the way it defines the memory model. In real mode, the memory model was relatively simple, and each memory location was formed based on the base address specified in a segment register, plus some offset. In protected mode, the global and local descriptor tables contain descriptors that describe memory areas. All the memory accesses in long mode are based on the new mechanism called [paging](https://en.wikipedia.org/wiki/Memory_paging). One of the crucial goals of the kernel setup code before it can switch to the long mode is to set up paging.

In this chapter, we will see how the kernel switches to long mode in detail.

> [!NOTE]
> There will be lots of assembly code in this part, so if you are not familiar with that, read another set of my [posts about assembly programming](https://github.com/0xAX/asm).

## The 32-bit kernel entry point location

The last point where we stopped was the [jump](https://en.wikipedia.org/wiki/Branch_(computer_science)#Implementation) instruction to the kernel's entry point in protected mode. This jump was located in the [arch/x86/boot/pmjump.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/pmjump.S) and looks like this:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/pmjump.S#L74-L74 -->
```assembly
	jmpl	*%eax			# Jump to the 32-bit entrypoint
```

The value of the `eax` register contains the address of the `32-bit` entry point. What is this address? To answer on this question, we can read the [Linux kernel x86 boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt) document:

> When using bzImage, the protected-mode kernel was relocated to 0x100000

We can make make sure that this 32-bit entry point of the Linux kernel using the [GNU GDB](https://sourceware.org/gdb/) debugger and running the Linux kernel in the [QEMU](https://www.qemu.org/) virtual machine. To do this, you can run the following command in one terminal:

```bash
sudo qemu-system-x86_64 -kernel ./linux/arch/x86/boot/bzImage \ 
                        -nographic                            \
                        -append "console=ttyS0 nokaslr" -s -S \ 
                        -initrd /boot/initramfs-6.17.0-rc3-g1b237f190eb3.img
```

> [!NOTE]
> You need to pass your own kernel image and [initrd](https://en.wikipedia.org/wiki/Initial_ramdisk) image to the `-kernel` and `-initrd` command line options.

After this, run the GNU GDB debugger in another terminal and pass the following commands:

```
$ gdb
(gdb) target remote :1234
(gdb) hbreak *0x100000
(gdb) c
Continuing.

Breakpoint 1, 0x0000000000100000 in ?? ()
```

As soon as the debugger stopped at the [breakpoint](https://en.wikipedia.org/wiki/Breakpoint), we can inspect registers to be sure that the `eax` register contains the `0x100000` - address of the 32-bit kernel entry point:

```
eax            0x100000	1048576
ecx            0x0	    0
edx            0x0	    0
ebx            0x0	    0
esp            0x1ff5c	0x1ff5c
ebp            0x0	    0x0
esi            0x14470	83056
edi            0x0	    0
eip            0x100000	0x100000
eflags         0x46	    [ PF ZF ]
```

From the previous part, you may remember:

> First of all, we preserve the address of `boot_params` structure in the `esi` register.

So the `esi` register has the pointer to the `boot_params`. Let's inspect it to make sure that it is really it. For example we can take a look at the command line string that we passed to the virtual machine:

```
(gdb) x/s ((struct boot_params *)$rsi)->hdr.cmd_line_ptr
0x20000:	"console=ttyS0 nokaslr"
(gdb) ptype struct boot_params
type = struct boot_params {
    struct screen_info screen_info;
    struct apm_bios_info apm_bios_info;
    __u8 _pad2[4];
    __u64 tboot_addr;
    struct ist_info ist_info;
    __u64 acpi_rsdp_addr;
    __u8 _pad3[8];
    __u8 hd0_info[16];
    __u8 hd1_info[16];
    struct sys_desc_table sys_desc_table;
    struct olpc_ofw_header olpc_ofw_header;
    __u32 ext_ramdisk_image;
    __u32 ext_ramdisk_size;
    __u32 ext_cmd_line_ptr;
    __u8 _pad4[112];
    __u32 cc_blob_address;
    struct edid_info edid_info;
    struct efi_info efi_info;
    __u32 alt_mem_k;
    __u32 scratch;
    __u8 e820_entries;
    __u8 eddbuf_entries;
    __u8 edd_mbr_sig_buf_entries;
    __u8 kbd_status;
    __u8 secure_boot;
    __u8 _pad5[2];
    __u8 sentinel;
    __u8 _pad6[1];
    struct setup_header hdr;
    __u8 _pad7[36];
    __u32 edd_mbr_sig_buffer[16];
    struct boot_e820_entry e820_table[128];
    __u8 _pad8[48];
    struct edd_info eddbuf[6];
    __u8 _pad9[276];
}
(gdb) x/s ((struct boot_params *)$rsi)->hdr.cmd_line_ptr
0x20000:	"console=ttyS0 nokaslr"
```

We got it 🎉

Now we know where we are, so let's take a look at the code and proceed with learning of the Linux kernel.

## First steps in the protected mode

The `32-bit` entry point is defined in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L81-L82 -->
```assembly
	.code32
SYM_FUNC_START(startup_32)
```

First of all, it is worth knowing why the directory is named `compressed`. It's because the kernel is in the [`bzImage`](https://en.wikipedia.org/wiki/Vmlinux#bzImage) file, which is a compressed package that contains the kernel image and kernel setup code. In all previous chapters, we were researching the kernel setup code. The next two big steps, which the kernel's setup code should do before we see the entry point of the kernel itself, are:

- Switch to long mode
- Decompress the kernel image and jump to its entry point

In this part, we will focus only on switching to long mode. The kernel image decompression will be covered in the next chapters. Returning to the current kernel code, you can find the following two files in the [arch/x86/boot/compressed](https://github.com/torvalds/linux/tree/master/arch/x86/boot/compressed) directory:

- [head_32.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_32.S)
- [head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S)

We will focus only on the `head_64.S` file. Yes, the file name contains the `64` suffix, despite the kernel being in the 32-bit protected mode at the moment. The explanation for this situation is simple. Let's look at [arch/x86/boot/compressed/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/Makefile). We can see the following `make` goal here:

```Makefile
vmlinux-objs-y := $(obj)/vmlinux.lds $(obj)/kernel_info.o $(obj)/head_$(BITS).o \
	$(obj)/misc.o $(obj)/string.o $(obj)/cmdline.o $(obj)/error.o \
	$(obj)/piggy.o $(obj)/cpuflags.o
```

The first line contains the following target - `$(obj)/head_$(BITS).o`. This means that `make` will select the file during the kernel build process based on the `$(BITS)` value. This `make` variable is defined in the [arch/x86/Makefile](https://github.com/torvalds/linux/blob/master/arch/x86/Makefile) Makefile and its value depends on the kernel's configuration:

```Makefile
ifeq ($(CONFIG_X86_32),y)
        BITS := 32
        ...
        ...
else
        BITS := 64
        ...
        ...
endif
```

Since we are consider the kernel for `x86_64` architecture, we assume that the `CONFIG_X86_64` is set to `y`. As the result, the `head_64.S` file will be used during the kernel build process. Let's start to investigate this what the kernel does in this file.

### Reload the segments if needed

As we already know, our start is in [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S) assembly source code file. The entry point is defined by the `startup_32` symbol.

At the beginning of the `startup_32`, we can see the `cld` instruction, which clears the `DF` or [direction flag](https://en.wikipedia.org/wiki/Direction_flag) bit in the [flags](https://en.wikipedia.org/wiki/FLAGS_register) register:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L81-L90 -->
```assembly
	.code32
SYM_FUNC_START(startup_32)
	/*
	 * 32bit entry is 0 and it is ABI so immutable!
	 * If we come here directly from a bootloader,
	 * kernel(text+data+bss+brk) ramdisk, zero_page, command line
	 * all need to be under the 4G limit.
	 */
	cld
	cli
```

When the direction flag is clear, all string or copy-like operations used for copying data, like for example [stos](https://www.felixcloutier.com/x86/stos:stosb:stosw:stosd:stosq) or [scas](https://www.felixcloutier.com/x86/scas:scasb:scasw:scasd), will increment the index registers `esi` or `edi`. We need to clear the direction flag because later we will use string operations for tasks such as clearing space for page tables or copying data.

The next instruction is to disable interrupts - `cli`. We have already seen it in the previous chapter. The interrupts are disabled "twice" because modern bootloaders can load the kernel starting from this point, but not only one that we have seen in the [first chapter](./linux-bootstrap-1.md).

After these two simple instructions, the next step is to calculate the difference between where the kernel is compiled to run, and where it actually was loaded. If we will take a look at the linker [script](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S), we will see the following definition:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/vmlinux.lds.S#L19-L24 -->
```linker-script
SECTIONS
{
	/* Be careful parts of head_64.S assume startup_32 is at
	 * address 0.
	 */
	. = 0;
```

This means that the code in this section is compiled to run at the address zero. We also can see this in the output of `objdump` utility:

```bash
$ objdump -D /home/alex/disk/dev/linux/arch/x86/boot/compressed/vmlinux | less

/home/alex/disk/dev/linux/arch/x86/boot/compressed/vmlinux:     file format elf64-x86-64


Disassembly of section .head.text:

0000000000000000 <startup_32>:
   0:   fc                      cld
   1:   fa                      cli
```

We can see that both the linker script and the `objdump` utility indicate that the address of the `startup_32` function is `0`, but this is not where the kernel was loaded. This is the address that the code was compiled for, also known as the link-time address. Why was it done like that? The answer is – for simplicity. By telling the linker to set the address of the very first symbol to zero, each next symbol becomes a simple offset from 0. As we already know, the kernel was loaded at the `0x100000` address. The difference between the address where the kernel was loaded and the address with which the kernel was compiled is called the relocation delta. Once the delta is known, the code can reach any variable or function by adding this delta to their compile-time addresses.

We know both these addresses based on the experiment above, and as a result, we know the value of the delta. Now let's take a look at how the kernel calculates this difference:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L100-L104 -->
```assembly
	leal	(BP_scratch+4)(%esi), %esp
	call	1f
1:	popl	%ebp
	subl	$ rva(1b), %ebp
```

The `call` instruction is used to get the physical address where the kernel is actually loaded. This trick works because after the `call` instruction is executed, the stack should have the return address on top. This return address will be exactly the address of the label `1`. 

In the code above, the kernel sets up a temporary mini stack where the return address will be stored after the `call` instruction. Right after the call, we pop this address from the stack and save it in the `ebp` register. Using the last instruction, we subtract the difference between the address of the label `1` and the `startup_32` physical address using the `rva` macro and `subl` instruction, and store the result in the `ebp` register.

The `rva` macro is defined in the same source code file and looks like this:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L79-L79 -->
```assembly
#define rva(X) ((X) - startup_32)
```

Schematically, it can be represented like this:

![startup_32](./images/startup_32.svg)

Starting from this moment, the `ebp` register contains the physical address of the `startup_32` symbol. Next, it will be used to calculate the offset to any other symbols or structures in memory.

The very first such structure that we need to access is the Global Descriptor Table. To switch to long mode, we need to update the previously loaded Global Descriptor Table with `64-bit` segments:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L106-L109 -->
```assembly
	leal	rva(gdt)(%ebp), %eax
	movl	%eax, 2(%eax)
	lgdt	(%eax)
```

Knowing now that the `ebp` register contains the physical address of the beginning of the kernel in protected mode, we calculate the offset to the `gdt` structure using it at the first line of code shown above. In the last two lines, we write this address to the `gdt` structure with offset `2`, and load the new Global Descriptor Table with the `lgdt` instruction.

The new Global Descriptor Table looks like this:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L495-L504 -->
```assembly
SYM_DATA_START_LOCAL(gdt)
	.word	gdt_end - gdt - 1
	.long	0
	.word	0
	.quad	0x00cf9a000000ffff	/* __KERNEL32_CS */
	.quad	0x00af9a000000ffff	/* __KERNEL_CS */
	.quad	0x00cf92000000ffff	/* __KERNEL_DS */
	.quad	0x0080890000000000	/* TS descriptor */
	.quad   0x0000000000000000	/* TS continued */
SYM_DATA_END_LABEL(gdt, SYM_L_LOCAL, gdt_end)
```

The new Global Descriptor table contains five descriptors: 

- 32-bit kernel code segment
- 64-bit kernel code segment
- 32-bit kernel data segment
- Task state descriptor
- Second task state descriptor

We already saw loading the Global Descriptor Table in the previous [part](./linux-bootstrap-3.md#set-up-global-descriptor-table), and now we're doing almost the same, but we set descriptors to use `CS.L = 1` and `CS.D = 0` for execution in `64` bit mode.

After the new Global Descriptor Table is loaded, the next step is to set up the stack:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L111-L119 -->
```assembly
	movl	$__BOOT_DS, %eax
	movl	%eax, %ds
	movl	%eax, %es
	movl	%eax, %fs
	movl	%eax, %gs
	movl	%eax, %ss

	/* Setup a stack and load CS from current GDT */
	leal	rva(boot_stack_end)(%ebp), %esp
```

In the previous step, we loaded a new Global Descriptor Table; however, all the segment registers may still have selectors from the old table. If those selectors point to invalid entries in the new Global Descriptor Table, the next memory access can cause [General Protection Fault](https://en.wikipedia.org/wiki/General_protection_fault). Setting them to `__BOOT_DS`, which is a well-known descriptor, should fix this potential fault and allow us to set the proper stack pointed by `boot_stack_end`.

The last action after we loaded the new Global Descriptor Table is to reload the `cs` descriptor:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L121-L125 -->
```assembly
	pushl	$__KERNEL32_CS
	leal	rva(1f)(%ebp), %eax
	pushl	%eax
	lretl
1:
```

Since we can not change segment registers using the `mov` instruction, a trick with the `lretl` instruction is used to set the `cs` with the correct value. This instruction fetches two values from the top of the stack, then puts the first value into the `eip` register and the second value into the `cs` register. Since this moment, we have a proper kernel code selector and instruction pointer values.

Just a couple of steps separate us from transitioning into the long mode. As mentioned at the beginning of this chapter, one of the most crucial steps is to set up `paging`. But before that, the kernel needs to do the last preparations, which we will see in the next sections.

## Last steps before paging setup

As we mentioned in the previous section, there a couple of additional steps before we can setup paging and switch to long mode. These steps are:

- Verification of CPU
- Calculation of the relocation address
- Enabling `PAE` mode

In the next sections we will take a look at these steps.

### CPU verification

Before the kernel can switch to long mode, it checks that it runs on a suitable `x86_64` processor by running this piece of code:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L132-L136 -->
```assembly
	/* Make sure cpu supports long mode. */
	call	verify_cpu
	testl	%eax, %eax
	jnz	.Lno_longmode
```

The `verify_cpu` function is defined in [arch/x86/kernel/verify_cpu.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/verify_cpu.S) and executes the [CPUID](https://en.wikipedia.org/wiki/CPUID) instruction to check the details of the processors on which the kernel is running. In our case, the most crucial check is for long mode and [SSE](http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions) support. This function returns the result in the `eax` register. Its value is `0` on success and `1` on failure. If long mode is not supported by the current processor, the kernel jumps to the `no_longmode` label, which stops the CPU with the `hlt` instruction:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L478-L483 -->
```assembly
	.code32
SYM_FUNC_START_LOCAL_NOALIGN(.Lno_longmode)
	/* This isn't an x86-64 CPU, so hang intentionally, we cannot continue */
1:
	hlt
	jmp     1b
```

If everything is ok, the kernel proceeds its work.

### Calculation of the kernel relocation address

The next step is to calculate the address for the kernel decompression. The kernel image mainly consists of two parts:

- Kernel's setup and decompressor code
- Chunk of compressed kernel code

We can see it looking at the [arch/x86/boot/compressed/vmlinux.lds.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/vmlinux.lds.S) linker script:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/vmlinux.lds.S#L19-L39 -->
```linker-script
SECTIONS
{
	/* Be careful parts of head_64.S assume startup_32 is at
	 * address 0.
	 */
	. = 0;
	.head.text : {
		_head = . ;
		HEAD_TEXT
		_ehead = . ;
	}
	.rodata..compressed : {
		*(.rodata..compressed)
	}
	.text :	{
		_text = .; 	/* Text */
		*(.text)
		*(.text.*)
		*(.noinstr.text)
		_etext = . ;
	}
```

There are three sections at the beginning of the linker script above:

- `.head.text` - section where we are now
- `.rodaya..compressed` - section with the compressed kernel image
- `.text` - section with the decompressor code

The kernel decompression happens in-place, which is the same place where the compressed kernel is. This means that the parts of the decompressed kernel image will overwrite the parts of the compressed image during the decompression process. It may sound dangerous – if the decompressed part overwrites the decompressor code or the part of the compressed kernel image that is not decompressed yet, this will lead to code or image corruption.

One way to avoid this problem is to allocate a buffer for the decompressed kernel image and copy the compressed image outside of it. But this is not the most effective way in terms of memory consumption, and may not work on devices with not enough memory to hold both kernel images.

The second way to avoid this problem is to allocate a buffer for the decompressed kernel image, but copy the compressed image to the end of this buffer and leave some room at the beginning of this buffer for the parts of the decompressed kernel. Of course, the kernel decompressor must choose the right parameters, so the pointer to the end of the decompressed part does not move faster than the pointer to the part that is currently compressed.

Schematically, it can be represented like this:

![kernel-relocation](./images/kernel-relocation.svg)

The buffer for the decompressed kernel starts at the address specified by the `LOAD_PHYSICAL_ADDR` macro, which by default expands to the `0x1000000` address. Since we loaded this address below (at `0x100000`), the kernel setup code should copy itself, the compressed kernel image, and the decompressor code at this address. In addition, to have some room for the safe in-place decompression, it should calculate a special offset from the beginning of this buffer.

We can see this calculation in the following code:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L146-L161 -->
```assembly
#ifdef CONFIG_RELOCATABLE
	movl	%ebp, %ebx
	movl	BP_kernel_alignment(%esi), %eax
	decl	%eax
	addl	%eax, %ebx
	notl	%eax
	andl	%eax, %ebx
	cmpl	$LOAD_PHYSICAL_ADDR, %ebx
	jae	1f
#endif
	movl	$LOAD_PHYSICAL_ADDR, %ebx
1:

	/* Target address to relocate to for decompression */
	addl	BP_init_size(%esi), %ebx
	subl	$ rva(_end), %ebx
```

Despite it may look scary, it is not as complex as it may seem. Let's take a closer look at it and try to understand what it does.

The `ebp` register contains the physical address where the protected kernel mode was loaded. We know that this address is `0x100000`. This address is aligned to the two-megabyte boundary, and the result value is compared with the `LOAD_PHYSICAL_ADDRESS`:

- If this value is equal to or greater than `LOAD_PHYSICAL_ADDRESS`, we leave it as is. 
- Otherwise, we put the value of the `LOAD_PHYSICAL_ADDRESS` (which is `0x1000000`) into the `ebx` register. 

At this moment, we have the pointer to the beginning of the buffer where the kernel image is relocated and decompressed in the `ebx` register.

The last two lines are the most interesting. Using them, the kernel calculates the offset where to move the compressed kernel image with the decompressor for safe in-place decompression. At first, we add the `BP_init_size` to the `ebx` register. The `BP_init_size` is the maximum value between the size of the uncompressed kernel image code (from `_text` to `_end`) and the size of the kernel setup code + compressed kernel image + decompressor code. At this moment, the `ebx` register points to the end of the decompression buffer. On the last line of the code, we move this pointer back to the new place of the `startup_32` symbol within the decompression buffer.

As a result, we get something like this:

![kernel-relocation](./images/kernel-relocation-2.svg)

The decompressor code decompresses the compressed kernel image starting from the beginning of the buffer and gradually overwrites the compressed kernel image. As mentioned above, the size of the gap between the beginning of the decompression buffer and `startup_32` must be safe enough not to overwrite still-compressed parts of the image with the decompressed ones. The calculation of this gap highly depends on the compression method the kernel uses and is encoded in `BP_init_size`. Here I will skip all the details about this calculation, but if you are interested, you can find more details in the comment located in the [arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/header.S) file.

### Enabling PAE mode

The next step before the kernel can switch the processor into the long mode is to set up the so-called [`PAE`](https://en.wikipedia.org/wiki/Physical_Address_Extension) mode:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L167-L170 -->
```assembly
	/* Enable PAE mode */
	movl	%cr4, %eax
	orl	$X86_CR4_PAE, %eax
	movl	%eax, %cr4
```

Kernel does it by setting the `X86_CR4_PAE` bit in the `cr4` [control register](https://en.wikipedia.org/wiki/Control_register). This tells the processor that the page table entries will be enlarged from `32` to `64` bits. We will see this process soon.

## Set up paging

At this moment, we almost finished the preparations needed to switch the processor into 64-bit long mode. The next crucial step is to build [page tables](https://en.wikipedia.org/wiki/Page_table). But before we take a look at the process of page table setup, let's try to briefly understand what it is.

In protected mode, each memory access is interpreted through a segment descriptor stored in the Global Descriptor Table. The situation changes significantly in long mode.

In 64-bit mode, segmentation is disabled. The base and limit fields of most segment descriptors are ignored, and the processor treats the address space as a flat linear range. Of course, code, data, and stack segments still exist, but only formally. The processor still requires valid segment selectors, but they no longer perform address translation in the traditional sense.

Instead, memory translation in long mode relies almost entirely on the mechanism called `paging`.

Each program operates now with addresses that are called `virtual`. When a program references a virtual address, the processor interprets the address as a 64-bit linear address and translates it through the multi-level structure called page tables.

> [!NOTE]
> Modern x86_64 processors support five-level paging, but we will skip it in this post and focus on four-level paging.

Let’s briefly see what happens when the processor needs to translate a virtual address into a physical one.

In four-level paging mode, a virtual address is 64 bits long. However, only the `48` bits are actually used for translation to a physical address. These `48` bits are divided into several parts:

![early-page-table.svg](./images/early-page-table.svg)

Each group of `9` bits selects an entry in one level of the page-table hierarchy. Since `9` bits can represent `512` values, each page table contains exactly `512` entries. Each entry of a page table occupies `8` bytes, so a single page table fits into one 4-kilobyte page.

When the processor translates a virtual address, it performs the following steps:

1. It reads the `cr3` control register to obtain the physical address of the top-level page table called `PML4`.
2. It extracts bits `47–39` of the virtual address and uses them as an index of the `PML4` page table.
3. The selected `PML4` entry contains the physical address of the next-level table called `PDPT`.
4. Bits `38–30` are selected to find an entry in the `PDPT`.
5. Bits `29–21` are selected to find an entry in the `PD`.
6. Bits `20–12` select an entry in the `PT`.
7. Bits `11–0` provide the offset inside the resulting physical page.

In addition to a physical address of the next-level table, each page table entry contains flags in first `12` bits. These flags are:

| Bit   | Name                     | Description                                                                                                                                        |
|-------|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| `P`   | Present                  | Indicates whether the page or page table entry is valid and exists in memory. If cleared, accessing the corresponding address causes a page fault. |
| `RW`  | Read/Write               | Determines whether write operations are permitted. If cleared, the page is read-only; if set, writes are allowed (subject to privilege rules).     |
| `US`  | User/Supervisor          | Controls privilege-level access. If cleared, the page is accessible only in supervisor mode. If set, it may also be accessed from user mode.       |
| `PWT` | Page-Level Write-Through | Controls the caching policy. If set, write-through caching is used; otherwise, write-back caching is typically applied.                            |
| `PCD` | Page Cache Disable       | Disables caching for the referenced page when set. Commonly used for memory-mapped I/O regions.                                                    |
| `A`   | Accessed                 | Set automatically by the processor when the page-table entry is used during address translation. Useful for page replacement decisions.            |
| `D`   | Dirty                    | Set automatically by the processor when a write operation occurs to a mapped page. Indicates that the page has been modified.                      |
| `PS`  | Page Size                | Determines whether the entry maps a large page (e.g., 2 MiB or 1 GiB) instead of pointing to a lower-level page table.                             |
| `NX`  | No-Execute               | Prevents instruction execution from the referenced page when set. Used to enforce executable/non-executable memory protections.                    |
       
You might wonder how an 8-byte entry can contain both a 64-bit physical address of the next-level page table and flags at the same time. The reason is that each page table is aligned on a four-kilobyte boundary. As a result, the lower 12 bits of its physical address are always zero. These 12 bits are therefore used to store the flags.

Now that we know how the processor translates a virtual address to a physical address using paging, it is time to take a look at the structure of page tables.

A page table in x86_64 is a four-kilobyte memory area that contains 512 entries. Each entry occupies `8` bytes. In four-level paging mode with four-kilobyte pages, four such tables participate in the translation of a virtual address:

| Level | Name   | Description                                                                                                                 |
|-------|--------|-----------------------------------------------------------------------------------------------------------------------------|
| 4     | `PML4` | The top-level page table. Each entry points to a Page Directory Pointer Table (`PDPT`).                                     |
| 3     | `PDPT` | The third-level table. Each entry points to a Page Directory (`PD`) or, if the `PS` bit is set, directly maps a 1 GiB page. |
| 2     | `PD`   | The second-level table. Each entry points to a Page Table (`PT`) or, if the `PS` bit is set, directly maps a 2 MiB page.    |
| 1     | `PT`   | The first-level table. Each entry points directly to a 4 KiB physical memory page.                                          |

Each table has the same internal structure. The only difference between them is how their entries are interpreted. As we already know, an entry in a page table is 64 bits wide. It contains two types of information:

- A physical address of either the next-level page table or a physical memory page
- A set of control flags that define access permissions and status information 

If you are interested in this topic, you can find more information about page tables and page table entries structure in the [Intel® 64 and IA-32 Architectures Software Developer Manuals](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html).

Now that we know a little about paging, we can return to the kernel and update our knowledge by looking at the real code. Now we will see how the kernel builds the early page table to switch to long mode. But before we jump directly to the code, we need to remember one important thing. The kernel will be relocated to the address stored in the `ebx` register, as seen above. So, all structures, including the page tables, should be aligned to this address.

The page table structure for boot is defined in the same source code file and looks like this:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L531-L533 -->
```assembly
	.section ".pgtable","aw",@nobits
	.balign 4096
SYM_DATA_LOCAL(pgtable,		.fill BOOT_PGT_SIZE, 1, 0)
```

The kernel needs to fill this structure with the proper page table entries for early 64-bit code. First of all, it fills the whole memory area occupied by the page tables with zeros for safety:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L200-L203 -->
```assembly
	leal	rva(pgtable)(%ebx), %edi
	xorl	%eax, %eax
	movl	$(BOOT_INIT_PGT_SIZE/4), %ecx
	rep	stosl
```

At the beginning, we set the address of the top of the page table to the `edi` register. After this, the kernel fills with zeros the memory area that will be occupied by the page table. The boot page table will have the following structure:

- 1 level4 table
- 1 level3 table
- 4 level2 table that maps everything with 2M pages

After the kernel clears the memory region reserved for the page tables, it starts populating it with entries. At the start, it fills the first and single entry of the top-level page table. The following snippet shows this:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L206-L209 -->
```assembly
	leal	rva(pgtable + 0)(%ebx), %edi
	leal	0x1007 (%edi), %eax
	movl	%eax, 0(%edi)
	addl	%edx, 4(%edi)
```

In the code above, the kernel fills the first entry of the top-level page table with the address of the next-level page table, which is located at the `pgtable + 0x1000` address and has `0x7` flags. In our case, the flags `0x7` are:

- Present
- Read/Write
- User

In the next step, the kernel builds four `Page Directory` entries in the `Page Directory Pointer` table with the same `Present+Read/Write/User` flags:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L212-L220 -->
```assembly
	leal	rva(pgtable + 0x1000)(%ebx), %edi
	leal	0x1007(%edi), %eax
	movl	$4, %ecx
1:	movl	%eax, 0x00(%edi)
	addl	%edx, 0x04(%edi)
	addl	$0x00001000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b
```

In the code above, we can see the filling of the first four entries of the 3rd-level page table. The first entry of the 3rd level page table is located at the offset `0x1000` from the beginning of the top-level page table. The value of the `eax` register is similar to the 4th-level page table entry, with the difference that now it points to the 2nd-level page table. Next, the kernel fills the four entries of the 3rd-level page table in the "loop" until the value of the `ecx` register is not zero. As soon as these page table entries are filled, the kernel proceeds to the next-level page table:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L223-L231 -->
```assembly
	leal	rva(pgtable + 0x2000)(%ebx), %edi
	movl	$0x00000183, %eax
	movl	$2048, %ecx
1:	movl	%eax, 0(%edi)
	addl	%edx, 4(%edi)
	addl	$0x00200000, %eax
	addl	$8, %edi
	decl	%ecx
	jnz	1b
```

Here we already fill four page directory tables with `2048` entries. The first entry is located at the offset `0x2000` from the beginning of the top-level page table. Each entry maps a two-megabyte chunk of memory with the following flags:

- Present
- Read/Write
- User
- Page Cache Disable
- Large Page 

The two additional flags tell the processor to keep [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) entry across reload of the value of the `cr3` register and use two-megabyte pages.

There is no need to populate the lowest-level page tables yet. Every entry in the 2nd-level page directory has the `Large Page` bit set, which means each entry directly maps a two-megabyte region of physical memory. During the address translation, the page-walk procedure stops at the 2nd-level page table, and the lower `21` bits of the virtual address are used as the offset inside that two-megabyte page.

The page tables are now fully prepared. The last remaining step is to actually enable paging. To do this, the processor must know where the top-level page table resides. As we know, this is done by loading the physical address of the top-level page table into the `cr3` control register:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L234-L235 -->
```assembly
	leal	rva(pgtable)(%ebx), %eax
	movl	%eax, %cr3
```

From this moment, page tables that cover four gigabytes of memory are ready, and paging is enabled. The kernel is ready for transition into the long mode.

## The transition into 64-bit mode

Only the last steps remain before the Linux kernel can switch the processor into the long mode. The first one is setting the `EFER.LME` flag in the special [model-specific register](http://en.wikipedia.org/wiki/Model-specific_register) to the predefined value `0xC0000080`:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L238-L241 -->
```assembly
	movl	$MSR_EFER, %ecx
	rdmsr
	btsl	$_EFER_LME, %eax
	wrmsr
```

This is the `Long Mode Enable` bit, and it is mandatory to set this bit to enable long mode.

In the next step, we can see the preparation for the jump on the long mode entrypoint. To do this jump, the kernel stores the base address of the kernel segment code along with the address of the long mode entrypoint on the stack:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L264-L266 -->
```assembly
	leal	rva(startup_64)(%ebp), %eax
	pushl	$__KERNEL_CS
	pushl	%eax
```

Since the stack contains the base of the kernel code segment and the address of the entrypoint, the kernel executes the last instruction in protected mode:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L273-L273 -->
```assembly
	lret
```

The CPU extracts the address of `startup_64`, which is the long mode entrypoint from the stack, and jumps there:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L276-L278 -->
```assembly
	.code64
	.org 0x200
SYM_CODE_START(startup_64)
```

The Linux kernel is now in 64-bit mode! 🎉

## Conclusion

This is the end of the third part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - [0xAX](https://twitter.com/0xAX), drop me an [email](mailto:anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-insides/issues/new).

## Links

Here is the list of the links that you may find useful during reading of this chapter:

- [Real mode](https://en.wikipedia.org/wiki/Real_mode)
- [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
- [Long mode](https://en.wikipedia.org/wiki/Long_mode)
- [Linux kernel x86 boot protocol](https://www.kernel.org/doc/Documentation/x86/boot.txt)
- [Intel® 64 and IA-32 Architectures Software Developer Manuals](https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html)
- [Paging](http://en.wikipedia.org/wiki/Paging)
- [Virtual addresses](https://en.wikipedia.org/wiki/Virtual_address_space)
- [Physical addresses](https://en.wikipedia.org/wiki/Physical_address)
- [Model specific registers](http://en.wikipedia.org/wiki/Model-specific_register)
- [Control registers](https://en.wikipedia.org/wiki/Control_register)
- [Previous part](linux-bootstrap-3.md)


================================================
FILE: Booting/linux-bootstrap-5.md
================================================
# Kernel booting process. Part 5

In the previous [part](./linux-bootstrap-4.md), we saw the transition from the [protected mode](https://en.wikipedia.org/wiki/Protected_mode) into [long mode](https://en.wikipedia.org/wiki/Long_mode), but what we have in memory is not yet the kernel image ready to run. We are still in the kernel setup code, which should decompress the kernel and pass control to it. The next step before we see the Linux kernel entrypoint is kernel decompression.

## First steps in the long mode

The point where we stopped in the previous chapter is the [lret](https://www.felixcloutier.com/x86/ret) instruction, which performed "jump" to the `64-bit` entry point located in the [arch/x86/boot/compressed/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/head_64.S):

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L276-L278 -->
```assembly
	.code64
	.org 0x200
SYM_CODE_START(startup_64)
```

This is the first 64-bit code that we see. Before decompression, the kernel must complete a few final steps. These steps are:

- Disabling the interrupts
- Unification of the segment registers
- Calculation of the kernel relocation address
- Reload of the Global Descriptor Table
- Load of the Interrupt Descriptor Table

All of this we will see in the next sections.

### Disabling the interrupts

The `64-bit` entrypoint starts with the same two instructions that `32-bit`:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L290-L291 -->
```assembly
	cld
	cli
```

As we already know from the previous part, the first instruction clears the [direction flag](https://en.wikipedia.org/wiki/Direction_flag) bit in the [flags](https://en.wikipedia.org/wiki/FLAGS_register) register, and the second instruction disables [interrupts](https://en.wikipedia.org/wiki/Interrupt).

The same as the bootloader can load the Linux kernel at the `32-bit` entrypoint instead of [16-bit entry point](linux-bootstrap-1.md#the-beginning-of-the-kernel-setup-stage), in the same way the bootloader can switch the processor into `64-bit` long mode by itself and load the kernel starting from the `64-bit` entry point. 

The kernel executes these two instructions if the bootloader didn't perform them before transfering the control to the kernel. The `direction flag` ensures that memory copying operations proceed in the correct direction, and disabling interrupts prevents them from disrupting the kernel decompression process.

### Unification of the segment registers

After these two instructions are executed, the next step is to unify segment registers:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L294-L299 -->
```assembly
	xorl	%eax, %eax
	movl	%eax, %ds
	movl	%eax, %es
	movl	%eax, %ss
	movl	%eax, %fs
	movl	%eax, %gs
```

Segment registers are not used in long mode, so the kernel resets them to zero.

### Calculation of the kernel relocation address

The next step is to compute the difference between the location the kernel was compiled to be loaded at and the location where it is actually loaded:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L315-L331 -->
```assembly
#ifdef CONFIG_RELOCATABLE
	leaq	startup_32(%rip) /* - $startup_32 */, %rbp
	movl	BP_kernel_alignment(%rsi), %eax
	decl	%eax
	addq	%rax, %rbp
	notq	%rax
	andq	%rax, %rbp
	cmpq	$LOAD_PHYSICAL_ADDR, %rbp
	jae	1f
#endif
	movq	$LOAD_PHYSICAL_ADDR, %rbp
1:

	/* Target address to relocate to for decompression */
	movl	BP_init_size(%rsi), %ebx
	subl	$ rva(_end), %ebx
	addq	%rbp, %rbx
```

This operation is very similar to what we have seen already in the [Calculation of the kernel relocation address](./linux-bootstrap-4.md#calculation-of-the-kernel-relocation-address) section of the previous chapter.

> [!TIP]
> It is highly recommended to read carefully [Calculation of the kernel relocation address](./linux-bootstrap-4.md#calculation-of-the-kernel-relocation-address) before trying to understand this code.

This piece of code is almost a 1:1 copy of what we have seen in protected mode. If you understood it back then, you shouldn't have any problems understanding it now. The main purpose of this code is to set up the `rbp` and `ebx` registers with the base addresses where the kernel will be decompressed, and the address where the kernel image with decompressor code should be relocated for safe decompression.

The only difference with the code from protected mode is that now, the kernel can use `rip` based addressing to get the address of the `startup_32`. So it does not need to do magic tricks with `call` and `popl` instructions like in protected mode. All the rest is just the same as what we already have seen in the previous chapter and done only for the same reason - if the bootloader is loaded, the kernel starts from the `64-bit` mode, and the protected mode code is skipped.

After these addresses are obtained, the kernel sets up the stack for the decompressor code:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L334-L334 -->
```assembly
	leaq	rva(boot_stack_end)(%rbx), %rsp
```

### Reload of the Global Descriptor Table

The next step is to set up a new Global Descriptor Table. Yes, one more time 😊 There are at least two reasons to do this:

1. The bootloader can load the Linux kernel starting from the `64-bit` entrypoint, and the kernel needs to set up its own Global Descriptor Table in case the one from the bootloader is not suitable.
2. The kernel might be configured with support for the [5-level](https://en.wikipedia.org/wiki/Intel_5-level_paging) paging, and in this case, the kernel needs to jump to `32-bit` mode again to set it safely.

The "new" Global Descriptor Table has the same entries but is pointed by the `gdt64` symbol:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L489-L493 -->
```assembly
	.data
SYM_DATA_START_LOCAL(gdt64)
	.word	gdt_end - gdt - 1
	.quad   gdt - gdt64
SYM_DATA_END(gdt64)
```

The single difference is that `lgdt` in `64-bit` mode loads `GDTR` register with size `10` bytes. In comparison, in `32-bit`, the size of `GDTR` is `6` bytes. To load the new Global Descriptor Table, the kernel writes its address to the `GDTR` register using the `lgdt` instruction:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L357-L368 -->
```assembly
	/* Make sure we have GDT with 32-bit code segment */
	leaq	gdt64(%rip), %rax
	addq	%rax, 2(%rax)
	lgdt	(%rax)

	/* Reload CS so IRET returns to a CS actually in the GDT */
	pushq	$__KERNEL_CS
	leaq	.Lon_kernel_cs(%rip), %rax
	pushq	%rax
	lretq

.Lon_kernel_cs:
```

### Load of the Interrupt Descriptor Table

After the new Global Descriptor Table is loaded, the next step is to load the new `Interrupt Descriptor Table`:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L369-L376 -->
```assembly
	/*
	 * RSI holds a pointer to a boot_params structure provided by the
	 * loader, and this needs to be preserved across C function calls. So
	 * move it into a callee saved register.
	 */
	movq	%rsi, %r15

	call	load_stage1_idt
```

The `load_stage1_idt` function is defined in [arch/x86/boot/compressed/idt_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/idt_64.c) and uses the `lidt` instruction to load the address of the new `Interrupt Descriptor Table`. For this moment, the `Interrupt Descriptor Table` has `NULL` entries to avoid handling the interrupts. As you can remember, the interrupts are disabled at this moment anyway. The valid interrupt handlers will be loaded after kernel relocation.

The next steps after this are highly related to the setup of `5-level` paging, if it is configured using the `CONFIG_PGTABLE_LEVELS=5` kernel configuration option. This feature extends the virtual address space beyond the traditional 4-level paging scheme, but it is still relatively uncommon in practice and not essential for understanding the mainline boot flow. As mentioned in the [previous chapter](./linux-bootstrap-5.md), for clarity and focus, we’ll set it aside and continue with the standard 4-level paging case.

### Kernel relocation

Since the calculation of the base address for the kernel relocation is done, the kernel setup code can copy the compressed kernel image and the decompressor code to the memory area pointed by this address:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L419-L425 -->
```assembly
	leaq	(_bss-8)(%rip), %rsi
	leaq	rva(_bss-8)(%rbx), %rdi
	movl	$(_bss - startup_32), %ecx
	shrl	$3, %ecx
	std
	rep	movsq
	cld
```

The set of assembly instructions above copies the compressed kernel image and decompressor code to the memory area, which starts at the address pointed by the `rbx` register. The code above copies the memory contents starting from the `_bss-8` up to the `_startup_32` symbol, which includes:

- `32-bit` kernel setup code
- compressed kernel image 
- decompressor code

Because of the `std` instruction, the copying is performed in the backward order, from higher memory addresses to the lower.

After the copying is performed, the kernel needs to reload the previously loaded `Global Descriptor Table` in case it was overwritten or corrupted during the copy procedure:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L432-L435 -->
```assembly
	leaq	rva(gdt64)(%rbx), %rax
	leaq	rva(gdt)(%rbx), %rdx
	movq	%rdx, 2(%rax)
	lgdt	(%rax)
```

And finally jump on the relocated code:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L440-L441 -->
```assembly
	leaq	rva(.Lrelocated)(%rbx), %rax
	jmp	*%rax
```

## The last actions before the kernel decompression

In the previous section, we saw the kernel relocation. The very first task after this jump is to clear the `.bss` section. This step is needed because the `.bss` section holds all uninitialized global and static variables. By definition, they must be initialized with zeros in `C` code. Cleaning it, the kernel ensures that all the following code, including the decompressor, begins with a proper `.bss` memory area without any possible garbage in it.

The following code does that:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L450-L455 -->
```assembly
	xorl	%eax, %eax
	leaq    _bss(%rip), %rdi
	leaq    _ebss(%rip), %rcx
	subq	%rdi, %rcx
	shrq	$3, %rcx
	rep	stosq
```

The assembly code above should be pretty easy to understand if you read the previous parts. It clears the value of the `eax` register and uses its value to fill the memory region of the `.bss` section between the `_bss` and `_ebss` symbols.

In the next step, the kernel fills the new `Interrupt Descriptor Table` with the call:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L457-L457 -->
```
	call	load_stage2_idt
```

This function defined in the [arch/x86/boot/compressed/idt_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/idt_64.c) and looks like this:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/idt_64.c#L59-L78 -->
```C
void load_stage2_idt(void)
{
	boot_idt_desc.address = (unsigned long)boot_idt;

	set_idt_entry(X86_TRAP_PF, boot_page_fault);
	set_idt_entry(X86_TRAP_NMI, boot_nmi_trap);

#ifdef CONFIG_AMD_MEM_ENCRYPT
	/*
	 * Clear the second stage #VC handler in case guest types
	 * needing #VC have not been detected.
	 */
	if (sev_status & BIT(1))
		set_idt_entry(X86_TRAP_VC, boot_stage2_vc);
	else
		set_idt_entry(X86_TRAP_VC, NULL);
#endif

	load_boot_idt(&boot_idt_desc);
}
```

We can skip the part of the code wrapped with `CONFIG_AMD_MEM_ENCRYPT` as it is not of main interest for us right now, but try to understand the rest of the function's body. It is similar to the first stage of the `Interrupt Descriptor Table`. It loads the entries of this table using the `lidt` instruction, which we already have seen before. The only single difference is that it sets up two interrupt handlers:

- `PF` - Page fault interrupt handler
- `NMI` - Non-maskable interrupt handler

The first interrupt handler is set because the `initialize_identity_maps` function (which we will see very soon) may trigger page fault exception. This exception can be triggered for example, when [Address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) is enabled and such random physical and virtual addresses were used for which the page tables do have an entry.

The second interrupt handler is needed to "handle" a triple-fault if such an interrupt appears during kernel decompression. So at least dummy NMI handler is needed.

After the `Interrupt Descriptor Table` is re-loaded, the `initialize_identity_maps` function is called:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L460-L461 -->
```assembly
	movq	%r15, %rdi
	call	initialize_identity_maps
```

This function is defined in [arch/x86/boot/compressed/ident_map_64.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/ident_map_64.c) and clears the memory area for the top-level page table identified by the `top_level_pgt` pointer to initialize a new page table. Yes, the kernel needs to initialize page tables one more time, despite we have seen the initialization and setup of the early page tables in the [previous chapter](./linux-bootstrap-4.md##setup-paging). The reason for "one more" page table is that if the kernel was loaded at the `64-bit` entrypoint, it uses the page table built by the bootloader. Since the kernel was relocated to a new place, the decompressor code can overwrite these page tables during decompression.

The new page table is built in a very similar way to the [previous page table](./linux-bootstrap-4.md#set-up-paging). Each [virtual address](https://en.wikipedia.org/wiki/Virtual_address_space) directly corresponds to the same [physical address](https://en.wikipedia.org/wiki/Physical_address). That is why it is called the identity mapping.

Now let's take a look at the implementation of this function. It starts by initializing an instance of the `x86_mapping_info` structure called `mapping_info`:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/ident_map_64.c#L119-L122 -->
```C
	mapping_info.alloc_pgt_page = alloc_pgt_page;
	mapping_info.context = &pgt_data;
	mapping_info.page_flag = __PAGE_KERNEL_LARGE_EXEC | sme_me_mask;
	mapping_info.kernpg_flag = _KERNPG_TABLE;
```

This structure provides information about memory mappings and a callback to allocate space for page table entries. The `context` field is used for tracking the allocated page tables. The `page_flag` and `kernpg_flag` fields define various page attributes (such as `present`, `writable`, or `executable`), which are reflected in their names.

In the next step, the kernel reads the address of the top-level page table from the `cr3` [control register](https://en.wikipedia.org/wiki/Control_register) and compares it with the `_pgtable`. If you read the previous chapter, you remember that `_pgtable` is the page table initialized by the early kernel setup code before switching to long mode. If we came from the `startup_32`, and it is exactly our case, the `cr3` register contains the same address as `_pgtable`. In this case, the kernel reuses and extends this page table:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/ident_map_64.c#L142-L146 -->
```C
	top_level_pgt = read_cr3_pa();
	if (p4d_offset((pgd_t *)top_level_pgt, 0) == (p4d_t *)_pgtable) {
		pgt_data.pgt_buf = _pgtable + BOOT_INIT_PGT_SIZE;
		pgt_data.pgt_buf_size = BOOT_PGT_SIZE - BOOT_INIT_PGT_SIZE;
		memset(pgt_data.pgt_buf, 0, pgt_data.pgt_buf_size);
```

Otherwise, the new page table is built:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/ident_map_64.c#L147-L152 -->
```C
	} else {
		pgt_data.pgt_buf = _pgtable;
		pgt_data.pgt_buf_size = BOOT_PGT_SIZE;
		memset(pgt_data.pgt_buf, 0, pgt_data.pgt_buf_size);
		top_level_pgt = (unsigned long)alloc_pgt_page(&pgt_data);
	}
```

At this stage, new identity mappings are added to cover the essential regions needed for the kernel to continue the boot process:

- the kernel image itself (from `_head` to `_end`)
- the boot parameters provided by the bootloader
- the kernel command line

All of the actual work is performed by the `kernel_add_identity_map` function defined in the same [file](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/ident_map_64.c):

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/ident_map_64.c#L161-L166 -->
```C
	kernel_add_identity_map((unsigned long)_head, (unsigned long)_end);
	boot_params_ptr = rmode;
	kernel_add_identity_map((unsigned long)boot_params_ptr,
				(unsigned long)(boot_params_ptr + 1));
	cmdline = get_cmd_line_ptr();
	kernel_add_identity_map(cmdline, cmdline + COMMAND_LINE_SIZE);
```

The `kernel_add_itntity_map` function walks the page table hierarchy and ensures that there is existing page table entries which provide 1:1 mapping into the virtual address space. If such entries does not exist, the new entry is allocated with the flags that we have seen during the initialization of the `mapping_info`.

After all the identity mapping page table entries were initialized, the kernel updates the `cr3` control register with the address of the top page table:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/ident_map_64.c#L183-L183 -->
```C
	write_cr3(top_level_pgt);
```

At this point, all the preparations needed to decompress the kernel image are done. Now the kernel decompressor code is ready to decompress the kernel:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L466-L475 -->
```assembly
	/* pass struct boot_params pointer and output target address */
	movq	%r15, %rdi
	movq	%rbp, %rsi
	call	extract_kernel		/* returns kernel entry point in %rax */

/*
 * Jump to the decompressed kernel.
 */
	movq	%r15, %rsi
	jmp	*%rax
```

After the kernel is decompressed. The last instructions of the decompressor code transfers control to the Linux kernel entrypoint jumping on the address of the kernel's entrypoint. The early setup phase is complete, and the Linux kernel starts its job 🎉

In the next section, let's see how the kernel decompression works.

## Kernel decompression

Right now, we are finally at the last point before we see the kernel entrypoint. The last remaining step is only to decompress the kernel and switch control to it.

The kernel decompression is performed by the `extract_kernel` function defined in [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c). This function starts with the video mode and console initialization that we already saw in the previous parts. The kernel needs to do this again because it does not know if the kernel was loaded in the [real mode](https://en.wikipedia.org/wiki/Real_mode) or whether the bootloader used the `32-bit` or `64-bit` boot protocol.

We will skip all these initialization steps as we already saw them in the previous chapters. After the first initialization steps are done, the decompressor code stores the pointers to the start of the free heap memory and to the end of it:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/misc.c#L458-L459 -->
```C
	free_mem_ptr     = heap;	/* Heap */
	free_mem_end_ptr = heap + BOOT_HEAP_SIZE;
```

The main reason to set up the heap borders is that the kernel decompressor code uses the heap intensively during decompression.

After the initialization of the heap, the kernel calls the `choose_random_location` function from [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c). This function chooses the random location in memory to write the decompressed kernel to. This function performs work only if the address randomization is enabled. At this point, we will skip it and move to the next step, as it is not the most crucial point in the kernel decompression. If you are interested in what this function does, you can find more information in the [next chapter](./linux-bootstrap-6.md).

Now let's get back to the `extract_kernel` function. Since we assume that the kernel address randomization is disabled, the address where the kernel image will be decompressed is stored in the `output` parameter without any change. The value from this variable is obtained from the `rbp` register as calculated in the previous steps.

The next action before the kernel is decompressed is to perform the sanitising checks:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/misc.c#L496-L512 -->
```C
	if ((unsigned long)output & (MIN_KERNEL_ALIGN - 1))
		error("Destination physical address inappropriately aligned");
	if (virt_addr & (MIN_KERNEL_ALIGN - 1))
		error("Destination virtual address inappropriately aligned");
#ifdef CONFIG_X86_64
	if (heap > 0x3fffffffffffUL)
		error("Destination address too large");
	if (virt_addr + needed_size > KERNEL_IMAGE_SIZE)
		error("Destination virtual address is beyond the kernel mapping area");
#else
	if (heap > ((-__PAGE_OFFSET-(128<<20)-1) & 0x7fffffff))
		error("Destination address too large");
#endif
#ifndef CONFIG_RELOCATABLE
	if (virt_addr != LOAD_PHYSICAL_ADDR)
		error("Destination virtual address changed when not relocatable");
#endif
```

After all these checks, we can see the familiar message on the screen of our computers:

```
Decompressing Linux...
```

The kernel setup code starts decompression by calling the `decompress_kernel` function:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/misc.c#L521-L521 -->
```C
	entry_offset = decompress_kernel(output, virt_addr, error);
```

This function performs the following actions:

1. Decompress the kernel
2. Parse kernel ELF binary
3. Handle relocations

The kernel decompression performed by the helper function `__decompress`. The implementation of this function depends on what compression algorithm was used to compress the kernel and located in one of the following files:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/misc.c#L63-L89 -->
```C
#ifdef CONFIG_KERNEL_GZIP
#include "../../../../lib/decompress_inflate.c"
#endif

#ifdef CONFIG_KERNEL_BZIP2
#include "../../../../lib/decompress_bunzip2.c"
#endif

#ifdef CONFIG_KERNEL_LZMA
#include "../../../../lib/decompress_unlzma.c"
#endif

#ifdef CONFIG_KERNEL_XZ
#include "../../../../lib/decompress_unxz.c"
#endif

#ifdef CONFIG_KERNEL_LZO
#include "../../../../lib/decompress_unlzo.c"
#endif

#ifdef CONFIG_KERNEL_LZ4
#include "../../../../lib/decompress_unlz4.c"
#endif

#ifdef CONFIG_KERNEL_ZSTD
#include "../../../../lib/decompress_unzstd.c"
#endif
```

I will not describe here each implementation as this information is rather about compression algorithms rather than something specific to the Linux kernel.

After the kernel is decompressed, two more functions are called: `parse_elf` and `handle_relocations`. Let's take a short look at them.

The kernel binary, which is called `vmlinux` is an [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) executable file. As a result, after decompression we have not just a "piece" of code on which we can jump but an ELF file with headers, program segments, debug symbols and other information. We can easily make sure in it inspecting the `vmlinux` with `readelf` utility:

```bash
readelf -l vmlinux

Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000200000 0xffffffff81000000 0x0000000001000000
                 0x0000000000893000 0x0000000000893000  R E    200000
  LOAD           0x0000000000a93000 0xffffffff81893000 0x0000000001893000
                 0x000000000016d000 0x000000000016d000  RW     200000
  LOAD           0x0000000000c00000 0x0000000000000000 0x0000000001a00000
                 0x00000000000152d8 0x00000000000152d8  RW     200000
  LOAD           0x0000000000c16000 0xffffffff81a16000 0x0000000001a16000
                 0x0000000000138000 0x000000000029b000  RWE    200000
  ...
  ...
  ...
```

The `parse_elf` function acts as a minimal [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) loader. It reads the ELF program headers of the decompressed kernel image and uses them to determine which segments must be loaded and where each segment should be placed in physical memory.

At this point, the `parse_elf` function has completed loading the decompressed kernel image into memory. Each `PT_LOAD` segment has been copied from the ELF file into its proper location. The kernel’s code, data, and other segments are now present at the chosen load address. However, it might not be sufficient to make the kernel fully runnable.

The kernel was originally linked assuming a specific base address. If the address space layout randomization is enabled, the kernel can instead be loaded at a different physical and virtual address. As a result, any absolute addresses embedded within the kernel image will still reflect the original link-time address rather than the actual load address. To resolve this, the kernel image includes a relocation table that identifies all locations containing such absolute references. 

The `handle_relocations` function processes this table and adjusts each affected value by applying the relocation delta, which is the difference between the actual load address and the link-time base address. 

Once the relocations are applied, the decompressor code jumps to the kernel entrypoint. Its address is stored in the `rax` register, as we already have seen above.

Now we are in the kernel 🎉🎉🎉

The kernel entrypoint is the `startup_64` function from [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). This is our next stop, but it will be in the next set of chapters - [Kernel initialization process](https://github.com/0xAX/linux-insides/tree/master/Initialization).

## Conclusion

This is the end of the third part about Linux kernel insides. If you have questions or suggestions, feel free ping me on X - [0xAX](https://twitter.com/0xAX), drop me an [email](mailto:anotherworldofworld@gmail.com), or just create an [issue](https://github.com/0xAX/linux-insides/issues/new).

## Links

Here is the list of the links that you can find useful when reading this chapter:

- [Real mode](https://en.wikipedia.org/wiki/Real_mode)
- [Protected mode](http://en.wikipedia.org/wiki/Protected_mode)
- [Long mode](https://en.wikipedia.org/wiki/Long_mode)
- [Flat memory model](https://en.wikipedia.org/wiki/Flat_memory_model)
- [Address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization)
- [ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format)
- [Previous part](linux-bootstrap-4.md)


================================================
FILE: Booting/linux-bootstrap-6.md
================================================
# Kernel booting process. Part 6

In the [previous part](./linux-bootstrap-5.md), we finally left the setup code and reached the Linux kernel itself. We explored the last steps of the early boot process - from the kernel decompression to the hand-off to the Linux kernel entrypoint (the `startup_64` function). You may think this is the end of the set of posts about the Linux kernel booting process, but I'd like to come back one more time to the early setup code and look at one more important part of it - `KASLR` or Kernel Address Space Layout Randomization.

As you can remember from the previous parts, the entry point of the Linux kernel is the `startup_64` function defined in [arch/x86/kernel/head_64.S](https://github.com/torvalds/linux/blob/master/arch/x86/kernel/head_64.S). In normal cases, the kernel is loaded at the fixed, well-known address defined by the value of the `CONFIG_PHYSICAL_START` configuration option. The description and the default value of this option are defined in [arch/x86/Kconfig](https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig):

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/Kconfig#L2021-L2025 -->
```
config PHYSICAL_START
	hex "Physical address where the kernel is loaded" if (EXPERT || CRASH_DUMP)
	default "0x1000000"
	help
	  This gives the physical address where the kernel is loaded.
```

However, modern systems rarely stick to predictable memory layouts for security reasons. Knowing the fixed address where the kernel was loaded can make it easier for attackers to guess the location of the kernel structures which can be exploited in various ways. To make such attacks harder, the Linux kernel provides support for [address space layout randomization](https://en.wikipedia.org/wiki/Address_space_layout_randomization) mechanism. 

To enable this mechanism, the `CONFIG_RANDOMIZE_BASE` kernel configuration option should be enabled. If this mechanism is enabled, the kernel will not be decompressed and loaded at the given fixed address. Instead, each boot the kernel image will be placed at a different physical address. 

In this part, we will look at how this mechanism works.

## Choose random location for kernel image

Before we will start to investigate kernel's code, let's remember where we were and what we have seen. 

In the [previous part](linux-bootstrap-5.md), we followed the kernel decompression code and transition to [long mode](https://en.wikipedia.org/wiki/Long_mode). The kernel's decompressor entrypoint is the `extract_kernel` function defined in [arch/x86/boot/compressed/misc.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c). At this point, the kernel image is about to be decompressed into the specific location in memory.

Before the kernel's decompressor actually begins to decompress the kernel image, it needs to decide where that image should be placed in memory. While we were going through the kernel's decompression code in the `extract_kernel`, we skipped the next function call:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/misc.c#L490-L493 -->
```C
	choose_random_location((unsigned long)input_data, input_len,
				(unsigned long *)&output,
				needed_size,
				&virt_addr);
```

This function is defined in [arch/x86/boot/compressed/kaslr.c](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/kaslr.c) and does nothing if the `kaslr` option is not passed to the kernel command line:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/kaslr.c#L861-L872 -->
```C
void choose_random_location(unsigned long input,
			    unsigned long input_size,
			    unsigned long *output,
			    unsigned long output_size,
			    unsigned long *virt_addr)
{
	unsigned long random_addr, min_addr;

	if (cmdline_find_option_bool("nokaslr")) {
		warn("KASLR disabled: 'nokaslr' on cmdline.");
		return;
	}
```

Otherwise, it selects a randomized address where the kernel image should be decompressed.

As we can see, this function takes five parameters:

- `input` - beginning address of the compressed kernel image
- `input_size` - size of the compressed kernel image
- `output` - physical address where the kernel should be decompressed
- `output_size` - size of the decompressed kernel image
- `virt_addr` - virtual address where the kernel should be decompressed

The `extract_kernel` function receives the `output` parameter from the code that prepares the decompressor:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/head_64.S#L467-L469 -->
```
	movq	%r15, %rdi
	movq	%rbp, %rsi
	call	extract_kernel		/* returns kernel entry point in %rax */
```

If you read the previous chapters, you can remember that the starting address where the kernel image should be decompressed was calculated and stored in the `rbp` register.

The source of the values for the `input`, `input_size`, and `output_size` parameters is quite interesting. These values come from a little program called [mkpiggy](https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/mkpiggy.c).

If you've ever tried compiling the Linux kernel yourself, you can find the output generated by this program in the `arch/x86/boot/compressed/piggy.S` assembly file, which contains all the parameters needed for decompression. In my case, this file looks like this:

```assembly
.section ".rodata..compressed","a",@progbits
.globl z_input_len
z_input_len = 14213122
.globl z_output_len
z_output_len = 36564556
.globl input_data, input_data_end
input_data:
.incbin "arch/x86/boot/compressed/vmlinux.bin.lz4"
input_data_end:
.section ".rodata","a",@progbits
.globl input_len
input_len:
	.long 14213122
.globl output_len
output_len:
	.long 36564556
```

At build time, the kernel's `vmlinux` image is compressed into `vmlinux.bin.{ALGO}` file. A small `mkpiggy` program gets the information about the compressed kernel image and generates this assembly file using the following code:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/mkpiggy.c#L52-L67 -->
```C
	printf(".section \".rodata..compressed\",\"a\",@progbits\n");
	printf(".globl z_input_len\n");
	printf("z_input_len = %lu\n", ilen);
	printf(".globl z_output_len\n");
	printf("z_output_len = %lu\n", (unsigned long)olen);

	printf(".globl input_data, input_data_end\n");
	printf("input_data:\n");
	printf(".incbin \"%s\"\n", argv[1]);
	printf("input_data_end:\n");

	printf(".section \".rodata\",\"a\",@progbits\n");
	printf(".globl input_len\n");
	printf("input_len:\n\t.long %lu\n", ilen);
	printf(".globl output_len\n");
	printf("output_len:\n\t.long %lu\n", (unsigned long)olen);
```

That is where the kernel setup code obtains the values of these parameters.

The last parameter of the `choose_random_location` function is the virtual base address for the decompressed kernel image. At this point during early boot it is set to the physical load address:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/misc.c#L409-L409 -->
```C
	unsigned long virt_addr = LOAD_PHYSICAL_ADDR;
```

Why is a virtual address initialized with the value of the physical address? The answer is simple and can be found in the previous chapters. During decompression, the early boot-time page tables are set up as an identity map. In other words, for this early stage, we have each virtual address equal to a physical address.

The value of `LOAD_PHYISICAL_ADDR` is the aligned value of the `CONFIG_PHYSICAL_START` configuration option, which we already saw at the beginning of this chapter:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/include/asm/page_types.h#L32-L32 -->
```C
#define LOAD_PHYSICAL_ADDR	__ALIGN_KERNEL_MASK(CONFIG_PHYSICAL_START, CONFIG_PHYSICAL_ALIGN - 1)
```

At this point, we have examined all the parameters passed to the `choose_random_location` function. Now it is time to look inside the function. 

As it was mentioned above, the first thing that this function does is check whether ASLR disabled using the `nokaslr` option in the kernel's command line:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/kaslr.c#L869-L872 -->
```C
	if (cmdline_find_option_bool("nokaslr")) {
		warn("KASLR disabled: 'nokaslr' on cmdline.");
		return;
	}
```

If this option is specified in the kernel command line, the function does nothing, and the kernel is decompressed at the fixed address. In this chapter, however, we focus on the case where this option is not provided, as that is the main topic under discussion. If the `nokaslr` option is not present, the function proceeds to find a random location in memory to decompress the kernel.

The very first step is to set a mark in the boot parameters that ASLR is enabled. This is done by setting a specific flag in the kernel’s boot header:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/kaslr.c#L874-L874 -->
```C
	boot_params_ptr->hdr.loadflags |= KASLR_FLAG;
```

After marking that ASLR is enabled, the next task is to determine the upper memory limit which system can use:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/kaslr.c#L876-L879 -->
```C
	if (IS_ENABLED(CONFIG_X86_32))
		mem_limit = KERNEL_IMAGE_SIZE;
	else
		mem_limit = MAXMEM;
```

Since we consider only `x86_64` systems, the memory limit is `MAXMEM`, which is a macro defined in [arch/x86/include/asm/pgtable_64_types.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/pgtable_64_types.h):

```C
#define MAXMEM			(1UL << MAX_PHYSMEM_BITS)
```

where `MAX_PHYSMEM_BITS` depends on is [5-level paging](https://en.wikipedia.org/wiki/Intel_5-level_paging) is enabled or not. We will consider only 4-level paging, so in our case `MAXMEM` will be expand to `1 << 46` bytes.

With the `mem_limit` value set, the decompressor and kernel code responsible for the address randomization will know how far they can safely go during calculating an address for the kernel image. But before a random address for the kernel image can be chosen, the kernel needs to make sure it does not overwrite something important.

### Avoiding reserved memory ranges

The next step in the randomization process is to build a map of forbidden memory regions to prevent the kernel image from overwriting memory areas that are already in use. These may include, for example, the [initial ramdisk](https://en.wikipedia.org/wiki/Initial_ramdisk) or the kernel command line. To gather this information, we use this function:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/kaslr.c#L882-L882 -->
```C
	mem_avoid_init(input, input_size, *output);
```

It collects the forbidden memory regions into the `mem_avoid` array, which has `mem_vector` type:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/misc.h#L97-L100 -->
```C
struct mem_vector {
	u64 start;
	u64 size;
};
```

For this moment, the randomization code tries to avoid the memory regions specified by the `mem_avoid_index`:

<!-- https://raw.githubusercontent.com/torvalds/linux/refs/heads/master/arch/x86/boot/compressed/kaslr.c#L86-L94 -->
```C
enum mem_avoid_index {
	MEM_AVOID_ZO_RANGE = 0,
	MEM_AVOID_INITRD,
	MEM_AVOID_CMDLINE,
	MEM_AVOID_BOOTPARAMS,
	MEM_AVOID_MEMMAP_BEGIN,
	MEM_AVOID_MEMMAP_END = MEM_AVOID_MEMMAP_BEGIN + MAX_MEMMAP_REGIONS - 1,
	MEM_AVOID_MAX,
};
```

Let's look at the implementation of the `mem_avoid_init` function. As we know, the main goal of this function is to s

Download .txt

gitextract_i4p4qj27/

├── .github/
│   ├── FUNDING.yml
│   ├── ISSUE_TEMPLATE/
│   │   ├── content-issue.yml
│   │   └── question.yml
│   ├── dependabot.yaml
│   ├── pull-request-template.md
│   └── workflows/
│       ├── check-code-snippets.yaml
│       ├── check-links.yaml
│       ├── generate-e-books.yaml
│       └── release-e-books.yaml
├── .gitignore
├── Booting/
│   ├── README.md
│   ├── linux-bootstrap-1.md
│   ├── linux-bootstrap-2.md
│   ├── linux-bootstrap-3.md
│   ├── linux-bootstrap-4.md
│   ├── linux-bootstrap-5.md
│   └── linux-bootstrap-6.md
├── CODEOWNERS
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── Cgroups/
│   ├── README.md
│   └── linux-cgroups-1.md
├── Concepts/
│   ├── README.md
│   ├── linux-cpu-1.md
│   ├── linux-cpu-2.md
│   ├── linux-cpu-3.md
│   └── linux-cpu-4.md
├── DataStructures/
│   ├── README.md
│   ├── linux-datastructures-1.md
│   ├── linux-datastructures-2.md
│   └── linux-datastructures-3.md
├── Dockerfile
├── Initialization/
│   ├── README.md
│   ├── linux-initialization-1.md
│   ├── linux-initialization-10.md
│   ├── linux-initialization-2.md
│   ├── linux-initialization-3.md
│   ├── linux-initialization-4.md
│   ├── linux-initialization-5.md
│   ├── linux-initialization-6.md
│   ├── linux-initialization-7.md
│   ├── linux-initialization-8.md
│   └── linux-initialization-9.md
├── Interrupts/
│   ├── README.md
│   ├── linux-interrupts-1.md
│   ├── linux-interrupts-10.md
│   ├── linux-interrupts-2.md
│   ├── linux-interrupts-3.md
│   ├── linux-interrupts-4.md
│   ├── linux-interrupts-5.md
│   ├── linux-interrupts-6.md
│   ├── linux-interrupts-7.md
│   ├── linux-interrupts-8.md
│   └── linux-interrupts-9.md
├── KernelStructures/
│   ├── .gitkeep
│   ├── README.md
│   └── linux-kernelstructure-1.md
├── LICENSE
├── LINKS.md
├── MM/
│   ├── README.md
│   ├── linux-mm-1.md
│   ├── linux-mm-2.md
│   └── linux-mm-3.md
├── Makefile
├── Misc/
│   ├── README.md
│   ├── linux-misc-1.md
│   ├── linux-misc-2.md
│   ├── linux-misc-3.md
│   └── linux-misc-4.md
├── README.md
├── SUMMARY.md
├── Scripts/
│   ├── README.md
│   ├── get_all_links.py
│   └── latex.sh
├── SyncPrim/
│   ├── README.md
│   ├── linux-sync-1.md
│   ├── linux-sync-2.md
│   ├── linux-sync-3.md
│   ├── linux-sync-4.md
│   ├── linux-sync-5.md
│   └── linux-sync-6.md
├── SysCall/
│   ├── README.md
│   ├── linux-syscall-1.md
│   ├── linux-syscall-2.md
│   ├── linux-syscall-3.md
│   ├── linux-syscall-4.md
│   ├── linux-syscall-5.md
│   └── linux-syscall-6.md
├── Theory/
│   ├── README.md
│   ├── linux-theory-1.md
│   ├── linux-theory-2.md
│   └── linux-theory-3.md
├── Timers/
│   ├── README.md
│   ├── linux-timers-1.md
│   ├── linux-timers-2.md
│   ├── linux-timers-3.md
│   ├── linux-timers-4.md
│   ├── linux-timers-5.md
│   ├── linux-timers-6.md
│   └── linux-timers-7.md
├── book-A5.json
├── book.json
├── contributors.md
├── lychee.toml
└── scripts/
    └── check_code_snippets.py

Download .txt

SYMBOL INDEX (6 symbols across 2 files)

FILE: Scripts/get_all_links.py
  function check_live_url (line 24) | def check_live_url(url):
  function main (line 42) | def main(path):

FILE: scripts/check_code_snippets.py
  function __split_url_and_range__ (line 14) | def __split_url_and_range__(url: str) -> Tuple[str, Optional[int], Optio...
  function __fetch_raw__ (line 21) | def __fetch_raw__(source: str) -> str:
  function __handle_md__ (line 25) | def __handle_md__(md: str):
  function __main__ (line 63) | def __main__():

Download .json

Condensed preview — 105 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,731K chars).

[
  {
    "path": ".github/FUNDING.yml",
    "chars": 61,
    "preview": "# These are supported funding model platforms\n\npatreon: 0xAX\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/content-issue.yml",
    "chars": 1114,
    "preview": "name: 📖 Content issue\ndescription: Report an issue with the content\nbody:\n  - type: markdown\n    attributes:\n      value"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/question.yml",
    "chars": 1167,
    "preview": "name: ❓ Questions and discussions\ndescription: Ask a question or start a discussion with other community members.\nbody:\n"
  },
  {
    "path": ".github/dependabot.yaml",
    "chars": 117,
    "preview": "version: 2\nupdates:\n  - package-ecosystem: \"github-actions\"\n    directory: \"/\"\n    schedule:\n      interval: \"daily\"\n"
  },
  {
    "path": ".github/pull-request-template.md",
    "chars": 578,
    "preview": "<!-- Thank you for your contribution. When contributing to the project, remember to:\n- Read the Contribution guide.\n- Fo"
  },
  {
    "path": ".github/workflows/check-code-snippets.yaml",
    "chars": 734,
    "preview": "name: check code snippets\n\non:\n  workflow_dispatch:\n  push:\n    branches:\n      - main\n  pull_request:\n\nconcurrency:\n  g"
  },
  {
    "path": ".github/workflows/check-links.yaml",
    "chars": 791,
    "preview": "name: check links\n\non:\n  workflow_dispatch:\n  push:\n    branches:\n      - main\n      - master\n  pull_request:\n\nconcurren"
  },
  {
    "path": ".github/workflows/generate-e-books.yaml",
    "chars": 1916,
    "preview": "name: Generate e-books\n\non:\n  workflow_dispatch: {}\n\njobs:\n  build-for-pr:\n    # For every PR, build the same artifacts "
  },
  {
    "path": ".github/workflows/release-e-books.yaml",
    "chars": 1958,
    "preview": "name: Release e-books\n\non:\n  push:\n    tags:\n      - 'v*.*' # Create a release only when a new tag matching v*.* is push"
  },
  {
    "path": ".gitignore",
    "chars": 12,
    "preview": "*.tex\nbuild\n"
  },
  {
    "path": "Booting/README.md",
    "chars": 2597,
    "preview": "# Kernel Boot Process\n\nWelcome to the boot journey of the Linux kernel, from power-on to the first instruction of the de"
  },
  {
    "path": "Booting/linux-bootstrap-1.md",
    "chars": 40397,
    "preview": "# Kernel Booting Process — Part 1\n\nIf you’ve read my earlier [posts](https://github.com/0xAX/asm) about [assembly langua"
  },
  {
    "path": "Booting/linux-bootstrap-2.md",
    "chars": 32500,
    "preview": "# Kernel booting process - Part 2\n\nWe have already started our journey into the Linux kernel in the previous [part](./li"
  },
  {
    "path": "Booting/linux-bootstrap-3.md",
    "chars": 33459,
    "preview": "# Kernel booting process. Part 3\n\nIn the previous [part](./linux-bootstrap-2.md), we have seen first pieces of C code th"
  },
  {
    "path": "Booting/linux-bootstrap-4.md",
    "chars": 40492,
    "preview": "# Kernel booting process. Part 4\n\nIn the previous [part](./linux-bootstrap-3.md), we saw the transition from the [real m"
  },
  {
    "path": "Booting/linux-bootstrap-5.md",
    "chars": 27993,
    "preview": "# Kernel booting process. Part 5\n\nIn the previous [part](./linux-bootstrap-4.md), we saw the transition from the [protec"
  },
  {
    "path": "Booting/linux-bootstrap-6.md",
    "chars": 23070,
    "preview": "# Kernel booting process. Part 6\n\nIn the [previous part](./linux-bootstrap-5.md), we finally left the setup code and rea"
  },
  {
    "path": "CODEOWNERS",
    "chars": 81,
    "preview": "# Owner of the repository\n* @0xAX\n\n# Documentation owners\n*.md @0xAX @klaudiagrz\n"
  },
  {
    "path": "CODE_OF_CONDUCT.md",
    "chars": 5224,
    "preview": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participa"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 3227,
    "preview": "# Contributing\n\nThis document outlines the contribution workflow, starting from opening an issue, creating a pull reques"
  },
  {
    "path": "Cgroups/README.md",
    "chars": 120,
    "preview": "# Cgroups\n\nThis chapter describes `control groups` mechanism in the Linux kernel.\n\n* [Introduction](linux-cgroups-1.md)\n"
  },
  {
    "path": "Cgroups/linux-cgroups-1.md",
    "chars": 22055,
    "preview": "Control Groups\n================================================================================\n\nIntroduction\n----------"
  },
  {
    "path": "Concepts/README.md",
    "chars": 252,
    "preview": "# Linux kernel concepts\n\nThis chapter describes various concepts which are used in the Linux kernel.\n\n* [Per-CPU variabl"
  },
  {
    "path": "Concepts/linux-cpu-1.md",
    "chars": 10677,
    "preview": "Per-CPU variables\n================================================================================\n\nPer-CPU variables ar"
  },
  {
    "path": "Concepts/linux-cpu-2.md",
    "chars": 10568,
    "preview": "CPU masks\n================================================================================\n\nIntroduction\n---------------"
  },
  {
    "path": "Concepts/linux-cpu-3.md",
    "chars": 20632,
    "preview": "The initcall mechanism\n================================================================================\n\nIntroduction\n--"
  },
  {
    "path": "Concepts/linux-cpu-4.md",
    "chars": 18693,
    "preview": "Notification Chains in Linux Kernel\n================================================================================\n\nIn"
  },
  {
    "path": "DataStructures/README.md",
    "chars": 453,
    "preview": "Data Structures in the Linux Kernel\n========================================================================\n\nLinux kern"
  },
  {
    "path": "DataStructures/linux-datastructures-1.md",
    "chars": 9401,
    "preview": "Data Structures in the Linux Kernel\n================================================================================\n\nDo"
  },
  {
    "path": "DataStructures/linux-datastructures-2.md",
    "chars": 8567,
    "preview": "Data Structures in the Linux Kernel\n================================================================================\n\nRa"
  },
  {
    "path": "DataStructures/linux-datastructures-3.md",
    "chars": 23704,
    "preview": "Data Structures in the Linux Kernel\n================================================================================\n\nBi"
  },
  {
    "path": "Dockerfile",
    "chars": 187,
    "preview": "FROM kyselejsyrecek/gitbook:3.2.3\nCOPY ./ /srv/gitbook/\nEXPOSE 4000\nWORKDIR /srv/gitbook\nCMD [\"sh\", \"-c\", \"/usr/local/bi"
  },
  {
    "path": "Initialization/README.md",
    "chars": 1824,
    "preview": "# Kernel initialization process\n\nYou will find here a couple of posts which describe the full cycle of kernel initializa"
  },
  {
    "path": "Initialization/linux-initialization-1.md",
    "chars": 33411,
    "preview": "Kernel initialization. Part 1.\n================================================================================\n\nFirst s"
  },
  {
    "path": "Initialization/linux-initialization-10.md",
    "chars": 30741,
    "preview": "Kernel initialization. Part 10.\n================================================================================\n\nEnd of"
  },
  {
    "path": "Initialization/linux-initialization-2.md",
    "chars": 30104,
    "preview": "Kernel initialization. Part 2.\n================================================================================\n\nEarly i"
  },
  {
    "path": "Initialization/linux-initialization-3.md",
    "chars": 18938,
    "preview": "Kernel initialization. Part 3.\n================================================================================\n\nLast pr"
  },
  {
    "path": "Initialization/linux-initialization-4.md",
    "chars": 28072,
    "preview": "Kernel initialization. Part 4.\n================================================================================\n\nKernel "
  },
  {
    "path": "Initialization/linux-initialization-5.md",
    "chars": 32332,
    "preview": "Kernel initialization. Part 5.\n================================================================================\n\nContinu"
  },
  {
    "path": "Initialization/linux-initialization-6.md",
    "chars": 29757,
    "preview": "Kernel initialization. Part 6.\n================================================================================\n\nArchite"
  },
  {
    "path": "Initialization/linux-initialization-7.md",
    "chars": 31245,
    "preview": "Kernel initialization. Part 7.\n================================================================================\n\nThe End"
  },
  {
    "path": "Initialization/linux-initialization-8.md",
    "chars": 34641,
    "preview": "Kernel initialization. Part 8.\n================================================================================\n\nSchedul"
  },
  {
    "path": "Initialization/linux-initialization-9.md",
    "chars": 29760,
    "preview": "Kernel initialization. Part 9.\n================================================================================\n\nRCU ini"
  },
  {
    "path": "Interrupts/README.md",
    "chars": 1555,
    "preview": "# Interrupts and Interrupt Handling\n\nIn the following posts, we will cover interrupts and exceptions handling in the Lin"
  },
  {
    "path": "Interrupts/linux-interrupts-1.md",
    "chars": 31609,
    "preview": "Interrupts and Interrupt Handling. Part 1.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-10.md",
    "chars": 26826,
    "preview": "Interrupts and Interrupt Handling. Part 10.\n============================================================================"
  },
  {
    "path": "Interrupts/linux-interrupts-2.md",
    "chars": 32977,
    "preview": "Interrupts and Interrupt Handling. Part 2.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-3.md",
    "chars": 24784,
    "preview": "Interrupts and Interrupt Handling. Part 3.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-4.md",
    "chars": 27056,
    "preview": "Interrupts and Interrupt Handling. Part 4.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-5.md",
    "chars": 25690,
    "preview": "Interrupts and Interrupt Handling. Part 5.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-6.md",
    "chars": 25175,
    "preview": "Interrupts and Interrupt Handling. Part 6.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-7.md",
    "chars": 26739,
    "preview": "Interrupts and Interrupt Handling. Part 7.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-8.md",
    "chars": 27849,
    "preview": "Interrupts and Interrupt Handling. Part 8.\n============================================================================="
  },
  {
    "path": "Interrupts/linux-interrupts-9.md",
    "chars": 30158,
    "preview": "Interrupts and Interrupt Handling. Part 9.\n============================================================================="
  },
  {
    "path": "KernelStructures/.gitkeep",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "KernelStructures/README.md",
    "chars": 517,
    "preview": "# Internal `system` structures of the Linux kernel\n\nThis is not usual chapter of `linux-insides`. As you may understand "
  },
  {
    "path": "KernelStructures/linux-kernelstructure-1.md",
    "chars": 10358,
    "preview": "interrupt-descriptor table (IDT)\n================================================================================\n\nThree"
  },
  {
    "path": "LICENSE",
    "chars": 20840,
    "preview": "Attribution-NonCommercial-ShareAlike 4.0 International\n\n================================================================"
  },
  {
    "path": "LINKS.md",
    "chars": 1730,
    "preview": "Useful links\n========================\n\nLinux boot\n------------------------\n\n* [Linux/x86 boot protocol](https://www.kern"
  },
  {
    "path": "MM/README.md",
    "chars": 452,
    "preview": "# Linux kernel memory management\n\nThis chapter describes memory management in the Linux kernel. You will see here a\ncoup"
  },
  {
    "path": "MM/linux-mm-1.md",
    "chars": 21436,
    "preview": "Linux kernel memory management Part 1.\n================================================================================\n"
  },
  {
    "path": "MM/linux-mm-2.md",
    "chars": 28945,
    "preview": "Linux kernel memory management Part 2.\n================================================================================\n"
  },
  {
    "path": "MM/linux-mm-3.md",
    "chars": 22680,
    "preview": "Linux kernel memory management Part 3.\n================================================================================\n"
  },
  {
    "path": "Makefile",
    "chars": 2984,
    "preview": "### HELP\n\n.PHONY: help\nhelp: ## Print help\n\t@egrep \"(^### |^\\S+:.*##\\s)\" Makefile | sed 's/^###\\s*//' | sed 's/^\\(\\S*\\)\\"
  },
  {
    "path": "Misc/README.md",
    "chars": 143,
    "preview": "# Misc\n\nThis chapter contains parts which are not directly related to the Linux kernel source code and implementation of"
  },
  {
    "path": "Misc/linux-misc-1.md",
    "chars": 29880,
    "preview": "Linux kernel development\n================================================================================\n\nIntroduction\n"
  },
  {
    "path": "Misc/linux-misc-2.md",
    "chars": 33462,
    "preview": "Process of the Linux kernel building\n================================================================================\n\nI"
  },
  {
    "path": "Misc/linux-misc-3.md",
    "chars": 29639,
    "preview": "Introduction\n---------------\n\nDuring the writing of the [linux-insides](https://github.com/0xAX/linux-insides/blob/maste"
  },
  {
    "path": "Misc/linux-misc-4.md",
    "chars": 24972,
    "preview": "Program startup process in userspace\n================================================================================\n\nI"
  },
  {
    "path": "README.md",
    "chars": 3780,
    "preview": "# Linux insides\n\nThis repository contains a book-in-progress about the Linux kernel and its insides.\n\nThe goal of this p"
  },
  {
    "path": "SUMMARY.md",
    "chars": 4918,
    "preview": "### Summary\n\n* [Booting](Booting/README.md)\n    * [From bootloader to kernel](Booting/linux-bootstrap-1.md)\n    * [First"
  },
  {
    "path": "Scripts/README.md",
    "chars": 322,
    "preview": "# Scripts\n\n## Description\n\n`get_all_links.py` : justify one link is live or dead with network connection\n\n`latex.sh` : a"
  },
  {
    "path": "Scripts/get_all_links.py",
    "chars": 1786,
    "preview": "#!/usr/bin/env python\n\nfrom __future__ import print_function\nfrom socket import timeout\n\nimport os\nimport sys\nimport cod"
  },
  {
    "path": "Scripts/latex.sh",
    "chars": 792,
    "preview": "# latex.sh\n# A script for converting Markdown files in each of the subdirectories into a unified PDF typeset in LaTeX. \n"
  },
  {
    "path": "SyncPrim/README.md",
    "chars": 828,
    "preview": "# Synchronization primitives in the Linux kernel.\n\nThis chapter describes synchronization primitives in the Linux kernel"
  },
  {
    "path": "SyncPrim/linux-sync-1.md",
    "chars": 18705,
    "preview": "Synchronization primitives in the Linux kernel. Part 1.\n================================================================"
  },
  {
    "path": "SyncPrim/linux-sync-2.md",
    "chars": 22737,
    "preview": "Synchronization primitives in the Linux kernel. Part 2.\n================================================================"
  },
  {
    "path": "SyncPrim/linux-sync-3.md",
    "chars": 22499,
    "preview": "Synchronization primitives in the Linux kernel. Part 3.\n================================================================"
  },
  {
    "path": "SyncPrim/linux-sync-4.md",
    "chars": 30146,
    "preview": "Synchronization primitives in the Linux kernel. Part 4.\n================================================================"
  },
  {
    "path": "SyncPrim/linux-sync-5.md",
    "chars": 32821,
    "preview": "Synchronization primitives in the Linux kernel. Part 5.\n================================================================"
  },
  {
    "path": "SyncPrim/linux-sync-6.md",
    "chars": 20505,
    "preview": "Synchronization primitives in the Linux kernel. Part 6.\n================================================================"
  },
  {
    "path": "SysCall/README.md",
    "chars": 945,
    "preview": "# System calls\n\nThis chapter describes the `system call` concept in the Linux kernel.\n\n* [Introduction to system call co"
  },
  {
    "path": "SysCall/linux-syscall-1.md",
    "chars": 26644,
    "preview": "System calls in the Linux kernel. Part 1.\n=============================================================================="
  },
  {
    "path": "SysCall/linux-syscall-2.md",
    "chars": 26804,
    "preview": "System calls in the Linux kernel. Part 2.\n=============================================================================="
  },
  {
    "path": "SysCall/linux-syscall-3.md",
    "chars": 22184,
    "preview": "System calls in the Linux kernel. Part 3.\n=============================================================================="
  },
  {
    "path": "SysCall/linux-syscall-4.md",
    "chars": 25167,
    "preview": "System calls in the Linux kernel. Part 4.\n=============================================================================="
  },
  {
    "path": "SysCall/linux-syscall-5.md",
    "chars": 25438,
    "preview": "How does the `open` system call work\n--------------------------------------------------------------------------------\n\nI"
  },
  {
    "path": "SysCall/linux-syscall-6.md",
    "chars": 9987,
    "preview": "Limits on resources in Linux\n================================================================================\n\nEach proc"
  },
  {
    "path": "Theory/README.md",
    "chars": 244,
    "preview": "# Theory\n\nThis chapter describes various theoretical concepts and concepts which are not directly related to practice bu"
  },
  {
    "path": "Theory/linux-theory-1.md",
    "chars": 16991,
    "preview": "Paging\n================================================================================\n\nIntroduction\n------------------"
  },
  {
    "path": "Theory/linux-theory-2.md",
    "chars": 8580,
    "preview": "Executable and Linkable Format\n================================================================================\n\nELF (Ex"
  },
  {
    "path": "Theory/linux-theory-3.md",
    "chars": 22925,
    "preview": "Inline assembly\n================================================================================\n\nIntroduction\n---------"
  },
  {
    "path": "Timers/README.md",
    "chars": 885,
    "preview": "# Timers and time management\n\nThis chapter describes timers and time management related concepts in the Linux kernel.\n\n*"
  },
  {
    "path": "Timers/linux-timers-1.md",
    "chars": 25906,
    "preview": "Timers and time management in the Linux kernel. Part 1.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-2.md",
    "chars": 30685,
    "preview": "Timers and time management in the Linux kernel. Part 2.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-3.md",
    "chars": 32588,
    "preview": "Timers and time management in the Linux kernel. Part 3.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-4.md",
    "chars": 20354,
    "preview": "Timers and time management in the Linux kernel. Part 4.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-5.md",
    "chars": 25859,
    "preview": "Timers and time management in the Linux kernel. Part 5.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-6.md",
    "chars": 22491,
    "preview": "Timers and time management in the Linux kernel. Part 6.\n================================================================"
  },
  {
    "path": "Timers/linux-timers-7.md",
    "chars": 21137,
    "preview": "Timers and time management in the Linux kernel. Part 7.\n================================================================"
  },
  {
    "path": "book-A5.json",
    "chars": 238,
    "preview": "{\n    \"title\": \"Linux Insides\",\n    \"author\" : \"0xAX\",\n    \"pdf\": {\n        \"paperSize\": \"a5\",\n        \"margin\":\n       "
  },
  {
    "path": "book.json",
    "chars": 56,
    "preview": "{\n    \"title\": \"Linux Insides\",\n    \"author\" : \"0xAX\"\n}\n"
  },
  {
    "path": "contributors.md",
    "chars": 6692,
    "preview": "# Contributors\n\nSpecial thanks to all the people who helped to develop this project:\n\n* [Akash Shende](https://github.co"
  },
  {
    "path": "lychee.toml",
    "chars": 671,
    "preview": "# Lychee link checker configuration\n# See https://github.com/lycheeverse/lychee for all options\n\n# Maximum number of ret"
  },
  {
    "path": "scripts/check_code_snippets.py",
    "chars": 2175,
    "preview": "\"\"\"\nA script that takes the lines of the Linux kernel source code from the comments\nin the markdown files that are attac"
  }
]

About this extraction

This page contains the full source code of the 0xAX/linux-insides GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 105 files (1.6 MB), approximately 428.3k tokens, and a symbol index with 6 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo