Introduction to Linux Kernel

1. Computer System Review

　　To begin with, we should have a refresher on the components of a computer system, especially the Operating System:

　　According to CSAPP, an operating system provides three fundamental abstractions: (1) Files are abstractions for I/O devices, (2) Virtual memory is an abstraction for both main memory and disk, (3) Processes are abstractions for processor, main memory and I/O devices.

2. Overview of Linux Kernel

　　What we learn here is Linux, a modular, UNIX-like, monolithic kernel. The major tasks of a kernel include Process Management, Memory Management, Device Management, supporting System Calls and so forth.

　　The following picture is the layout of Linux Source Tree:

3. Linux Booting Procedure

　　(1) When the computer is turned on, CPU jumps to memory 0xFFFF0, which the Basic Input Output System (BIOS) resides. BIOS will run a Power-On Self Test (POST), do hardware initializations, and finally extract the 1st sector (512B) of the disk into memory 0x7C00.

　　(2) The 1st sector of the hard disk is called the Master Boot Record (MBR), whose first 446B contains the boot loader followed by the partition table. When it takes over the booting procedure, the boot loader will be executed to start the kernel. GNU GRUB is an operating system independent boot loader.The 1st stage of GRUB boot process lies within MBR, and its task is just to load GRUB stage 1.5, which immediately follows MBR and contains file sytem drivers. Then GRUB stage 2 will be loaded by stage 1.5, and will display boot menu to the user. Finally, GRUB will load the user-selected (or default) kernel into memory and pass on control to the kernel.

　　(3) When the kernel image is loaded and decompressed, it will do some requisite initiallizations - such as function setup(), startup_32() and start_kernel(). Then it will turn to user mode (the swapper process, pid = 0) and create the init process (pid = 1), which is the ancestor of all other processes running on Linux. The init process will start various daemons according to the run-level, and typically create instances of "getty" to spawn user's shell processes. Besides creating more processes, another task of the init process lies in managing the system shutdown.

4. Linux Process Management

　　在现代操作系统中，进程是资源管理的最小单位，线程是程序执行的最小单位，尽管Linux对进程和线程并不严格区别。一个线程享有单独的PC、寄存器和栈空间，但属于同一进程中的线程——或者说同一进程组中的轻量级进程——共享诸如代码段、数据段、地址空间和文件等一系列资源。

　　一个 Linux 进程的主要信息存储在它的进程描述符中。进程描述符 (process descriptor) 指的是 task_struct 结构体，该结构体的所有实例被存储在动态内存区，并形成一个以 init_task 为首元素的双向链表，称为 process list。每一个进程拥有一个16位的短整型字段作为进程标识符（PID），该字段一般有最大值32767，内核通过 pidhash table 将 PID 映射到对应的 task_struct 上。进程描述符中还有一个长整形字段 state 表示其状态，包括以下几种可能值：(1) TASK_RUNNING (ready or running), (2) TASK_INTERRUPTIBLE (waiting), (3) TASK_UNINTERRUPTIBLE, (4) TASK_STOPPED, and (5) TASK_ZOMBIE (terminated)。此外，进程间关系可以表示为 left-child, right sibling tree，相应字段包括 real_parent, parent, children, 以及 siblings 。

　　除了 process list (tasks 字段)，所有运行的进程还被组织成一个 runqueue (run_list 字段)，所有等待的进程被组织成一个 waitqueue (wait_queue_head_t 的 task_list 字段)。内核代码中，诸如双向循环链表等数据结构没有采用面向对象语言中的泛型实现方式，但它将数据结构节点（如 list_head）嵌入到目标结构体中从而达到了相同的效果。

　　Linux 2.6 以上版本采用 Completely Fair Scheduling (CFS)，将运行的进程按 virtual runtime 组织成一棵红黑树，每次挑取权值最小的节点，运行该进程并更新权值放回红黑树中。红黑树通过一系列旋转操作保证其自身特性，以达到近似的平衡化。

5. Process Address Space

　　现代操作系统采用虚拟内存一般基于以下几点需求：

　　(1) The ability to run programs larger than the size of physical memory.

　　(2) Code relocation to simplify loading a program for execution.

　　(3) Efficient and safe sharing of memory among multiple programs.

　　Linux 系统中，兼有分段和分页两种地址转换机制。所谓分段，是指分段单元将逻辑地址 (16-bit segment+32-bit offset) 转换为32位线性地址；所谓分页，是指分页单元将32线性地址进一步转换为物理地址，即 RAM 引脚。并不是任何情况下都会分段，比如在 RISC 体系当中；也并不是任何情况下都要分页，比如 80x86 运行在实模式的时候。分页容易产生的一个问题是页表过于庞大、代价高昂。为此，Linux 使用分级页表，将页表分为四级：(1) page global directory, (2) page upper directory, (3) page middle directory, and (4) page table。一般而言，1 page=1 frame=4KB，在32位x86体系中采用 10+0+0+10+12 的地址解析方式，而在x86_64中则采用 9+9+9+9+12（最高16位不用于寻址）。分页使得内存更像是硬盘的一个采用写回策略和无写入分派的全相联高速缓存。

	32         22        12          0
	  +---------+---------+----------+
	  | PGD idx |  PT idx |  offset  |
	  +---------+---------+----------+

	48         39        30        21       12        0
	  +---------+---------+---------+--------+--------+
	  | PGD idx | PUD idx | PMD idx | PT idx | offset |
	  +---------+---------+---------+--------+--------+

　　Linux 内核一般被放置在物理内存的低端1MB到3MB之间。初始化时，内核根据 BIOS 的报告建立物理地址映射，并建立内核页表。除了 BIOS 等硬件保留的区域以及内核自身代码和静态数据占用的区域，剩余 RAM 被称为动态内存提供给内核和用户进程。内核使用 buddy system algorithm 和 slab allocator 分配动态内存，并记录物理内存中每一个 frame 的使用，以此实现内存保护。每个进程拥有4GB的虚拟内存空间，且前3GB可被用户态使用（见下图，摘自 CSAPP），在运行时根据 cr3 寄存器访问进程页表以进行地址转换。

　　进程的内存描述符采用结构体 mm_struct，在 task_struct 中被指针 mm 引用。所有 mm_struct 形成一个以 init_mm 为首的双向循环链表，与 process list 相似。内存描述符中，指针 mmap 指向 vm_area_struct 结构体链表，该链表记录所有该地址空间中被使用的虚拟内存区域。每个 vm_area_struct 记录一个线性地址区间，端值为字段 vm_start 和 vm_end，具体可以参考我的例程。除了通过 vm_next 字段构成链表，vm_area_struct 还通过 vm_rb 字段构成红黑树，便于搜索指定地址。

　　在进程创建时，父进程会在 fork() 函数中将自己的页表拷贝给子进程并为其分配进程描述符和内核栈，然后调用 exec() 函数在新的地址空间装载并运行子进程代码。父子进程间采用 copy-on-write 的策略延缓页面复制的发生，避免不必要的消耗。进程终止时会调用 exit() 释放几乎所有资源，并最终由父进程调用 wait4() 收集信息并释放子进程的进程描述符和内核栈。

6. Exceptions and Interrupts

　　在80x86体系中，中断向量表 (Interrupt Vector Table) 存储在RAM的最低1KB中，一共记录了256个中断源的ISR (Interrupt Service Routine) 的入口地址，每个入口地址占用4字节（CS和IP）。每执行完一条指令，CPU会自动按顺序检查（除单步中断外的）内部中断、不可屏蔽中断、可屏蔽中断、以及单步中断。一旦某个中断被确认，CPU会把FLAGS与下一条指令的地址（CS和IP）一同压栈，然后根据中断号查询中断向量表进而执行ISR；待ISR执行结束，CPU将IP、CS和FLAGS出栈，继续运行中断发生前的程序。这里的中断是广义的，既包括硬件产生的异步中断 (interrupts)，也包括软件产生的同步中断 (exceptions)，比如缺页中断 (INT 14)。

　　处于用户态的进程通过三种方式陷入内核态：(1) system call (INT 80), (2) 其他 exception, 和 (3) 硬件的 interrupt 。每个进程的内核栈与 thread_info 结构体共同占用8KB (2 pages)，其位置由 task_struct 中的 tss 字段记录 (ss0+esp0)。当中断或异常发生时，进程首先要进行 context switch, 将SS, ESP, FLAGS, CS, EIP一同压入内核栈，并更新寄存器（内核栈的SS与ESP，中断向量表中的CS和EIP），然后才调用 ISR；中断结束时，进程要将用户态寄存器的值从内核栈中取出以恢复中断前的状态，从而完成另一个反向的 context switch。

References:

　　1. Bovet, D.P., Marco Cesati. 深入理解 Linux 内核（第三版）[M]. 北京：中国电力出版社，2007