MIT-JOS系列9：多任务处理（二）

Part B：fork的写时拷贝(Copy-on-Write Fork)

注：根据MIT-JOS的lab指导手册，以下不明确区分“环境”和“进程”

重要提醒：每次实现完系统调用，记得补充kern/syscall.c的syscall()!!!!!!!

如上一篇文章提及到的，Unix提供系统调用fork()作为创建进程的原语，它将父进程的地址空间拷贝到子进程。

xv6 Unix的fork()实现是：为子进程分配新的页面，并把父进程页面中的所有数据复制到子进程（A部分的dumpfork.c就是这样做的）。数据的拷贝是fork()过程中代价最大的操作

然而进程子进程在调用fork()之后会立刻执行exec()，它将子进程的内存完全替换为新的程序，这时候子进程仅在调用exec()前用一下这部分内存，从父进程复制的数据基本上都被浪费了

因此，在后来版本的Unix利用虚拟内存硬件允许父子进程共享映射到各自地址空间的内存，直到某个进程实际修改了内存，这种技术被成为写时拷贝（Copy-on-Write）。为此，fork()中内核将复制父进程的地址空间映射到子进程而不是复制页面内容与此同时将共享的页面标记为只读。当父子进程任何一方企图向共享页面写入数据时将会发生页面错误（page fault），此时内核会意识到这个页面是一个“虚拟的的”或“写时复制的”副本，然后给触发异常的进程分配一个私有的可写页面并复制原页面的数据。这样，在实际有数据写入之前并不会发生页面的复制，降低了fork()+exec()调用的代价

在本节实验中，我们将以Copy-on-Write的方式在用户lib里实现一个更好的fork()。在用户空间实现写时拷贝的fork使内核更简单更不容易出错，同时也支持用户程序为自己自定义fork()

用户级页面错误处理

用户级需要知道何时在写保护页面发生了页面错误，用于写时拷贝只是用户级页面错误处理的用途之一

内核为进程不同的页面错误执行不同的处理方法。例如kernel初始状态下仅为一个新进程分配一个页面作为堆栈空间，若用户进程需要用到更大的栈空间，则会在未映射过的栈地址处发生一个页面错误。当用户空间的不同区域发生页面错误时，Unix内核必须追踪其发生的错误，并采取不同的行动，例如

栈区的页面错误：分配和映射一个新的物理页面
BSS区域的页面错误：分配和映射一个新的物理页面，并将该页面初始化为0
按需分配页面的可执行文件中的text区域发生的错误：从磁盘读取二进制文件的相应页面并映射

注册页面错误处理程序

为了能够自定义处理页面错误，用户环境需要通过系统调用sys_env_set_pgfault_upcall()向JOS内核注册一个页面错误处理程序入口（page fault handler entrypoint），并向Env数据结构增加env_pgfault_upcall用于记录这个用户环境自定义的页面错误处理程序

代码实现如下：

static int
sys_env_set_pgfault_upcall(envid_t envid, void *func)
{
	// LAB 4: Your code here.
	struct Env *env = NULL;
	if (envid2env(envid, &env, 1) < 0)
		return -E_BAD_ENV;
	
	env->env_pgfault_upcall = func;
	return 0;
}

用户环境的正常栈和异常栈

正常情况下，用户环境运行在JOS分配给用户的正常栈上，ESP指针指向USTACKTOP，压栈时数据被压到USTACKTOP-1到USTACKTOP-PGSIZE之间的区域。当用户模式下页面错误发生时，JOS内核将栈从正常用户栈切换到用户异常栈（user exception stack）以运行用户级页面错误处理程序，和中断发生时从用户栈切换到内核栈的过程相似。

JOS的用户异常栈大小为一个页面，初始栈顶地址为UXSTACKTOP。当运行在异常栈时，用户级页面错误处理程序能通过JOS的系统调用分配一个新的页面或调整地址映射来修复页面错误异常。处理完成后，回到导致错误的语句继续执行

每个支持自定义用户页面错误处理程序的用户进程都要为它自己的异常栈分配一页内存，可以通过sys_page_alloc()系统调用实现

调用用户页面处理程序

我们需要修改kern/trap.c中的页面错误处理程序以便处理用户的页面错误

如果没有注册页面错误处理程序，JOS在发生用户态页面错误时会直接销毁用户环境。否则，内核应该在用户异常栈上设置struct UTrapframe(inc/trap.h)结构的trap frame（这个就像中断发生时往内核栈压的内容，以便作为参数被中断处理程序读取），然后恢复用户进程，使其在异常栈上执行它的页面错误处理程序（如何？）

如果在异常发生时用户已经运行在异常栈上，则说明用户的页面错误处理程序本身出现了故障。这时候新栈帧应该从当前的tf->tf_esp开始分配新的异常栈而不是UXSTACKTOP，并push进去一个空的32位字，然后才是UTrapframe结构体

整理一下用户态处理页面异常的过程：

发生异常前，用户已经向内核注册自定义的页面处理程序，并为自己的异常栈分配一页物理页面
用户态发生页面错误，走正常的中断处理程序，陷入内核态切换到内核栈、进入trap()
根据中断号发现是页面错误，调用page_fault_handler()进行处理
检测trap frame的tf_cs发现是用户态发生的错误
判断是否有用户自定义页面异常处理程序：如果有，销毁环境
如果有，准备转向用户态处理异常：
- 检查tf_esp，若在[UXSTACKTOP-PGSIZE, UXSTACKTOP)范围内说明是在用户的页面处理程序内发生了异常，则将当前的栈指针视为栈顶，压栈前检查栈是否越界：利用user_mem_assert()，根据memlayout.h所示USTACKTOP到UXSTACKTOP-PGSIZE之间有一段Empty Memory，用户无权读写。先压入4个空字节再压入UTrapframe结构的各寄存器参数
- 如果不在，则将UXSTACKTOP视为栈顶，压入UTrapframe结构
- 设置当前用户栈指针tf->tf_esp指向异常栈压入UTrapframe后的栈顶
- 设置当前用户下一条执行代码tf->tf_eip为用户异常处理程序env_pgfault_upcall
恢复用户环境执行页面错误处理程序。恢复过程中会利用tf设置用户环境的寄存器，完成栈的切换和执行指令的跳转

kern/trap.c中的page_fault_handler()修改如下：

void
page_fault_handler(struct Trapframe *tf)
{
	uint32_t fault_va;

	// Read processor's CR2 register to find the faulting address
	fault_va = rcr2();

	// Handle kernel-mode page faults.

	// LAB 3: Your code here.
	if ((tf->tf_cs & 0x11) == 0)
		panic("kernel page fault at %x.
", fault_va);
	// We've already handled kernel-mode exceptions, so if we get here,
	// the page fault happened in user mode.

	// Call the environment's page fault upcall, if one exists.  Set up a
	// page fault stack frame on the user exception stack (below
	// UXSTACKTOP), then branch to curenv->env_pgfault_upcall.
	//
	// The page fault upcall might cause another page fault, in which case
	// we branch to the page fault upcall recursively, pushing another
	// page fault stack frame on top of the user exception stack.
	//
	// The trap handler needs one word of scratch space at the top of the
	// trap-time stack in order to return.  In the non-recursive case, we
	// don't have to worry about this because the top of the regular user
	// stack is free.  In the recursive case, this means we have to leave
	// an extra word between the current top of the exception stack and
	// the new stack frame because the exception stack _is_ the trap-time
	// stack.
	//
	// If there's no page fault upcall, the environment didn't allocate a
	// page for its exception stack or can't write to it, or the exception
	// stack overflows, then destroy the environment that caused the fault.
	// Note that the grade script assumes you will first check for the page
	// fault upcall and print the "user fault va" message below if there is
	// none.  The remaining three checks can be combined into a single test.
	//
	// Hints:
	//   user_mem_assert() and env_run() are useful here.
	//   To change what the user environment runs, modify 'curenv->env_tf'
	//   (the 'tf' variable points at 'curenv->env_tf').

	// LAB 4: Your code here.
	if (curenv->env_pgfault_upcall) {
		struct UTrapframe *utrapframe = NULL;
		// 递归页面错误异常, 在tf_esp处留白32位(4字节)再填入UTrapframe
		if (tf->tf_esp >= UXSTACKTOP-PGSIZE && tf->tf_esp <= UXSTACKTOP-1) {
			utrapframe = (struct UTrapframe*)(tf->tf_esp - 4 - sizeof(struct UTrapframe));
		} else {
			utrapframe = (struct UTrapframe*)(UXSTACKTOP - sizeof(struct UTrapframe));
		}
        user_mem_assert(curenv, 
                        (void*)utrapframe, 
                        sizeof(struct UTrapframe)+4, 
                        PTE_U|PTE_W|PTE_P);
		// 填写UTrapframe，相当于给用户异常栈压栈中断的那些寄存器和参数
		utrapframe->utf_fault_va = fault_va;
		utrapframe->utf_err = tf->tf_err;
		utrapframe->utf_regs = tf->tf_regs;
		utrapframe->utf_eip = tf->tf_eip;
		utrapframe->utf_eflags = tf->tf_eflags;
		utrapframe->utf_esp = tf->tf_esp;

		// 切换用户栈到异常栈，设置代码进入用户页面异常处理程序执行
		tf->tf_esp = (uintptr_t)utrapframe;
		tf->tf_eip = (uintptr_t)curenv->env_pgfault_upcall;

		env_run(curenv);
	}

	// Destroy the environment that caused the fault.
	cprintf("[%08x] user fault va %08x ip %08x
",
		curenv->env_id, fault_va, tf->tf_eip);
	print_trapframe(tf);
	env_destroy(curenv);
}

值得注意的是，即使不是在递归页面错误的情况下，仍需要进行user_mem_assert进行内存合法性检查，因为可能会存在用户没有为异常栈分配页面的情况，此时用户无权访问异常栈，程序应该出错。

用户模式页面错误入口点

接下来，我们需要在lib/pfentry.S编写汇编代码_pgfault_upcall，实现用户页面处理程序调用后跳转到导致错误的语句处继续执行。这段汇编代码的入口通过sys_env_set_pgfault_upcall()向内核注册

_pgfault_upcall事实上是向内核注册的用户页面错误异常处理函数。可以看一下lib/pgfault.c，因为很乱，所以这里提前理理清：

_pgfault_upcall是完整的用户页面异常处理程序，其包括两部分：
1. 调用用户自定义函数处理页面异常
2. 用户自定义函数处理完异常返回后，切换用户栈并返回出错的语句继续执行
_pgfault_handler是用户自定义的页面异常处理程序的核心部分，它只负责处理页面异常
用户调用set_pgfault_handler()传递一个自定义的页面错误处理函数handler作为参数，这个参数被赋值到_pgfault_handler
set_pgfault_handler()中将_pgfault_upcall作为用户页面异常处理程序向内核注册
当用户态的页面错误发生时，先陷入内核态，在回到用户态执行_pgfault_upcall，进一步调用_pgfault_handler。_pgfault_handler返回后在_pgfault_upcall中恢复寄存器，切换用户栈并返回出错的语句继续执行

在pfentry.S中恢复寄存器并跳转到错误发生的代码处，需要注意：

不能调用jmp，因为jmp需要一个寄存器保存jmp的地址，但所有寄存器都应该恢复到异常发生前的值
同理不能直接使用ret，因为ret会修改esp（ret相当于pop %eip，会自动将esp+0x4）
因此我们把eip送入用户正常栈的栈顶，然后在恢复esp到旧esp-0x4后调用ret将它pop出来
在恢复eflags后不能使用任何add, sub指令，防止对标志位发生修改

因此先要在故障栈中找出旧eip送入到旧esp-0x4的位置，并把故障栈中的esp减去0x4

在回到_pgfault_upcall时，故障栈的布局为

值	地址
utf_fault_va	%esp
utf_err	0x4(%esp)
utf_regs	0x8(%esp)
utf_eip	0x28(%esp)
utf_eflags	0x2C(%esp)
utf_esp	0x30(%esp)

恢复到出错点的步骤为：

从0x28(%esp)取出故障时eip到临时寄存器%eax
0x30(%esp)处的值减0x04（故障时esp-0x04）
从0x30(%esp)取出故障时esp-0x04到%ebx
在故障时esp-0x04位置写入故障时eip：即在正常用户栈的栈顶上方写入故障时eip
按顺序恢复utf_regs, utf_eflags, utf_esp到各寄存器（此时utf_esp的值为故障时esp-0x04）
ret

代码实现如下：

.text
.globl _pgfault_upcall
_pgfault_upcall:
	// Call the C page fault handler.
	pushl %esp			// function argument: pointer to UTF
	movl _pgfault_handler, %eax
	call *%eax
	addl $4, %esp			// pop function argument
	
	// Now the C page fault handler has returned and you must return
	// to the trap time state.
	// Push trap-time %eip onto the trap-time stack.
	//
	// Explanation:
	//   We must prepare the trap-time stack for our eventual return to
	//   re-execute the instruction that faulted.
	//   Unfortunately, we can't return directly from the exception stack:
	//   We can't call 'jmp', since that requires that we load the address
	//   into a register, and all registers must have their trap-time
	//   values after the return.
	//   We can't call 'ret' from the exception stack either, since if we
	//   did, %esp would have the wrong value.
	//   So instead, we push the trap-time %eip onto the *trap-time* stack!
	//   Below we'll switch to that stack and call 'ret', which will
	//   restore %eip to its pre-fault value.
	//
	//   In the case of a recursive fault on the exception stack,
	//   note that the word we're pushing now will fit in the
	//   blank word that the kernel reserved for us.
	//
	// Throughout the remaining code, think carefully about what
	// registers are available for intermediate calculations.  You
	// may find that you have to rearrange your code in non-obvious
	// ways as registers become unavailable as scratch space.
	//
	// LAB 4: Your code here.
	// 压入eip到正常栈并调整故障前esp位置
	movl 0x28(%esp), %eax
	subl $4, 0x30(%esp)
	movl 0x30(%esp), %ebx
	movl %eax, (%ebx)

	// Restore the trap-time registers.  After you do this, you
	// can no longer modify any general-purpose registers.
	// LAB 4: Your code here.
	// 跳过utf_fault_va和utf_err，指向utf_regs
	addl $0x8, %esp
	popal

	// Restore eflags from the stack.  After you do this, you can
	// no longer use arithmetic operations or anything else that
	// modifies eflags.
	// LAB 4: Your code here.
	// 跳过eip, 恢复eflags
	addl $0x4, %esp
	popfl

	// Switch back to the adjusted trap-time stack.
	// LAB 4: Your code here.
	popl %esp

	// Return to re-execute the instruction that faulted.
	// LAB 4: Your code here.
	ret

最后在lib/pgfault.c实现用户页面处理程序set_pgfault_handler()，其完成

为用户异常栈分配一个物理页面
向内核注册用户页面错误处理程序
将处理程序的核心_pgfault_handler与用户自定义的函数关联

代码实现如下：

void
set_pgfault_handler(void (*handler)(struct UTrapframe *utf))
{
	int r;

	if (_pgfault_handler == 0) {
		// First time through!
		// LAB 4: Your code here.
		int err;
		if ((err = sys_page_alloc(0, (void*)(UXSTACKTOP-PGSIZE), PTE_U|PTE_W)) < 0)
			panic("set_pgfault_handler error: %e", err);

		if ((err = sys_env_set_pgfault_upcall(0, _pgfault_upcall)) < 0)
			panic("set_pgfault_handler error: %e", err);
		// panic("set_pgfault_handler not implemented");
	}

	// Save handler pointer for assembly to call.
	_pgfault_handler = handler;
}

实现fork的写时拷贝

~~这个实验非常难，做时请尽量保持自己清醒~~

现在我们已经具有了在用户空间实现copy on write fork的条件

lib/fork.c中提供了一个fork()的骨架。与dumbfork()类似，fork()应该

创建一个子进程
将父进程的地址空间映射关系复制给子进程

fork()的基本控制流程如下：

父进程调用set_pgfault_handler()，将pgfault()设置为页面错误处理函数
父进程调用sys_exofork()创建子进程
对任意有写权限或copy on write的UTOP以下的页面，父进程调用duppage()将其映射到子进程的地址空间（只复制映射关系），然后将它们重新映射到自己的地址空间，权限为只读并添加PTE_COW标识
- 只读的原因：copy on write的原理是当其中一个进程要修改页面，触发页面错误，然后将错误页面拷贝成自己私有而不能直接修改共享的页面，如果不是只读是触发不了页面错误的，会发生父子进程同时修改一个页面的情况，比如堆栈，直接造成程序错误~~（是的我就是不小心多加了写权限然后找了一天的堆栈异常BUG）~~
- 必须在将父进程的页面设置为PTE_COW之前设置子进程页面为PTE_COW，为什么？
异常栈不以这种方式映射，子进程需要自行分配一个新的空闲页面作为异常栈（父进程帮他分配）

fork()还需要处理存在于父进程中但不能写入或写时拷贝的页面：做只读映射，反正大家都不能改，就很安全
父进程为子进程注册页面错误处理程序入口
父进程设置子进程的状态为runnable，子进程运行

每当父子进程之一企图写copy-on-write的页面时，发生一个页面错误，并在用户页面错误处理程序中处理，流程如下：

内核将页面错误传递到_pgfault_upcall，通过用户自定义的pgfault()处理它
pgfault()检查是写页面时（错误码为FEC_WR）发生的错误且页面被标记为PTE_COW，如果不是，panic
pgfault()分配一个新的页面并将发生页面错误的页面的数据复制到新页面，赋予其读写权限，然后修改映射，使新页面取代旧页面

在以上实现的用户级lib/fork.c代码中有几个操作必须要查询当前环境的页表（例如获取出错页面是否标记为PTE_COW）。之前在初始化用户环境、建立其页表目录时已经设置了e->env_pgdir[PDX(UVPT)]=e->env_cr3|PTE_P|PTE_U，即当前环境的页表目录已被映射到UVPT。在当前进程启动时，lib/entry.S中导出UVPT为uvpt；UVPT+(UVPT>>12)*4被导出为uvpd

uvpt：uvpt[n]为第n个虚拟页面的PTE。对虚拟地址la，其PTE（所在页表的页表项）为uvpt[PGNUM(la)]
uvpd：uvpd[n]为页表目录的第n项

具体为什么会这样，可以参考MIT-JOS系列：用户态访问页表项详解

fork()实现如下：

envid_t
fork(void)
{
	// LAB 4: Your code here.
	// 1. 父进程调用`set_pgfault_handler()`，将`pgfault()`设置为页面错误处理函数
	int err;
	extern void _pgfault_upcall(void);

	set_pgfault_handler(pgfault);
	// 2. 父进程调用`sys_exofork()`创建子进程
	envid_t chld_id = sys_exofork();
	if (chld_id == 0) { 
		// child
		thisenv = &envs[ENVX(sys_getenvid())];
		return 0;
	}
	// 3. 映射
	uintptr_t addr;
	for (addr = UTEXT; addr < USTACKTOP; addr += PGSIZE) {
		// ???
		if ((uvpd[PDX(addr)] & PTE_P) && (uvpt[PGNUM(addr)] & (PTE_P | PTE_U))) {
			duppage(chld_id, PGNUM(addr));
		}
	}

	// 4. 父进程为子进程设置异常栈和注册页面错误处理程序入口
	if ((err=sys_page_alloc(chld_id, (void*)(UXSTACKTOP-PGSIZE), PTE_U|PTE_W)) < 0)
		panic("fork error: %e", err);
	if ((err=sys_env_set_pgfault_upcall(chld_id, _pgfault_upcall)) < 0)
		panic("fork error: %e", err);
	
	// 5. 父进程设置子进程的状态为runnable，子进程运行
	if ((err=sys_env_set_status(chld_id, ENV_RUNNABLE)) < 0)
		panic("fork error: %e", err);
	
	return chld_id;
	// panic("fork not implemented");
}

duppage()实现如下：

static int
duppage(envid_t envid, unsigned pn)
{
	int r;
	int err, perm = PTE_U;
	uintptr_t va = pn*PGSIZE;
	// LAB 4: Your code here.
	if ((uvpt[pn] & PTE_W) || (uvpt[pn] & PTE_COW))
		perm |= PTE_COW;

	if ((err=sys_page_map(0, (void*)va, envid, (void*)va, perm)) < 0)
		panic("duppage error: %e", err);

	if ((err=sys_page_map(0, (void*)va, 0, (void*)va, perm)) < 0)
		panic("duppage error: %e", err);
	// panic("duppage not implemented");
	return 0;
}

pgfault()实现如下：

static void
pgfault(struct UTrapframe *utf)
{
	void *addr = (void *) ROUNDDOWN(utf->utf_fault_va, PGSIZE);
	uint32_t err = utf->utf_err;
	int r;

	// Check that the faulting access was (1) a write, and (2) to a
	// copy-on-write page.  If not, panic.
	// Hint:
	//   Use the read-only page table mappings at uvpt
	//   (see <inc/memlayout.h>).

	// LAB 4: Your code here.
	if (!(err & FEC_WR)) {
		panic("pgfault error: not writing a page.");
	}
	if (!(uvpt[PGNUM(utf->utf_fault_va)] & PTE_COW))
		panic("pgfault error: not a COW page.");

	// Allocate a new page, map it at a temporary location (PFTEMP),
	// copy the data from the old page to the new page, then move the new
	// page to the old page's address.
	// Hint:
	//   You should make three system calls.

	// LAB 4: Your code here.
	
	if ((r=sys_page_alloc(0, (void*)PFTEMP, PTE_U|PTE_W)) < 0)
		panic("pgfault error: %e", r);
	memmove((void*)PFTEMP, addr, PGSIZE);

	if ((r=sys_page_map(0, (void*)PFTEMP, 0, addr, PTE_U|PTE_W)) < 0)
		panic("pgfault error: %e", r);
	if ((r=sys_page_unmap(0, (void*)PFTEMP)) < 0)
		panic("pgfault error: %e", r);

	// panic("pgfault not implemented");
}

注意第一行void *addr = (void *) ROUNDDOWN(utf->utf_fault_va, PGSIZE);，一定要将出现页面错误的地址对齐到4K边界。我之前忘了，然后收到的错误是

[00001000] user panic in <unknown> at lib/fork.c:31: pgfault error: not writing a page.

推测是memmove的过程中摸到了下一个只读页面，再次发生了页面错误，因此报错的不是sys_page_map判断页面对齐失败，而是出错的页面非可写页面。报错信息不一致导致问题极为难找，结合其他的各种各样的BUG，这个fork()调了我整整两天。。。。