gvisor 系统 调用初始化

Go 中 Syscall 的实现,在汇编文件 syscall/asm_linux_amd64.s 中

// func Syscall(trap int64, a1, a2, a3 int64) (r1, r2, err int64);

// Trap # in AX, args in DI SI DX R10 R8 R9, return in AX DX

// Note that this differs from "standard" ABI convention, which

// would pass 4th arg in CX, not R10.

TEXT    ·Syscall(SB),NOSPLIT,$0-56

    CALL    runtime·entersyscall(SB)

    MOVQ    a1+8(FP), DI

    MOVQ    a2+16(FP), SI

    MOVQ    a3+24(FP), DX

    MOVQ    $0, R10

    MOVQ    $0, R8

    MOVQ    $0, R9

    MOVQ    trap+0(FP), AX    // syscall entry

    SYSCALL     // 进入内核,执行内核处理例程

    // 0xfffffffffffff001 是 linux MAX_ERRNO 取反 转无符号,http://lxr.free-electrons.com/source/include/linux/err.h#L17

    CMPQ    AX, $0xfffffffffffff001        // 发生错误,r1=-1

    JLS    ok

    MOVQ    $-1, r1+32(FP)

    MOVQ    $0, r2+40(FP)

    NEGQ    AX             // 错误码,因为错误码是负数,这里转为正数

    MOVQ    AX, err+48(FP)

    CALL    runtime·exitsyscall(SB)

    RET

ok:

    MOVQ    AX, r1+32(FP)

    MOVQ    DX, r2+40(FP)

    MOVQ    $0, err+48(FP)

    CALL    runtime·exitsyscall(SB)

    RET
// See entry_amd64.go.
TEXT ·sysenter(SB),NOSPLIT,$0
        // _RFLAGS_IOPL0 is always set in the user mode and it is never set in
        // the kernel mode. See the comment of UserFlagsSet for more details.
        TESTL $_RFLAGS_IOPL0, R11
        JZ kernel
user:
        SWAP_GS()
        MOVQ AX, ENTRY_SCRATCH0(GS)            // Save user AX on scratch.
        MOVQ ENTRY_KERNEL_CR3(GS), AX          // Get kernel cr3 on AX.
        WRITE_CR3()                            // Switch to kernel cr3.

        MOVQ ENTRY_CPU_SELF(GS), AX            // Load vCPU.
        MOVQ CPU_REGISTERS+PTRACE_RAX(AX), AX  // Get user regs.
        REGISTERS_SAVE(AX, 0)                  // Save all except IP, FLAGS, SP, AX.
        MOVQ CX,  PTRACE_RIP(AX)
        MOVQ R11, PTRACE_FLAGS(AX)
        MOVQ SP,  PTRACE_RSP(AX)
        MOVQ ENTRY_SCRATCH0(GS), CX            // Load saved user AX value.
        MOVQ CX,  PTRACE_RAX(AX)               // Save everything else.
        MOVQ CX,  PTRACE_ORIGRAX(AX)

        MOVQ ENTRY_CPU_SELF(GS), AX            // Load vCPU.
        MOVQ CPU_REGISTERS+PTRACE_RSP(AX), SP  // Get stacks.
        MOVQ $0, CPU_ERROR_CODE(AX)            // Clear error code.
        MOVQ $1, CPU_ERROR_TYPE(AX)            // Set error type to user.

        // Return to the kernel, where the frame is:
        //
        //      vector      (sp+32)
        //      userCR3     (sp+24)
        //      regs        (sp+16)
        //      cpu         (sp+8)
        //      vcpu.Switch (sp+0)
        //
        MOVQ CPU_REGISTERS+PTRACE_RBP(AX), BP // Original base pointer.
        MOVQ $Syscall, 32(SP)                 // Output vector.
        RET

gdb KernelSyscall

(dlv) b pkg/sentry/platform/kvm/bluepill_arm64.go:108
Breakpoint 1 set at 0x87ab50 for gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).KernelSyscall() pkg/sentry/platform/kvm/bluepill_arm64.go:108
(dlv) c
> gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).KernelSyscall() pkg/sentry/platform/kvm/bluepill_arm64.go:108 (hits goroutine(236):1 total:1) (PC: 0x87ab50)
Warning: debugging optimized function
(dlv) bt
0  0x000000000087ab50 in gvisor.dev/gvisor/pkg/sentry/platform/kvm.(*vCPU).KernelSyscall
   at pkg/sentry/platform/kvm/bluepill_arm64.go:108
1  0x0000000000000000 in ???
   at :0
   error: NULL address
(truncated)
(dlv) c

 

System calls

For Linux kernel, Anatomy of a system call, part 1 gives a good overview of how syscall is handled in kernel. MSR_LSTAR is a Model-Specific Registers, used to hold “Target RIP for the called procedure when SYSCALL is executed in 64-bit mode”, details in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 4: Model-Specific Registers Table 2-2. On the latest kernel v4.20, syscall_init sets MSR_LSTAR to be entry_SYSCALL_64, which will jump to syscall according to the syscall number at do_syscall_64.

For gvisor, from How gvisor trap to syscall handler in kvm platform, “On the KVM platform, system call interception works much like a normal OS. When running in guest mode, the platform sets MSR_LSTAR to a system call handler sysenter, which is invoked whenever an application (or the sentry itself) executes a SYSCALL instruction.”

SyscallTable is a struct. All the implemented syscalls are listed in var AMD64.

pkg/sentry/platform/ring0/entry_amd64.go

//
// The CPU state will be set to c.Registers().
func Start()

// Exception stubs.
func divideByZero()
func debug()
func nmi()
func breakpoint()
func overflow()
func boundRangeExceeded()
func invalidOpcode()
func deviceNotAvailable()
func doubleFault()
func coprocessorSegmentOverrun()
func invalidTSS()
func segmentNotPresent()
func stackSegmentFault()
func generalProtectionFault()
func pageFault()
func x87FloatingPointException()
func alignmentCheck()
func machineCheck()
func simdFloatingPointException()
func virtualizationException()
func securityException()
func syscallInt80()

调用ring0.Start

root@cloud:~# go tool objdump /usr/local/bin/runsc  | grep arm64 | grep  Start
TEXT gvisor.dev/gvisor/pkg/sentry/platform/ring0.Start(SB) bazel-out/aarch64-dbg-ST-4c64f0b3d5c7/bin/pkg/sentry/platform/ring0/entry_impl_arm64.s
root@cloud:~# 

ring0/entry_impl_amd64.s:257:TEXT ·Start(SB),NOSPLIT,$0

    // Set the entrypoint for the kernel.
        kernelUserRegs.RIP = uint64(reflect.ValueOf(ring0.Start).Pointer())           //注册系统调用
        kernelUserRegs.RAX = uint64(reflect.ValueOf(&c.CPU).Pointer())
        kernelUserRegs.RSP = c.StackTop()
        kernelUserRegs.RFLAGS = ring0.KernelFlagsSet
// See entry_amd64.go.
TEXT ·Start(SB),NOSPLIT,$0
        PUSHQ $0x0            // Previous frame pointer.
        MOVQ SP, BP           // Set frame pointer.
        PUSHQ AX              // First argument (CPU).
        CALL ·start(SB)       // Call Go hook.
//func start(c *CPU)

JMP ·resume(SB) // Restore to registers. // See entry_amd64.go. TEXT ·sysenter(SB),NOSPLIT,$0 // _RFLAGS_IOPL0 is always set in the user mode and it is never set in // the kernel mode. See the comment of UserFlagsSet for more details. TESTL $_RFLAGS_IOPL0, R11 JZ kernel
func start(c *CPU) {
        // Save per-cpu & FS segment.
        WriteGS(kernelAddr(c.kernelEntry))
        WriteFS(uintptr(c.registers.Fs_base))

        // Initialize floating point.
        //
        // Note that on skylake, the valid XCR0 mask reported seems to be 0xff.
        // This breaks down as:
        //
        //      bit0   - x87
        //      bit1   - SSE
        //      bit2   - AVX
        //      bit3-4 - MPX
        //      bit5-7 - AVX512
        //
        // For some reason, enabled MPX & AVX512 on platforms that report them
        // seems to be cause a general protection fault. (Maybe there are some
        // virtualization issues and these aren't exported to the guest cpuid.)
        // This needs further investigation, but we can limit the floating
        // point operations to x87, SSE & AVX for now.
        fninit()
        xsetbv(0, validXCR0Mask&0x7)

        // Set the syscall target.
        wrmsr(_MSR_LSTAR, kernelFunc(sysenter)) // sysenter
        wrmsr(_MSR_SYSCALL_MASK, KernelFlagsClear|_RFLAGS_DF)

        // NOTE: This depends on having the 64-bit segments immediately
        // following the 32-bit user segments. This is simply the way the
        // sysret instruction is designed to work (it assumes they follow).
        wrmsr(_MSR_STAR, uintptr(uint64(Kcode)<<32|uint64(Ucode32)<<48))
        wrmsr(_MSR_CSTAR, kernelFunc(sysenter))
}

 调用Start

pkg/sentry/platform/kvm/machine_amd64.go:135: kernelUserRegs.RIP = uint64(reflect.ValueOf(ring0.Start).Pointer())
pkg/sentry/platform/kvm/machine_arm64_unsafe.go:122: data = uint64(reflect.ValueOf(ring0.Start).Pointer())

func newMachine(vm int) (*machine, error) {


        // Initialize architecture state.
        if err := m.initArchState(); err != nil {
                m.Destroy()
                return nil, err
        }
}

gvisor syscall 初始化

//go:nosplit
func start(c *CPU) {
    // Save per-cpu & FS segment.
    WriteGS(kernelAddr(c.kernelEntry))
    WriteFS(uintptr(c.registers.Fs_base))

    // Initialize floating point.
    //
    // Note that on skylake, the valid XCR0 mask reported seems to be 0xff.
    // This breaks down as:
    //
    //    bit0   - x87
    //    bit1   - SSE
    //    bit2   - AVX
    //    bit3-4 - MPX
    //    bit5-7 - AVX512
    //
    // For some reason, enabled MPX & AVX512 on platforms that report them
    // seems to be cause a general protection fault. (Maybe there are some
    // virtualization issues and these aren't exported to the guest cpuid.)
    // This needs further investigation, but we can limit the floating
    // point operations to x87, SSE & AVX for now.
    fninit()
    xsetbv(0, validXCR0Mask&0x7)

    // Set the syscall target.
    wrmsr(_MSR_LSTAR, kernelFunc(sysenter))
    wrmsr(_MSR_SYSCALL_MASK, KernelFlagsClear|_RFLAGS_DF)

    // NOTE: This depends on having the 64-bit segments immediately
    // following the 32-bit user segments. This is simply the way the
    // sysret instruction is designed to work (it assumes they follow).
    wrmsr(_MSR_STAR, uintptr(uint64(Kcode)<<32|uint64(Ucode32)<<48))
    wrmsr(_MSR_CSTAR, kernelFunc(sysenter))
}

 

KVM_SET_REGS 初始化 regs.rip寄存器

复制代码
VM* kvm_init(uint8_t code[], size_t len) {
  //获取kvm句柄
  int kvmfd = open("/dev/kvm", O_RDONLY | O_CLOEXEC);
  if(kvmfd < 0) pexit("open(/dev/kvm)");
  //检查 kvm 版本
  int api_ver = ioctl(kvmfd, KVM_GET_API_VERSION, 0);
    if(api_ver < 0) pexit("KVM_GET_API_VERSION");
  if(api_ver != KVM_API_VERSION) {
    error("Got KVM api version %d, expected %d
",
      api_ver, KVM_API_VERSION);
  }
  //创建kvm虚拟机
  int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
  if(vmfd < 0) pexit("ioctl(KVM_CREATE_VM)");
  //为KVM虚拟机申请内存
  void *mem = mmap(0,
    MEM_SIZE,
    PROT_READ | PROT_WRITE,
    MAP_SHARED | MAP_ANONYMOUS,
    -1, 0);
  if(mem == NULL) pexit("mmap(MEM_SIZE)");
  size_t entry = 0;
  //拷贝用户态代码到虚拟机内存
  memcpy((void*) mem + entry, code, len);
  //创建内存结构体
  struct kvm_userspace_memory_region region = {
    .slot = 0,
    .flags = 0,
    .guest_phys_addr = 0,
    .memory_size = MEM_SIZE,
    .userspace_addr = (size_t) mem
  };
  //设置 KVM 的内存区域
  if(ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region) < 0) {
    pexit("ioctl(KVM_SET_USER_MEMORY_REGION)");
  }
  //创建 VCPU
  int vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
  if(vcpufd < 0) pexit("ioctl(KVM_CREATE_VCPU)");
  //获取 KVM 运行时结构的大小
  size_t vcpu_mmap_size = ioctl(kvmfd, KVM_GET_VCPU_MMAP_SIZE, NULL);
  //将 kvm run 与 vcpu 做关联,这样能够获取到kvm的运行时信息
  struct kvm_run *run = (struct kvm_run*) mmap(0,
    vcpu_mmap_size,
    PROT_READ | PROT_WRITE,
    MAP_SHARED,
    vcpufd, 0);
 //设置虚拟机结构体
  VM *vm = (VM*) malloc(sizeof(VM));
  *vm = (struct VM){
    .mem = mem,
    .mem_size = MEM_SIZE,
    .vcpufd = vcpufd,
    .run = run
  };
  //设置特殊寄存器
  setup_regs(vm, entry);
  //设置段页
  setup_long_mode(vm);

  return vm;
}
复制代码
复制代码
/* set rip = entry point
 * set rsp = MAX_KERNEL_SIZE + KERNEL_STACK_SIZE (the max address can be used)
 *
 * set rdi = PS_LIMIT (start of free (unpaging) physical pages)
 * set rsi = MEM_SIZE - rdi (total length of free pages)
 * Kernel could use rdi and rsi to initalize its memory allocator.
 */
void setup_regs(VM *vm, size_t entry) {
  struct kvm_regs regs;
  //KVM_GET_SREGS 获得特殊寄存器
  if(ioctl(vm->vcpufd, KVM_GET_REGS, &regs) < 0) pexit("ioctl(KVM_GET_REGS)");
  //初始化寄存器
  regs.rip = entry;        //代码开始运行点
  regs.rsp = MAX_KERNEL_SIZE + KERNEL_STACK_SIZE; /* temporary stack */
  regs.rdi = PS_LIMIT; /* start of free pages */
  regs.rsi = MEM_SIZE - regs.rdi; /* total length of free pages */
  regs.rflags = 0x2;
  //KVM_SET_SREGS 设置特殊寄存器
  if(ioctl(vm->vcpufd, KVM_SET_REGS, &regs) < 0) pexit("ioctl(KVM_SET_REGS");
}
复制代码
//go:nosplit
func (c *CPU) SwitchToUser(switchOpts SwitchOpts) (vector Vector) {
        userCR3 := switchOpts.PageTables.CR3(!switchOpts.Flush, switchOpts.UserPCID)
        c.kernelCR3 = uintptr(c.kernel.PageTables.CR3(true, switchOpts.KernelPCID))

        // Sanitize registers.
        regs := switchOpts.Registers
        regs.Eflags &= ^uint64(UserFlagsClear)
        regs.Eflags |= UserFlagsSet
        regs.Cs = uint64(Ucode64) // Required for iret.
        regs.Ss = uint64(Udata)   // Ditto.

        // Perform the switch.
        swapgs()                                         // GS will be swapped on return.
        WriteFS(uintptr(regs.Fs_base))                   // escapes: no. Set application FS.
        WriteGS(uintptr(regs.Gs_base))                   // escapes: no. Set application GS.
        LoadFloatingPoint(switchOpts.FloatingPointState) // escapes: no. Copy in floating point.
        if switchOpts.FullRestore {
                vector = iret(c, regs, uintptr(userCR3))
        } else {
                vector = sysret(c, regs, uintptr(userCR3))
        }
        SaveFloatingPoint(switchOpts.FloatingPointState) // escapes: no. Copy out floating point.
        WriteFS(uintptr(c.registers.Fs_base))            // escapes: no. Restore kernel FS.
        return
}

系统调用初始化

http://zhongmingmao.me/2019/04/20/linux-system-call-process/

系统调用的初始化过程为:start_kernel --> trap_init --> cpu_init --> syscall_init。主要分为两步,第一步是中断初始化,第二步是系统调用初始化

当产生向量为0x80的编程异常时,系统怎么就知道要执行中断处理函数system_call呢?那是因为在初始化内核的时候,会执行中断初始化函数trap_init,此函数拷贝中断异常向量表到指定位置,系统就能根据中断向量号跳转到对应的中断处理。而系统调用对应的中断是软件中断,其向量号为0x80。在第一步中,通过将软件中断的处理程序system_call(entry_SYSCALL_64)与0x80绑定,所以执行int 0x80时系统就会跳转到中断处理函数system_call。

其次,中断处理函数还需要根据系统调用号来执行相应的系统调用,这就需要初始化系统调用,将系统调用中断向量与服务例程绑定。内核维护一张系统调用表system call table,系统调用表是Linux内核源码文件 arch/x86/entry/syscall_64.c中定义的数组sys_call_table的对应。在第二步中,cpu_init函数调用syscall_init完成per-cpu状态初始化。该函数执行系统调用入口的初始化,该函数没有参数且首先填充两个特殊模块寄存器:

第一个特殊模块集寄存器- MSR_STAR 的 63:48 为用户代码的代码段。这些数据将加载至 CS 和 SS 段选择符,由提供将系统调用返回至相应特权级的用户代码功能的 sysret 指令使用。 同时从内核代码来看, 当用户空间应用程序执行系统调用时,MSR_STAR 的 47:32 将作为 CS and SS段选择寄存器的基地址。

第二行代码中我们将使用系统调用入口entry_SYSCALL_64 填充 MSR_LSTAR 寄存器。 entry_SYSCALL_64 在arch/x86/entry/syscall_64.S汇编文件中定义,包含系统调用执行前的准备。

void syscall_init(void)
{
    wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
    wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
  ...

 用gdbt调试,给start_kernel ,trap_init , cpu_init ,syscall_init几个函数增加断点验证初始化过程。

 系统调用执行

用户态程序发起系统调用,对于x86-64位程序应该是直接跳到entry_SYSCALL_64,,在do_syscall_64中根据系统调用号执行对应的系统调用。

复制代码
SYM_CODE_START(entry_SYSCALL_64)
...
    /* IRQs are off. */
    movq    %rax, %rdi
    movq    %rsp, %rsi
    call    do_syscall_64        /* returns with IRQs disabled */
复制代码
 
 
原文地址:https://www.cnblogs.com/dream397/p/14275777.html