Linux内核设计与实现 总结笔记(第十五章)进程地址空间

一、地址空间

进程地址空间由进程可寻址的虚拟内存组成,内核允许进程使用这种虚拟内存中的地址。

每个进程都有一个32位或64位的平坦地址空间,空间的具体大小取决于体系结构。“平坦”指的是地址空间范围是一个独立的连续区间。

现代采用虚拟内存的操作系统通常都使用平坦地址空间而不是分段式的内存模式。

一个进程的地址空间与另一个进程的地址空间即使有相同的内存地址,实际上也彼此互不相干。这样的进程为线程。

一个进程可以寻址4GB虚拟内存(32位地址空间中),但这不代表能访问所有虚拟地址。可以被合法访问的地址空间称为内存区域。

进程只能访问有效内存区域内的内存地址,每个内存区域也具有相关权限如对相关进程有可读、可写、可执行属性。

如果一个进程访问了不在有效范围中的内粗你去与,或以不正确的方式访问了有效地址,那么内核就会终止该进程,并返回“段错误”信息。

  • 可执行文件代码的内存映射称为代码段。(text section)
  • 可执行文件的已初始化全局变量的内存映射,称为数据段。(data section)
  • 包含未初始化全局变量,也就是bss段的零页的内存映射。
  • 用于进程用户空间栈的零页的内存映射
  • 每一个诸如C库或动态连接程序等共享库的代码段、数据段和bss也会被载入进程的地址空间
  • 任何内存映射文件
  • 任何共享内存段
  • 任何匿名的内存映射,比如由malloc()分配的内存

二、内存描述符

内存描述符结构体表示进程的地址空间,该结构包含了和进程地址空间有关的全部信息。

内存描述符由mm_struct结构体表示,定义在文件<linux/sched.h>中。

struct mm_struct {
    struct vm_area_struct *mmap;        /* list of VMAs 内存区域链表 */
    struct rb_root mm_rb;                /* VMA形成的红黑树 */
    u32 vmacache_seqnum;                   /* per-thread vmacache */
#ifdef CONFIG_MMU
    unsigned long (*get_unmapped_area) (struct file *filp,
                unsigned long addr, unsigned long len,
                unsigned long pgoff, unsigned long flags);
#endif
    unsigned long mmap_base;        /* base of mmap area */
    unsigned long mmap_legacy_base;         /* base of mmap area in bottom-up allocations */
    unsigned long task_size;        /* size of task vm space */
    unsigned long highest_vm_end;        /* highest vma end address */
    pgd_t * pgd;                        /* 页全局目录 */
    atomic_t mm_users;            /* How many users with user space? 使用地址空间的用户数*/
    atomic_t mm_count;            /* How many references to "struct mm_struct" (users count as 1) 主使用计数器*/
    atomic_long_t nr_ptes;            /* PTE page table pages */
#if CONFIG_PGTABLE_LEVELS > 2
    atomic_long_t nr_pmds;            /* PMD page table pages */
#endif
    int map_count;                /* number of VMAs 内存区域的个数*/

    spinlock_t page_table_lock;        /* Protects page tables and some counters 页表锁*/
    struct rw_semaphore mmap_sem;        /* 内存区域信号量 */

    struct list_head mmlist;        /* List of maybe swapped mm's.    These are globally strung
                         * together off init_mm.mmlist, and are protected
                         * by mmlist_lock 所有mm_struct形成的链表
                         */


    unsigned long hiwater_rss;    /* High-watermark of RSS usage */
    unsigned long hiwater_vm;    /* High-water virtual memory usage */

    unsigned long total_vm;        /* Total pages mapped 全部页面数目*/
    unsigned long locked_vm;    /* Pages that have PG_mlocked set 上锁的页面数目*/
    unsigned long pinned_vm;    /* Refcount permanently increased */
    unsigned long shared_vm;    /* Shared pages (files) */
    unsigned long exec_vm;        /* VM_EXEC & ~VM_WRITE */
    unsigned long stack_vm;        /* VM_GROWSUP/DOWN */
    unsigned long def_flags;
    unsigned long start_code, end_code, start_data, end_data;        /* 代码段开始和停止,数据段开始和停止 */
    unsigned long start_brk, brk, start_stack;                /* 堆首地址,堆尾,进程栈地址 */
    unsigned long arg_start, arg_end, env_start, env_end;

    unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv 保存的auxv*/

    /*
     * Special counters, in some configurations protected by the
     * page_table_lock, in other configurations by being atomic.
     */
    struct mm_rss_stat rss_stat;

    struct linux_binfmt *binfmt;

    cpumask_var_t cpu_vm_mask_var;        /* 懒惰(Lazy)TLB交换掩码 */

    /* Architecture-specific MM context 体系结构特殊数据*/
    mm_context_t context;

    unsigned long flags; /* Must use atomic bitops to access the bits 状态标志*/

    struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO
    spinlock_t            ioctx_lock;
    struct kioctx_table __rcu    *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
    /*
     * "owner" points to a task that is regarded as the canonical
     * user/owner of this mm. All of the following must be true in
     * order for it to be changed:
     *
     * current == mm->owner
     * current->mm != mm
     * new_owner->mm == mm
     * new_owner->alloc_lock is held
     */
    struct task_struct __rcu *owner;
#endif
    struct user_namespace *user_ns;

    /* store ref to file /proc/<pid>/exe symlink points to */
    struct file __rcu *exe_file;
#ifdef CONFIG_MMU_NOTIFIER
    struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
    pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
#ifdef CONFIG_CPUMASK_OFFSTACK
    struct cpumask cpumask_allocation;
#endif
#ifdef CONFIG_NUMA_BALANCING
    /*
     * numa_next_scan is the next time that the PTEs will be marked
     * pte_numa. NUMA hinting faults will gather statistics and migrate
     * pages to new nodes if necessary.
     */
    unsigned long numa_next_scan;

    /* Restart point for scanning and setting pte_numa */
    unsigned long numa_scan_offset;

    /* numa_scan_seq prevents two threads setting pte_numa */
    int numa_scan_seq;
#endif
#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
    /*
     * An operation with batched TLB flushing is going on. Anything that
     * can move process memory needs to flush the TLB when moving a
     * PROT_NONE or PROT_NUMA mapped page.
     */
    bool tlb_flush_pending;
#endif
#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
    /* See flush_tlb_batched_pending() */
    bool tlb_flush_batched;
#endif
    struct uprobes_state uprobes_state;
#ifdef CONFIG_X86_INTEL_MPX
    /* address of the bounds directory */
    void __user *bd_addr;
#endif
#ifdef CONFIG_HUGETLB_PAGE
    atomic_long_t hugetlb_usage;
#endif
    struct work_struct async_put_work;
};
mm_struct

mm_users域记录正在使用该地址的进程数目。mm_count域是mm_struct结构体的主引用计数。

mmap和mm_rb这两个不同数据结构体描述的对象是相同的:该地址空间中的全部内存区域。前者以链表形式存放而后者以红-黑树形式存放。

mmap结构体作为链表,利于简单、高效地遍历所有元素。而mm_rb结构体作为红-黑树,更适合搜索指定元素。

所有mm_struct结构体都通过自身的mmlist域连接在一个双向链表中,该链表的首元素是init_mm内存描述符,代表init进程地址空间

操作该链表的时候需要使用mmlist_lock锁来防止并发访问,在<kernel/fork.c>中。

2.1 分配内存描述符

在进程描述符<linux/sched.h>中,mm域存放着该进程使用的内存描述符,所以current->mm便指向当前进程的内存描述符。

2.2 撤销内存描述符

当进程退出时,内核会调用定义在kernel/exit.c中的exit_mm()函数,该函数执行一些常规的撤销工作,同时更新一些统计量。

该函数调用mmput()函数减少内存描述符中的mm_users用户计数,如果降到0。调用mmdrop函数,减少mm_count使用计数。

2.3 mm_struct与内核线程

内核线程没有进程地址空间,也没有相关的内存描述符。 

三、虚拟内存区域

vm_area_struct结构体,定义在文件<linux/mm_types.h>中。它描述了指定地址空间内连续区域间上的一个独立内存范围。

内核将每个内存区域作为一个单独的内存对象管理,每个内存区域都拥有一致的属性。

/*
 * This struct defines a memory VMM memory area. There is one of these
 * per VM-area/task.  A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
    /* The first cache line has the info for VMA tree walking. */

    unsigned long vm_start;        /* Our start address within vm_mm. 区间的首地址*/
    unsigned long vm_end;        /* The first byte after our end address
                       within vm_mm. 区间的尾地址*/

    /* linked list of VM areas per task, sorted by address */
    struct vm_area_struct *vm_next, *vm_prev;        /* VMA链表 */

    struct rb_node vm_rb;                            /* 树上该VMA的节点 */

    /*
     * Largest free memory gap in bytes to the left of this VMA.
     * Either between this VMA and vma->vm_prev, or between one of the
     * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
     * get_unmapped_area find a free area of the right size.
     */
    unsigned long rb_subtree_gap;

    /* Second cache line starts here. */

    struct mm_struct *vm_mm;    /* The address space we belong to. 相关的mm_struct结构体*/
    pgprot_t vm_page_prot;        /* Access permissions of this VMA. */
    unsigned long vm_flags;        /* Flags, see mm.h. 标志位*/

    /*
     * For areas with an address space and backing store,
     * linkage into the address_space->i_mmap interval tree.
     *
     * For private anonymous mappings, a pointer to a null terminated string
     * in the user process containing the name given to the vma, or NULL
     * if unnamed.
     */
    union {
        struct {
            struct rb_node rb;
            unsigned long rb_subtree_last;
        } shared;
        const char __user *anon_name;
    };

    /*
     * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
     * list, after a COW of one of the file pages.    A MAP_SHARED vma
     * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
     * or brk vma (with NULL file) can only be in an anon_vma list.
     */
    struct list_head anon_vma_chain; /* Serialized by mmap_sem &
                      * page_table_lock */
    struct anon_vma *anon_vma;    /* Serialized by page_table_lock 匿名VMA对象*/

    /* Function pointers to deal with this struct. 相关的操作表*/
    const struct vm_operations_struct *vm_ops;

    /* Information about our backing store: */
    unsigned long vm_pgoff;        /* Offset (within vm_file) in PAGE_SIZE
                       units, *not* PAGE_CACHE_SIZE 文件中的偏移量*/
    struct file * vm_file;        /* File we map to (can be NULL). 被映射的文件(如果存在)*/
    void * vm_private_data;        /* was vm_pte (shared mem) 私有数据*/

#ifndef CONFIG_MMU
    struct vm_region *vm_region;    /* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
    struct mempolicy *vm_policy;    /* NUMA policy for the VMA */
#endif
    struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
};
vm_area_struct

每个内存描述符都对应于进程地址空间中的唯一区间。vm_start域指向区间的首地址,vm_end域指向区间的尾地址之后的第一个字节。vm_end-vm_start就是区间的长度。

vm_mm域指向和VMA相关的mm_struct结构体,VMA对其香瓜你的mm_struct。

3.1 VMA标志

VMA标志是一种标志位,定义在<linux/mm.h>。在vm_flags域内,包含了域所包含的页面的行为和信息。

/*
 * vm_flags in vm_area_struct, see mm_types.h.
 */
#define VM_NONE        0x00000000

#define VM_READ        0x00000001    /* currently active flags 页面可读*/
#define VM_WRITE        0x00000002    /* 页面可写 */
#define VM_EXEC        0x00000004    /* 页面可执行 */
#define VM_SHARED    0x00000008    /* 页面可共享 */

/* mprotect() hardcodes VM_MAYREAD >> 4 == VM_READ, and so for r/w/x bits. */
#define VM_MAYREAD    0x00000010    /* limits for mprotect() etc VM_READ标志可被设置*/
#define VM_MAYWRITE    0x00000020
#define VM_MAYEXEC    0x00000040
#define VM_MAYSHARE    0x00000080

#define VM_GROWSDOWN    0x00000100    /* general info on the segment 区域可向下增长*/
#define VM_UFFD_MISSING    0x00000200    /* missing pages tracking 区域可向上增长*/
#define VM_PFNMAP    0x00000400    /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE    0x00000800    /* ETXTBSY on write attempts.. 区域映射一个不可写文件*/
#define VM_UFFD_WP    0x00001000    /* wrprotect pages tracking */

#define VM_LOCKED    0x00002000        /* 区域中的页面被锁定 */
#define VM_IO           0x00004000    /* Memory mapped I/O or similar 区域映射设备I/O控件*/

                    /* Used by sys_madvise() */
#define VM_SEQ_READ    0x00008000    /* App will access data sequentially 页面可能会被连续访问*/
#define VM_RAND_READ    0x00010000    /* App will not benefit from clustered reads 页面可能会被随机访问*/

#define VM_DONTCOPY    0x00020000      /* Do not copy this vma on fork 区域不能在fork时被拷贝*/
#define VM_DONTEXPAND    0x00040000    /* Cannot expand with mremap() 区域不能通过mremap()增加*/
#define VM_LOCKONFAULT    0x00080000    /* Lock the pages covered when they are faulted in */
#define VM_ACCOUNT    0x00100000    /* Is a VM accounted object 该区域是一个记账VM对象*/
#define VM_NORESERVE    0x00200000    /* should the VM suppress accounting */
#define VM_HUGETLB    0x00400000    /* Huge TLB Page VM 区域使用了hugetlb页面*/
#define VM_ARCH_1    0x01000000    /* Architecture-specific flag */
#define VM_ARCH_2    0x02000000
#define VM_DONTDUMP    0x04000000    /* Do not include in the core dump */

#ifdef CONFIG_MEM_SOFT_DIRTY
# define VM_SOFTDIRTY    0x08000000    /* Not soft dirty clean area */
#else
# define VM_SOFTDIRTY    0
#endif

#define VM_MIXEDMAP    0x10000000    /* Can contain "struct page" and pure PFN pages */
#define VM_HUGEPAGE    0x20000000    /* MADV_HUGEPAGE marked this vma */
#define VM_NOHUGEPAGE    0x40000000    /* MADV_NOHUGEPAGE marked this vma */
#define VM_MERGEABLE    0x80000000    /* KSM may merge identical pages */

#if defined(CONFIG_X86)
# define VM_PAT        VM_ARCH_1    /* PAT reserves whole VMA at once (x86) */
#elif defined(CONFIG_PPC)
# define VM_SAO        VM_ARCH_1    /* Strong Access Ordering (powerpc) */
#elif defined(CONFIG_PARISC)
# define VM_GROWSUP    VM_ARCH_1
#elif defined(CONFIG_METAG)
# define VM_GROWSUP    VM_ARCH_1
#elif defined(CONFIG_IA64)
# define VM_GROWSUP    VM_ARCH_1
#elif !defined(CONFIG_MMU)
# define VM_MAPPED_COPY    VM_ARCH_1    /* T if mapped copy of data (nommu mmap) */
#endif
VMA标志

3.2 VMA操作

vm_area_struct结构体中的vm_ops域指向与指定内存区域相关的操作函数表,内核使用表中的方法操作VMA。

操作函数表由vm_operations_struct结构体表示,定义在文件<linux/mm.h>中:

/*
 * These are the virtual MM functions - opening of an area, closing and
 * unmapping it (needed to keep files on disk up-to-date etc), pointer
 * to the functions called when a no-page or a wp-page exception occurs. 
 */
struct vm_operations_struct {
    void (*open)(struct vm_area_struct * area);
    void (*close)(struct vm_area_struct * area);
    int (*mremap)(struct vm_area_struct * area);
    int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
    int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
                        pmd_t *, unsigned int flags);
    void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);

    /* notification that a previously read-only page is about to become
     * writable, if an error is returned it will cause a SIGBUS */
    int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);

    /* same as page_mkwrite when using VM_PFNMAP|VM_MIXEDMAP */
    int (*pfn_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);

    /* called by access_process_vm when get_user_pages() fails, typically
     * for use by special VMAs that can switch between memory and hardware
     */
    int (*access)(struct vm_area_struct *vma, unsigned long addr,
              void *buf, int len, int write);

    /* Called by the /proc/PID/maps code to ask the vma whether it
     * has a special name.  Returning non-NULL will also cause this
     * vma to be dumped unconditionally. */
    const char *(*name)(struct vm_area_struct *vma);

#ifdef CONFIG_NUMA
    /*
     * set_policy() op must add a reference to any non-NULL @new mempolicy
     * to hold the policy upon return.  Caller should pass NULL @new to
     * remove a policy and fall back to surrounding context--i.e. do not
     * install a MPOL_DEFAULT policy, nor the task or system default
     * mempolicy.
     */
    int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);

    /*
     * get_policy() op must add reference [mpol_get()] to any policy at
     * (vma,addr) marked as MPOL_SHARED.  The shared policy infrastructure
     * in mm/mempolicy.c will do this automatically.
     * get_policy() must NOT add a ref if the policy at (vma,addr) is not
     * marked as MPOL_SHARED. vma policies are protected by the mmap_sem.
     * If no [shared/vma] mempolicy exists at the addr, get_policy() op
     * must return NULL--i.e., do not "fallback" to task or system default
     * policy.
     */
    struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
                    unsigned long addr);
#endif
    /*
     * Called by vm_normal_page() for special PTEs to find the
     * page for @addr.  This is useful if the default behavior
     * (using pte_page()) would not find the correct page.
     */
    struct page *(*find_special_page)(struct vm_area_struct *vma,
                      unsigned long addr);
};

    void (*open)(struct vm_area_struct * area);
当制定的内存区域被加入到一个地址空间时,改函数被调用
    void (*close)(struct vm_area_struct * area);
当制定的内存区域从地址空间删除时,该函数被调用
    int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
当没有出现在物理内存中的页面被访问时,改函数被页面故障处理调用
    int (*page_mkwrite)(struct vm_area_struct *vma, struct vm_fault *vmf);
当某个页面为只读页面时,该函数被页面故障处理调用
    int (*access)(struct vm_area_struct *vma, unsigned long addr,
              void *buf, int len, int write);
当get_user_pages()函数调用失败时,该函数被access_process_vm()函数调用
VMA操作

3.3 内存区域的树型结构和内存区域的链表结构

vm_area_struct结构体通过自身的vm_next域连入链表,所有区域按地址增长方向排序,mmap域指向链表中第一个内存区域,链中最后一个结构体指向空。

3.4 实际使用中的内存区域

可以使用/proc文件系统和pmap工具查看给定进程的内存空间和其中所含的内存区域。

四、操作内存区域

内核市场需要在某个内存区域上执行一些操作,他们都声明在<linux/mm.h>中。

4.1 find_vma()

为了找到给定的内存地址属于哪个内存区域,内核提供了find_vma()函数。函数定义在<mm/mmap.c>中。

struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr);
在指定的地址空间中搜索第一个vm_end大于addr的内存区域。
如果没有发现这样的区域,该函数返回NULL。否则返回指向匹配的内存区域的vm_area_struct结构体指针。
find_vma

4.2 find_vma_prev()

与find_vma工作方式相同,返回第一个小于addr的VMA,该函数定义和声明分别在文件mm/mmap.c中和文件<linux/mm.h>中

struct vm_area_struct *find_vma_prev(struct mm_struct *mm, unsigned long addr, struct vm_area_struct **pprev)
pprev参数存放指向先于addr的VMA指针
find_vma_prev

4.3 find_vma_intersection()

返回第一个指定地址区间相交的VMA,内联函数,在<linux/mm.h>中

stati inline struct vm_area_struct *find_vma_intersection(struct mm_struct *mm, unsigned long start_addr, unsigned long end_addr) {
    struct vm_area_struct *vma;
    vma = find_vma(mm, start_addr);
    if(vma && end_addr <= vma->vm_start)
        vma = NULL;
    return vma;
}
第一个参数mm是要搜索的地址空间,start_addr是区间开始首地址,end_addr是区间的尾位置。
如果返回NULL,没有发现这样的区域。如果返回有效VMA,则只有在该VMA起始位置于给定的地址区间结束位置之前,才将其返回。如果VMA起始位置大于指定地址范围的结束位置,则该函数返回NULL。
find_vma_intersection

五、mmap()和do_mmap():创建地址区间

内核使用do_mmap()函数创建一个新的线性地址区间。定义在文件<linux/mm.h>中。

unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flag, unsigned long offset)
file:指定的文件,offset:偏移量,len:长度
do_mmap

如果file参数是NULL,offset也为0,代表这次映射没有文件相关,是匿名映射。

addr是可选参数,指定搜索空闲区域的起始位置。

prot参数指定内存区域中页面的访问权限,标志位在<asm/mman.h>中,不同体系结构有所不同。

PROT_READ:对应于VM_READ
PROT_WRITE:对应于VM_WRITE
PROT_EXEC:对应于VM_EXEC
PROT_NONE:不可被访问
页保护标志

flag参数指定了VMA标志,这些标志指定类型并该表映射的行为,也在文件<asm/mman.h>中定义。

MAP_SHARED:映射可以被共享
MAP_PRIVATE:映射不能被共享
MAP_FIXED:新区间必须开始于指定地址addr
MAP_ANONYMOUS:映射不是file-backed,而是匿名的
MAP_GROWSDOWN:对应于VM_GROWSDOWN
MAP_DENYWRITE:对应于VM_DENYWRITE
MAP_EXECUTABLE:对应于VM_EXECUTABLE
MAP_LOCKED:对应于VM_LOCKED
MAP_NORESERVE:不需要为映射保留空间
MAP_POPULATE:填充页表
MAP_NONBLOCK:在I/O操作上不堵塞
页保护标志

空户空间可以通过mmap()系统调用获取内核函数do_mmap()功能,mmap调用定义如下:

void *mmap2(void *start, size_t length, int prot, int flags, int fd, off_t pgoff)
mmap2

六、munmap()和do_menmap():删除地址区间

do_munmap()函数从特定的进程地址空间中删除指定地址区间,该函数定义在文件<linux/mm.h>

int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
mm:要删除区域所在的地址空间
start:要删除的开始地址
len:地址长度
成功:0,否则负的错误码
do_munmap

系统调用munmap()给用户提供了删除指定地址的方法,和mmap相反。

int munmap(void *start, size_t length)
munmap

七、页表

地址转换需要将虚拟地址分段,使每段虚拟地址都作为一个索引指向页表,而页表项则指向下一级别的页表或指向最终的物理页面。

Linux使用三级页表完成地址转换。使用三级页表结构可以利用“最大公约数”的思想,一种简单的体系结构。

顶级页表是全局目录(PGD)

二级页表是中间页目录(PMD)

最后一级页表指向物理页面。

页表队以ing的结构体依赖于具体的体系结构,所以定义在文件<asm/page.h>中。

原文地址:https://www.cnblogs.com/ch122633/p/11731790.html