Memory Virtualization (Chapter13-23)

The abstraction : Address space

Early systems
the early machines did not provide such abstraction to users, which directly use physical memory to run a program(process). As shown in illustration:

Due to machines were expensive, the ways that make share machines more effectively are multiprogramming and time sharing. The early systems, however, would save all state(physical memory) to disk and load other process's state to achieve changing above effectively ways. There is a big problem: save all state is too slow. We realized when saving and restoring register-level state(PC, general-purpose registers, etc) faster than Early systems' way. Therefore, we just leave processes in memory while switching between them, allowing OS to implement time sharing efficiently. We build a new physical memory figure:

For example, there are three processes(A, B, C). When process A is running, the B&C are in ready queue, waiting to switch to restore register-level to run. All of them are store in physical memory. Because of it, the security issues are important to OS to manage them.
Address space
The abstraction of physical memory we called Address space, which is created to easy use by OS. The address space of a process contains all of the memory state, such as code of the program, stack that keep track of where it is function call chain as well as to allocate local variables and pass parameters and return values, heap that is used for dynamically-allocated, user-managed memory(malloc()/new in C/C++), etc.

Noted that address space is OS providing to the running program, which really physical memory is not at 0 through 16kb; rather it load at some arbitrary physical address. For example, in Process A, try to load at address 0(virtual address), somehow in OS, with some hardware support, will have to make sure load does not actually go to physical memory 0 but rather to physical address 320KB(above figure).
- Goals
  Due to there are many metrics, we need some goals to guide us to make sure OS do what they really need to do.
  - Transparency
    The OS should implement virtual memory in a way that invisible to the running program. That's means every process as if have their own private physical memory.
  - Efficiently
    The OS should strive to make the virtualization as efficient as possible, both in terms of time and space. To implement a efficient virtualization, the TLBs is the important feature on hardware.
  - Protection
    The OS should make sure to protect processes from one another as well as OS itself. When one process fetch instruction, it should not be able to access or affect in any way the contents of any other process or OS itself.Protection thus enables us to deliver the property of isolation among processes.

Mechanism: Address translation

Like Limited direct execute(or LDE), at certain key points in time, arrange so that OS gets involved and make sure the "right" things happened, by a little hardware supporting. Also, due to the OS interposing at those critical points in time that make sure OS maintains control over the hardware.
The hardware-based address translation(or address translation) transforms each memory access, changing the virtual address to physical address where desired information is actually loaded.

Dynamic(Hardware-based) Relocation
To implement it, we must use two hardware registers within each CPU: base and bounds(or limited) register. Therefore, the translation is simply:
physical address = virtual address + base
Also, prevent overflow is simply too:
virtual/physical address (leq) bounds address (v or p depend on OS developers)
Because base and bounds registers are hardware structures kept on the chip(one per CPU), people sometimes call the part memory management unit(or MMU)
Hardware support
This paragraph we discuss a lot address translation upon hardware support. Yes, hardware support, which make OS more effectively. In above chapter, the kernel mode and user mode, just implement by a single bit, perhaps stored in some kind of processor status word. Also, in switch context, hardware will save and restore more registers such as *base and bounds**; Only in kernel mode can change the registers due to those instructions are privileged. Therefore, the CPU must be able to generate exceptions about out-of-bounds or execute privileged in user mode.
Operating system issues
Just as the hardware provides new features to support dynamic relocation, the OS now has new issues it must handle. Specifically, there are a few critical junctures the OS must to get involved in.
First, the OS take action when process is created, finding space for its address space in memory. The OS will have to search data structure(called free list) to find a room for the new address space.
Second, the OS must do some works when a process is terminated, reclaiming all of its memory for other process or the OS, thus putting the memory back on free list, and update any associated data structure.
Third, when switch context, like what discuss above, the CPU will do extra. Specifically, the OS will save the values of necessary state to memory, in some per-process structure such as process structure or process control block(PCB).
There is a figure illustrates much of hardware/OS interaction in a timeline:
Other issues
Unfortunately, this simple technique of dynamic relocation does have its inefficiencies. For example, there is a 16kb size address space with small code, stack and heap. Therefore, between the stack and heap have a large unused space, called internal fragmentation, thus wasted. Our first attempt will be a slight generalization of base and bounds known as segmentation, which we will discuss next.

Segmentation

Not only the internal fragmentation, but how to find a fix-sized address space also become a problem. In this paragraph discuss, we also use the base and bounds, however, it's different above, more flexible.
We still use the MMU, but not just one pair base and bounds. Every logical segments in address space having such a pair is a smart way. In canonically, we have three such segments: code, stack and heap.

Which segment are we refer to
The hardware use segments registers during translation, but, how the Hardware know what kind of segment deal with? Especially stack and heap grow in different forward.
One common way in VAX/VMS system is an explicit approach, which used on the top few bits of the virtual address, if we use 14-bits virtual address, like:

Notes that whatever the segment is, the offset will not change. The only difference in stack and heap only the top bit, like complement binary.
Fine-grained vs. Coarse-grained Segmentation
Other issues
Previously, we assumed that each address space was the same size, but, in segmentation, each might be a different size. Therefore, how to manage free physical space is important.
In the figure, there is a extra fragmentation, called external fragmentation. The number of algorithms such as best-fit, worst-fit, first-fit, buddy algorithm will not actually solve this problem. The only way is to compact physical memory, but which is much more costly.

Free-space Management

To manage free list in segmentation, external fragmentation can not avoid. If there is 30Kb free list, and then malloc() to get 10kb. In the figure that caused it: , malloc() to get 20kb will return nullptr.

Low-level mechanisms
- malloc() and free()
  In C, the prototype of malloc() and free() are: void malloc(size_t size), void free(void ptr). You will note that when invoke free(), it just need a pointer but not the size. Therefore, free() how to get the chunk size pointed by pointer is important. Specifically, most allocators store a littble bit of extra information in header block which is kept in memory. Due to this structure help, we can use free() precise to get size.
- Splitting and Coalescing
  This mechanism is simple: when allocate a space, choosing a fit chunk, splitting it into two chunk, one is required, the other is remained. when a used chunk will be freed, adding it into free list, and check whatever it can coalesce.
- Embedding a free list
  Like allocators, the free list also contains a node to store information:
  The mmap() will discuss in next paragraph about page.
- Growing the heap
  If we want to growing heap, we need some system calls(sbrk in most Unix), allocating new space, updating free list.
Basic Strategies
- Best Fit
- Worst Fit
- First Fit
- Next Fit
Other Approach
- Segregated List
  The basic idea is simple: if a particular application has one(or a few) popular-sized request that it makes, keep a separate list just to manage objects of that size, all other requests are forwarded to a more general memory allocator.
  - slab allocator
    Specifically, when the kernel boots up, it allocates a number of object caches for kernel objects that are likely to be requested frequently (such as
    locks, file-system inodes, etc.); the object caches thus are each segregated free lists of a given size and serve memory allocation and free requests quickly. When a given cache is running low on free space, it requests some slabs of memory from a more general memory allocator (the total amount requested being a multiple of the page size and the object in question). Conversely, when the reference counts of the objects within a given slab all go to zero, the general allocator can reclaim them from the specialized allocator, which is often done when the VM system needs more memory. The slab allocator also goes beyond most segregated list approaches
    by keeping free objects on the lists in a pre-initialized state. Bonwick shows that initialization and destruction of data structures is costly; by keeping freed objects in a particular list in their initialized state, the slab allocator thus avoids frequent initialization and destruction cycles per object and thus lowers overheads noticeably.
  - Buddy algorithm

Page

Introduction
Contrast with choppoing memory into variable-sized pieces, called segmentation, the page is chopped into fixed-sized pieces. Correspondingly, we view physical memory as an array of fixed-sized slots called page frames, each of those can contain a virtual-memory page.
Using page, there are two obvious advantages: one is using page can regardless of how a process uses the address space(for code, stack or heap), the other is fixed-sized units let free list easy to manage page(or physical memory).
To record where each virtual page of the address space is placed in physical memory, the OS usually keeps a per-process data structure called page table, which major role is to store address translations, thus let us know where in physical memory each page resides. Note, the page table is per-process structure.
Like what we discuss above in segmentation, the virtual address also contain two parts: virtual physical number(VPN), offset. When translation, the offset will not change, but VPN do.
- Where page table stored
  We have known that page table is the important component in virtual-to-physical page, and the page table entry(PTE) contain the information we need. In some 64-bit system, the page table is too large, so we can not keep any special on-chip hardware in the MMU to store the page table. Install, we store the page table for each process in memory somewhere. In physical memory or OS virtual memory(can swap to disk, we will discuss next).
- Smaller page table
  Due to the page table so large, we must think about how to reduce its size.
  - Simple solution: Bigger pages
    The size of address space and PTE are not changed, but if we multiple 4x of pages, the entries of page tables are reduced 4x. However, the pages size more large, the internal fragmentation more severe.
  - Hybrid Approach: Paging and Segmentation
    Whenever you have two reasonable but different approaches to something in life, you should always examine the combination of the two to see if you can obtain the best of both worlds. Like Base and Bounds, in hybrid, we still have those structure in MMU, extra hold the physical address of the page table of that segment.
    On a TLB miss, the hardware uses the segment bits (SN) to determine which base and bounds pair to use.
    
    However, the disadvantege of them still remain, the external fragmentation, hard to manage variable-sized in free lists, etc.
  - Multi-level Page tables
    This structure, turning the linear page table into something like a tree. The basic idea behind a multi-level page table is simple. First, chop up the page table into page-sized units; then, if an entire page of page-table entries (PTEs) is invalid, don’t allocate that page of the page table at all. To track whether a page of the page table is valid (and if valid, where it is in memory), use a new structure, called the page directory. The page directory thus either can be used to tell you where a page of the page table is, or that the entire page of the page table contains no valid pages.
    
    The page directory, in a simple two-level table, consists of page directory entries(PDE), A PDE has a valid bit and a page frame number(PFN), similar to PTE. Multi-level page tables have some obvious advantages over approaches we’ve seen thus far. First, and perhaps most obviously, the multi-level table only allocates page-table space in proportion to the amount of address space you are using; thus it is generally compact and supports sparse address spaces. Second, if carefully constructed, each portion of the page table fits neatly within a page, making it easier to manage memory.
    
    For example, there is 14-bit virtual address space, which size of 16kb, with 64-byte pages. Therefore, 8-bit for VPN, 6bit for offset. Once we extract the page-directory index(PDIndex) from the VPN, we can use it to find the address of the page-directory entry(PDE) with a simple calculation: PDEAddr = PageDirBase + (PDIndex * sizeof(PDE)). Next, if the PDE is valid, we can get the address of the page-table entry(PTE) using the page-table Index(PTIndex); PTEAddr = (PDE.PFN << SHIFT) + (PTIndex * sizeof(PTE)). Then we can get the page-frame number(PFN) in PTE.
    
    Note, more than two levels also is valid. Remember our goal in constructing a multi-level page table: to make each piece of the page table fit within a single page. For example, 30-bit virtual address space, with 512-bytes pages, which have 21-bit VPN and 9-bit offset. Therefore, we should make three-level.
  - Inverted Table
    Here, instead of having many page tables (one per process of the system), we keep a single page table that has an entry for each physical page of the system. The entry tells us which process is using this page, and which virtual page of that process maps to this physical page.
- Page table Entry
  The page table Entry also contain many bits to protection or check if valid, etc.
- Paging and Memory trace
  The figure show the code of accessing memory with paging:
  There is a program that we will trance its memory access to understand paging in details.
  
  Let’s assume we have a linear (array-based) page table and that it is located at physical address 1KB (1024). Because the page size is 1KB, virtual address 1024 resides on the second page of the virtual address space (VPN=1, as VPN=0 is the first page). Let’s assume this virtual page maps to physical frame 4 (VPN 1 → PFN 4). Next, there is the array itself. Its size is 4000 bytes (1000 integers), and we assume that it resides at virtual addresses 40000 through 44000 (not including the last byte). The virtual pages for this decimal range are VPN=39 ... VPN=42. Thus, we need mappings for these pages. Let’s assume these virtual-to-physical mappings for the example: (VPN 39 → PFN 7), (VPN 40 → PFN 8), (VPN 41 → PFN 9), (VPN 42 → PFN 10).
Translation-lookaside buffer
Every translation from virtual to physical page using above code is OK, but it is still a little slow in performance. In learning CS, hierarchy cache is important notion to improve access memory. Therefore, we also can use it in there. The Translation-lookaside buffer(or TLB) is part of the MMU, a simply hardware cache of the translation; thus, a better name would be an address-translation cache. Upon each virtual memory reference, the hardware first checks the TLB to see if the desired translation is held therein; if so, the translation is performed (quickly) without having to consult the page table (which has all translations). Because of their tremendous performance impact, TLBs in a real sense make virtual memory possible.
- Who handle the TLB miss
  When translation virtual page to physical page, if OS can not find current information in TLB, we called it TLB miss, if not, TLB hit. There are two ways to solve the TLB miss, by hardware or OS. Usually, in CISC is hardware, in RISC is OS by multi-level page table ensuring efficient, handler ensuring control. There is some special on normal return-from-trap when returning from a TLB miss-handling trap, the hardware must resume execution at the instruction that caused the trap, or we can say: "Try again to TLB, please".
- TLB issue: Context switch
  When context switch, if the two processes have the same virtual page, but map different physical page, how the OS to judge it? There is a number of solutions for this question.
  One approach is to simply flush the TLB on context switch, thus an explicit privileged hardware-instruction can accomplish it. However, this approach contains an obvious problem: if the context switch frequently, the cost will be high.
  To reduce the overhead, some systems add hardware support to enable sharing of TLB across context switch, which provides an address space identified(ASID) to accomplish it.
- TLB issue: Replacement Policy
  When the TLB is full, to install new entry we must replace an old one, thus the question: Which one to replace?
  There are some replacement policy. One common approach is to evict the least-recently-user or LRU entry. LRU tries to take advantage of locality in the memory-reference stream, assuming it is likely that an entry that has not recently been used is a good candidate for eviction. Another typical approach is to use a random policy, which evicts a TLB mapping at random. Such a policy is useful due to its simplicity and ability to avoid corner-case behaviors; for example, a “reasonable” policy such as LRU behaves quite unreasonably when a program loops over n + 1 pages with a TLB of size n; in this case, LRU misses upon every access, whereas random does much better.

Swap space

Thus far, we have assumed that all pages reside in physical memory. However, to support large address spaces, the OS will need a place to stash away portions of address spaces that currently aren't in great demand. In general, the characteristics of such a location are that it should have more capacity than memory; as a result, it is generally slower (if it were faster, we would just use it as memory, no?). In modern systems, this role is usually served by a hard disk drive. Thus, in our memory hierarchy, big and slow hard drives sit at the bottom, with memory just above. Beyond just a single process, the addition of swap space allows the OS to support the illusion of a large virtual memory for multiple concurrently-running processes. The invention of multiprogramming (running multiple programs “at once”, to better utilize the machine) almost demanded the ability to swap out some pages, as early machines clearly could not hold all the pages needed by all processes at once. Thus, the combination of multiprogramming and ease-of-use leads us to want to support using more memory than is physically available.

Mechanisms
The first thing we will need to do is to reserve some space on the disk for moving pages back and forth. In operating systems, we generally refer to such space as swap space, because we swap pages out of memory to it and swap pages into memory from it. Thus, we will simply assume that the OS can read from and write to the swap space, in page-sized units. To do so, the OS will need to remember the disk address of a given page.
- The present bit
  Recall the TLB, when the TLB miss, the Othe hardware locates the page table in memory (using the page table base register) and looks up the page table entry (PTE) for this page using the VPN as an index. If the target physical page is not in memory, or present bit set to zero, that means the page has been swapping into disk. Therefore, we call it page fault, then OS will run page-fault-handler to solve this problem.
- Page Fault
  In many systems, the disk-address natural store in page table, which makes OS can deal with page fault(finding the target page in disk and then swapping into memory) When the disk I/O completes, the OS will then update the page table to mark the page as present, update the PFN field of the page-table entry (PTE) to record the in-memory location of the newly-fetched page, and retry the instruction.
Policies
Like TLB, when memory is full, we also need some policies to replace them.
- FIFO(segmented FIFO in VAX/VMS)
  First-in First-out, which is easy and common in other policies.
- Random
- LRU
- Clock algorithm
  using some bit such as dirty bit, which performance similar to LRU but easy implement.
- Second-chance Lists
- 2Q replacement algorithm(In Linux)

Note that the more details in swap space I can not refine better, you can see the book to get more understanding.