key concepts in memory

1, virtual memory

Each process has an illusion that is can use the whole memory, which simplifies the coder’s work. VM separates different processes, which guarantees one process cannot do harm to another process.

VM makes the kernel manages the limited physical memory more effectively, since processes can share memory which makes more processes run simultaneously, and a process can use more memory (virtual) than the total physical memory.

VM enables memory mapping (file on disk, physical memory, virtual memory, page table), which is much faster than file operations. This makes processes can share system library which should be re-entrant.

2, how does VM works

When a variable is accessed in a procdess, MMU translates its logical addr to physical addr by looking up a page table of this process.

The minimal memory unit (granularity) of kernel is named page, which is usually 4KB.

A page table has many PTEs (Page Table Entry), which is consisted of a key and a value, where a key is a page id in logical space while a value is a page id in physical space. There are also some bit flags in a PTE to identify some properties of this page, e.g. valid-invalid bit, read-write bit, dirty bit and so on. Valid bit is used to distinguish between the pages that are in memory or the page that are on the disk.

Assume each PTE consume 4B, and if a process has 2M entries, the PT consumes 8M for each process. If there are 100 processes run simultaneously, then 800M memory is need for PTs! TLB (translation look-aside buffer) is a cache for PTEs, which is used to accelerate the address translation. TLB is shoot-down when a scheduler switches processes since the key in TLB is meaningless for the new process. It takes a while for TLB to warm up. So the scheduler should not switch processes too frequently. Some TLB has a process id in its each entry. This kind of TLB will not be shoot-down during switch process.

3, important techniques in VM

Share memory

Two or more processes share memory by specifying their page tables to map their logical memory to the same physical memory. This makes more processes can run simultaneously.

Demand paging

The instructions being executed should be in physical memory. One way is to place the entire logical space into physical memory. Dynamic loading can help to ease the restriction, but it needs programmer’s work (build dynamic lib).

Demand paging enables the ability to execute a program that is only partially in physical memory. With demand-paged VM, pages are loaded when they are demanded during program execution; pages that are never accessed thus are never loaded into physical memory. What happened when a process try to access a page that is not brought into the physical memory? Access to a page marked invalid causes a page-fault trap. The procedure for handling page fault is as follows.

a, check the process control block to determine whether this reference is a valid or invalid memory access.

b, if the reference is invalid, terminate the process. Otherwise, if we have not yet brought in that page, we now page it in.

c, select a free frame from the free-frame list

d, we perform a disk operation to read the page into the newly allocated frame

e, when the disk read is complete, we modify the page table to indicate the page is in memory(valid)

f, restart the instruction that was interrupted by the trap

The performance of demand paging relies on the low page fault rate since reading in the page by disk operation is very time consuming.

Copy on Write

Fork uses a technique known as copy-on-write providing for rapid process creation and maximizing the share of page between child process and parent process. Child process initially share all the page of parent process since it simply duplicate the page table of parent process. If any one of parent process and child process attempts to modify a page setting to be copy-on-write, e.g. assuming the child process, the OS will create a copy of this frame, and modify the page table of child process to map this frame in its logical space. The child process will then modify its copied frame ant not the frame belonging to the parent process. All unmodified pages can be shared between parent and child process.

Page replacement

While a process is executing, a page fault occurs. The OS determines where the desired page is residing on the disk but then finds there are no free frames on the free-frame list; all memory is in use. So page replacement is needed. It takes the following approach. If no frame is free, we find one that not currently being used and free it (victim frame). We free a frame by writing its contents to the swap space and change the page table to indicate that the page is no longer in memory. We can now use the freed frame to hold the page for which the process faulted. We now continue to handle the page fault routine. We can speed up the page replacement by using a dirty bit in PTE to save a swap out for a non-modified frame.

There are some page replacement algorithms, including FIFO page replacement, optimal page replacement, LRU page replacement, LRU-approximation page replacement (used in some OS) and so on.

The relationship between working set and page fault rate assuming there is sufficient physical memory to store the working set of the process. The page fault rate will transition between valley and peak over time.

Thrashing means the CPU utilization decreases when degree of multiprogramming increases. A process is thrashing if it spends more time paging than execution.

4, Allocate kernel memory

Kernel needs contiguous physical memory because certain hardware relies on contiguous physical memory.

Many OS doesn’t subject kernel code/data to paging system.

Two allocation algorithms

Buddy system

Buddy system uses a pow-of-2 allocator. An advantage of buddy is how quickly adjacent buddies can be combined to form a larger segment using a technique known as coalescing. The disadvantage of buddy is round up to the next highest power of 2 is likely to cause fragmentation inside the allocated segments.

Slab system

A slab is made up of one or more physically contiguous pages. A cache consists of one or more slabs. There is a single cache for each kind of kernel structure, e.g. a separate cache for data structure representing process descriptor, and a separate cache for data structure representing semaphores. Each cache is populated with objects that are instantiations the cache represents.

When a cache is created, a number of objects-which are initially marked as free-are allocated to the cache. For example, a 12KB slab (comprised of 3 contiguous 4KB pages) could store six 2KB objects. When a new object for a kernel data structure is needed, the allocator assigns a free object in the cache to satisfy the request. The object assigned from the cache is marked as used.

The slab allocator has 2 main benefits. No memory is wasted due to fragmentation and memory requests can be satisfied quickly.