四、内核启动（二）

4.1 MMU设置续

　　上一节分析到调用 __armv4_mmu_cache_on，执行如下，这里我们要分析 set_mmu 函数

4.1.1 __setup_mmu

　　前文已经分析过在内核最终运行地址r4下面有16KB的空间（我环境中是0x00004000~0x00008000），这就是用来存放页表的，但是现在要建立的页表在内核真正启动后会被销毁，只是用于零时存放。同时这里将要建立的页表映射关系是1:1映射（即虚拟地址 == 物理地址）。

　　首先开始执行的给页表留出空间，将页表的起始地址保存到 R3 中，R4 中保存的内容是 ZRELADDR，然后对齐页表，经过两次 bit 指令， R3 中的值的低 14 位均为0，实际上是对齐到了 16KB 边界。

1 __setup_mmu:    sub    r3, r4, #16384        @ Page directory size
2         bic    r3, r3, #0xff        @ Align the pointer
3         bic    r3, r3, #0x3f00

常数 #16384 的由来：

32 位的 RAM 系统，寻址空间为 4GB，此处每一个页表项代表 1MB，则需要 4096 个页表项。同时，每一个页表项的大小为 4Byte，那么就需要 4096 * 4Byte = 16384Byte = 16KB 的空间。
此时页表项的每项对应 1MB 的内存空间，其格式为 段(section)页表项 ，如下图所示：

　　图中 bit4 XN 为不可执行位， bit3 C 为 cacheable，bit2 B 为 bufferable

　　注： L1页表项的格式有 4 种，分别为 Fault 页表项、Section 页表项、Page Table 页表项和 Supersection 页表项。详细内容参考 ARM 手册

　　继续向下执行：

 1 /*
 2  * Initialise the page tables, turning on the cacheable and bufferable
 3  * bits for the RAM area only.
 4  */
 5         mov    r0, r3
 6         mov    r9, r0, lsr #18
 7         mov    r9, r9, lsl #18        @ start of RAM
 8         add    r10, r9, #0x10000000    @ a reasonable RAM size

　　注释说的很清楚，初始化页表，打开 cacheable 和 bufferable 位

　　对物理RAM空间建立cache和buffer。然后通过将r0（r3）中的地址值右移18位再左移18位（即清零r3中地址的低18位），得到物理RAM空间的“初始地址”（其实是估计值）并保存到r9（0x00000000）中去，然后将该地址加上256MB的大小作为物理RAM的“结束地址”（也是估计值）并保存到r10（0x10000000）中去。这里的另一个隐含意思也就是最多映射256MB大小的空间

　　R0 = R3; R9 保存的实际是 R3 的高 14 位的内容，低 18 位全为0，对齐到了 256MB 的边界； R10 = R9 + 256MB.

　　比较 R1 和 R9，若 R1 >= R9，则继续比较 R10 和 R1，之后 R1 = 0xC02，用于设置MMU区域表项的低12位状态位

1         mov    r1, #0x12        @ XN|U + section mapping
2         orr    r1, r1, #3 << 10    @ AP=11
3         add    r2, r3, #16384

　　继续向下执行：

1 1:        cmp    r1, r9            @ if virt > start of RAM
2         cmphs    r10, r1            @   && end of RAM > virt
3         bic    r1, r1, #0x1c        @ clear XN|U + C + B
4         orrlo    r1, r1, #0x10        @ Set XN|U for non-RAM
5         orrhs    r1, r1, r6        @ set RAM section settings
6         str    r1, [r0], #4        @ 1:1 mapping
7         add    r1, r1, #1048576
8         teq    r0, r2
9         bne    1b

　　接着r1比较r9和r10以设置MMU区域表项状态位：（其中r6中的值在前面__armv4_mmu_cache_on中赋值）

　　（1） r1 > r9 && r1 <r10 （r1的值在物理RAM地址范围内）：

　　设置RAM表项的C+B 位来开启cache和buffer，同时清除XN表示可执行code

　　（2） r1 < r9 || r1 > r10（r1的值在物理RAM地址范围外）：

　　设置RAM表项的XN位并清除C+B位来关闭cache和buffer，不可执行code

　　在设置完状态为后就要写入页表的相应地址中去了，然后将页表的地址+4（指向下一个表项），物理地址空间+1M设置下一项（下一个需要映射物理地址的基地址），直到填完所有的4096表项。设置完后页表项与映射关系如下：

　　如果代码不是运行在RAM中而是运行在FLASH中的，则映射2MB代码，如果运行在RAM中，则这部分代码重复前面的工作。

 1 /*
 2  * If ever we are running from Flash, then we surely want the cache
 3  * to be enabled also for our execution instance...  We map 2MB of it
 4  * so there is no map overlap problem for up to 1 MB compressed kernel.
 5  * If the execution is in RAM then we would only be duplicating the above.
 6  */
 7         orr    r1, r6, #0x04        @ ensure B is set for this
 8         orr    r1, r1, #3 << 10
 9         mov    r2, pc
10         mov    r2, r2, lsr #20
11         orr    r1, r1, r2, lsl #20
12         add    r0, r3, r2, lsl #2
13         str    r1, [r0], #4
14         add    r1, r1, #1048576
15         str    r1, [r0]
16         mov    pc, lr
17 ENDPROC(__setup_mmu)

　　至此，cache on 分析结束，回到主干上继续执行，执行 restart 标签中的语句，继续是在not_angel中

4.2 restart

1 restart:    adr    r0, LC0
2         ldmia    r0, {r1, r2, r3, r6, r10, r11, r12}
3         ldr    sp, [r0, #28]

　　通过前面LC0地址表的内容可见，这里r0中的内容就是编译时决定的LC0的实际运行地址（特别注意不是链接地址），然后调用ldmia命令依次将LC0地址表处定义的各个地址加载到r1、r2、r3、r6、r10、r11、r12和SP寄存器中去。执行之后各个寄存器中保存内容的意义如下：

　　（1） r0：LC0标签处的运行地址

　　（2） r1：LC0标签处的链接地址

　　（3） r2：__bss_start处的链接地址

　　（4） r3：_ednd处的链接地址（即程序结束位置）

　　（5） r6：_edata处的链接地址（即数据段结束位置）

　　（6） r10：压缩后内核数据大小位置

　　（7） r11：GOT表的启示链接地址

　　（8） r12：GOT表的结束链接地址

　　（9） sp：栈空间结束地址

　　在获取了LC0的链接地址和运行地址后，就可以通过计算这两者之间的差值来判断当前运行的地址是否就是编译时的链接地址。

1         /*
2          * We might be running at a different address.  We need
3          * to fix up various pointers.
4          */
5         sub    r0, r0, r1        @ calculate the delta offset
6         add    r6, r6, r0        @ _edata
7         add    r10, r10, r0        @ inflated kernel size location

　　将运行地址和链接地址的偏移保存到r0寄存器中，然后更新r6和r10中的地址，将其转换为实际的运行地址。

 1         /*
 2          * The kernel build system appends the size of the
 3          * decompressed kernel at the end of the compressed data
 4          * in little-endian form.
 5          */
 6         ldrb    r9, [r10, #0]
 7         ldrb    lr, [r10, #1]
 8         orr    r9, r9, lr, lsl #8
 9         ldrb    lr, [r10, #2]
10         ldrb    r10, [r10, #3]
11         orr    r9, r9, lr, lsl #16
12         orr    r9, r9, r10, lsl #24

　　注释中说明了，内核编译系统在压缩内核时会在末尾处以小端模式附上未压缩的内核大小，这部分代码的作用就是将该值计算出来并保存到r9寄存器中去

 1 #ifndef CONFIG_ZBOOT_ROM
 2         /* malloc space is above the relocated stack (64k max) */
 3         add    sp, sp, r0
 4         add    r10, sp, #0x10000
 5 #else
 6         /*
 7          * With ZBOOT_ROM the bss/stack is non relocatable,
 8          * but someone could still run this code from RAM,
 9          * in which case our reference is _edata.
10          */
11         mov    r10, r6
12 #endif

　　这里将镜像的结束地址保存到r10中去，我这里并没有定义ZBOOT_ROM（如果定义了ZBOOT_ROM则bss和stack是非可重定位的），这里将r10设置为sp结束地址上64kb处（这64kB空间是用来作为堆空间的）。

　　接下来内核如果配置为支持设备树（DTB）会做一些特别的工作，我这里没有配置（#ifdef CONFIG_ARM_APPENDED_DTB），所以先跳过。

 1 /*
 2  * Check to see if we will overwrite ourselves.
 3  *   r4  = final kernel address (possibly with LSB set)
 4  *   r9  = size of decompressed image
 5  *   r10 = end of this image, including  bss/stack/malloc space if non XIP
 6  * We basically want:
 7  *   r4 - 16k page directory >= r10 -> OK
 8  *   r4 + image length <= address of wont_overwrite -> OK
 9  * Note: the possible LSB in r4 is harmless here.
10  */
11         add    r10, r10, #16384
12         cmp    r4, r10
13         bhs    wont_overwrite
14         add    r10, r4, r9
15         adr    r9, wont_overwrite
16         cmp    r10, r9
17         bls    wont_overwrite

　　这部分代码用来分析当前代码是否会和最后的解压部分重叠，如果有重叠则需要执行代码搬移。首先比较内核解压地址r4-16Kb（这里是0x00004000，包括16KB的内核页表存放位置）和r10，如果r4 – 16kB >= r10，则无需搬移，否则继续计算解压后的内核末尾地址是否在当前运行地址之前，如果是则同样无需搬移，不然的话就需要进行搬移了。

　　总结一下可能的3种情况：

　　（1）内核起始地址– 16kB >= 当前镜像结束地址：无需搬移

　　（2）内核结束地址 <= wont_overwrite运行地址：无需搬移

　　（3）内核起始地址– 16kB < 当前镜像结束地址 && 内核结束地址 > wont_overwrite运行地址：需要搬移

　　仔细分析一下，这里内核真正运行的地址是0x00004000，而现在代码的运行地址显然已经在该地址之后了反汇编发现wont_overwrite的运行地址是0x00008000+0x00000168），而且内核解压后的空间必然会覆盖掉这里（内核解压后的大小大于0x00000168），所以这里会执行代码搬移。

 1 /*
 2  * Relocate ourselves past the end of the decompressed kernel.
 3  *   r6  = _edata
 4  *   r10 = end of the decompressed kernel
 5  * Because we always copy ahead, we need to do it from the end and go
 6  * backward in case the source and destination overlap.
 7  */
 8         /*
 9          * Bump to the next 256-byte boundary with the size of
10          * the relocation code added. This avoids overwriting
11          * ourself when the offset is small.
12          */
13         add    r10, r10, #((reloc_code_end - restart + 256) & ~255)
14         bic    r10, r10, #255
15 
16         /* Get start of code we want to copy and align it down. */
17         adr    r5, restart
18         bic    r5, r5, #31

　　从这里开始会将镜像搬移到解压的内核地址之后，首先将解压后的内核结束地址进行扩展，扩展大小为代码段的大小（reloc_code_end定义在head.s的最后）保存到r10中，即搬运目的起始地址，然后r5保存了restart的起始地址，并进行对齐，即搬运的原起始地址。反汇编查看这里扩展的大小为0x800。

 1         sub    r9, r6, r5        @ size to copy
 2         add    r9, r9, #31        @ rounded up to a multiple
 3         bic    r9, r9, #31        @ ... of 32 bytes
 4         add    r6, r9, r5
 5         add    r9, r9, r10
 6 
 7 1:        ldmdb    r6!, {r0 - r3, r10 - r12, lr}
 8         cmp    r6, r5
 9         stmdb    r9!, {r0 - r3, r10 - r12, lr}
10         bhi    1b
11 
12         /* Preserve offset to relocated code. */
13         sub    r6, r9, r6
14 
15 #ifndef CONFIG_ZBOOT_ROM
16         /* cache_clean_flush may use the stack, so relocate it */
17         add    sp, sp, r6
18 #endif
19 
20         bl    cache_clean_flush
21 
22         badr    r0, restart
23         add    r0, r0, r6
24         mov    pc, r0

　　这里首先计算出需要搬运的大小保存到r9中，搬运的原结束地址到r6中，搬运的目的结束地址到r9中。注意这里只搬运代码段和数据段，并不包含bss、栈和堆空间。

　　接下来开始执行代码搬移，这里是从后往前搬移，一直到r6 == r5结束，然后r6中保存了搬移前后的偏移，并重定向栈指针（cache_clean_flush可能会使用到栈）。

　　之后调用调用cache_clean_flush清楚缓存，然后将PC的值设置为搬运后restart的新地址，然后重新从restart开始执行。这次由于进行了代码搬移，所以会在检查自覆盖时进入wont_overwrite处执行。

 1 wont_overwrite:
 2 /*
 3  * If delta is zero, we are running at the address we were linked at.
 4  *   r0  = delta
 5  *   r2  = BSS start
 6  *   r3  = BSS end
 7  *   r4  = kernel execution address (possibly with LSB set)
 8  *   r5  = appended dtb size (0 if not present)
 9  *   r7  = architecture ID
10  *   r8  = atags pointer
11  *   r11 = GOT start
12  *   r12 = GOT end
13  *   sp  = stack pointer
14  */
15         orrs    r1, r0, r5
16         beq    not_relocated
17 
18         add    r11, r11, r0
19         add    r12, r12, r0

　　这里的注释列出了现有所有寄存器值得含义，如果r0为0则说明当前运行的地址就是链接地址，无需进行重定位，跳转到not_relocated执行，但是这里运行的地址已经被移动到内核解压地址之后，显然不会是链接地址0x00000168（反汇编代码中得到），所以这里需要重新修改GOT表中的变量地址来实现重定位。

 1         add    r11, r11, r0
 2         add    r12, r12, r0
 3 
 4 #ifndef CONFIG_ZBOOT_ROM
 5         /*
 6          * If we're running fully PIC === CONFIG_ZBOOT_ROM = n,
 7          * we need to fix up pointers into the BSS region.
 8          * Note that the stack pointer has already been fixed up.
 9          */
10         add    r2, r2, r0
11         add    r3, r3, r0
12 
13         /*
14          * Relocate all entries in the GOT table.
15          * Bump bss entries to _edata + dtb size
16          */
17 1:        ldr    r1, [r11, #0]        @ relocate entries in the GOT
18         add    r1, r1, r0        @ This fixes up C references
19         cmp    r1, r2            @ if entry >= bss_start &&
20         cmphs    r3, r1            @       bss_end > entry
21         addhi    r1, r1, r5        @    entry += dtb size
22         str    r1, [r11], #4        @ next entry
23         cmp    r11, r12
24         blo    1b
25 
26         /* bump our bss pointers too */
27         add    r2, r2, r5
28         add    r3, r3, r5

　　更新GOT表的运行起始地址到r11和结束地址到r12中去，然后同样更新BSS段的运行地址（需要修正BSS段的指针）。然后进入“1”标签中开始执行重定位。

　　通过r1获取GOT表中的一项，然后对这一项的地址进行修正，如果修正后的地址 < BSS段的起始地址，或者在BSS段之中则再加上DTB的大小（如果不支持DTB则r5的值为0），然后再将值写回GOT表中去。如此循环执行直到遍历完GOT表。

　　在重定位完成后，继续执行not_relocated部分代码，这里循环清零BSS段。

1 not_relocated:    mov    r0, #0
2 1:        str    r0, [r2], #4        @ clear bss
3         str    r0, [r2], #4
4         str    r0, [r2], #4
5         str    r0, [r2], #4
6         cmp    r2, r3
7         blo    1b

　　这里检测r4中的最低位，如果已经置位则说明在前面执行restart前并没有执行cache_on来打开缓存（见前文），这里补执行。

1         /*
2          * Did we skip the cache setup earlier?
3          * That is indicated by the LSB in r4.
4          * Do it now if so.
5          */
6         tst    r4, #1
7         bic    r4, r4, #1
8         blne    cache_on

 1 /*
 2  * The C runtime environment should now be setup sufficiently.
 3  * Set up some pointers, and start decompressing.
 4  *   r4  = kernel execution address
 5  *   r7  = architecture ID
 6  *   r8  = atags pointer
 7  */
 8         mov    r0, r4
 9         mov    r1, sp            @ malloc space above stack
10         add    r2, sp, #0x10000    @ 64k max
11         mov    r3, r7
12         bl    decompress_kernel
13         bl    cache_clean_flush
14         bl    cache_off
15         mov    r1, r7            @ restore architecture number
16         mov    r2, r8            @ restore atags pointer

　　到此为止，C语言的执行环境已经准备就绪，设置一些指针就可以开始解压内核了（这里的内核解压部分是使用C代码写的）。

　　跳到 decompress_kernel 跳转到内核C代码运行。

　　这里r0~r3的4个寄存器是decompress_kernel()函数传参用的，r0传入内核解压后的目的地址，r1传入堆空间的起始地址，r2传入堆空间的结束地址，r3传入机器码，然后就开始调用decompress_clean_flush()函数执行内核

　　decompress_kernel()是用于解压内核