自旋锁spinlock

1 在单处理器上的实现

单核系统上，不存在严格的并发，因此对资源的共享主要是多个任务分时运行造成的。

只要在某一时段，停止任务切换，并且关中断（对于用户态应用程序，不大可能与中断处理程序抢临界区资源），或者对临界区资源的访问可以用一条原子指令完成，就能够保证只有一个任务在运行。

这就是spinlock的实现机制。

   1: #define __LOCK(lock)    2:   do { preempt_disable(); __acquire(lock); (void)(lock); } while (0)

   1: #define _raw_spin_lock(lock)            __LOCK(lock)

   1: static inline void spin_lock(spinlock_t *lock)   2: {   3:     raw_spin_lock(&lock->rlock);   4: }

   1: #define __UNLOCK(lock)    2:   do { preempt_enable(); __release(lock); (void)(lock); } while (0)

即基本上就是调用preempt_disable和preempt_enable。

但是，在单处理器结构中的原子指令，比如“测试并设置”指令，如果两个处理器同时在执行，那么就会发生竞争，就已经不再是原子的了。

spinlock_t结构的定义

   1: typedef struct spinlock {   2:     union {   3:         struct raw_spinlock rlock;   4:     5: #ifdef CONFIG_DEBUG_LOCK_ALLOC   6: # define LOCK_PADSIZE (offsetof(struct raw_spinlock, dep_map))   7:         struct {   8:             u8 __padding[LOCK_PADSIZE];   9:             struct lockdep_map dep_map;  10:         };  11: #endif  12:     };  13: } spinlock_t;

   1: typedef struct raw_spinlock {   2:     arch_spinlock_t raw_lock;   3: #ifdef CONFIG_GENERIC_LOCKBREAK   4:     unsigned int break_lock;   5: #endif   6: #ifdef CONFIG_DEBUG_SPINLOCK   7:     unsigned int magic, owner_cpu;   8:     void *owner;   9: #endif  10: #ifdef CONFIG_DEBUG_LOCK_ALLOC  11:     struct lockdep_map dep_map;  12: #endif  13: } raw_spinlock_t;

   1: typedef struct arch_spinlock {   2:     unsigned int slock;   3: } arch_spinlock_t;

其实 spinlock_t类型，只是包裹了一个unsigned int而已，通过 lock->rlock可以引用到这个unsigned int。

   1: static inline int spin_trylock(spinlock_t *lock)   2: {   3:     return raw_spin_trylock(&lock->rlock);   4: }

对于单CPU系统来说，执行一条指令，不论是什么样的指令，都是“原子”的，因为单CPU的并行只是在多个任务切换之间发生的，而任务切换的时机一定是在一条指令执行完之后，所以执行一条指令，肯定不会受到另外一条任务的干扰。

即使是中断，也不能干扰正在执行的这条指令。

因此单CPU系统中的任何单条指令都是原子的。

原子性也可以从总线的角度来考虑，如果一个操作在执行期间，对系统总线是独占的，即其他指令无法请求系统总线，那么这个操作就是原子的。因为无法操作系统总线，就意味着无法读写内存。

但是引入了缓存，这个结论还适用吗？

但是，这个结论在SMP系统中并不适用。

SMP系统中，因为有些指令是同时包含着对内存的“读，写”操作的，那么多个CPU在同时执行这样的复合指令时，就可能会产生竞争情况。对于这样的指令，CPU提供了给系统总线加锁的机制。

在一条汇编指令前加上前缀LOCK，在这条指令执行时候，执行它的CPU会将CPU上的LOCK引脚的电位拉低，从而通知总线，该CPU需要独占总线，禁止其他CPU对系统总线的请求。

有一条特殊指令，xchg，它是用来交换两个操作数的内容的。

如果两个操作数都是CPU的寄存器，那么肯定是原子的；

如果一个操作数是寄存器，而另一个操作数是内存，那么会相当于自动在该指令前边加上LOCK；

不可能两个操作数都是内存。

所以，xchg操作在SMP系统中无论如何都是原子的。

题外话题：

GNU asm inline block

语法参见：http://ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html

关于”memory”以及”volatile”的解释：

5.3 Clobber List.

Some instructions clobber some hardware registers. We have to list those registers in the clobber-list, ie the field after the third ’:’ in the asm function. This is to inform gcc that we will use and modify them ourselves. 【clobber列表告诉GCC编译器，这些寄存器我们的汇编块会使用并修改，GCC编译器不要染指这些寄存器】So gcc will not assume that the values it loads into these registers will be valid. We shoudn’t list the input and output registers in this list. Because, gcc knows that "asm" uses them (because they are specified explicitly as constraints). If the instructions use any other registers, implicitly or explicitly (and the registers are not present either in input or in the output constraint list), then those registers have to be specified in the clobbered list.

If our instruction can alter the condition code register, we have to add "cc" to the list of clobbered registers.

If our instruction modifies memory in an unpredictable fashion, add "memory" to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction. 【告诉GCC不要对汇编块里面的语句进行优化，即不要使用寄存器中的对内存的拷贝，而直接使用内存】

We also have to add the volatile keyword if the memory affected is not listed in the inputs or outputs of the asm.【如果被影响到的内存没有在input/output列表中指明，还要加上volatile】

We can read and write the clobbered registers as many times as we like. Consider the example of multiple instructions in a template; it assumes the subroutine _foo accepts arguments in registers eax and ecx.
asm ("movl %0,%%eax; movl %1,%%ecx; call _foo" : /* no outputs */ : "g" (from), "g" (to) : "eax", "ecx" );
5.4 Volatile ...?

If you are familiar with kernel sources or some beautiful code like that, you must have seen many functions declared as volatile or __volatile__ which follows an asm or __asm__. I mentioned earlier about the keywords asm and __asm__. So what is this volatile?

If our assembly statement must execute where we put it, 【汇编指令必须按照我们定义好的顺序执行】(i.e. must not be moved out of a loop as an optimization), put the keyword volatile after asm and before the ()’s. So to keep it from moving, deleting and all, we declare it as

asm volatile ( ... : ... : ... : ...);

Use __volatile__ when we have to be verymuch careful.

If our assembly is just for doing some calculations and doesn’t have any side effects, it’s better not to use the keyword volatile. Avoiding it helps gcc in optimizing the code and making it more beautiful.

通过上面的LOCK的方法，可以保证指令在SMP系统中的原子性，那么就可以很方便地利用这些指令来保证对某个数据结构的读取和修改的原子性，利用这个数据结构就可以继续实现更高级别的并发控制。

参考：http://www.ibm.com/developerworks/cn/linux/l-cn-spinlock/index.html

传统自旋锁的实现

   1: static inline void __raw_spin_lock(raw_spinlock_t *lock)   2: {   3:     asm volatile("
1:	"   4:              LOCK_PREFIX " ; decb %0
	"   5:              "jns 3f
"   6:              "2:	"   7:              "rep;nop
	"   8:              "cmpb $0,%0
	"   9:              "jle 2b
	"  10:              "jmp 1b
"  11:              "3:
	"  12:              : "+m" (lock->slock) : : "memory");  13: }

简单地解析一下，就是原子地递减lock->slock的值，如果该值在递减后仍不是负数，那么代表锁可用，就可以安逸地返回，尽情地享用该锁。

如果该值在递减后是负数，代表锁被其他CPU使用，只能不停地循环进行Nop操作，直到发现该值已经变成了非负，那么再进行递减判断，如果递减后仍不是负数，就可以享用该锁了。

这里面关键的地方，在于多个CPU通过一个内存位置来进行同步状态的维护。因此volatile/”memory”手段都用上了，以保证操作的精准性，而且读取的一定是内存中的值，而不是寄存器或者缓存中的值。

因为只有dec这一条指令被LOCK保护，即当前CPU在对该内存区域进行操作时，保证其他CPU不能够读到该内存，会被阻塞。但是当前CPU进行读内存时，则没有进行保护，这是因为对于内存位置的只读是不会造成竞争的，就像rwlock一样；同时，对于内存的一个单纯的读操作，也是原子的，不会被其他CPU干扰到。

也就是说，不写的时候随便读，写的时候不能读。

dec操作不是严格意义上的原子操作，其中包含了“读-递减-写”的复合过程，所以需要LOCK。

   1: static inline void __raw_spin_unlock(raw_spinlock_t *lock)   2: {   3:     asm volatile("movb $1,%0" : "+m" (lock->slock) :: "memory");   4: }

如果获得自旋锁的CPU在使用完了该锁以后，则简单地将内存中的lock->slock设置成1。但是为什么设置的时候，没有加LOCK机制呢？

因为单纯地写内存，也是原子指令。

排队自旋锁

排队自旋锁是对原始自旋锁的一种改进，主要解决的是“公平”问题。

即保证先申请的CPU会先获得自旋锁，FIFO。

这是通过将lock->slock分成两部分完成的：Next, Owner。

分配的方案有两种，8：8，这时unsigned int的高16位不使用。

或者16：16, 这时unsigned int的全部数位都被使用。

我们看一下实现

   1: static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)   2: {   3:     short inc = 0x0100;   4:     5:     asm volatile (   6:         LOCK_PREFIX "xaddw %w0, %1
"   7:         "1:	"   8:         "cmpb %h0, %b0
	"   9:         "je 2f
	"  10:         "rep ; nop
	"  11:         "movb %1, %b0
	"  12:         /* don't need lfence here, because loads are in-order */  13:         "jmp 1b
"  14:         "2:"  15:         : "+Q" (inc), "+m" (lock->slock)  16:         :  17:         : "memory", "cc");  18: }

inc是每次有CPU进行申请该自旋锁时，都会对lock->slock进行的increment，可见这是8：8的划分方式，即对Next加1，而Onwer保持不变。

In the assembler code, inc is referred to as %0 and slock as %1. Moreover, %b0 denotes the lower 8 bit, i.e. inc % 0x100, and %h0 is inc / 0x100.

http://stackoverflow.com/a/17405681/941650

就是，首先

LOCK_PREFIX "xaddw %w0, %1 "

对总线加锁，并且先将inc与lock->slock交换，然后将二者之和赋给lock->slock，即相当于C=A;A=B;B+=C;的这样的C语言执行序列。

如果这是一个新分配的，没有使用过的自旋锁，会被初始化为0，即Next==Onwer==0

这时寄存器0保存的是原来的Next与Onwer，如果二者相等，则表明自旋锁可用，就跳出循环，当前申请者享用自旋锁。

否则，当前CPU不断地循环，每个循环中先执行一条Nop指令，然后从内存中加载Onwer到寄存器1，再进行比较，直到获得了该自旋锁。

   1: static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)   2: {   3:     asm volatile(UNLOCK_LOCK_PREFIX "incb %0"   4:              : "+m" (lock->slock)   5:              :   6:              : "memory", "cc");   7: }

放弃当前已经获得的自旋锁的方式，就是简单地对Onwer递增。因为inc不同于set/mov操作，set是单纯的写内存指令，是原子的，而inc是读写组合的复合操作，不是原子的，所以需要对总线加锁进行操作。

缓存，对LOCK以及自旋锁的实现有影响吗？

由于高速缓存的引入，可能出现一种顺序上的错位，即有些事件CPU已经执行完成了，但是还没有反应到系统总线上，即没有反应到内存上。或者说，

会有一些逻辑上已经完成的，但是物理上尚未实现的内存操作

【出自Linux内核源码情景分析】

这种情况的存在，就会影响到自旋锁的真正发挥作用，因为一旦我们的CPU对lock->slock的操作只是缓存在Cache上，那么其他的CPU就无法第一时间获悉到这种改变，从而造成CPU之间沟通上的延迟。

因此，在SMP系统中存在一种机制，用于设置一个时间点，或者说指令序列上的某个特殊点，使得整个SMP系统上的所有CPU都把该条特殊指令之前的缓存下来的所有指令的执行结果都在总线上，内存上最终完成。就好像在某个deadline把所有欠下的债全部还清一样。

这种手段被称为“内存路障”(Memory Barrier)。

内存路障并不是一条单纯的指令，它是一系列的操作都能够产生的后果。

可以通过以下方式，设置内存路障：

1. LOCK指令前缀本身起到内存路障的作用。CPU在执行这条指令之前，会把该CPU中所有已经写入缓存，但是还没有同步到内存上的操作全部冲刷到内存上。

2. 一些特殊指令：

   1: iret   2: cpuid   3: sfence    // storage fence

3. 对控制寄存器CR0~CR4的设置指令，即以CR0~CR4为目标的mov指令。

4. 对调试寄存器的设置指令。

5. 对GDTR，LDTR，IDTR等寄存器的装入操作，即lgdtr, lldtr, lidtr指令。

6. 对高速缓存本身的控制操作。

Snooping（窥探）机制

每个CPU中都有专门的硬件电路，用于监控系统总线上的写操作，如果发现系统总线上写操作的目标存在于本CPU的缓存中，那么就直接将对应的缓存行失效(Invalidate)。

这种机制可以很方便地保证缓存与内存的一致性问题。

刷新TLB缓存

   1: static inline void __native_flush_tlb(void)   2: {   3:     native_write_cr3(native_read_cr3());   4: }   5:     6: static inline void __native_flush_tlb_global(void)   7: {   8:     unsigned long flags;   9:     unsigned long cr4;  10:    11:     /*  12:      * Read-modify-write to CR4 - protect it from preemption and  13:      * from interrupts. (Use the raw variant because this code can  14:      * be called from deep inside debugging code.)  15:      */  16:     raw_local_irq_save(flags);  17:    18:     cr4 = native_read_cr4();  19:     /* clear PGE */  20:     native_write_cr4(cr4 & ~X86_CR4_PGE);  21:     /* write old PGE again and flush TLBs */  22:     native_write_cr4(cr4);  23:    24:     raw_local_irq_restore(flags);  25: }

1. 强制对保存页目录物理地址的CR3寄存器进行写操作，就会使CPU自动将TLB缓存invalidate掉，但是并不会将Global Page刷出TLB。

2. CR4寄存器PGE位用于控制Global Page机制，Global Page机制，让经常使用的或者共享的页在任务切换和写CR3寄存器时，也不会从TLB中被刷出。清空PGE标志，就可以关闭该机制。

全局页机制对于多任务系统很有好处，比如多个任务之间的内核态的页表就是多个任务共享的，所以在任务切换时也不需要将其TLB刷出失效。

Intel手册上关于PGE的解释：

invalidates all TLB entries (including global entries) and all entries in all paging-structure caches.

因此，清空CR4位，可以将TLB包括Global Page都被刷出，整个TLB都失效。

3. CPU还提供了一种单独使某一项TLB失效的机制

   1: static inline void __native_flush_tlb_single(unsigned long addr)   2: {   3:     asm volatile("invlpg (%0)" ::"r" (addr) : "memory");   4: }