无锁同步：计数器

概述

同步问题是并发编程中经常遇到的问题。在用户层次，处理同步问题的一般方法是锁和信号量等，但这些方法都有性能问题。对性能的简单比较见此文最后。

intel x86、x86_64处理器支持compare and swap (CAS)操作，该操作把

读取A的值
改变A的值

这两个操作变成了一个原子操作，保证不会被其他CPU指令打断。

GCC编译器从4.1.0开始通过内置函数支持CAS操作，具体文档见：http://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#_005f_005fsync-Builtins。

因此，使用GCC编译代码时，我们可以不使用系统提供的锁机制，而使用GCC内置同步函数，达到同步的目的，即传说中的“无锁同步”。这里的“无锁”，并不是真的无锁，只是不使用系统提供的锁API了。

应用示例：计数器

这里做一个示例程序。多个线程并发运行，都试图修改一个全局计数器1万次，主线程输出最后结果。

显然，在不处理同步的情况下，计数器最后的结果是不确定的。以下是不同步时的代码和输出结果：

 1 /*
 2  * 使用GCC __sync_*系列内置原子操作函数.
 3  * Author: 赵子清
 4  * Blog: http://www.cnblogs.com/zzqcn
 5  **/
 6 
 7 #include <sys/time.h>
 8 #include <pthread.h>
 9 #include <stdlib.h>
10 #include <stdio.h>
11 
12 static int g_count = 0;
13 
14 
15 void*  thread_test(void* arg)
16 {
17     int i;
18     for(i=0; i<10000; ++i)
19     {
20         g_count++;
21     }
22 
23     return NULL;
24 }
25 
26 
27 int main(int argc, char** argv)
28 {
29     pthread_t   id[20];
30     int  i;
31     struct timeval  t1, t2;
32     double  t;
33 
34     gettimeofday(&t1, NULL);
35 
36     for(i=0; i<20; ++i)
37         pthread_create(&id[i], NULL, thread_test, NULL);
38 
39     for(i=0; i<20; ++i)
40         pthread_join(id[i], NULL);
41 
42     gettimeofday(&t2, NULL);
43     t = t2.tv_sec - t1.tv_sec + (t2.tv_usec - t1.tv_usec)/1000000.0;
44     
45     printf("count: %d, used: %f s
", g_count, t);
46     return 0;
47 }

运行多次的结果：

count: 138679, used: 0.003055 s
count: 169814, used: 0.003493 s
count: 84474, used: 0.004649 s
count: 96267, used: 0.002249 s
count: 89185, used: 0.002405 s
count: 147552, used: 0.003148 s

接下来，我们使用GCC内置的同步函数，处理同步问题。只需将第20行代码修改为：

__sync_fetch_and_add(&g_count, 1);

修改之后的程序，多次运行结果如下：

count: 200000, used: 0.009921 s
count: 200000, used: 0.008430 s
count: 200000, used: 0.008944 s
count: 200000, used: 0.007860 s
count: 200000, used: 0.009346 s
count: 200000, used: 0.004421 s

可见确实起到了同步效果。

与标准锁机制的性能比较

如果对上例使用pthread互斥量，性能会如何呢？我们把原程序的第13到第24行改为如下代码：

 1 pthread_mutex_t  mutex = PTHREAD_MUTEX_INITIALIZER;
 2 
 3 void*  thread_test(void* arg)
 4 {
 5     int i;
 6     for(i=0; i<10000; ++i)
 7     {
 8         pthread_mutex_lock(&mutex);
 9         g_count++;
10         pthread_mutex_unlock(&mutex);
11     }
12 
13     return NULL;
14 }

修改后编译，运行多次的结果如下：

count: 200000, used: 0.048875 s
count: 200000, used: 0.035149 s
count: 200000, used: 0.053074 s
count: 200000, used: 0.044250 s
count: 200000, used: 0.047366 s
count: 200000, used: 0.047658 s

可见，使用pthread互斥量时，执行时间几乎多了10倍！！！虽然这不能代表所有锁机制的性能，但从一个侧面反映了CAS原子操作相对于系统锁机制，带来的性能提升。

参考资料

【1】 GCC文档 Legacy __sync Built-in Functions for Atomic Memory Access