Linux RCU机制详解

关于rcu的几点声明：

1:RCU使用在读者多而写者少的情况.RCU和读写锁相似.但RCU的读者占锁没有任何的系统开销.写者与写写者之间必须要保持同步,且写者必须要等它之前的读者全部都退出之后才能释放之前的资源.
2:RCU保护的是指针.这一点尤其重要.因为指针赋值是一条单指令.也就是说是一个原子操作.因它更改指针指向没必要考虑它的同步.只需要考虑cache的影响.
3:读者是可以嵌套的.也就是说rcu_read_lock()可以嵌套调用.
4:读者在持有rcu_read_lock()的时候,不能发生进程上下文切换.否则,因为写者需要要等待读者完成,写者进程也会一直被阻塞.

5:spin lock是互斥的，任何时候只有一个thread（reader or writer）进入临界区，rw spin lock要好一些，允许多个reader并发执行，提高了性能。不过，reader和updater不能并发执行，RCU解除了这些限制，允许一个updater（不能多个updater进入临界区，这可以通过spinlock来保证）和多个reader并发执行。

核心api：

对于reader，RCU的操作包括：

（1）rcu_read_lock，用来标识RCU read side临界区的开始。

（2）rcu_dereference，该接口用来获取RCU protected pointer。reader要访问RCU保护的共享数据，当然要获取RCU protected pointer，然后通过该指针进行dereference的操作。

（3）rcu_read_unlock，用来标识reader离开RCU read side临界区

对于writer，RCU的操作包括：

（1）rcu_assign_pointer。该接口被writer用来进行removal的操作，在witer完成新版本数据分配和更新之后，调用这个接口可以让RCU protected pointer指向RCU protected data。

（2）synchronize_rcu。writer端的操作可以是同步的，也就是说，完成更新操作之后，可以调用该接口函数等待所有在旧版本数据上的reader线程离开临界区，一旦从该函数返回，说明旧的共享数据没有任何引用了，可以直接进行reclaimation的操作。

（3）call_rcu。当然，某些情况下（例如在softirq context中），writer无法阻塞，这时候可以调用call_rcu接口函数，该函数仅仅是注册了callback就直接返回了，在适当的时机会调用callback函数，完成reclaimation的操作。这样的场景其实是分开removal和reclaimation的操作在两个不同的线程中：updater和reclaimer。

Example1：

struct foo {
int a;
char b;
long c;
};
DEFINE_SPINLOCK(foo_mutex);

struct foo *gbl_foo;
void foo_update_a(int new_a)
{
struct foo *new_fp;
struct foo *old_fp;

new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
spin_lock(&foo_mutex);
old_fp = gbl_foo;
*new_fp = *old_fp;
new_fp->a = new_a;
rcu_assign_pointer(gbl_foo, new_fp);
spin_unlock(&foo_mutex);
synchronize_rcu();
kfree(old_fp);
}

int foo_get_a(void)
{
int retval;

rcu_read_lock();
retval = rcu_dereference(gbl_foo)->a;
rcu_read_unlock();
return retval;
}
如上代码所示,RCU被用来保护全局指针struct foo *gbl_foo. foo_get_a()用来从RCU保护的结构中取得gbl_foo的值.而foo_update_a()用来更新被RCU保护的gbl_foo的值.
另外,我们思考一下,为什么要在foo_update_a()中使用自旋锁foo_mutex呢?
假设中间没有使用自旋锁.那foo_update_a()的代码如下:

void foo_update_a(int new_a)
{
struct foo *new_fp;
struct foo *old_fp;

new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);

old_fp = gbl_foo;
1:-------------------------
*new_fp = *old_fp;
new_fp->a = new_a;
rcu_assign_pointer(gbl_foo, new_fp);

synchronize_rcu();
kfree(old_fp);
}
假设A进程在上图----标识处被B进程抢点.B进程也执行了goo_ipdate_a().等B执行完后，再切换回A进程.此时,A进程所持的old_fd实际上已经被B进程给释放掉了.此后A进程对old_fd的操作都是非法的.

另外,我们在上面也看到了几个有关RCU的核心API.它们为别是:
rcu_read_lock()
rcu_read_unlock()
synchronize_rcu()
rcu_assign_pointer()
rcu_dereference()
其中,rcu_read_lock()和rcu_read_unlock()用来保持一个读者的RCU临界区.在该临界区内不允许发生上下文切换.
rcu_dereference():读者调用它来获得一个被RCU保护的指针.
Rcu_assign_pointer():写者使用该函数来为被RCU保护的指针分配一个新的值.这样是为了安全从写者到读者更改其值.这个函数会返回一个新值

Example2：

 1 struct el {                          1 struct el {
 2   struct list_head list;             2   struct list_head list;
 3   long key;                          3   long key;
 4   spinlock_t mutex;                  4   spinlock_t mutex;
 5   int data;                          5   int data;
 6   /* Other data fields */            6   /* Other data fields */
 7 };                                   7 };
 8 rwlock_t listmutex;                  8 spinlock_t listmutex;
 9 struct el head;                      9 struct el head;

 1 int search(long key, int *result)    1 int search(long key, int *result)
 2 {                                    2 {
 3   struct list_head *lp;              3   struct list_head *lp;
 4   struct el *p;                      4   struct el *p;
 5                                      5
 6   read_lock(&listmutex);             6   rcu_read_lock();
 7   list_for_each_entry(p, head, lp) { 7   list_for_each_entry_rcu(p, head, lp) {
 8     if (p->key == key) {             8     if (p->key == key) {
 9       *result = p->data;             9       *result = p->data;
10       read_unlock(&listmutex);      10       rcu_read_unlock();
11       return 1;                     11       return 1;
12     }                               12     }
13   }                                 13   }
14   read_unlock(&listmutex);          14   rcu_read_unlock();
15   return 0;                         15   return 0;
16 }                                   16 }

 1 int delete(long key)                 1 int delete(long key)
 2 {                                    2 {
 3   struct el *p;                      3   struct el *p;
 4                                      4
 5   write_lock(&listmutex);            5   spin_lock(&listmutex);
 6   list_for_each_entry(p, head, lp) { 6   list_for_each_entry(p, head, lp) {
 7     if (p->key == key) {             7     if (p->key == key) {
 8       list_del(&p->list);            8       list_del_rcu(&p->list) or list_add_rcu(&p->list);
 9       write_unlock(&listmutex);      9       spin_unlock(&listmutex);
                                       10       synchronize_rcu();
10       kfree(p);                     11       kfree(p);
11       return 1;                     12       return 1;
12     }                               13     }
13   }                                 14   }
14   write_unlock(&listmutex);         15   spin_unlock(&listmutex);
15   return 0;                         16   return 0;
16 }                                   17 }

Example3：

rcu_assign_pointer()通常用于写者的发布，rcu_dereference()通常用于读者的订阅。

写者：

1 p->a = 1;
2 p->b = 2;
3 p->c = 3;
4 rcu_assign_pointer(gp, p);

读者：

1 rcu_read_lock();
2 p = rcu_dereference(gp);
3 if (p != NULL) {
4 do_something_with(p->a, p->b, p->c);
5 }
6 rcu_read_unlock();

rcu_assign_pointer()是说，先把那块内存写好，再把指针指过去。这里使用的内存写屏障是为了保证并发的读者读到数据一致性。在这条语句之前的读者读到旧的指针和旧的内存，这条语句之后的读者读到新的指针和新的内存。如果没有这条语句，很有可能出现读者读到新的指针和旧的内存。也就是说，这里通过内存屏障刷新了p所指向的内存的值，至于gp本身的值有没有更新还不确定。实际上，gp本身值的真正更新要等到并发的读者来促发。
rcu_dereference() 原语用的是数据依赖屏障，smp_read_barrier_dependence,它要求后面的读操作如果依赖前面的读操作，则前面的读操作需要首先完成。根据数据之间的依赖，要读p->a, p->b, p->c, 就必须先读p，要先读p，就必须先读p1，要先读p1，就必须先读gp。也就是说读者所在的core在进行后续的操作之前，gp必须是同步过的当前时刻的最新值。如果没有这个数据依赖屏障，有可能读者所在的core很长一段时间内一直用的是旧的gp值。所以，这里使用数据依赖屏障是为了督促写者将gp值准备好，是为了呼应写者，这个呼应的诉求是通过数据之间的依赖关系来促发的，也就是说到了非呼应不可的地步了。

Example4：

/*共享数据结构体*/
/*其中rcu_head为双向链表*/
struct shared_data{
char a;
int b;
struct rcu_head rcu;
}
/*读取者，临界区的代码不允许睡眠*/
static void reader(struct shared_data *ptr)
{
struct shared_data *p = NULL;
rcu_read_lock();
/*调用 rcu_dereference 在双向链表中获得ptr指针*/
p = rcu_dereference(*ptr);
if(p)
do_something_with(p);
rcu_read_unlock();
}
/*写入者*/
/*使用回调函数,contain_of从双向链表中获取老的共享数据*/
static void del_old_ptr(struct rcu_head *rh)
{
struct shared_data *p = contain_of(rh,struct shared_data,rcu)
kfree(p);
}
static void writer(struct shared_data *ptr)
{
struct shared_data *new_ptr = malloc(...);
...
new_ptr->a = 'a';
new_ptr->b = 1;
/*更新指针*/
rcu_assign_pionter(new_ptr);
/*注册回调函数*/
call_rcu(ptr->rcu,del_old_ptr);
}

Example5：

	struct foo {
		int a;
		int b;
		int c;
	};
	struct foo *gp1;
	struct foo *gp2;

	void updater(void)
	{
		struct foo *p;

		p = kmalloc(...);
		if (p == NULL)
			deal_with_it();
		p->a = 42;  /* Each field in its own cache line. */
		p->b = 43;
		p->c = 44;
		rcu_assign_pointer(gp1, p);
		p->b = 143;
		p->c = 144;
		rcu_assign_pointer(gp2, p);
	}

	void reader(void)
	{
		struct foo *p;
		struct foo *q;
		int r1, r2;

		p = rcu_dereference(gp2);
		if (p == NULL)
			return;
		r1 = p->b;  /* Guaranteed to get 143. */
		q = rcu_dereference(gp1);  /* Guaranteed non-NULL. */
		if (p == q) {
			/* The compiler decides that q->c is same as p->c. */
			r2 = p->c; /* Could get 44 on weakly order system. */
		}
		do_something_with(r1, r2);
	}

You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible,
but you should not be.  After all, the updater might have been invoked
a second time between the time reader() loaded into "r1" and the time
that it loaded into "r2".  The fact that this same result can occur due
to some reordering from the compiler and CPUs is beside the point.

But suppose that the reader needs a consistent view?

Then one approach is to use locking, for example, as follows:

	struct foo {
		int a;
		int b;
		int c;
		spinlock_t lock;
	};
	struct foo *gp1;
	struct foo *gp2;

	void updater(void)
	{
		struct foo *p;

		p = kmalloc(...);
		if (p == NULL)
			deal_with_it();
		spin_lock(&p->lock);
		p->a = 42;  /* Each field in its own cache line. */
		p->b = 43;
		p->c = 44;
		spin_unlock(&p->lock);
		rcu_assign_pointer(gp1, p);
		spin_lock(&p->lock);
		p->b = 143;
		p->c = 144;
		spin_unlock(&p->lock);
		rcu_assign_pointer(gp2, p);
	}

	void reader(void)
	{
		struct foo *p;
		struct foo *q;
		int r1, r2;

		p = rcu_dereference(gp2);
		if (p == NULL)
			return;
		spin_lock(&p->lock);
		r1 = p->b;  /* Guaranteed to get 143. */
		q = rcu_dereference(gp1);  /* Guaranteed non-NULL. */
		if (p == q) {
			/* The compiler decides that q->c is same as p->c. */
			r2 = p->c; /* Locking guarantees r2 == 144. */
		}
		spin_unlock(&p->lock);
		do_something_with(r1, r2);
	}

As always, use the right tool for the job!

Example6：

如果写者需要对链表条目进行修改，那么就需要首先拷贝要修改的条目，然后修改条目的拷贝，等修改完毕后，再使用条目拷贝取代要修改的条目，要修改条目将被在经历一个grace period后安全删除。

对于系统调用审计代码，并没有这种情况。这里假设有修改的情况，那么使用rwlock的修改代码应当如下：

       static inline int audit_upd_rule(struct audit_rule *rule,
                                         struct list_head *list,
                                         __u32 newaction,
                                         __u32 newfield_count)
        {
                struct audit_entry  *e;
                struct audit_newentry *ne;
                write_lock(&auditsc_lock);
                /* Note: audit_netlink_sem held by caller. */
                list_for_each_entry(e, list, list) {
                        if (!audit_compare_rule(rule, &e->rule)) {
                                e->rule.action = newaction;
                                e->rule.file_count = newfield_count;
                                write_unlock(&auditsc_lock);
                                return 0;
                        }
                }
                write_unlock(&auditsc_lock);
                return -EFAULT;         /* No matching rule */
        }

如果使用RCU，修改代码应当为；

      static inline int audit_upd_rule(struct audit_rule *rule,
                                         struct list_head *list,
                                         __u32 newaction,
                                         __u32 newfield_count)
        {
                struct audit_entry  *e;
                struct audit_newentry *ne;
                list_for_each_entry(e, list, list) {
                        if (!audit_compare_rule(rule, &e->rule)) {
                                ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
                                if (ne == NULL)
                                        return -ENOMEM;
                                audit_copy_rule(&ne->rule, &e->rule);
                                ne->rule.action = newaction;
                                ne->rule.file_count = newfield_count;
                                list_replace_rcu(e, ne);
                                call_rcu(&e->rcu, audit_free_rule, e);
                                return 0;
                        }
                }
                return -EFAULT;         /* No matching rule */
        }

修改操作立即可见

前面两种情况，读者能够容忍修改可以在一段时间后看到，也就说读者在修改后某一时间段内，仍然看到的是原来的数据。在很多情况下，读者不能容忍看到旧的数据，这种情况下，需要使用一些新措施，如System V IPC，它在每一个链表条目中增加了一个deleted字段，标记该字段是否删除，如果删除了，就设置为真，否则设置为假，当代码在遍历链表时，核对每一个条目的deleted字段，如果为真，就认为它是不存在的。

还是以系统调用审计代码为例，如果它不能容忍旧数据，那么，读端代码应该修改为：

       static enum audit_state audit_filter_task(struct task_struct *tsk)
        {
                struct audit_entry *e;
                enum audit_state   state;
                rcu_read_lock();
                list_for_each_entry_rcu(e, &audit_tsklist, list) {
                        if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
                                spin_lock(&e->lock);
                                if (e->deleted) {
                                        spin_unlock(&e->lock);
                                        rcu_read_unlock();
                                        return AUDIT_BUILD_CONTEXT;
                                }
                                rcu_read_unlock();
                                return state;
                        }
                }
                rcu_read_unlock();
                return AUDIT_BUILD_CONTEXT;
        }

注意，对于这种情况，每一个链表条目都需要一个spinlock保护，因为删除操作将修改条目的deleted标志。此外，该函数如果搜索到条目，返回时应当保持该条目的锁，因为只有这样，才能看到新的修改的数据，否则，仍然可能看到就的数据。

写端的删除操作将变成：

       static inline int audit_del_rule(struct audit_rule *rule,
                                         struct list_head *list)
        {
                struct audit_entry  *e;
                /* Do not use the _rcu iterator here, since this is the only
                 * deletion routine. */
                list_for_each_entry(e, list, list) {
                        if (!audit_compare_rule(rule, &e->rule)) {
                                spin_lock(&e->lock);
                                list_del_rcu(&e->list);
                                e->deleted = 1;
                                spin_unlock(&e->lock);
                                call_rcu(&e->rcu, audit_free_rule, e);
                                return 0;
                        }
                }
                return -EFAULT;         /* No matching rule */
        }

删除条目时，需要标记该条目为已删除。这样读者就可以通过该标志立即得知条目是否已经删除。