mysql-innodb 锁基础

一、锁的意义

在一个复杂一点的并发系统中，锁始终是一个绕不开的存在，大家通常接触到这个概念最多的是多任务操作系统，例如服务器比较常用的linux操作系统。在所有的操作系统教材中，都会对锁进行描述，生产者/消费者问题；哲学家就餐问题等都是典型的入门例子，所以锁最为常见的应用场景就是在操作系统中。但是，几乎其它大型的并发系统中也会使用锁机制，只是这些锁是基于操作系统提供的锁机制(当然现在越来越流行使用基于硬件的CAS机制的轻量级锁机制)。在我们常见的LAMP中的Linux、Apache http server、MySQL都在实现中大量使用了锁。

但是锁机制通常是比较复杂的，可能会引入比较多的问题，例如死锁、系统并发量下降的问题，所以一些框架竟然完全使用单线程+异步(select/epoll)等机制来实现服务器框架，也是为了绕开锁使用所带来的编程复杂性问题。

在MySQL中，无可避免的使用了锁机制，并且是实现数据库中ACID的Atomicity与Consistency的基础。但是由于之前描述的锁的复杂性问题，所以在刚开始接触一个系统的时候我都会忽略到锁相关的逻辑，不过由于MySQL中锁是事务实现的基础，在了解事务实现的时候，锁相关的基础实现是一个始终无法绕开的概念，这里这里还是简单的总结下对于锁的相关基础理解。

二、2 Phase Lock(2PL)

在数据库教材中会描述这个机制的意义，它描述其实非常见；“对于每个事务，所有的锁定动作要早于所有的解锁动作”。一个直观例子可以展示如果不使用这种机制可能导致的问题。例如一个数据库

mysql> create database 2pl;

Query OK, 1 row affected (0.02 sec)

mysql> use 2pl;

Database changed

mysql> create table 2pl (k int, v int, primary key(k)) engine=innodb;

Query OK, 0 rows affected (0.13 sec)

mysql> insert 2pl values(1, 1) , (2, 1);

Query OK, 2 rows affected (5.05 sec)

Records: 2 Duplicates: 0 Warnings: 0

mysql> select * from 2pl;

+---+------+

| k | v |

+---+------+

| 1 | 1 |

| 2 | 1 |

+---+------+

2 rows in set (8.11 sec)

mysql>

假设在两个独立的连接中启动两个不同的事务，假设第一个T1执行

update 2pl set v=v*10;

另一个连接执行事务T2

update 2pl set v=v + 10;

如果T1在获得了记录<1,1>记录锁、执行v=v*10(执行之后记录变为<1,10>)、然后解锁；此时事务T1所在的线程被操作系统挂起，并调用T2所在的线程；T2事务获得<1,10>的记录锁，执行v=v + 10(记录变为<1,20>),T2继续对记录<2,1>锁定，执行v=v + 10(此时记录<2,1>变为<2,11>)，之后事务T2结束；接下来T1被调度回来，对记录<2,11>锁定，执行v=v*10操作(之后记录变为<2，110>)。经过这个调度之后，数据库中原有的两个记录<1，2>、<2，1>变化为<1,20>、<2,110>。这个结果结果明显违背了事务一致性原则，开始两个记录的相同值经过两个事务的相同操作，结果得到了不同值。这个例子展示事务的调度违反了事务的“串行化”规则，也就是相当于用户看到了事务之间的干扰。

而对于2PL，这个简单的协议即可保证事务的可串行化属性，简单的描述下它的影响下，对于第一个执行解锁动作的事务Tf(Transaction First)，它的所有操作对象在其它事务中均没有被操作过，这样就相当于事务Tf中操作早于早于所有其它事务，进而这个事务相当于是自己串行化的执行了事务。可以通过反证法来简单说明下面结论：Tf中对于任意元素e操作Oe之前，其它事务中不存在对于该元素e的操作。由于Tf可以对e进行操作，说明Tf之前执行了对于e的锁定动作Le，假设其它事务在Tf之前操作过e，则说明其它事务To在Tf的Le之前执行过Leo(Lock Element in Other ),Ueo(Unlock Element in other)，综合两个事务就有一个这样的序列 Leo、Ueo、Lef。由于事务的所有锁定早于所有的解锁，所以对于Tf事务，它的第一个解锁动作Uff(Unlock first in first)应该在Tf中的锁定操作之后Lef之后，进而有一个Leo、Ueo、Lef、Uff，这样与Tf是第一个执行解锁操作事务的假设相矛盾。

这个协议只是保证了说如果按照这个协议执行成功的话，那么执行结果一定是一致的，但是它并不保证这样的调度一定可行，事实上可以轻松的制造出一个死锁而导致这样的调度无法完成。

在第一个数据库连接C1中执行下面动作

mysql> start transaction;

Query OK, 0 rows affected (0.00 sec)

mysql> update 2pl set v=v*10 where k=1;

Query OK, 1 row affected (0.06 sec)

Rows matched: 1 Changed: 1 Warnings: 0

在第二个数据库连接C2中执行

mysql> start transaction;

Query OK, 0 rows affected (0.00 sec)

mysql> update 2pl set v=v*10 where k=2;

Query OK, 1 row affected (0.02 sec)

Rows matched: 1 Changed: 1 Warnings: 0

再切回第一个连接C1，执行下面动作，可以发现出现死锁。

mysql> update 2pl set v=v*10 where k=2;

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

mysql>

这里的原因也非常明显，对于2pl表格中的两条记录，C1通过更新获得记录K1的锁，由于事务没有结束，所以它在更新之后并没有释放K1记录锁，然后C2获得K2更新时的记录锁，并且同样没有释放；之后C1再次尝试更新K2的时候由于无法获得K2的锁而失败。

三、锁定动作的传递

这里以innodb的update动作为例说明下记录锁从语法分析到innodb存储引擎之间的传递路径。下面主要是通过mysql 5.1.61版本的一些调用链来展示下这个操作的主要过程。调用链看起来有些乏味，但是它的确是将系统中的各个模块连接起来(通过调用链)的重要方法。并且定性分析通常可以了解一个大概，但是定量分析中还是最好需要知道详细的细节，这些细微之处通常是定位一些系统异常问题的必要基础。

1、语法分析

mysql-5.1.61sqlsql_yacc.yy

update:

UPDATE_SYM

{

……

}

opt_low_priority opt_ignore join_table_list

SET update_list

{

In case of multi-update setting write lock for all tables may

be too pessimistic. We will decrease lock level if possible in

mysql_multi_update().

Select->set_lock_for_tables($3);

………………………………………………………………

opt_low_priority:

/* empty */ { $$= TL_WRITE_DEFAULT; }

| LOW_PRIORITY { $$= TL_WRITE_LOW_PRIORITY; }

;

默认情况下，opt_low_priority中设置锁定级别为TL_WRITE_DEFAULT，然后在update语句中通过Select->set_lock_for_tables($3);将锁定规则设置到

/**

Set lock for all tables in current select level.

@param lock_type Lock to set for tables

@note

If lock is a write lock, then tables->updating is set 1

This is to get tables_ok to know that the table is updated by the

query

void st_select_lex::set_lock_for_tables(thr_lock_type lock_type)

{

bool for_update= lock_type >= TL_READ_NO_INSERT;

DBUG_ENTER("set_lock_for_tables");

DBUG_PRINT("enter", ("lock_type: %d for_update: %d", lock_type,

for_update));

for (TABLE_LIST *tables= table_list.first;

tables;

tables= tables->next_local)

{

tables->lock_type= lock_type;

tables->updating= for_update;

}

DBUG_VOID_RETURN;

}

2、table中锁定传递给open时的reginfo

mysql_update===>>>>

int open_tables(THD *thd, TABLE_LIST **start, uint *counter, uint flags)

{

……

if (tables->lock_type != TL_UNLOCK && ! thd->locked_tables)

{

if (tables->lock_type == TL_WRITE_DEFAULT)

tables->table->reginfo.lock_type= thd->update_lock_default;

else if (tables->lock_type == TL_READ_DEFAULT)

tables->table->reginfo.lock_type=

read_lock_type_for_table(thd, thd->lex, tables);

else

tables->table->reginfo.lock_type= tables->lock_type;

}

3、从reginfo传递给ha_innobase handler的prebuild成员

mysql_update====>>>>lock_tables====>>>>mysql_lock_tables====>>>>lock_external====>>>>handler::ha_external_lock====>>>>ha_innobase::external_lock

if (lock_type == F_WRLCK

|| (table->s->tmp_table

&& thd_sql_command(thd) == SQLCOM_LOCK_TABLES)) {

/* If this is a SELECT, then it is in UPDATE TABLE ...

or SELECT ... FOR UPDATE

For temporary tables which are locked for READ by LOCK TABLES

updates are still allowed by SQL-layer. In order to accomodate

for such a situation we always request X-lock for such table

at LOCK TABLES time.

prebuilt->select_lock_type = LOCK_X;

prebuilt->stored_select_lock_type = LOCK_X;

}

4、传递给innobase的select操作

mysql_update====>>>>rr_quick====>>>>QUICK_RANGE_SELECT::get_next====>>>>handler::read_multi_range_first====>>>>handler::read_range_first====>>>>handler::index_read_map====>>>>ha_innobase::index_read====>>>>row_search_for_mysql

/* Do some start-of-statement preparations */

if (!prebuilt->sql_stat_start) {

……

} else if (prebuilt->select_lock_type == LOCK_NONE) {

/* This is a consistent read */

/* Assign a read view for the query */

trx_assign_read_view(trx);

prebuilt->sql_stat_start = FALSE;

} else {

wait_table_again:

err = lock_table(0, index->table,

prebuilt->select_lock_type == LOCK_S

? LOCK_IS : LOCK_IX, thr);

if (err != DB_SUCCESS) {

table_lock_waited = TRUE;

goto lock_table_wait;

}

prebuilt->sql_stat_start = FALSE;

}

……

rec_loop:

/*-------------------------------------------------------------*/

/* PHASE 4: Look for matching records in a loop */

……

/* We are ready to look at a possible new index entry in the result

set: the cursor is now placed on a user record */

if (prebuilt->select_lock_type != LOCK_NONE) {

……

err = sel_set_rec_lock(rec, index, offsets,

prebuilt->select_lock_type,

lock_type, thr);

}

在这个地方展示了所谓的“意向锁”(Intention Lock)的应用场景，为了获得一个行锁，首先要获得该行所在table的意向锁，也就是wait_table_again:处展示的对于table的锁定，之后才对行进行锁定，也就是sel_set_rec_lock对record添加的锁定操作。

四、页面锁与记录锁

在我开始看innodb代码的时候，有一个关于页面锁和行锁的问题：在操作一个记录的时候，数据库要首先获得这个记录在buffer pool中所属页面的锁，从代码上看，例如buf_page_get_gen函数的参数中有rw_latch,/* in: RW_S_LATCH, RW_X_LATCH, RW_NO_LATCH */这个参数。这样就是说在操作一个记录的时候，一定是已经获得了这个记录所在页面的锁，那么再次获得记录锁是否就是冗余的呢？

这个其实要从页面锁和记录锁的生命周期来看，它们的意义差别很大，一个页面锁的使用周期很短，通常在一个mtr(mini transaction)周期中生效，而记录锁的生命周期可以非常长(如果不是任意长的话)。只是因为对于事务接触比较少，所以对这个概念理解不是很深入。

五、线程挂起的位置

在记录锁需要等待的时候，线程并不是等待在了一个记录锁的内部，事实上不可能为每个记录锁分配一个可供线程进行阻塞的同步数据结构，因为这样代价太大，而是通过一个共享的资源池来进行锁定。

row_search_for_mysql====>>>>row_mysql_handle_errors====>>>>srv_suspend_mysql_thread====>>>>

最后线程是阻塞在这个操作系统本地(Native)os_event_t结构上。例如在linux系统下，这个锁定使用的就是pthread_mutex_t，其中mysql-5.1.61storageinnodb_pluginincludeos0sync.h中定义/** Native mutex */typedef pthread_mutex_t os_fast_mutex_t;

/* Thread slot in the thread table */

struct srv_slot_struct{

os_thread_id_t id; /* thread id */

os_thread_t handle; /* thread handle */

ulint type; /* thread type: user, utility etc. */

ibool in_use; /* TRUE if this slot is in use */

ibool suspended; /* TRUE if the thread is waiting

for the event of this slot */

ib_time_t suspend_time; /* time when the thread was

suspended */

os_event_t event; /* event used in suspending the

thread when it has nothing to do */

que_thr_t* thr; /* suspended query thread (only

used for MySQL threads) */

};

当一个线程因为锁定而被挂起之后，在唤醒的时候如何找到这个slot呢？这里将一个事务的wait_thrs上所有阻塞的"线程"(thread)唤醒

lock_grant====>>>>trx_end_lock_wait

void

trx_end_lock_wait(

/*==============*/

trx_t* trx) /* in: transaction */

{

que_thr_t* thr;

ut_ad(mutex_own(&kernel_mutex));

ut_ad(trx->que_state == TRX_QUE_LOCK_WAIT);

thr = UT_LIST_GET_FIRST(trx->wait_thrs);

while (thr != NULL) {

que_thr_end_wait_no_next_thr(thr);

UT_LIST_REMOVE(trx_thrs, trx->wait_thrs, thr);

thr = UT_LIST_GET_FIRST(trx->wait_thrs);

}

trx->que_state = TRX_QUE_RUNNING;

}

que_thr_end_wait_no_next_thr====>>>>srv_release_mysql_thread_if_suspended

知道了线程ID就可以在线程池中遍历，如果slot的threadid和当前线程id相同则认为匹配成功(这些操作都是在已经获得了kernel_mutex锁的前提下执行的)。

for (i = 0; i < OS_THREAD_MAX_N; i++) {

slot = srv_mysql_table + i;

if (slot->in_use && slot->thr == thr) {

/* Found */

os_event_set(slot->event);

return;

}

六、客户端连接请求的处理

这个和锁没有直接关系，只是顺便看到，所以还是也放在这里。

对于网络请求的处理，直接在handle_connections_sockets函数中accept，然后创建线程执行

FD_SET(ip_sock,&clientFDs);

#ifdef HAVE_SYS_UN_H

FD_SET(unix_sock,&clientFDs);

#ifdef HAVE_FCNTL

socket_flags=fcntl(unix_sock, F_GETFL, 0);

#endif

……

#ifdef HAVE_SYS_UN_H

if (FD_ISSET(unix_sock,&readFDs))

{

sock = unix_sock;

flags= socket_flags;

}

else

#endif

{

sock = ip_sock;

flags= ip_flags;

}

if (select((int) max_used_connection,&readFDs,0,0,0) < 0)

for (uint retry=0; retry < MAX_ACCEPT_RETRY; retry++)

{

size_socket length=sizeof(struct sockaddr_in);

new_sock = accept(sock, my_reinterpret_cast(struct sockaddr *) (&cAddr),

&length);

……

#endif

}

……

create_new_thread(thd);

thread_scheduler.add_connection(thd)===>>>>create_thread_to_handle_connection

if ((error=pthread_create(&thd->real_id,&connection_attrib,

handle_one_connection,

(void*) thd)))

这里初始化了添加一个新连接的处理函数

mysql-5.1.61sqlscheduler.cc

#ifndef EMBEDDED_LIBRARY

void one_thread_per_connection_scheduler(scheduler_functions* func)

{

func->max_threads= max_connections;

func->add_connection= create_thread_to_handle_connection;

func->end_thread= one_thread_per_connection_end;

}

#endif /* EMBEDDED_LIBRARY */

七、源码中的一个注释

mysql-5.1.61storageinnobasesrvsrv0srv.c这个注释虽然比较多(我还没仔细看)，但是感觉应该很有指导性，所以拷贝到这里，以后查阅的时候方便些。其中说明如果获得kernelmutex可以认为线程执行在了内核态，否则执行在用户态，可能是因为操作核心数据的时候需要这个关键的核心锁。

handle_connections_sockets

IMPLEMENTATION OF THE SERVER MAIN PROGRAM

=========================================

There is the following analogue between this database

server and an operating system kernel:

DB concept equivalent OS concept

---------- ---------------------

transaction -- process;

query thread -- thread;

lock -- semaphore;

transaction set to

the rollback state -- kill signal delivered to a process;

kernel -- kernel;

query thread execution:

(a) without kernel mutex

reserved -- process executing in user mode;

(b) with kernel mutex reserved

-- process executing in kernel mode;

The server is controlled by a master thread which runs at

a priority higher than normal, that is, higher than user threads.

It sleeps most of the time, and wakes up, say, every 300 milliseconds,

to check whether there is anything happening in the server which

requires intervention of the master thread. Such situations may be,

for example, when flushing of dirty blocks is needed in the buffer

pool or old version of database rows have to be cleaned away.

The threads which we call user threads serve the queries of

the clients and input from the console of the server.

They run at normal priority. The server may have several

communications endpoints. A dedicated set of user threads waits

at each of these endpoints ready to receive a client request.

Each request is taken by a single user thread, which then starts

processing and, when the result is ready, sends it to the client

and returns to wait at the same endpoint the thread started from.

So, we do not have dedicated communication threads listening at

the endpoints and dealing the jobs to dedicated worker threads.

Our architecture saves one thread swithch per request, compared

to the solution with dedicated communication threads

which amounts to 15 microseconds on 100 MHz Pentium

running NT. If the client

is communicating over a network, this saving is negligible, but

if the client resides in the same machine, maybe in an SMP machine

on a different processor from the server thread, the saving

can be important as the threads can communicate over shared

memory with an overhead of a few microseconds.

We may later implement a dedicated communication thread solution

for those endpoints which communicate over a network.

Our solution with user threads has two problems: for each endpoint

there has to be a number of listening threads. If there are many

communication endpoints, it may be difficult to set the right number

of concurrent threads in the system, as many of the threads

may always be waiting at less busy endpoints. Another problem

is queuing of the messages, as the server internally does not

offer any queue for jobs.

Another group of user threads is intended for splitting the

queries and processing them in parallel. Let us call these

parallel communication threads. These threads are waiting for

parallelized tasks, suspended on event semaphores.

A single user thread waits for input from the console,

like a command to shut the database.

Utility threads are a different group of threads which takes

care of the buffer pool flushing and other, mainly background

operations, in the server.

Some of these utility threads always run at a lower than normal

priority, so that they are always in background. Some of them

may dynamically boost their priority by the pri_adjust function,

even to higher than normal priority, if their task becomes urgent.

The running of utilities is controlled by high- and low-water marks

of urgency. The urgency may be measured by the number of dirty blocks

in the buffer pool, in the case of the flush thread, for example.

When the high-water mark is exceeded, an utility starts running, until

the urgency drops under the low-water mark. Then the utility thread

suspend itself to wait for an event. The master thread is

responsible of signaling this event when the utility thread is

again needed.

For each individual type of utility, some threads always remain

at lower than normal priority. This is because pri_adjust is implemented

so that the threads at normal or higher priority control their

share of running time by calling sleep. Thus, if the load of the

system sudenly drops, these threads cannot necessarily utilize

the system fully. The background priority threads make up for this,

starting to run when the load drops.

When there is no activity in the system, also the master thread

suspends itself to wait for an event making

the server totally silent. The responsibility to signal this

event is on the user thread which again receives a message

from a client.

There is still one complication in our server design. If a

background utility thread obtains a resource (e.g., mutex) needed by a user

thread, and there is also some other user activity in the system,

the user thread may have to wait indefinitely long for the

resource, as the OS does not schedule a background thread if

there is some other runnable user thread. This problem is called

priority inversion in real-time programming.

One solution to the priority inversion problem would be to

keep record of which thread owns which resource and

in the above case boost the priority of the background thread

so that it will be scheduled and it can release the resource.

This solution is called priority inheritance in real-time programming.

A drawback of this solution is that the overhead of acquiring a mutex

increases slightly, maybe 0.2 microseconds on a 100 MHz Pentium, because

the thread has to call os_thread_get_curr_id.

This may be compared to 0.5 microsecond overhead for a mutex lock-unlock

pair. Note that the thread

cannot store the information in the resource, say mutex, itself,

because competing threads could wipe out the information if it is

stored before acquiring the mutex, and if it stored afterwards,

the information is outdated for the time of one machine instruction,

at least. (To be precise, the information could be stored to

lock_word in mutex if the machine supports atomic swap.)

The above solution with priority inheritance may become actual in the

future, but at the moment we plan to implement a more coarse solution,

which could be called a global priority inheritance. If a thread

has to wait for a long time, say 300 milliseconds, for a resource,

we just guess that it may be waiting for a resource owned by a background

thread, and boost the the priority of all runnable background threads

to the normal level. The background threads then themselves adjust

their fixed priority back to background after releasing all resources

they had (or, at some fixed points in their program code).

What is the performance of the global priority inheritance solution?

We may weigh the length of the wait time 300 milliseconds, during

which the system processes some other thread

to the cost of boosting the priority of each runnable background

thread, rescheduling it, and lowering the priority again.

On 100 MHz Pentium + NT this overhead may be of the order 100

microseconds per thread. So, if the number of runnable background

threads is not very big, say < 100, the cost is tolerable.

Utility threads probably will access resources used by

user threads not very often, so collisions of user threads

to preempted utility threads should not happen very often.

The thread table contains

information of the current status of each thread existing in the system,

and also the event semaphores used in suspending the master thread

and utility and parallel communication threads when they have nothing to do.

The thread table can be seen as an analogue to the process table

in a traditional Unix implementation.

The thread table is also used in the global priority inheritance

scheme. This brings in one additional complication: threads accessing

the thread table must have at least normal fixed priority,

because the priority inheritance solution does not work if a background

thread is preempted while possessing the mutex protecting the thread table.

So, if a thread accesses the thread table, its priority has to be

boosted at least to normal. This priority requirement can be seen similar to

the privileged mode used when processing the kernel calls in traditional

Unix.*/