innodb记录延迟删除对于其它DB操作的影响

一、快速删除记录

在事务型DB删除记录时，一个比较容易想到的优化是以通过设置一个标志位来表示这条记录已经被逻辑上删除(相对于物理删除)。这样实现的优点在于删除动作的指向会很快，特别是在事务提交中，如果只向磁盘flush一个bit的数据修改，可以缩短IO延迟，提高命令响应速度；并且有机会在回滚和再次插入相同键值数据的时候更加快捷。

下面是innodb在删除一条记录时的执行流程：

ha_innobase::delete_row===>>>

/* This is a delete */

prebuilt->upd_node->is_delete = TRUE;

innodb_srv_conc_enter_innodb(trx);

error = row_update_for_mysql((byte*) record, prebuilt);

===>>>row_upd_clust_step===>>>btr_cur_del_mark_set_clust_rec===>>>rec_set_deleted_flag===>>>

if (flag) {

val |= REC_INFO_DELETED_FLAG;

} else {

val &= ~REC_INFO_DELETED_FLAG;

}

===>>>rec_get_info_bits===>>>rec_get_bit_field_1，注意在这个偏移的读取中不是常规的对于offset前向偏移而是后向偏移，这个结构的地址也就是innodb中所说的“物理记录”的起始位置

UNIV_INLINE

ulint

rec_get_bit_field_1(

/*================*/

const rec_t* rec, /*!< in: pointer to record origin */

ulint offs, /*!< in: offset from the origin down */

ulint mask, /*!< in: mask used to filter bits */

ulint shift) /*!< in: shift right applied after masking */

{

ut_ad(rec);

return((mach_read_from_1(rec - offs) & mask) >> shift);

}

在每个记录逻辑地址的开始，还包含了一些不可见信息，其中就有一个是否是已经被删除的标志位，设置的标志位为

/* The deleted flag in info bits */

#define REC_INFO_DELETED_FLAG 0x20UL /* when bit is set to 1, it means the

record has been delete marked */

和这个标志位相邻定义的就是这个表示记录是整个Btree该层最小记录的标志位REC_INFO_MIN_REC_FLAG，这个标志位在之前的一篇笔记中也讨论过，这个标志位可以避免在每层定义一个最左节点，从而使记录的查找和处理更加统一。

/* Info bit denoting the predefined minimum record: this bit is set

if and only if the record is the first user record on a non-leaf

B-tree page that is the leftmost page on its level

(PAGE_LEVEL is nonzero and FIL_PAGE_PREV is FIL_NULL). */

#define REC_INFO_MIN_REC_FLAG 0x10UL

二、延迟删除对于select的影响

在延迟删除记录之后，由于记录只是存储在一个约定的标志位，而键值等其它信息依然是原封不动的存储在btree上，这就像一首广为流传的现代诗所描述的：“有些记录活着，它已经死了”。也就是通过通用的btree查找操作依然可以捞取到该记录，这就要求更上层的逻辑感知并处理这个生死攸关的1 bit。mysql-5.1.61storageinnobase ow ow0sel.c:row_search_for_mysql:

/* NOTE that at this point rec can be an old version of a clustered

index record built for a consistent read. We cannot assume after this

point that rec is on a buffer pool page. Functions like

page_rec_is_comp() cannot be used! */

if (UNIV_UNLIKELY(rec_get_deleted_flag(rec, comp))) {

/* The record is delete-marked: we can skip it */

三、延迟删除对于insert的影响

由于和select相同的原因，在插入的时候同样会感知到这个已经删除的字段，特别是在新插入的记录的键值和已经删除的记录键值相同并且这个键值是一个unique键值的时候：

row_ins_index_entry_low===>>>

if (modify != 0) {

/* There is already an index entry with a long enough common

prefix, we must convert the insert into a modify of an

existing record */

if (modify == ROW_INS_NEXT) {

rec = page_rec_get_next(btr_cur_get_rec(&cursor));

btr_cur_position(index, rec, &cursor);

}

if (index->type & DICT_CLUSTERED) {

err = row_ins_clust_index_entry_by_modify(

mode, &cursor, &big_rec, entry,

ext_vec, n_ext_vec, thr, &mtr);

} else {

err = row_ins_sec_index_entry_by_modify(

mode, &cursor, entry, thr, &mtr);

}

===>>>row_ins_clust_index_entry_by_modify===>>>row_upd_build_difference_binary===>>>upd_create===>>>

update = mem_heap_alloc(heap, sizeof(upd_t));

update->info_bits = 0;

update->n_fields = n;

在这个地方，先创建的update结构中将info_bits清零，而一条记录是否被删除的标志位也正式保存在该记录中，这样当这个update动作被执行之后，记录中的REC_INFO_DELETED_FLAG标志位会被清零，从而表示这条记录摆脱黑户，正式成为合法公民。

四、延迟删除对于purge的影响

前面已经说过，延迟删除只是为了让删除更快的返回，在一定时间之后，这条记录终究是要被删除的。有趣的情况发生在这样的情况下：记录R被事务T1删除，但是此时只是设置了已删除标志位；然后事务T2插入相同键值的记录并提交；然后purge事务按照T1的undo列表开始尝试进行记录的物理删除，这里要注意的是：这个时候删除使用的是逻辑记录而不是物理记录，也就是这个记录的PK值。此时虽然undo中有纪律这个记录删除动作，但是在purge的时候该记录又被T2光复，所以它已经是一个合法的记录，从而不应该被删除。

row_purge===>>>row_purge_del_mark===>>>row_purge_remove_clust_if_poss===>>>row_purge_remove_clust_if_poss_low===>>>

if (0 != ut_dulint_cmp(node->roll_ptr, row_get_rec_roll_ptr(

rec, index, rec_get_offsets(

rec, index, offsets_,

ULINT_UNDEFINED, &heap)))) {

if (UNIV_LIKELY_NULL(heap)) {

mem_heap_free(heap);

}

/* Someone else has modified the record later: do not remove */

btr_pcur_commit_specify_mtr(pcur, &mtr);

return(TRUE);

}

也就是通过比较当前记录的回滚指针和undo中保存的回滚指针是否相同来确定是否可以删除。至于进一步解释这个地方为什么不使用更为简单的REC_INFO_DELETED_FLAG是否被置位来确定是否可以删除就不得而知了。