Redis 响应延迟问题排查

http://www.oschina.net/translate/redis-latency-problems-troubleshooting?cmp

This document will help you understand what the problem could be if you are experiencing latency problems with Redis.

In this context latency is the maximum delay between the time a client issues a command and the time the reply to the command is received by the client. Usually Redis processing time is extremely low, in the sub microsecond range, but there are certain conditions leading to higher latency figures.

译者信息

本文将有助于你找出Redis 响应延迟的问题所在。

文中出现的延迟（latency）均指从客户端发出一条命令到客户端接受到该命令的反馈所用的最长响应时间。Reids通常处理（命令的）时间非常的慢，大概在次微妙范围内，但也有更长的情况出现。

Measuring latency

If you are experiencing latency problems, probably you know how to measure it in the context of your application, or maybe your latency problem is very evident even macroscopically. However redis-cli can be used to measure the latency of a Redis server in milliseconds, just try:

redis-cli --latency -h `host` -p `port`

译者信息

计算延迟时间

如果你正在经历响应延迟问题，你或许能够根据应用程序的具体情况算出它的延迟响应时间，或者你的延迟问题非常明显，宏观看来，一目了然。不管怎样吧，用redis-cli可以算出一台Redis 服务器的到底延迟了多少毫秒。踹这句：

redis-cli --latency -h `host` -p `port`

Latency induced by network and communication

Clients connect to Redis using a TCP/IP connection or a Unix domain connection. The typical latency of a 1 GBits/s network is about 200 us, while the latency with a Unix domain socket can be as low as 30 us. It actually depends on your network and system hardware. On top of the communication itself, the system adds some more latency (due to thread scheduling, CPU caches, NUMA placement, etc ...). System induced latencies are significantly higher on a virtualized environment than on a physical machine.

译者信息

网络和通信引起的延迟

当用户连接到Redis通过TCP/IP连接或Unix域连接，千兆网络的典型延迟大概200us，而Unix域socket可能低到30us。这完全基于你的网络和系统硬件。在通信本身之上，系统增加了更多的延迟（线程调度，CPU缓存，NUMA替换等等）。系统引起的延迟在虚拟机环境远远高于在物理机器环境。

The consequence is even if Redis processes most commands in sub microsecond range, a client performing many roundtrips to the server will have to pay for these network and system related latencies.

An efficient client will therefore try to limit the number of roundtrips by pipelining several commands together. This is fully supported by the servers and most clients. Aggregated commands like MSET/MGET can be also used for that purpose. Starting with Redis 2.4, a number of commands also support variadic parameters for all data types.

译者信息

实际情况是即使Redis处理大多数命令在微秒之下，客户机和服务器之间的交互也必然消耗系统相关的延迟。

一个高效的客户机因而试图通过捆绑多个命令在一起的方式减少交互的次数。服务器和大多数客户机支持这种方式。聚合命令象MSET/MGET也可以用作这个目的。从Redis 2.4版本起，很多命令对于所有的数据类型也支持可变参数。

Here are some guidelines:

If you can afford it, prefer a physical machine over a VM to host the server.
Do not systematically connect/disconnect to the server (especially true for web based applications). Keep your connections as long lived as possible.
If your client is on the same host than the server, use Unix domain sockets.
Prefer to use aggregated commands (MSET/MGET), or commands with variadic parameters (if possible) over pipelining.
Prefer to use pipelining (if possible) over sequence of roundtrips.
Future version of Redis will support Lua server-side scripting (experimental branches are already available) to cover cases that are not suitable for raw pipelining (for instance when the result of a command is an input for the following commands).

On Linux, some people can achieve better latencies by playing with process placement (taskset), cgroups, real-time priorities (chrt), NUMA configuration (numactl), or by using a low-latency kernel. Please note vanilla Redis is not really suitable to be bound on a single CPU core. Redis can fork background tasks that can be extremely CPU consuming like bgsave or AOF rewrite. These tasks must never run on the same core as the main event loop.

In most situations, these kind of system level optimizations are not needed. Only do them if you require them, and if you are familiar with them.

译者信息

这里有一些指导：

如果你负担的起，尽可能的使用物理机而不是虚拟机来做服务器
不要经常的connect/disconnect与服务器的连接（尤其是对基于web的应用），尽可能的延长与服务器连接的时间。
如果你的客户端和服务器在同一台主机上，则使用Unix域套接字
尽量使用聚合命令(MSET/MGET)或可变参数命令而不是pipelining
如果可以尽量使用pipelining而不是序列的往返命令。
针对不适合使用原始pipelining的情况，如某个命令的结果是后续命令的输入，在以后的版本中redis提供了对服务器端的lua脚本的支持，实验分支版本现在已经可以使用了。

在Linux上，你可以通过process placement(taskset)、cgroups、real-time priorities(chrt)、NUMA配置(numactl)或使用低延迟内核的方式来获取较低的延迟。请注意Redis 并不适合被绑到单个CPU核上。redis会在后台创建一些非常消耗CPU的进程，如bgsave和AOF重写，这些任务是绝对不能和主事件循环进程放在一个CPU核上的。

大多数情况下上述的优化方法是不需要的，除非你确实需要并且你对优化方法很熟悉的情况下再使用上述方法。

Single threaded nature of Redis

Redis uses a mostly single threaded design. This means that a single process serves all the client requests, using a technique called multiplexing. This means that Redis can serve a single request in every given moment, so all the requests are served sequentially. This is very similar to how Node.js works as well. However, both products are often not perceived as being slow. This is caused in part by the small about of time to complete a single request, but primarily because these products are designed to not block on system calls, such as reading data from or writing data to a socket.

I said that Redis is mostly single threaded since actually from Redis 2.4 we use threads in Redis in order to perform some slow I/O operations in the background, mainly related to disk I/O, but this does not change the fact that Redis serves all the requests using a single thread.

译者信息

Redis的单线程属性

Redis 使用了单线程的设计，意味着单线程服务于所有的客户端请求，使用一种复用的技术。这种情况下redis可以在任何时候处理单个请求，所以所有的请求是顺序处理的。这和Node.js的工作方式很像，所有的产出通常不会有慢的感觉，因为处理单个请求的时间非常短，但是最重要的是这些产品被设计为非阻塞系统调用，比如从套接字中读取或写入数据。

我提到过Redis从2.4版本后几乎是单线程的，我们使用线程在后台运行一些效率低下的I/O操作，主要关系到硬盘I/O，但是这不改变Redis使用单线程处理所有请求的事实。

Latency generated by slow commands

A consequence of being single thread is that when a request is slow to serve all the other clients will wait for this request to be served. When executing normal commands, like GET or SET or LPUSH this is not a problem at all since this commands are executed in constant (and very small) time. However there are commands operating on many elements, like SORT, LREM, SUNION and others. For instance taking the intersection of two big sets can take a considerable amount of time.

The algorithmic complexity of all commands is documented. A good practice is to systematically check it when using commands you are not familiar with.

译者信息

低效操作产生的延迟

单线程的一个结果是，当一个请求执行得很慢，其他的客户端调用就必须等待这个请求执行完毕。当执行GET、SET或者 LPUSH 命令的时候这不是个问题，因为这些操作可在很短的常数时间内完成。然而，对于多个元素的操作，像SORT, LREM, SUNION 这些，做两个大数据集的交叉要花掉很长的时间。

文档中提到了所有操作的算法复杂性。在使用一个你不熟悉的命令之前系统的检查它会是一个好办法。

If you have latency concerns you should either not use slow commands against values composed of many elements, or you should run a replica using Redis replication where to run all your slow queries.

It is possible to monitor slow commands using the Redis Slow Log feature.

Additionally, you can use your favorite per-process monitoring program (top, htop, prstat, etc ...) to quickly check the CPU consumption of the main Redis process. If it is high while the traffic is not, it is usually a sign that slow commands are used.

译者信息

如果你对延迟有要求，那么就不要执行涉及多个元素的慢操作，你可以使用Redis的replication功能，把这类慢操作全都放到replica上执行。

可以用Redis 的Slow Log 来监控慢操作。

此外，你可以用你喜欢的进程监控程序（top, htop, prstat, 等...）来快速查看Redis进程的CPU使用率。如果traffic不高而CPU占用很高，八成说明有慢操作。

Latency generated by fork

In order to generate the RDB file in background, or to rewrite the Append Only File if AOF persistence is enabled, Redis has to fork background processes. The fork operation (running in the main thread) can induce latency by itself.

Forking is an expensive operation on most Unix-like systems, since it involves copying a good number of objects linked to the process. This is especially true for the page table associated to the virtual memory mechanism.

译者信息

延迟由fork产生

Redis不论是为了在后台生成一个RDB文件，还是为了当AOF持久化方案被开启时重写Append Only文件，都会在后台fork出一个进程。fork操作（在主线程中被执行）本身会引发延迟。在大多数的类unix操作系统中，fork是一个很消耗的操作，因为它牵涉到复制很多与进程相关的对象。而这对于分页表与虚拟内存机制关联的系统尤为明显

For instance on a Linux/AMD64 system, the memory is divided in 4 KB pages. To convert virtual addresses to physical addresses, each process stores a page table (actually represented as a tree) containing at least a pointer per page of the address space of the process. So a large 24 GB Redis instance requires a page table of 24 GB / 4 KB * 8 = 48 MB.

When a background save is performed, this instance will have to be forked, which will involve allocating and copying 48 MB of memory. It takes time and CPU, especially on virtual machines where allocation and initialization of a large memory chunk can be expensive.

译者信息

对于运行在一个linux/AMD64系统上的实例来说，内存会按照每页4KB的大小分页。为了实现虚拟地址到物理地址的转换，每一个进程将会存储一个分页表（树状形式表现），分页表将至少包含一个指向该进程地址空间的指针。所以一个空间大小为24GB的redis实例，需要的分页表大小为 24GB/4KB*8 = 48MB。

当一个后台的save命令执行时，实例会启动新的线程去申请和拷贝48MB的内存空间。这将消耗一些时间和CPU资源，尤其是在虚拟机上申请和初始化大块内存空间时，消耗更加明显。

Fork time in different systems

Modern hardware is pretty fast to copy the page table, but Xen is not. The problem with Xen is not virtualization-specific, but Xen-specific. For instance using VMware or Virutal Box does not result into slow fork time. The following is a table that compares fork time for different Redis instance size. Data is obtained performing a BGSAVE and looking at thelatest_fork_usecfiled in the INFO command output.

Linux beefy VM on VMware 6.0GB RSS forked in 77 milliseconds (12.8 milliseconds per GB).
Linux running on physical machine (Unknown HW) 6.1GB RSS forked in 80 milliseconds (13.1 milliseconds per GB)
Linux running on physical machine (Xeon @ 2.27Ghz) 6.9GB RSS forked into 62 millisecodns (9 milliseconds per GB).
Linux VM on 6sync (KVM) 360 MB RSS forked in 8.2 milliseconds (23.3 millisecond per GB).
Linux VM on EC2 (Xen) 6.1GB RSS forked in 1460 milliseconds (239.3 milliseconds per GB).
Linux VM on Linode (Xen) 0.9GBRSS forked into 382 millisecodns (424 milliseconds per GB).

As you can see a VM running on Xen has a performance hit that is between one order to two orders of magnitude. We believe this is a severe problem with Xen and we hope it will be addressed ASAP.

译者信息

在不同系统中的Fork时间

除了Xen系统外，现代的硬件都可以快速完美的复制页表。Xen系统的问题不是特定的虚拟化，而是特定的Xen.例如使用VMware或者Virutal Box不会导致较慢的fork时间。下面的列表比较了不同Redis实例的fork时间。数据包含正在执行的BGSAVE，并通过INFO指令查看thelatest_fork_usecfiled。

Linux beefy VM on VMware 6.0GB RSS forked 77 微秒 (每GB 12.8 微秒 ).
Linux running on physical machine (Unknown HW) 6.1GB RSS forked 80 微秒(每GB 13.1微秒)
Linux running on physical machine (Xeon @ 2.27Ghz) 6.9GB RSS forked into 62 微秒 (每GB 9 微秒).
Linux VM on 6sync (KVM) 360 MB RSS forked in 8.2 微秒 (每GB 23.3 微秒).
Linux VM on EC2 (Xen) 6.1GB RSS forked in 1460 微秒 (每GB 239.3 微秒).
Linux VM on Linode (Xen) 0.9GBRSS forked into 382 微秒 (每GB 424 微秒).

你能看到运行在Xen上的VM的Redis性能相差了一到两个数量级。我们相信这是Xen系统的一个验证问题，我们希望这个问题能尽快处理。

Latency induced by swapping (operating system paging)

Linux (and many other modern operating systems) is able to relocate memory pages from the memory to the disk, and vice versa, in order to use the system memory efficiently.

If a Redis page is moved by the kernel from the memory to the swap file, when the data stored in this memory page is used by Redis (for example accessing a key stored into this memory page) the kernel will stop the Redis process in order to move the page back into the main memory. This is a slow operation involving random I/Os (compared to accessing a page that is already in memory) and will result into anomalous latency experienced by Redis clients.

译者信息

swapping (操作系统分页)引起的延迟

Linux (以及其他一些操作系统) 可以把内存页存储在硬盘上，反之也能将存储在硬盘上的内存页再加载进内存，这种机制使得内存能够得到更有效的利用。

如果内存页被系统移到了swap文件里，而这个内存页中的数据恰好又被redis用到了（例如要访问某个存储在内存页中的key），系统就会暂停redis进程直到把需要的页数据重新加载进内存。这个操作因为牵涉到随机I/O，所以很慢，会导致无法预料的延迟。

The kernel relocates Redis memory pages on disk mainly because of three reasons:

The system is under memory pressure since the running processes are demanding more physical memory than the amount that is available. The simplest instance of this problem is simply Redis using more memory than the one available.
The Redis instance data set, or part of the data set, is mostly completely idle (never accessed by clients), so the kernel could swap idle memory pages on disk. This problem is very rare since even a moderately slow instance will touch all the memory pages often, forcing the kernel to retain all the pages in memory.
Some processes are generating massive read or write I/Os on the system. Because files are generally cached, it tends to put pressure on the kernel to increase the filesystem cache, and therefore generate swapping activity. Please note it includes Redis RDB and/or AOF background threads which can produce large files.

Fortunately Linux offers good tools to investigate the problem, so the simplest thing to do is when latency due to swapping is suspected is just to check if this is the case.

译者信息

系统之所以要在内存和硬盘之间置换redis页数据主要因为以下三个原因：

系统总是要应对内存不足的压力，因为每个运行的进程都想申请更多的物理内存，而这些申请的内存的数量往往超过了实际拥有的内存。简单来说就是redis使用的内存总是比可用的内存数量更多。
redis实例的数据，或者部分数据可能就不会被客户端访问，所以系统可以把这部分闲置的数据置换到硬盘上。需要把所有数据都保存在内存中的情况是非常罕见的。
一些进程会产生大量的读写I/O。因为文件通常都有缓存，这往往会导致文件缓存不断增加，然后产生交换（swap）。请注意，redis RDB和AOF后台线程都会产生大量文件。

所幸Linux提供了很好的工具来诊断这个问题，所以当延迟疑似是swap引起的，最简单的办法就是使用Linux提供的工具去确诊。

The first thing to do is to checking the amount of Redis memory that is swapped on disk. In order to do so you need to obtain the Redis instance pid:

$ redis-cli info | grep process_id
process_id:5454

Now enter the /proc file system directory for this process:

$ cd /proc/5454

Here you'll find a file called smaps that describes the memory layout of the Redis process (assuming you are using Linux 2.6.16 or newer). This file contains very detailed information about our process memory maps, and one field called Swap is exactly what we are looking for. However there is not just a single swap field since the smaps file contains the different memory maps of our Redis process (The memory layout of a process is more complex than a simple linear array of pages).

译者信息

首先要做的是检查swap到硬盘上的redis内存的数量，为实现这个目的要知道redis实例的进程id：

$ redis-cli info | grep process_id
process_id:5454

进入进程目录:

$ cd /proc/5454

在这里你会发现一个名为smaps 的文件，它描述了redis进程的内存布局 (假定你使用的是Linux 2.6.16或者更新的版本)。这个文件包括了很多进程所使用内存的细节信息，其中有一项叫做Swap的正是我们所关心的。不过仅看这一项是不够的，因为smaps文件包括有redis进程的多个不同的的内存映射区域的使用情况（进程的内存布局远不是线性排列那么简单）。

Since we are interested in all the memory swapped by our process the first thing to do is to grep for the Swap field across all the file:

$ cat smaps | grep 'Swap:'
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                 12 kB
Swap:                156 kB
Swap:                  8 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  4 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  4 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  4 kB
Swap:                  4 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB

If everything is 0 kb, or if there are sporadic 4k entries, everything is perfectly normal. Actually in our example instance (the one of a real web site running Redis and serving hundreds of users every second) there are a few entries that show more swapped pages. To investigate if this is a serious problem or not we change our command in order to also print the size of the memory map:

$ cat smaps | egrep '^(Swap|Size)'
Size:                316 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  0 kB
Size:                 40 kB
Swap:                  0 kB
Size:                132 kB
Swap:                  0 kB
Size:             720896 kB
Swap:                 12 kB
Size:               4096 kB
Swap:                156 kB
Size:               4096 kB
Swap:                  8 kB
Size:               4096 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:               1272 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                 16 kB
Swap:                  0 kB
Size:                 84 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  4 kB
Size:                  8 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  4 kB
Size:                144 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  4 kB
Size:                 12 kB
Swap:                  4 kB
Size:                108 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                272 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB

As you can see from the output, there is a map of 720896 kB (with just 12 kB swapped) and 156 kb more swapped in another map: basically a very small amount of our memory is swapped so this is not going to create any problem at all.

译者信息

从我们对所有进程的内存交换情况感兴趣以来，我们首先要做的事情是使用grep命令显示进程的smaps文件

$ cat smaps | grep 'Swap:'
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                 12 kB
Swap:                156 kB
Swap:                  8 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  4 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  4 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  4 kB
Swap:                  4 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB
Swap:                  0 kB

假如所有的数据显示为0kb或者某些数据偶尔显示为4kb，表示当前一切正常。实际上我们的例子是一个真实的运行着Redis并每秒为数百的用户提供服务的网站，会显示更多的交换页。为了研究是否存在一个严重的问题，我们改变命令打印出分配的内存尺寸

$ cat smaps | egrep '^(Swap|Size)'
Size:                316 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  0 kB
Size:                 40 kB
Swap:                  0 kB
Size:                132 kB
Swap:                  0 kB
Size:             720896 kB
Swap:                 12 kB
Size:               4096 kB
Swap:                156 kB
Size:               4096 kB
Swap:                  8 kB
Size:               4096 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:               1272 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                 16 kB
Swap:                  0 kB
Size:                 84 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  8 kB
Swap:                  4 kB
Size:                  8 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  4 kB
Size:                144 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  4 kB
Size:                 12 kB
Swap:                  4 kB
Size:                108 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB
Size:                272 kB
Swap:                  0 kB
Size:                  4 kB
Swap:                  0 kB

在输出信息中，你能看到有一个720896kb的内存分配（有12kb的交换）还有一个156kb的交换是另一个进程的。基本上我们的内存只会有很小的内存交换，因此不会产生任何的问题

If instead a non trivial amount of the process memory is swapped on disk your latency problems are likely related to swapping. If this is the case with your Redis instance you can further verify it using the vmstatcommand:

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0   3980 697932 147180 1406456    0    0     2     2    2    0  4  4 91  0
 0  0   3980 697428 147180 1406580    0    0     0     0 19088 16104  9  6 84  0
 0  0   3980 697296 147180 1406616    0    0     0    28 18936 16193  7  6 87  0
 0  0   3980 697048 147180 1406640    0    0     0     0 18613 15987  6  6 88  0
 2  0   3980 696924 147180 1406656    0    0     0     0 18744 16299  6  5 88  0
 0  0   3980 697048 147180 1406688    0    0     0     4 18520 15974  6  6 88  0

The interesting part of the output for our needs are the two columns si and so, that counts the amount of memory swapped from/to the swap file. If you see non zero counts in those two columns then there is swapping activity in your system.

译者信息

假如进程的内存有相当部分花在了swap上，那么你的延迟可能就与swap有关。假如redis出现这种情况那么可以用 vmstat 命令来验证一下猜测：

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0   3980 697932 147180 1406456    0    0     2     2    2    0  4  4 91  0
 0  0   3980 697428 147180 1406580    0    0     0     0 19088 16104  9  6 84  0
 0  0   3980 697296 147180 1406616    0    0     0    28 18936 16193  7  6 87  0
 0  0   3980 697048 147180 1406640    0    0     0     0 18613 15987  6  6 88  0
 2  0   3980 696924 147180 1406656    0    0     0     0 18744 16299  6  5 88  0
 0  0   3980 697048 147180 1406688    0    0     0     4 18520 15974  6  6 88  0

输出中我们最感兴趣的两行是si 和 so，这两行分别统计了从swap文件恢复到内存的数量和swap到文件的内存数量。如果在这两行发现了非0值那么就说明系统正在进行swap。

Finally, the iostat command can be used to check the global I/O activity of the system.

$ iostat -xk 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.55    0.04    2.92    0.53    0.00   82.95

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.77     0.00    0.01    0.00     0.40     0.00    73.65     0.00    3.62   2.58   0.00
sdb               1.27     4.75    0.82    3.54    38.00    32.32    32.19     0.11   24.80   4.24   1.85

If your latency problem is due to Redis memory being swapped on disk you need to lower the memory pressure in your system, either adding more RAM if Redis is using more memory than the available, or avoiding running other memory hungry processes in the same system.

译者信息

最后，可以用iostat命令来查看系统的全局I/O行为。

$ iostat -xk 1
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          13.55    0.04    2.92    0.53    0.00   82.95

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.77     0.00    0.01    0.00     0.40     0.00    73.65     0.00    3.62   2.58   0.00
sdb               1.27     4.75    0.82    3.54    38.00    32.32    32.19     0.11   24.80   4.24   1.85

如果确认延迟是由于swap引起的，那么就需要减小系统的内存压力，要么给机器增加内存，要么不要在同一个机器上运行其他消耗内存的程序。

Latency due to AOF and disk I/O

Another source of latency is due to the Append Only File support on Redis. The AOF basically uses two system calls to accomplish its work. One is write(2) that is used in order to write data to the append only file, and the other one is fdatasync(2) that is used in order to flush the kernel file buffer on disk in order to ensure the durability level specified by the user.

Both the write(2) and fdatasync(2) calls can be source of latency. For instance write(2) can block both when there is a system wide sync in progress, or when the output buffers are full and the kernel requires to flush on disk in order to accept new writes.

译者信息

AOF 和硬盘I/O操作延迟

另一个延迟的根源是Redis的AOF（仅附加文件）模式。AOF基本上是通过两个系统间的调用来完成工作的。一个是写，用来写数据到AOF，另外一个是文件数据同步，通过清除硬盘上空核心文件的缓冲来保证用户指定的持久级别。

包括写和文件数据同步的调用都可以导致延迟的根源。写实例可以阻塞系统范围的同步操作，也可以阻塞当输出的缓冲区满并且内核需要清空到硬盘来接受新的写入的操作。

The fdatasync(2) call is a worse source of latency as with many combinations of kernels and file systems used it can take from a few milliseconds to a few seconds to complete, especially in the case of some other process doing I/O. For this reason when possible Redis does the fdatasync(2) call in a different thread since Redis 2.4.

We'll see how configuration can affect the amount and source of latency when using the AOF file.

译者信息

文件数据同步对于延迟的影响非常大，因为它涉及到好几步调用，可能要花掉几毫秒以致几秒的时间，特别是在还有其他进程后也在占用I/O的情况下。因为这个原因，从redis2.4开始用一个单独的线程来做文件数据同步。

我们来看看当使用AOF的时候如何配置来降低延迟。

The AOF can be configured to perform an fsync on disk in three different ways using the appendfsyncconfiguration option (this setting can be modified at runtime using the CONFIG SET command).

When appendfsync is set to the value of no Redis performs no fsync. In this configuration the only source of latency can be write(2). When this happens usually there is no solution since simply the disk can't cope with the speed at which Redis is receiving data, however this is uncommon if the disk is not seriously slowed down by other processes doing I/O.
When appendfsync is set to the value of everysec Redis performs an fsync every second. It uses a different thread, and if the fsync is still in progress Redis uses a buffer to delay the write(2) call up to two seconds (since write would block on Linux if an fsync is in progress against the same file). However if the fsync is taking too long Redis will eventually perform the write(2) call even if the fsync is still in progress, and this can be a source of latency.
When appendfsync is set to the value of always an fsync is performed at every write operation before replying back to the client with an OK code (actually Redis will try to cluster many commands executed at the same time into a single fsync). In this mode performances are very low in general and it is strongly recommended to use a fast disk and a file system implementation that can perform the fsync in short time.

译者信息

通过设置AOF相关的appendfsync项，可以使用三种不同的方式来执行文件同步（也可以在运行时使用CONFIG SET 命令来修改这个配置）。

appendfsync 的值设置为no，redis不执行fsync。这种情况下造成延迟的唯一原因就是写操作。这种延迟没有办法可以解决，因为redis接收到数据的速度是不可控的，不过这种情况也不常见，除非有其他的进程占用I/O使得硬盘速度突然下降。
appendfsync 的值设置为everysec，每秒都会执行fsync。fsync 由一个单独线程执行，如果需要写操作的时候有fsync正在执行redis就会用一个buffer来延迟写入2秒（因为在Linux如果一个fsync 正在运行那么对该文件的写操作就会被堵塞）。如果fsync 耗时过长（译者注：超过了2秒），即使fsync 还在进行redis也会执行写操作，这就会造成延迟。
appendfsync 的值设置为always ，fsync 会在每次写操作返回成功代码之前执行（事实上redis会积累多个命令在一次fsync 过程中执行）。这种模式下的性能表现是非常差劲的，所以最好使用一个快速的磁盘和文件系统以加快fsync 的执行。

Most Redis users will use either the no or everysec setting for the appendfsync configuration directive. The suggestion for minimum latency is to avoid other processes doing I/O in the same system. Using an SSD disk can help as well, but usually even non SSD disks perform well with the append only file if the disk is spare as Redis writes to the append only file without performing any seek.

If you want to investigate your latency issues related to the append only file you can use the strace command under Linux:

sudo strace -p $(pidof redis-server) -T -e trace=fdatasync

译者信息

大多数redis用户都会把这个值设成 no 或者 everysec。要减少延迟，最好避免在同一个机器上有其他耗费I/O的程序。用SSD也有益于降低延迟，不过即使不使用SSD，如果能有冗余的硬盘专用于AOF也会减少寻址时间，从而降低延迟。

如果你想诊断AOF相关的延迟原因可以使用strace 命令：

sudo strace -p $(pidof redis-server) -T -e trace=fdatasync

The above command will show all the fdatasync(2) system calls performed by Redis in the main thread. With the above command you'll not see the fdatasync system calls performed by the background thread when the appendfsync config option is set to everysec. In order to do so just add the -f switch to strace.

If you wish you can also see both fdatasync and write system calls with the following command:

sudo strace -p $(pidof redis-server) -T -e trace=fdatasync,write

However since write(2) is also used in order to write data to the client sockets this will likely show too many things unrelated to disk I/O. Apparently there is no way to tell strace to just show slow system calls so I use the following command:

sudo strace -f -p $(pidof redis-server) -T -e trace=fdatasync,write 2>&1 | grep -v '0.0' | grep -v unfinished

译者信息

上面的命令会展示redis主线程里所有的fdatasync系统调用。不包括后台线程执行的fdatasync 调用。如果appendfsync 配置为everysec，则给strace增加-f选项。

用下面命令可以看到fdatasync和write调用：

sudo strace -p $(pidof redis-server) -T -e trace=fdatasync,write

不过因为write也会向客户端写数据，所以用上面的命令很可能会获得许多与磁盘I/O没有关系的结果。似乎没有办法让strace 只显示慢系统调用，所以要用下面的命令：

sudo strace -f -p $(pidof redis-server) -T -e trace=fdatasync,write 2>&1 | grep -v '0.0' | grep -v unfinished

Latency generated by expires

Redis evict expired keys in two ways:

One lazy way expires a key when it is requested by a command, but it is found to be already expired.
One active way expires a few keys every 100 milliseconds.

The active expiring is designed to be adaptive. An expire cycle is started every 100 milliseconds (10 times per second), and will do the following:

SampleREDIS_EXPIRELOOKUPS_PER_CRONkeys, evicting all the keys already expired.
If the more than 25% of the keys were found expired, repeat.

Given thatREDIS_EXPIRELOOKUPS_PER_CRONis set to 10 by default, and the process is performed ten times per second, usually just 100 keys per second are actively expired. This is enough to clean the DB fast enough even when already expired keys are not accessed for a log time, so that the lazy algorithm does not help. At the same time expiring just 100 keys per second has no effects in the latency a Redis instance.

译者信息

数据过期造成的延迟

redis有两种方式来去除过期的key：

lazy 方式，在key被请求的时候才检查是否过期。 to be already expired.
active 方式，每0.1秒进行一次过期检查。

active过期模式是自适应的，每过100毫秒开始一次过期检查（每秒10次），每次作如下操作：

根据 REDIS_EXPIRELOOKUPS_PER_CRON 的值去除已经过期的key（是指如果过期的key数量超过了REDIS_EXPIRELOOKUPS_PER_CRON 的值才会启动过期操作，太少就不必了。这个值默认为10）, evicting all the keys already expired.
假如超过25%（是指REDIS_EXPIRELOOKUPS_PER_CRON这个值的25%，这个值默认为10，译者注）的key已经过期，则重复一遍检查失效的过程。

REDIS_EXPIRELOOKUPS_PER_CRON 默认为10，过期检查一秒会执行10次，通常在actively模式下1秒能处理100个key。在过期的key有一段时间没被访问的情况下这个清理速度已经足够了，所以 lazy模式基本上没什么用。1秒只过期100个key也不会对redis造成多大的影响。

However the algorithm is adaptive and will loop if it founds more than 25% of keys already expired in the set of sampled keys. But given that we run the algorithm ten times per second, this means that the unlucky event of more than 25% of the keys in our random sample are expiring at least in the same second.

Basically this means that if the database contains has many many keys expiring in the same second, and this keys are at least 25% of the current population of keys with an expire set, Redis can block in order to reach back a percentage of keys already expired that is less than 25%.

译者信息

这种算法式是自适应的，如果发现有超过指定数量25%的key已经过期就会循环执行。这个过程每秒会运行10次，这意味着随机样本中超过25%的key会在1秒内过期。

通常来讲如果有很多key在同一秒过期，超过了所有key的25%，redis就会阻塞直到过期key的比例下降到25%以下。

This approach is needed in order to avoid using too much memory for keys that area already expired, and usually is absolutely harmless since it's strange that a big number of keys are going to expire in the same exact second, but it is not impossible that the user used EXPIREAT extensively with the same Unix time.

In short: be aware that many keys expiring at the same moment can be a source of latency.

译者信息

使用这种策略是为了避免清除过期key的过程占用太多内存，这种方法对系统几乎不会有不良影响，因为大量key同时到期并非是一种常见现象，不过如果用户使用了 EXPIREAT 来设置过期时间的话也是有可能的。

总而言之：要知道大量key同时过期会对系统延迟造成影响。

Redis software watchdog

Redis 2.6 introduces the Redis Software Watchdog that is a debugging tool designed to track those latency problems that for one reason or the other esacped an analysis using normal tools.

The software watchdog is an experimental feature. While it is designed to be used in production enviroments care should be taken to backup the database before proceeding as it could possibly have unexpected interactions with the normal execution of the Redis server.

It is important to use it only as last resort when there is no way to track the issue by other means.

译者信息

Redis 看门狗

Redis2.6版本引进了redis看门狗(watchdog)软件，这是个调试工具用于诊断Redis的延迟问题

这个看门狗软件还是一个实验性功能，当用于生产环境时，请小心并做好备份工作，可能有意想不到的问题影响正常的redis服务。

当你没有更好的工具追踪问题时，可以使用它。

This is how this feature works:

The user enables the softare watchdog using teCONFIG SETcommand.
Redis starts monitoring itself constantly.
If Redis detects that the server is blocked into some operation that is not returning fast enough, and that may be the source of the latency issue, a low level report about where the server is blocked is dumped on the log file.
The user contacts the developers writing a message in the Redis Google Group, including the watchdog report in the message.

Note that this feature can not be enabled using the redis.conf file, because it is designed to be enabled only in already running instances and only for debugging purposes.

To enable the feature just use the following:

CONFIG SET watchdog-period 500

The period is specified in milliseconds. In the above example I specified to log latency issues only if the server detects a delay of 500 milliseconds or greater. The minimum configurable period is 200 milliseconds.

译者信息

这个功能是这样工作的：

用户通过命令CONFIG SET开启软件看门狗
Redis启动监测程序监测自己的状态
如果Redis检测到服务器被某些操作阻塞了，并运行速度不够快，也许是因为延迟导致的，Redis就会在log文件中写入一份关于被阻塞服务器的底层监测数据报表
用户通过Redis Google Group发送消息给开发人员，消息包括看门狗报表。

请注意，这项功能不能通过redis.conf文件开启，因为这项够能设计之初就是面向正在运行的服务器，而且只是为了调试程序。

如果要开启该功能，只需运行如下命令：

CONFIG SET watchdog-period 500

时间间隔以毫秒为单位。在上面的例子中，我指定了，当服务器检测到500毫秒或更大的延迟的时候，才记录延迟事件。最小的时间间隔是200毫秒。

When you are done with the software watchdog you can turn it off setting thewatchdog-periodparameter to 0. Important: remember to do this because keeping the instance with the watchdog turned on for a longer time than needed is generally not a good idea.

The following is an example of what you'll see printed in the log file once the software watchdog detects a delay longer than the configured one:

[8547 | signal handler] (1333114359)
--- WATCHDOG TIMER EXPIRED ---
/lib/libc.so.6(nanosleep+0x2d) [0x7f16b5c2d39d]
/lib/libpthread.so.0(+0xf8f0) [0x7f16b5f158f0]
/lib/libc.so.6(nanosleep+0x2d) [0x7f16b5c2d39d]
/lib/libc.so.6(usleep+0x34) [0x7f16b5c62844]
./redis-server(debugCommand+0x3e1) [0x43ab41]
./redis-server(call+0x5d) [0x415a9d]
./redis-server(processCommand+0x375) [0x415fc5]
./redis-server(processInputBuffer+0x4f) [0x4203cf]
./redis-server(readQueryFromClient+0xa0) [0x4204e0]
./redis-server(aeProcessEvents+0x128) [0x411b48]
./redis-server(aeMain+0x2b) [0x411dbb]
./redis-server(main+0x2b6) [0x418556]
/lib/libc.so.6(__libc_start_main+0xfd) [0x7f16b5ba1c4d]
./redis-server() [0x411099]
------

Note: in the example the DEBUG SLEEP command was used in order to block the server. The stack trace is different if the server blocks in a different context.

If you happen to collect multiple watchdog stack traces you are encouraged to send everything to the Redis Google Group: the more traces we obtain, the simpler it will be to understand what the problem with your instance is.

译者信息

当你运行完了软件看门狗，你可以通过设置时间间隔参数为0来关闭看门狗。需要注意的：记得关闭看门狗，因为开启看门狗太长时间并不是一个好主意。

以下的例子，你可以看到，当看门狗监测到延迟事件的时候，输出日志文件的内容：

[8547 | signal handler] (1333114359)
--- WATCHDOG TIMER EXPIRED ---
/lib/libc.so.6(nanosleep+0x2d) [0x7f16b5c2d39d]
/lib/libpthread.so.0(+0xf8f0) [0x7f16b5f158f0]
/lib/libc.so.6(nanosleep+0x2d) [0x7f16b5c2d39d]
/lib/libc.so.6(usleep+0x34) [0x7f16b5c62844]
./redis-server(debugCommand+0x3e1) [0x43ab41]
./redis-server(call+0x5d) [0x415a9d]
./redis-server(processCommand+0x375) [0x415fc5]
./redis-server(processInputBuffer+0x4f) [0x4203cf]
./redis-server(readQueryFromClient+0xa0) [0x4204e0]
./redis-server(aeProcessEvents+0x128) [0x411b48]
./redis-server(aeMain+0x2b) [0x411dbb]
./redis-server(main+0x2b6) [0x418556]
/lib/libc.so.6(__libc_start_main+0xfd) [0x7f16b5ba1c4d]
./redis-server() [0x411099]
------

注意：例子中 DEBUG SLEEP 命令是用于阻塞服务器的。在不同的阻塞背景下，堆栈信息会有不同。

如果收集到多个看门狗的监测堆栈信息，我们鼓励你把这些信息发送到Redis Google Group：我们获得越多的信息，我们就越容易分析得到你的服务器到底有什么问题。

APPENDIX A: Experimenting with huge pages

Latency introduced by fork can be mitigated using huge pages at the cost of a bigger memory usage during persistence. The following appeindex describe in details this feature as implemented in the Linux kernel.

Some CPUs can use different page size though. AMD and Intel CPUs can support 2 MB page size if needed. These pages are nicknamed huge pages. Some operating systems can optimize page size in real time, transparently aggregating small pages into huge pages on the fly.

译者信息

附录A：大内存页的实验

fork产生的延迟，可以通过大内存页来减缓，只是需要耗费更大的内存。下面的附录将详细描述在Linux内核中实现的这个功能。

虽然某些CPU会使用不同大小的页面。AMD和Intel CPU可以支持2MB的页面大小。这些页面有个别名，叫做“大页面”。某些操作系统可以实时地优化页面大小，透明地把小页面聚合成大页面。

On Linux, explicit huge pages management has been introduced in 2.6.16, and implicit transparent huge pages are available starting in 2.6.38. If you run recent Linux distributions (for example RH 6 or derivatives), transparent huge pages can be activated, and you can use a vanilla Redis version with them.

This is the preferred way to experiment/use with huge pages on Linux.

Now, if you run older distributions (RH 5, SLES 10-11, or derivatives), and not afraid of a few hacks, Redis requires to be patched in order to support huge pages.

译者信息

在Linux系统，显式的huge page管理在2.6.16中得到支持，并且隐式透明的huge page管理也在2.6.38中得到支持。如果你的是最近的Linux发行版本（例如 RH6或者其派生版本），透明的huge page可以被开启，并且你可以使用包含这项够能的Redis版本。

这个是在Linux中，实验/使用huge page的最佳方法。

现在，如果运行旧版本的Linux（RH5, SLES 10-11, 或者其派生版本），不要害怕使用一些技巧，Redis可以通过补丁来支持huge page。

The first step would be to read Mel Gorman's primer on huge pages

There are currently two ways to patch Redis to support huge pages.

For Redis 2.4, the embedded jemalloc allocator must be patched. patch by Pieter Noordhuis. Note this patch relies on the anonymous mmap huge page support, only available starting 2.6.32, so this method cannot be used for older distributions (RH 5, SLES 10, and derivatives).
For Redis 2.2, or 2.4 with the libc allocator, Redis makefile must be altered to link Redis with the libhugetlbfs library. It is a straightforward change

Then, the system must be configured to support huge pages.

The following command allocates and makes N huge pages available:

$ sudo sysctl -w vm.nr_hugepages=<N>

The following command mounts the huge page filesystem:

$ sudo mount -t hugetlbfs none /mnt/hugetlbfs

In all cases, once Redis is running with huge pages (transparent or not), the following benefits are expected:

The latency due to the fork operations is dramatically reduced. This is mostly useful for very large instances, and especially on a VM.
Redis is faster due to the fact the translation look-aside buffer (TLB) of the CPU is more efficient to cache page table entries (i.e. the hit ratio is better). Do not expect miracle, it is only a few percent gain at most.
Redis memory cannot be swapped out anymore, which is interesting to avoid outstanding latencies due to virtual memory.

译者信息

第一步，阅读Mel Gorman's primer on huge pages

现在有两个方法给Redis打补丁，让它支持huge page

对于Redis 2.4，内置的jemalloc 分配器需要打上补丁。Pieter Noordhuis的补丁patch 。需要注意，这个补丁依赖于匿名mmap huge page的支持，这项功能只能从2.6.32之后才得到支持，所以这个方法不能用于旧的版本（RH5 ,SLES 10, 和其派生版本）
对于Redis 2.2 或者2.4，附带libc分配器，必须修改redis makefile，使Redis和the libhugetlbfs library进行连接。这个是最直接的更改

然后，系统必须配置为支持huge page

以下命令分配和创建 N个huge page：

$ sudo sysctl -w vm.nr_hugepages=<N>

以下命令挂载huge page到文件系统

$ sudo mount -t hugetlbfs none /mnt/hugetlbfs

在所有的情况下，一旦Redis运行huge page（透明或者非透明），将会得到如下的好处：

由于fork引起的延迟将得到缓解。尤其是对超大的实例，尤其是在VM上运行的实例。
Redis速度得到提够，是因为CPU的转换旁视缓冲(TLB)更有效的缓存页面（例如命中率会更高）。不要期望有奇迹发生，性能至多只能提高一点。
Redis内存不会再被换走，这样就能避免由于虚拟内存造成的不必的延迟。

Unfortunately, and on top of the extra operational complexity, there is also a significant drawback of running Redis with huge pages. The COW mechanism granularity is the page. With 2 MB pages, the probability a page is modified during a background save operation is 512 times higher than with 4 KB pages. The actual memory required for a background save therefore increases a lot, especially if the write traffic is truly random, with poor locality. With huge pages, using twice the memory while saving is not anymore a theoretical incident. It really happens.

The result of a complete benchmark can be found here.

译者信息

很不幸，除了更多的复杂操作，还有redis使用huge page会带来一个明显的缺陷。COW机制的粒度是页面。伴随2MB页面，页面被后台存储操作修改的可能性是4KB页面的512倍。实际的内存需要后台存储，所以可能性会增加很多，尤其是，当写操作很随机，并且伴随很差的定位。通过huge page，使用两倍的内存，而存储将不再是理论的突发事件。它真的会发生。

完整的性能评估结果可以参阅这里.