[转] KVM storage performance and cache settings on Red Hat Enterprise Linux 6.2

Almost one year ago, I checked how different cache settings affected KVM storage subsystem performance. Results were very clear: to obtain good I/O speed, you had to use the write-back or none cache policies, avoiding the write-through one. However, as the write-back policy intrinsic comported some data-loss risk, the safer bullet was to not use any host-based cache (the “nocache” KVM option).

But what is the situation now? With the newly released RHEL 6.2 point release, I am going to check if something has changed. But let me first recap the whole KVM caching question.

KVM cache modes overview

Normally, a virtual guest system use an host-side file to store its data: this file represent a virtual disk, that the guest use as a normal, physical disk. However, from the host's view this virtual disk is a normal data file and it may be subject to caching.

In this context, caching is the process to “hide” some disk-related data to physical RAM. When we use a cache to store in RAM only previously read data, we speak about a read cache, or a write-through cache. When we also store in RAM some data that will be later flushed to disk, we speak about a read/write cache, or write-back cache. A write-back cache, by caching write request in the fast RAM, has higher performance; however, it is also more prone to data loss than a write-through one, as the latter only cache read requests and immediately write to disk any data.

As disk I/O is a very important parameter, Linux and Windows o.s. generally use a write-back policy with periodic flush to the physical disk. However, when using an hypervisor to virtualize some guest system, you can effectively cache things twice (one time in the host memory and another time in the virtual guest memory), so you can disable host-based caching on the virtual disk file and to let the guest system to manage its own caching. Moreover, a host-side write-back policy on virtual disk file used to significantly increase the risk of data loss in case of guest crash. However, as you will soon find, thank to a new “barrier passing” feature, this may not be the case now.

KVM cache modes overview, continued

We can use an image to better track how things work:

Guest / Host write caching

Let start from the beginning, assuming traditional (no-barrier-passing) behavior: when a guest-side application write something, data generally go into the guest-side pagecache. The pagecache will then be periodically flushed to the virtual disk device and so to the host-side disk image file. On the other side, if the application has to write some very important data, it can bypass the pagecache and use a synchronized write semantic where a write is supposed to return if and only if the write successfully committed all the data to the (virtual) permanent storage system.

Anyway, at this point (writes flushed to virtual disk device) we have three possibilities:

if the cache policy is set to “writeback”, data will be cached in the host-side pagecache (red arrow);
if the cache policy is set to “none”, data will be immediately flushed to the physical disk/controller cache (gray arrow);
if the cache policy is set to “writethrough”, data will be immediately flushed to the physical disk platters (blue arrow).

Note that the only 100% safe choice is the write-through setting, as the others will not guarantee that a write returns only after the data are committed to permanent physical storage. This is not always a problem: after all, on a classic, not-virtualized system you don't have any guarantee that normal write will be immediately stored to disk. However, as we stated above, some writes must be committed to disk or significant loss will occur (eg: think to a database system or filesystem journal commits). These important writes are generally marked as “synchronous” ones and are executed with the O_SYNC or similar flags, meaning that the system is supposed to return only when all data are successfully committed to the permanent storage system.

Unfortunately, when we speak about the virtualized guest the only cache setting that guarantee a 100% permanent write, the writethrough mode, is the slower one (as you will find soon). This means that you had to make a choice between safety and performance, with the no-cache mode often used because, while not 100% safe, it was noticeably safer than write-back cache.

Now things have changed: newer KVM releases enable a “barrier-passing” feature that assure a 100% permanent data storage for guest-side synchronized writes, regardless of the caching mode in use. This means that we can potentially use the high-performance “writeback” setting without fear of data loss (see the green arrow above). However your guest operating system had to use barriers in the first place: this means that for most EXT3-based Linux distributions (as Debian) you had to manually enable barriers or use a filesystem with write barriers turned on by default (most notably EXT4).

If you virtualize an old operating system without barriers support, you had to use the write-through cache setting or at most the no-cache mode. In the latter case you don't have 100% guarantee that synchronized writes will be stored to disk; however, if your guest OS didn't support barriers, it is intrinsics unsafe on standard hardware also. So the no-cache mode seems a good bet for these barrier-less operating system, specially considering the high performance impact of write-through cache mode.

Ok, things are always wonderful in theory, but how well the new write-barrier-passing feature works in a practical environment? We will see that in a moment...

Testbed and methods

All test were run on a Dell D620 laptop. The complete system specifications are:

Core2 T7200 CPU @ 2.0 GHz
4 GB of DDR2-667 RAM
Quadro NVS110 videocard
a Seagate ST980825AS 7200 RPM 80 GB SATA hard disk drive (in IDE compatibility mode, as the D620's BIOS does not support AHCI operation)
OS Red Hat Enterprise Linux 6.2 amd64

Kernel version was kernel-2.6.32-220.el6.x86_64 while Qemu/KVM version was qemu-kvm-0.12.1.2-2.209.el6_2.1.x86_64. The internal hard disk was partitioned into three slices: a first ~4 GB swap partition, a second ~20 GB ext4 root partition and a third ~50 GB ext4 partition (mounted on /opt) for testing purposes. All guest's disk images were stored on this last partition.

I installed a Debian 6 amd64 OS in a guest instance backed by a Qcow2 image file. The Debian installation process was repeated multiple times, both with EXT3 and EXT4 root filesystems. Please note that when using EXT3-based root filesystem, Debian installation proceeded without write barriers support, as the installer does not permit their activation from the default text-based menu; EXT3 write barriers were enabled after the installation was completed by setting the correct option in /etc/fstab. On the other hand, EXT4 uses write barriers by default.

Write performances were benchmarked by measuring the time needed to install the Debian base system, while I used the Linux “dd” tool to check for correct Qemy/KVM barrier-passing functionality.

Testing the barrier-passing feature

Write barrier passing is not the newest feature: it exists since some time now. However, it was not always enable, as it initially only worked on VirtIO based virtual device and only in specific Qemu/KVM build. So the first thing to do is to test its effectiveness in the RHEL 6.2-provided Qemu packages and with standard KVM wizard-created virtual machines.

But how can we test this features? One smart, simple yet reliable test is to use, inside the guest OS, the Linux “dd” utility with appropriate command line flags: direct and sync. The first flag enable direct access to the virtual storage device, bypassing the guest pagecache but not issuing any write barrier command on the guest side. Moreover, a direct guest write hits some of the host-based caches (which specific cache depend on KVM cache setting – see the picture with the red, gray and blue arrow published before). This barrier-less direct write enable us to simulate a “no-barrier-passing” situation.

The second flag enable us to run another dd write pass with both direct and sync flags set, meaning that the guest will now generate a write barrier for each write operations (remember that the sync flag should guarantee a 100% permanent write, so a write barrier is needed and it will be issued if the guest OS supports it).

Ok, its time to give you some number now. First, let we see how a VirtIO-based virtual machine performs in these “dd” tests:

Barrier passing testing with VirtIO driver

We can start our analysis from the host side: as you can see, the sync-less writes are much faster than the synched ones. The reason is simple: while the first can use the physical disk cache for temporarily storing data, the latter force any data to physically go to the disk platters (for permanent storage). It is interesting to note that, while with different absolute values, the “nocache” and “writeback” guests results follow the same pattern: this prove the effectiveness of barrier-passing feature. If this feature didn't work, we should see synched speed on par with not-synched one, but this didn't happens. But why the direct write-back score is so high? It is probably an artifact of the VirtIO driver; perhaps the driver is collapsing multiple writes into one single data transfer.

See the very low “writethrough” scores? They are due to the “always-sync” policy of this cache mode. You can see that, while very secure, this cache mode is way slower than the others.

What happens if we had to use the older, but universally-compatible IDE driver?

Barrier passing testing with IDE driver

While absolute values are quite different from the previous ones, the IDE block driver give us the same end result: the write barrier-passing feature is up and working. And again we see the very low write-through result.

Be sure that barrier-passing feature is working is not small feat: we can now comfortably use the faster-performing cache mode, the writeback one. Please note that the above results were almost identical for both EXT3 (with barriers enabled) and EXT4 filesystems.

Debian base system installation time – EXT3 and no guest-side write barriers

Now we can see how the various caching policies affect a Debian base system installation time. Will a real-world scenario redeem the historically safer write-through mode?

Debian install time - EXT3

In the case of a guest operating system with default-configured EXT3, it seems no: the writethrough policy remains way slower than the others. While the nocache mode shows remarkably good result, the write-back one is marginally faster thank to its use of not only guest-side, but also host-side pagecaches.

This is a perfect example of the safety vs speed trade-off described above: the guest OS does not use write barriers and so the writeback mode is significantly prone to data loss (because the synchronized guest-side writes are cached in the host-side pagecache before to be flushed to permanent storage) , while the writethrough mode, while slower, is even safer than this guest OS configuration running on the real hardware.

In this case, a good compromise between safety and performance is the nocache mode: it provide a safety level comparable to that achieved by this guest OS instance running on real hardware, as synchronized guest-side writes are cached in disk's internal cache only.

Debian base system installation time – EXT4 and guest-side write barriers enabled

While the previous test show remarkably well the situation when using relatively old guest filesystems, what happens when we use a modern, barrier-optimized filesystems as EXT4 inside our guest?

Debian install time - EXT4

Things become definitely more interesting: while slower, the writethrough setting regain competitiveness. However with guest-side write barriers turned on it isn't any more secure than the other caching methods.

Speaking between the nocache and write-back settings, we can see that the former has a slight edge in IDE mode: this is probably due to the “I-am-trying-to-use-host-cache-but-I-can't-really-use-it” situation in which the writeback mode find itself in this test (a .deb package installation is a mix of uncompressing and synchronized write to disk).

However, the two modes are more-or-less on par. This means that, when using a write-barriers aware guest filesystem, you can use one of these two settings, without boring with the slower write-through one.

Conclusions

Hey, wait a moment: were are the other benchmarks? Simply but: I didn't run any other benchmark. Why? The point is that in all benchmarks we would see the same repeating pattern: write-through mode would be the slower, with nocache and writeback ones way faster.

To tell the truth, I expect some variations in the nocache vs writeback cache mode because the latter can use the host-side pagecache, effectively accessing more memory as disk cache. However, any difference here would be very specific to a number of factors, most notably: host memory size, guest memory size, application read/write patterns, number of virtual machines hosted, number of shared backend file (eg: to support multiple snapshots)... In short, way too much to give you significant and repeatable results that will apply on your cases. Moreover, in other cases the implicit direct access granted by the nocache mode can give you a slight performance boost.

My suggestion is to use the write-back or nocache policies each time you can enable write barriers on the guest side (ie: all Linux distributions released in the last years). If you have a mostly read-bound workload, go with a write-back policy (as it permit to use additional pagecache memory), while for mostly synchronized-write-bound guest (ie: a PostgreSQL machine), use the nocache mode. If you can't enable guest-side write barriers use the nocache option. In this last case, for maximum safety you can use the write-through mode but be aware of the massive performance loss and the potential impact that the added disk activity can have on other virtual machines as well.

It's a very pleasing thing that Qemu/KVM storage stack now supports write barrier-passing on both VirtIO and IDE devices. All in all, the KVM project is evolving very well and it has a great future. I hope you find this article interesting.