Disk alignment and caching

So now that we've conducted measurements and run benchmarks, what do we do with the results? How does the system need to be built to deliver good performance? What options do we have?

Disk Alignment

First and foremost, code dealing with raw devices usually uses the concept of dividing the disk up in blocks of a certain size. Addressing blocks instead of each byte separately has the huge advantage that an unsigned integer can be used to manage 2³² Blocks instead of 2³² Bytes, thereby easily increasing the maximum size available without needing more memory.

The downside is that this means that we need to take care when reading and writing data: We cannot overwrite parts of a block, because the only command we have available is "overwrite Block number 1284319 with this data". There is simply no command to say "overwrite the second half of block 1284319". Hence, we always need to write complete blocks.

But what happens if we don't have enough data to overwrite a complete block? Well then the system reads the old block into RAM, modifies only the bits and bytes that need to change, and writes back the complete block. This is called a read-modify-write cycle.

This immediately becomes relevant as soon as we try to make actual use of a device, because then we need to layer multiple block-oriented somethings on top of each other. In order to prevent read-modify-write from happening, we have to make sure that upper layers only overwrite byte ranges that the lower layers consider a full block. We do that by ensuring the file system writes a minimum of, say, 4096 Bytes to a position that is a multiple of 4096 Bytes away from the beginning of the file system. If the lower layer uses a block size of 4096 Bytes, we will now always overwrite a full block. If the lower layer uses a smaller block size that 4096 is a multiple of, this will also work. So using such a file system also works when the lower layer uses 2048, 1024 or 512 Bytes as its block size. 1536 bytes would not work, because in order to write a single file system block, the block device would have to write 2.66 of its own blocks. For that, it would require at least one read-modify-write cycle.

Most data handling systems divide their storage into a metadata area and a data area. Since the data area is the one much more frequently written to, we have to make sure its blocks are properly aligned. The size of the metadata area is relevant in this regard since it acts as a padding that moves the whole data area back a bit. So you probably guessed it already, this padding needs to be a multiple of the block size. The block size of the data area itself must then again be aligned with the block sizes of other systems in the stack.

So let's take a quick look at paddings and block sizes of different tools:

Tool

Padding

Block Size

Hardware RAID

Hopefully, 0

Chunk Size (e.g. 256KiB)

Stripe Width (e.g. 1MiB)

Partition Table

2048 Sectors → 1MiB

2048 Sectors → 1MiB

LVM

1MiB (1st Phys. Extent)

4MiB (VG Extent Size)

File Systems

Mostly aligned to 1MiB

4096 Bytes (4KiB)

QCow2 VM Image

2MiB

depends on guest file system

Raw VM Image

0

depends on guest file system

For qcow2 and raw VM images, make sure you don't use a smaller block size in the guest's file systems than in the host filesystem to avoid read-modify-writes.

Now with regard to the block sizes, there's another pitfall to be aware of: The VG Extent size and RAID Chunk Size differ from the block size in a file system in that they do not denote a minimum write IO size. Instead, they are only relevant for storage allocation. An LV that uses an extent size of 4MiB is perfectly capable of performing 4KiB IOs. The extent size only matters for where a newly-allocated file system starts.

So the easiest way to get all those numbers to play nicely with one another is to use the defaults wherever possible. This minimizes the amount of fiddling you have to do, and minimum fiddling means minimum mistakes. Since a single mistake can flush your whole system's performance down the drain, minimum mistakes is a good thing, especially when you allow the defaults to work with things you do frequently (like, creating a partition table in a newly-created raw VM image in an LVM logical volume).

If you don't use RAID, make sure you use partition tables with their partitions always starting at 2048 sectors, and partitions being a multiple of 2048 sectors in length. That is what gparted's "Align to 1MiB" button does, as does pretty much every modern operating system on the planet. If your partition table starts at sector 63, your OS is not one of those. You should then do the partitioning with a live CD or something before installing the OS. If that is not an option and you're using a VM store backed by a file system that does not reside on SSDs, you may also consider switching that file system's block size to 512 Bytes.

If you do use RAID, things get a little more complicated.

Layouting RAID

Every RAID level except for RAID-1 employs a technique called striping. This means that when writing a certain amount of data, the data gets split into chunks, and each chunk is going to land on a different disk. For instance, if you're writing 1MiB of data to a RAID that has a chunk size of 256KiB, you're going to overwrite four chunks, because 1MiB / 256KiB = 4. If you're writing more chunks than your array has data disks, at some point this is going to wrap around. So if you have 6 data disks in your array, writing 4 chunks does not need to wrap, but if you only had 3, then one disk would get two chunks to write. The maximum amount of data that can be written without having to wrap around is called the Stripe Width, and is calculated as the number of data disks times the chunk size. So for instance, a RAID-5 array of 8 disks with a chunk size of 256KiB would then have a Stripe Width of 1792KiB: 7 * 256KiB = 1792KiB.

Now take a look at the table again. Since most offsets default to 1MiB and LVM's extent size is also a multiple of 1MiB, it would be nice if the start of a stripe + 1MiB = the start of another stripe. This is easily achieved if your Stripe Width is 1MiB, but 512 KiB would work just as well.

If all your partitions and LVs start exactly at the beginning of a RAID stripe, you will then be able to perform filesystem tuning. At least Ext3, Ext4 and XFS can be tuned by giving them information about the RAID layout, but for that to work correctly, the first byte of the file system has to be aligned to the start of a stripe. Otherwise, none of the calculations the file system performs will match reality.

RAID controllers are most efficient when overwriting a full stripe. With random IO, that is never going to happen ever. Don't try to make it happen. It won't. But we can reduce the pain by using a big chunk size (in my experience, 256KiB works splendid), so the RAID controller can process even large IO requests by updating a single chunk per stripe.

I also did a bunch of tests comparing hardware RAID to software RAID. The results indicate that hardware RAID is better suited for RAID levels that involve parities and mirroring (RAID-1, 5, 6). Software RAID is better suited for striping (RAID-0). I figured this is because then the Linux kernel can distribute the load to multiple devices in parallel instead of having to shove everything down one single pipe (citation needed).

Hardware and software RAID aren't mutually exclusive. For instance, layering a software RAID-0 instance over four hardware RAID-5/6 instances with four data disks each gets you a total of 16 data disks. If each one of those is 1.2TB in size, that's about 20TB, and you didn't even have to violate the "four disks per array" principle from above. However, this setup does introduce some fuzziness about what "the beginning of a stripe" exactly is. Still, I have yet to see one of those systems be brought to its limit, so this configuration works extremely well in practice.

Caching

Caching is the most effective optimization in IT.

First and foremost, Linux uses the host's ram for caching — so, equipping your storage system with far more than enough RAM makes sense. After a while, you'll find the file system cache filling up every last bit of RAM. Curiously the block device buffer doesn't seem to be as effective, so I prefer using a file storage backend for VM images.

And for the love of god, use your raid controller's cache. Always buy a friggin' CacheVault. With it, you might even be able to use slower disks, because the RAID controller gets to feed them more sequential IO.

On the other hand, do not enable your disks' caches because CacheVault doesn't protect those. There's an option in the RAID controller to control disk's caches. Make sure it is set to disabled.

A word on fragmentation

You might be worried about file system fragmentation. When writing data sequentially, fragmentation is an issue because it disturbs the sequence in which data is written to disk, thereby causing the disk to do random-ish IO. This reduces the maximum achieved throughput, and is therefore bad.

When putting VM images into a file system, this is not an issue because VMs write randomly anyway.

This is the fragmentation of two of my largest VM cluster setups:

$ xfs_db -c frag -r /dev/drbd/by-res/ovirt_vms01
[...] fragmentation factor 90.29%

$ xfs_db -c frag -r /dev/drbd/by-res/ovirt_vms02
[...] fragmentation factor 99.68%

While not suitable for storing movies anymore, those things are absolutely fine for running VMs. If anything, worry about latency.