Filesystem Tuning

A while back, I promised I’d write about file system tuning someday. Since it hasn’t really happened yet, I thought I’d do it now.

So you’ve done everything you can on the hardware side. You’ve built a nice RAID over precisely four data disks with a neato stripe width of 1MiB. You’ve phased out DRBD, probably replacing it with other shiny stuff of the day. Or you’ve decided to go all-SSD, ditching all of that oldschool spinning rust altogether. (Which may turn out to be a bad idea if done in the wrong places, but I digress.) Now of course you’re wondering: What should I do in the file system?

Use XFS.

I get it, Ext4 is tried and tested; but it has one fundamental design flaw: The journal.

When multiple processes want to write data to the file system, those writes get queued up in the journal. The journal basically ensures that only one process can ever write data at a time. This is not really a problem on a machine dedicated to a single purpose, but on a virtualization host where dozens of VMs are running, this kinda sucks.

XFS divides the disk space up in multiple (mostly, 16) journals. So basically, up to 16 processes can send writes to the disk at the same time, without having to synchronize with each other. This can make a huge difference already. So, if you have a massively parallel workload, use XFS. If you’re unsure, also use XFS because it won’t hurt. I’m indeed a huge fan of XFS. XFS is the shit.

But isn’t ZFS way cooler?

If you’re making heavy use of snapshots, maybe. Snapshots aren’t really a thing for me, so I don’t really have much motivation to ditch my battle-hardened setup and try something different. (I may have just become a greybeard in that sense, but as long as my machine runs circles around yours, I can live with that.)

Tuning parameters

First of all: The most gains are to be made on the block-device side of things, specifically the RAID. If you get this right, filesystem tuning won’t make much of a difference anymore. Still, I do it because you know, it can’t hurt.

Second: Only do this when you have full control over the hardware, and you’re absolutely, positively sure what the RAID layout is. If you need to guess, chances are you’re guessing wrong, and when that happens your performance will actually get worse than if you had applied no tuning at all and just used the defaults.

So, with that out of the way, here’s how I run mkfs.xfs on RAID over spinning rust (sas1 being the VG name):

mkfs -t xfs -b size=4096 -s size=512 -d su=256k -d sw=4 -l su=256k /dev/sas1/some-volume

And here’s the same thing for SSDs (again, ssd1 being the VG name):

mkfs -t xfs -b size=4096 -s size=4096 /dev/ssd1/some-volume

Options explained:

  • -b size=4096: Set file system block size to 4k (the default). This is to prevent any auto detection magic from doing the wrong thing. (Basically, paranoia. But I’ve had a customer where I had to set this to 512, because one of their databases was an ancient VM that had been configured with a 512b block size on its NTFS C: drive, which caused the host file system to do read-modify-write all the time. Words cannot adequately express how much that sucked. So it’s good to know this option exists.)

  • -s size=512 or -s size=4096: Spinning disks use a block size of 512b; SSDs usually use 4k. So it kinda makes sense to adapt this to the kind of media you’re using, just for good measure.

  • -d su=256k -d sw=4: Hey XFS, just so you know, you’re on a RAID with a 256kiB chunk size and 4 data disks.

  • -l su=256k: Please also align the journal to those 256kiB chunks.

Again, if you’re not a hundred percent sure what your RAID layout looks like (e.g. because you’re on AWS), do not specify any su or sw options. Guessing is not allowed, because wrong values make things worse. If you don’t know, stick to the defaults.

And that’s basically all there’s to it :)