Storage Performance Howto

Virtualization is no longer a trend, it's simply the reality in nearly every data center, especially the ones that are newly built. And it makes sense, too: These days, Computers have such an insane amount of processing power that no one single user will ever be able to use it to full capacity. (Check your load average if you don't believe me. Tip: 100% is usually equal to the number of CPU cores you have.)

But if you're running a multitude of virtual machines on a single hardware, that hardware has to be able to take the load. With regard to CPU and RAM, that's easy: Just buy enough of 'em. In terms of network bandwidth that works too, a simple Gigabit connection can already get you quite far, and ten gigabit ethernet is readily available.

Unfortunately, it's not that easy when it comes to storage. Scaling up a single system can get pretty expensive beyond a certain point, and you don't want to build yourself a very expensive single point of failure. Scaling out to a bunch'a systems is fair enough regarding the price, but at the end of the day, it always comes down to a single user waiting for a single disk to do stuff, so this approach does not do anything to increase that particular user's experienced performance.

So in order to speed that up, we'll go and buy some SSDs, right?

Wrong.

We don't need no SSDs

Granted, 7.2kRPM disks from next door's supermarket might not be as fun as 10kRPM SAS server drives, but the latter is about all you're going to need — that is, if you get the storage setup right. Which means:

  • The right number of disks is key. If you do that part right, you'll find yourself in disk alignment heaven, if you mess this up, you're screwed for a lifetime.

  • Don't throw away your RAID controller too quickly just because ZFS is the hot stuff right now. RAID controllers have caches. Caches rule. We want caches.

  • Talk to your RAID in a sensible way. That is, update exactly one chunk per write. Full stripe writes are never gonna happen when you're doing random IO, but we can at least reduce the pain.

  • Make sure your file system on the storage gets to have a sane workload. That is,

    • no read-modify-write cycles. If your storage does those, you're screwed.

    • use a file system that layouts files in a way that requests can be merged.

    • make sure there's enough free RAM available for the Linux file system cache to use, so you don't have to do as many read operations.

    • equip your virtualization hosts with enough RAM so they act as another caching layer themselves.

  • Listen to what the system tells you. If an optimization doesn't improve performance by orders of magnitude, it ain't worth jack, period.

  • If you have a problem, there's always one tiny little parameter that's wrong. (You may very well have multiple problems though.)

All these rules only have one single goal, and that is reducing your storage latency.

Latency

Reduced storage latency means each step of work takes less time, which means you can do more steps of work in the same time, which means a higher total amount of work done. So, less latency means more IOPS means more throughput.

However, there's more than one way to achieve a higher throughput, the simplest of which is to arrange the work you're doing in a way that once the platter has moved in position, you can get a lot of work done without repositioning the heads simply by writing data sequentially. In this scenario, latency doesn't matter as much, because it's only relevant for the first write you're about to do. So when writing data sequentially, achieving a high throughput is relatively easy.

Thing is: VMs don't write data sequentially. Ever. And even if a single VM did, the other 99 running in parallel and spewing their requests in between will absolutely, positively make sure that your storage never ever gets to do sequential IO. So having measured a high throughput with sequential IO does not mean that running VMs on that storage is gonna be fun.

Go ahead and fire up iostat or perfmon.exe on a system running on SSDs, and take a look at the throughput. I think I can safely predict you won't see more than a megabyte written per second.

Reducing latency is all that counts.

Rewards

I know these rules sound somewhat harsh, and during the development process for openATTIC, there have been times where following them has been tough (especially the one about the order of magnitude thing). But from today's perspective, it has definitely been worth it.

I've seen a system on 16 15kRPM 146GB disks crawl to a halt while a box with only 8 10kRPM 900GB drives runs circles around it. So the system that had fewer much slower disks delivered the performance that the system with the much better specs was uncapable of. What matters is not the hardware, it's the setup.

Following the rules I outlined above, you'll end up with a bunch'a VMs that, from the user's perspective, feel like real boxen running on SSDs. At the same time, your storage will be happily chewing away at a load of about 30%, and you're gonna be running out of space before that system hits the performance cap.

Eventually, this leads to satisfied users.

Ultimately, this leads to users that start flaming about how slow everything else is.