How to run relevant benchmarks

The energy needed to refute benchmarks is multiple orders of magnitude bigger than to run them.

Running a benchmark is easy.

Finding out whether you can learn anything from it is hard.

Here's what to do, and much more importantly: What not to do.

So you've bought yourself some hardware and you want to figure out whether it's fast™ or slow™. Since there's an insane amount of tools on the net and you don't really know which one to use, you decide to go with the pretty obvious dd test first, and just dump a shitload of data onto them disks to see what'll happen.

dd

The result might look somewhat like this:

dd if=/dev/zero of=/dev/md1 bs=256k count=131072
...
34359738368 Bytes (34 GB) copied, 42,8475 s, 802 MB/s

Bam! 802 MB/s sounds like a pretty awesome number, doesn't it? But then you boot up a couple of VMs on it, just to find out they take 5 minutes to boot and run unbearably slowly. This is because you just walked straight into the trap of irrelevant benchmarking.

In your benchmark, you ran one single process that wrote lots of data sequentially. When booting VMs on the thing, you then ran many processes that write a little bit of data randomly. Those two workloads are as different as can be.

In a first attempt at making things better, you may now think "ok, then let's write random data", and re-run the benchmark like this:

dd if=/dev/urandom of=/dev/md1 bs=256k count=1024
...
268435456 Bytes (268 MB) copied, 16,6311 s, 16,1 MB/s

Go ahead, do that a couple of times on your own hardware. If you've carefully read the part about measurements, you should be able to run iostat in the background and see what this does to your disks. And just for the fun of it, run htop in another window. See if you notice anything odd.

Ready?

You haven't done it anyway, so I don't have to feel bad about spoiling: If you had run the benchmarks, you would've noticed that the disk finds this load to be absolutely laughable, while your CPU would've started crunching because like this, you're benchmarking the Linux Random Number Generator instead of your disks. Those still get to write a sequential data stream, just that it consists of data that is different from zero. This "random IO" business is about writing data to random positions, not about writing random data.

dd is absolutely unsuitable for testing this kind of workload, because it simply does the wrong thing. But this leaves us with one question:

What is the right thing?

Well, if you want your benchmark to tell you something about what's going to happen under real load conditions, you have to use a benchmark that does exactly the same thing VMs would do. That is:

  • Write data to random positions instead of sequentially. That means your benchmarking tool needs to lseek(), write(), repeat.

  • Use the correct block size when writing data. In today's world, most filesystems use 4KiB per default, so that write() system call needs to get exactly 4096 Bytes of data.

  • Run it on the same file system you'd use in production, set up in exactly the same way. (Benchmarking a standalone volume is not sufficient when you intend to run production on DRBD.)

  • Use multiple processes simultaneously. A single process will hit a performance cap at some point, but when running more processes, you'll find the cap is pretty much the same for all of them. Try realistic numbers: A VM cluster can easily run hundreds of VMs, so don't just try 4 processes.

  • Use a benchmark that does not write huge blocks of zeroes. If you intend to play with features like ZFS's deduplication, writing blocks of zeroes would cause ZFS to just write the block one single time, but that doesn't happen with real-world data. So, use a benchmark tool that writes non-zero data without benchmarking /dev/urandom.

In the past, I've had some success running multiple instances of Postmark in parallel. Postmark simulates a busy mail server by creating/writing to/deleting lots of small files, generating lots of random IO in the process. Unfortunately, Postmark simply does as much as possible, firing up the %util column in iostat to 100%.

By itself, that wouldn't be too bad — isn't that actually our goal, knowing what the system is capable of?

The problem is that when the bus nears saturation, the system responds in a completely different way than normally. Most notably, the latency skyrockets by multiple orders of magnitude. A system that has a latency of as low as 0.06ms may easily crawl to a halt when maxed out, then reporting latencies of 50ms or more. That's an 800-fold increase! Needless to say, bad things™ will ensue.

So, we'll have to add another bullet point to the list:

  • Never ever max out the system during a benchmark.

In order to get a feeling for the maximum load a system will be able to take, see what the latency is when running at about 30%. Then slowly increase the workload, and see how far you can go before the latency gets unacceptable.

Distmark

Since I didn't find a tool that satisfies all those criteria (maybe fio does, I couldn't figure that out because imho fio is a mess to configure), I hacked up a little tool of my own named distmark that does exactly this and nothing else. When running it, you pass in a directory on the file system you wish to test, a number of processes to run, and a maximum number of IOPS that the processes combined are allowed to generate. Distmark will output the number of IOPS that each child process has written every second. Use iostat to get more info.