Measuring storage performance

The first step in performance tuning is measuring. This sounds pretty easy at first, but is actually a really hard thing to do correctly. This is partly because monitoring tools show irrelevant metrics and hide the important ones, and partly because we're accustomed to looking the wrong way.

One of the most helpful tools when it comes to storage performance is iostat. If you run it, it'll print statistics like this:

root@vh002:~# iostat /dev/sdc
Linux 3.2.0-52-generic (vh002)        11/26/2015      _x86_64_        (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.07    0.00    2.45    0.94    0.00   93.54

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdc              31.25      1498.85       368.02 54548383024 13393710071

Looks fine, dunnit? Unfortunately, no:

  1. The CPU indicator is next to useless. Modern boxes have 24, 32 or even 64 CPU cores, so an average value over all those cores is meaningless. Even firing up htop doesn't tell you the full story: In the default configuration, it doesn't display iowait, which is what we're looking for in regard to CPU.

    If you want to debug your storage subsystem, get storage stats. So, use iostat -d.

  2. Those disk stats at the bottom are average values since the system has booted. During boot, the workload is completely different, and this doesn't give you any indication about what your system is doing right now. Those values don't tell you anything about load peaks or what happens if you start or stop certain services. For that, you need continuously-updated statistics that run for a while.

    iostat can give you just that, if you pass in an interval as a parameter. But just what kind of interval do we want to use? A short interval shows peaks but hides the general performance. A long interval shows general performance, but hides peaks.

    Well, the solution that I prefer is running tmux, splitting the window into two horizontal panes, and running two instances of iostat in parallel: One with a ten-second interval, one with a one-second interval. This way, I can see both how big the variation is and get an idea of how good or bad the system is coping in general. (High peaks are not an issue if there is enough time with lower load afterwards.)

  3. iostat defaults to using numbers in KiB as the output. On modern systems, this is kinda ridiculous because throughputs have vastly increased. Passing the -m option, iostat can be made to use MiB, leading to numbers that are much better readable.

  4. By default, iostat only shows throughput and a combined IOPS incidator. Unfortunately, IOPS are meaningless when displayed on their own; and throughput is usually completely irrelevant — at least in the way we usually think about it. Check out the part about Latency in my first blog post, I explained this a bit there.

So the command that I use boils down to this:

root@vh002:~# iostat -xdm /dev/sd? 1
Linux 3.2.0-52-generic (vh002)        11/26/2015      _x86_64_        (32 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda              66.94    49.28   10.12   14.75     0.32     0.56    72.56     0.02    0.91    1.75    0.33   1.34   3.33
sdb             126.50    32.71   30.72    7.87     0.63     0.16    41.61     0.01    3.42    3.33    3.79   2.13   8.22
sdc             347.13    81.28   20.89   10.37     1.46     0.36   119.47     0.05    5.22    3.61    8.47   1.93   6.02

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    11.00    0.00   25.00     0.00     0.14    11.52     0.06    2.40    0.00    2.40   2.24   5.60
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00   689.00    0.00  109.00     0.00     3.12    58.57     2.81   25.76    0.00   25.76   3.38  36.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.00    0.00    3.00     0.00     0.02    10.67     0.00    1.33    0.00    1.33   1.33   0.40
sdb               0.00     0.00    0.00    3.00     0.00     0.01     8.00     0.01    4.00    0.00    4.00   4.00   1.20
sdc             368.00  1501.00   74.00  386.00     5.27     7.86    58.49    22.56   44.63   65.78   40.57   1.49  68.40

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     7.00    0.00    3.00     0.00     0.04    26.67     0.00    1.33    0.00    1.33   1.33   0.40
sdb               0.00     9.00    0.00    3.00     0.00     0.05    32.00     0.01    2.67    0.00    2.67   2.67   0.80
sdc             871.00  1427.00   13.00  134.00     0.79     5.59    88.87     6.55   24.05   33.54   23.13   4.00  58.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     7.00    0.00   10.00     0.00     0.07    13.60     0.02    2.00    0.00    2.00   2.00   2.00
sdb               3.00     0.00    2.00    1.00     0.02     0.00    16.00     0.01    4.00    6.00    0.00   4.00   1.20
sdc            1784.00  1604.00   66.00  543.00     6.88     8.43    51.48    33.90   63.54   31.21   67.47   1.48  90.00

This command fixes the problems outlined above by not displaying CPU statistics, using MB/s output units, displaying info about reads and writes in separate columns, and giving an indication about "how bad it is" in the %util column. Without this information, all the others are completely useless.

Reading iostat: %util

So now that we have all this information, what do we do with it?

First of all, compare the data in the first paragraph to the ones following it. The difference is most apparent in the %util column for /dev/sdc: The first paragraph indicates everything's fine, whereas the others don't look as shiny. This is because the first paragraph always displays a summary for since the system has booted. This is useless, because you need to get a grasp of what's currently going on.

The next paragraphs are always sampled since the last one was printed, so they only cover the last one-second interval. This is why they paint a completely different picture, and this information is far more relevant. However, you can see that there are varations, with %util jumping from 36% up to 68% and back down to 58%. See that 90% bump at the end? Even this might be okay, there might've just been a moment where lots of data came in to be processed and it just gave the storage a huge amount of work to do. Whether or not this is bad depends on how long it stays that way.

This is why I always run two instances of iostat in a tmux session, and I usually have some monitoring system in the background that samples data in even larger intervals. However, there's usually not much difference between ten seconds and five minutes anyway.

But what about the other columns?

Reading iostat: Throughput

When looking at the extended iostat output for the first time, the rMB/s and wMB/s columns look the most familiar. But check out those numbers — 3.12? 5.59? And when running at 90% utilization, the throughput is only 8.43 MB/s? That doesn't make sense, disks usually manage over 100MB/s, so how can this stupid thing be maxed out at friggin' eight MB/s? Clearly, this must be wrong.

Unfortunately, it is utterly correct. This is because what you are seeing as throughput when copying movies in Windows Explorer is achieved in a completely different fashion than the throughput you see here.

Have you ever done throughput testing by copying a file in Windows Explorer? Which kind of file did you use? How did you get Explorer to display the progress bar window that shows the stats in the first place? Probably you chose a movie, a disk image or some other ridiculously large file and watched the numbers in the progress window.

In the background, Windows repeatedly reads data from one place, stores it in a buffer in RAM, and then writes it to the other disk. The size of this buffer ultimately limits the amount of data that can be copied in one single step. Since in today's world RAM is usually not an issue, we can safely assume the buffer to be a megabyte or more. So Windows goes "hey source disk, go to position 1857191753 and read one Megabyte back to me". The disk moves its platters and heads into the correct position and then starts streaming the data. After the data has landed in RAM, Windows does the same thing on the target disk. When the process repeats, the next read request will most likely start at position 1858240329 (= 1857191753 + 1024²), so the disk may even think ahead and just read that part into a buffer of its own before Windows even asked it to. So the disk has virtually no seeking to do, it can just fire away with data until the bus can't take it anymore. This is how high throughput is achieved when copying large files.

Now go check out the column titled avgrq-sz. This denotes the average request size (the "and read one Megabyte back to me" part). Its units is sectors of 512 Bytes, or half a Kilobyte. So if you divide that number by two, you get the request size in Kilobytes. See how the biggest value is 88 Sectors, or 44KiB? The disk never gets the chance to go into the cool "no seeking, just fire away" mode. The operating system wants to have a few KB from here, then a few from there, then it wants to write an even tinier bit somewhere completely different. This is pretty much like running errands: The problem is not that you'd have much work to do once you get to the ATM, the bakery and the convenience store. What makes this tedious is that you have so many different places to go, and going there takes time.

Actually, for virtualization storage systems, 44KiB is already quite a big request because the requests that the VMs are sending are usually 4KiB (yes, four) in size. This is because the file system inside the VM usually organizes its storage in 4KiB blocks. (I say "usually" because I have already seen systems running on a block size of 512 Bytes.)

Under those circumstances, achieving a throughput of more than a few MB/s is a very, very hard thing to do.

So then what do we look at?

Reading iostat: IOPS

Take a look at the r/s and w/s columns. Those denote the number of errands that your disks had to run per second in the last interval for reading and writing data. Also keep track of the %util column to get an idea of how much a certain workload is stressing out your system. This is different for every box, so you'll have to keep looking at those numbers for a while to get a feeling for it. Also do the split tmux trick in order to get a grasp of load spikes and how they affect the system. Do they also appear in the ten-second interval iostat? How long does the system need to recover from them? Are there intervals where iostat shows 100% %util and zero reads or writes, meaning the system has actually overloaded? What happens after those "holes"? Do they happen regularly? Do they affect the ten-second iostat?

This takes time. Be patient. Compare many different systems with different setups. Compare VMs to their hosts. Try to determine where the workload comes from. How many reads are there compared to writes? Should there be any reads at all, or is your system currently only gathering data and writing that to disk? If there are reads that are not explicable from the workload, where else do they come from?

There's no telling what "good" or "bad" IOPS look like, because this depends way too much on your specific hardware setup and virtual machine workload. You'll need to compare as many systems as you can and try to find out which patterns emerge. This takes time (took me about a year), but unfortunately, that's the way it is.

Reading iostat: Latency

Take a look at the r_await and w_await columns. While the r/s and w/s columns indicate how many errands your disk had to run, the r_await and w_await columns tell you how long each of those errands took in average. Check out the difference between sda/sdb and sdc in the output: An errand run on sda/sdb usually completes in a millisecond or two, while those ran on sdc take somewhere between 25 and 70 milliseconds.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.00    0.00    3.00     0.00     0.02    10.67     0.00    1.33    0.00    1.33   1.33   0.40
sdb               0.00     0.00    0.00    3.00     0.00     0.01     8.00     0.01    4.00    0.00    4.00   4.00   1.20
sdc             368.00  1501.00   74.00  386.00     5.27     7.86    58.49    22.56   44.63   65.78   40.57   1.49  68.40

This can have multiple reasons:

  1. If caching is disabled, this is the difference between 10k SAS and 7.2k SATA disks. Enable the cache (so, buy a BBU or CacheVault module) and/or use faster disks. Sorry dude, this is gonna cost some money.

  2. When the load nears saturation, latency goes up. So if the load actually comes from you running a benchmark, find one that allows you to limit the IOPS it produces to a sane amount (30% is usually good). Otherwise, the IOPS numbers that the benchmark spits out are completely meaningless, because that's what your system can do in firefighting crunch mode. You don't want your system to be in firefighting crunch mode unless there's a fire, so if you scale your system like that, you fully deserve what's going to happen to you.

Reading iostat: The others

You can mostly ignore the other columns. The service time (svctm) column will go away soon and is by far less relevant as the *_await columns are. The await column is kinda redundant, so I only ever look at the other two. The first two columns indicate merged requests, I'm not sure what those are actually about. The queue size (avgqu-sz) can give you a hint about whether or not the system is overloaded, but unless something's really wrong it shouldn't be too interesting.

A word on throughput again

Keeping everything I said in mind, we can now start taking a look at throughput again. We'll just use a somewhat different terminology.

When happily chewing away at, say, 5000 IO operations per second, each with a request size of 40KiB (80 Sectors), the storage system would write data at a rate of 190 MB/s. This is a realistic value under the workload imposed by a cluster of virtual machines.

However, using a value in MB/s is very tempting to be compared to sequential IO. The thing has 24 disks in it, why the hell is it only twice as fast as the one in my stupid desktop machine?

I found it very helpful not to be talking about MB/s when it comes to the throughput of such storage systems. Instead, I like talking about Terabytes per day, TB/d. What would be perceived as a slow-ass system had we named the throughput in MB/s suddenly becomes a friggin' rocket if you tell people that it processes 16 Terabytes of data every day! At the end of the day it is the same number (just multiplied by 86400), but since people intuitively compare this to Windows Explorer talking about Megabytes instead of Terabytes, the way people perceive the performance completely changes.

iostat.sh

I've written a script called iostat.sh (wget link) that makes handling iostat a bit easier. It hides the lesser relevant columns, colors values according to what I experienced as "generally good" and "generally bad", and gives you the "written TB/d" value. Apart from that, it works in exactly the same way iostat itself does (after all, it just reformats the output).