Resilvering a ZFSonLinux disk

The other day, a disk died in our ZFSonLinux NAS. So we ordered a new one and installed the replacement today. The installation was pretty straightforward: Remove the old disk, put in the new one, zpool replace the damn thing, and voila, here goes the resilvering. And it says it's just gonna take about 58 hours, or two and a half days for a 500 GB disk.

scan: resilver in progress since Thu Jun 18 14:25:34 2015
    124G scanned out of 1,35T at 6,10M/s, 58h39m to go
        14,3G resilvered, 9,01% done

Dafuq.

Taking a closer look at iostat, we get statistics like this one:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0,00     0,00  145,00    0,00     1,21     0,00    17,12     0,06    0,39    0,39    0,00   0,33   4,80
sdb               0,00     0,00  130,00    0,00     1,29     0,00    20,26     0,09    0,68    0,68    0,00   0,43   5,60
sdc               0,00     0,00  145,00    0,00     1,17     0,00    16,46     0,05    0,36    0,36    0,00   0,30   4,40
sde               0,00     0,00  147,00    0,00     1,21     0,00    16,93     0,05    0,33    0,33    0,00   0,30   4,40
sdd               0,00     0,00  131,00    0,00     1,20     0,00    18,69     0,08    0,61    0,61    0,00   0,34   4,40
sdh               0,00     0,00  145,00    0,00     1,22     0,00    17,21     0,06    0,41    0,41    0,00   0,33   4,80
sdg               1,00     0,00  123,00    0,00     1,24     0,00    20,70     0,05    0,42    0,42    0,00   0,39   4,80
sdf               0,00     0,00    0,00  135,00     0,00     0,81    12,28     1,99    9,07    0,00    9,07   7,41 100,00

It appears seven of our eight disks are doing next to nothing, while one gets shot into orbit with IO — namely, precisely the one disk being resilvered. Let's take a closer look at sdf:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdf               0,00     0,00    0,20  139,20     0,00     0,80    11,83     1,82   19,51 9356,00    6,09   7,17 100,00
sdf               0,00     0,00    0,20  138,60     0,00     0,82    12,12     1,99   20,38 8932,00    7,52   7,20 100,00
sdf               0,00     0,00    0,00  111,60     0,00     0,43     7,87     1,90    7,13    0,00    7,13   8,96 100,00
sdf               0,00     0,00    0,20  185,60     0,00     0,84     9,30     1,76   15,09 9012,00    5,39   5,37  99,84
sdf               0,00     0,00    0,00  159,40     0,00     0,73     9,36     1,99    6,10    0,00    6,10   6,27 100,00
sdf               0,00     0,00    0,60  131,20     0,00     0,72    11,28     1,89   21,18 3216,00    6,57   7,59 100,00
sdf               0,00     0,00    0,00  122,00     0,00     0,61    10,31     1,99    8,31    0,00    8,31   8,20 100,00
sdf               0,00     0,00    0,20  128,20     0,00     0,50     8,02     1,90   21,30 9512,00    6,49   7,79 100,00
sdf               0,00     0,00    0,00  214,00     0,00     0,89     8,49     2,00    3,94    0,00    3,94   4,68 100,08
sdf               0,00     0,00    0,20  157,20     0,00     0,67     8,66     1,89   15,76 7844,00    5,80   6,35 100,00
sdf               0,00     0,00    0,20  238,60     0,00     1,02     8,79     1,88   11,55 8340,00    4,57   4,19 100,00
sdf               0,00     0,00    0,00   96,80     0,00     0,62    13,08     2,00    9,82    0,00    9,82  10,33 100,00
sdf               0,00     0,00    0,20   99,20     0,00     0,59    12,11     1,69   25,33 9740,00    5,74  10,06 100,00

If you've done your fair share of performance tuning for storage systems, you'll first take a look at the w/s column, be mildly alarmed, then check out the average request size and drop dead: ZFS apparently resilvers using completely random IO.

This is, like, a bad thing. Check out the wMB/s column to see what I mean. The one big advantage of resilvering compared to a RAID rebuild is that we only need to rebuild those blocks that actually contain data, but because ZFS uses random IO for that, we can't get any noticeable bandwidth. Even if the disk contains only around 200GB of data, resilvering those 200GB at 0.72MB/s takes 79 hours, whereas rebuilding the full 500GB at 100MB/s would take only one and a half hour.

Apparently, ZFS introduced Sequential Resilvering in order to fix this, but from what I can tell, it won't make its way into ZFSonLinux anytime soon because those versions of ZFS are closed source. So it looks like I can only wait for ZFS to finish this stupidity, and hope the disk doesn't die in the process.

Update

scan: resilvered 171G in 8h23m with 0 errors on Thu Jun 18 22:48:38 2015

That's an average throughput of around 5MB/s.

I had been expecting worse, but still, that's not very impressive.