Resilvering a ZFSonLinux disk
The other day, a disk died in our ZFSonLinux NAS. So we ordered a new one and installed the replacement today.
The installation was pretty straightforward: Remove the old disk, put in the new one, zpool replace
the damn
thing, and voila, here goes the resilvering. And it says it's just gonna take about 58 hours, or two and a
half days for a 500 GB disk.
scan: resilver in progress since Thu Jun 18 14:25:34 2015 124G scanned out of 1,35T at 6,10M/s, 58h39m to go 14,3G resilvered, 9,01% done
Dafuq.
Taking a closer look at iostat, we get statistics like this one:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0,00 0,00 145,00 0,00 1,21 0,00 17,12 0,06 0,39 0,39 0,00 0,33 4,80 sdb 0,00 0,00 130,00 0,00 1,29 0,00 20,26 0,09 0,68 0,68 0,00 0,43 5,60 sdc 0,00 0,00 145,00 0,00 1,17 0,00 16,46 0,05 0,36 0,36 0,00 0,30 4,40 sde 0,00 0,00 147,00 0,00 1,21 0,00 16,93 0,05 0,33 0,33 0,00 0,30 4,40 sdd 0,00 0,00 131,00 0,00 1,20 0,00 18,69 0,08 0,61 0,61 0,00 0,34 4,40 sdh 0,00 0,00 145,00 0,00 1,22 0,00 17,21 0,06 0,41 0,41 0,00 0,33 4,80 sdg 1,00 0,00 123,00 0,00 1,24 0,00 20,70 0,05 0,42 0,42 0,00 0,39 4,80 sdf 0,00 0,00 0,00 135,00 0,00 0,81 12,28 1,99 9,07 0,00 9,07 7,41 100,00
It appears seven of our eight disks are doing next to nothing, while one gets shot into orbit with IO — namely,
precisely the one disk being resilvered. Let's take a closer look at sdf
:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdf 0,00 0,00 0,20 139,20 0,00 0,80 11,83 1,82 19,51 9356,00 6,09 7,17 100,00 sdf 0,00 0,00 0,20 138,60 0,00 0,82 12,12 1,99 20,38 8932,00 7,52 7,20 100,00 sdf 0,00 0,00 0,00 111,60 0,00 0,43 7,87 1,90 7,13 0,00 7,13 8,96 100,00 sdf 0,00 0,00 0,20 185,60 0,00 0,84 9,30 1,76 15,09 9012,00 5,39 5,37 99,84 sdf 0,00 0,00 0,00 159,40 0,00 0,73 9,36 1,99 6,10 0,00 6,10 6,27 100,00 sdf 0,00 0,00 0,60 131,20 0,00 0,72 11,28 1,89 21,18 3216,00 6,57 7,59 100,00 sdf 0,00 0,00 0,00 122,00 0,00 0,61 10,31 1,99 8,31 0,00 8,31 8,20 100,00 sdf 0,00 0,00 0,20 128,20 0,00 0,50 8,02 1,90 21,30 9512,00 6,49 7,79 100,00 sdf 0,00 0,00 0,00 214,00 0,00 0,89 8,49 2,00 3,94 0,00 3,94 4,68 100,08 sdf 0,00 0,00 0,20 157,20 0,00 0,67 8,66 1,89 15,76 7844,00 5,80 6,35 100,00 sdf 0,00 0,00 0,20 238,60 0,00 1,02 8,79 1,88 11,55 8340,00 4,57 4,19 100,00 sdf 0,00 0,00 0,00 96,80 0,00 0,62 13,08 2,00 9,82 0,00 9,82 10,33 100,00 sdf 0,00 0,00 0,20 99,20 0,00 0,59 12,11 1,69 25,33 9740,00 5,74 10,06 100,00
If you've done your fair share of performance tuning for storage systems, you'll
first take a look at the w/s
column, be mildly alarmed, then check out the average request size and drop dead: ZFS
apparently resilvers using completely random IO.
This is, like, a bad thing. Check out the wMB/s
column to see what I mean. The one big advantage of
resilvering compared to a RAID rebuild is that we only need to rebuild those blocks that actually
contain data, but because ZFS uses random IO for that, we can't get any noticeable bandwidth. Even
if the disk contains only around 200GB of data, resilvering those 200GB at 0.72MB/s takes 79 hours,
whereas rebuilding the full 500GB at 100MB/s would take only one and a half hour.
Apparently, ZFS introduced Sequential Resilvering in order to fix this, but from what I can tell, it won't make its way into ZFSonLinux anytime soon because those versions of ZFS are closed source. So it looks like I can only wait for ZFS to finish this stupidity, and hope the disk doesn't die in the process.
Update
scan: resilvered 171G in 8h23m with 0 errors on Thu Jun 18 22:48:38 2015
That's an average throughput of around 5MB/s.
I had been expecting worse, but still, that's not very impressive.