Locating dying disks in LSI RAID using StorCLI

I often find myself in need of locating disks in an LSI RAID that are not quite dead yet, but in the process of dying. Google knows how to do that using MegaCli, but I totally hate that tool and want to do the same thing using storcli instead, which is a bit less insane. Here's how.

For a quick glance on which disk might be causing trouble, try:

root@psql-n1:~# storcli /c0/eALL/sALL show all | grep -e 'State :' -e "Predictive Failure Count"
Drive /c0/e8/s0 State :
Predictive Failure Count = 0
Drive /c0/e8/s1 State :
Predictive Failure Count = 0
Drive /c0/e8/s2 State :
Predictive Failure Count = 2
Drive /c0/e8/s3 State :
Predictive Failure Count = 0
Drive /c0/e8/s4 State :
Predictive Failure Count = 0

For disks that are okay, this count should be 0. So in that example, /c0/e8/s2 is broken. Notably though, it's not yet broken enough to be actually thrown out of the array:

root@psql-n1:~# storcli /c0/e8/s2 show
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive Information :
=================

-------------------------------------------------------------------------
EID:Slt DID State DG       Size Intf Med SED PI SeSz Model            Sp
-------------------------------------------------------------------------
8:2      11 Onln   1 558.406 GB SAS  HDD N   N  512B HUS156060VLS600  U
-------------------------------------------------------------------------

State is still online and the array is not yet degraded, it's only throwing random IO errors.

Critical Disks

MegaCli has a nice way of getting a quick overview, using the MegaCli -AdpAllInfo -aAll command:

root@psql-n1:~# megacli64 -AdpAllInfo -aALL | less

[... lots'a stuff ...]

                Device Present
                ================
Virtual Drives    : 2
  Degraded        : 0
  Offline         : 0
Physical Devices  : 12
  Disks           : 10
  Critical Disks  : 1
  Failed Disks    : 0

storcli seems to be able to also output this information:

root@psql-n1:~# grep 'Critical Disks' /usr/local/sbin/storcli
Binary file /usr/local/sbin/storcli matches

Unfortunately I haven't yet managed to find out the incarnation necessary to make it actually spit out that info, and the manual is less than helpful in that regard too. :(

Locating the disk

Fire this command:

root@psql-n1:~# storcli /c0/e8/s2 start locate

In my setup, this will make the activity LED blink really fast. This is of course less than helpful if your box is currently under load, because then the activity LED will blink really fast anyway. Probably this is why sane RAID controllers blink the failure LED for locating disks, but hey, this is LSI. So, better make sure your server isn't doing anything other than locating the fucking disk.

Update

A kind soul wrote me that there's a setting to control this behavior:

storcli /cALL set activityforlocate=off

It seems to be off by default though, and also this behavior only occurs on our old hardware. We bought new boxen in the meantime, those actually blink the failure LED as they should. So, in conclusion: meh.

Thanks, Evan!

Replacing it

After locating it, I just pulled the disk out, added a replacement in, and found the RAID controller had already started rebuilding on its own. Since the replacement disk had been in use before, I'd expected a foreign configuration to show up. But:

root@psql-n1:~# storcli /c0/fALL show
Controller = 0
Status = Success
Description = Couldn't find any foreign Configuration

Allright then I guess, maybe we picked a previous hot spare that had in fact not yet been in use anywhere. But dear future me, if you ever happen to need it, here's a link I found that may come in handy: https://wiki.nikhef.nl/grid/Managing_RAID_Controllers.