Pacemaker Resource Resurrection Guide

Svedrin

2017-07-05 21:50

In case you're ever confronted with an active-passive high availability cluster managed by pacemaker (if the crm_mon command exists on your system, it probably is), it's 4am and you just need that damn service to work again, here's what you can try.

Find out if your component is running under pacemaker

Run crm_mon -nfr1 (if you want to run it permanently, leave out the 1). The result looks a bit like this example:

root@hive:~# crm_mon -nfr1
Last updated: Wed Jul  5 14:17:32 2017
Last change: Wed Jul  5 13:36:41 2017 via crm_attribute on hive
Stack: corosync
Current DC: erwin (1084757795) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured


Node hive (1084757794):  online
Node erwin (1084757795): online
        ip_smbd         (ocf::heartbeat:IPaddr2):       Started
        fs_smbd         (ocf::heartbeat:Filesystem):    Started
        rc_smbd         (lsb::smbd):                    Started

Inactive resources:


Migration summary:
* Node hive:
* Node erwin:

There are three relevant sections:

Node summary

The node summary lists all nodes in the cluster, and which resources are currently running on them (if any):

Node hive (1084757794):  online
Node erwin (1084757795): online
        ip_smbd         (ocf::heartbeat:IPaddr2):       Started
        fs_smbd         (ocf::heartbeat:Filesystem):    Started
        rc_smbd         (lsb::smbd):                    Started

In this example, node hive is online but does not have any resources running on it. Node erwin is also online, and everything's running there.

Nodes can also be offline or in standby.

Inactive resources

This section lists resources that are currently not running anywhere (either because they're set to stopped, or the node that ran them has crashed).

In this example, that section is empty.

Migration summary

This section shows errors encountered during resource operations. Resource operations are starting, stopping and monitoring resources.

Check if pacemaker is running

If crm_mon complains about not being able to connect to the cluster, this is a good indication that pacemaker is not running.

Pacemaker consists of two services: corosync (the telephone used by the two pacemakers to talk to one another), and pacemaker itself. Both services have init scripts and need to be up and running for pacemaker to work correctly.

Always start the corosync service first (should be in autostart), and pacemaker second (probably not in autostart). You don't need to wait between the two commands though.

Pacemaker is usually not in autostart because it may be desirable to allow the admin (that is, you) to check configs / put nodes in standby / have another beer before starting it. Pacemaker can be somewhat unpredictable at times, and this way you have a chance to implement safety measures.

My resources won't start!

If your resource appears as "stopped" in the inactive resources section, it is not configured to start. Run crm resource start <resource name>, e.g. crm resource start rc_smbd, to start it.

If it appears as started but still does not start and one of your cluster nodes is down, try crm_resource -U migrate <resource name>. This may or may not work, and your best bet is to get the failed node up and running again (or start the component manually).

If the resource has failed operations in the Migration summary, try crm resource cleanup <resource name>. This also may or may not work.

Sometimes, Pacemaker runs into an awkward state and completely freezes. In that case, run the following commands on all nodes:

killall -9 corosync crmd lrmd pengine cib

Best run that multiple times, until it says:

corosync: no process found
crmd: no process found
lrmd: no process found
pengine: no process found
cib: no process found

Then start the services again:

service corosync start
service pacemaker start

and pray to the deity of your choice.

None of that worked!

Are you in an on-call schedule? Now is the time to escalate.