Pacemaker Resource Resurrection Guide
In case you're ever confronted with an active-passive high availability cluster managed by pacemaker (if the crm_mon
command exists on your system, it probably is), it's 4am and you just need that damn service to work again, here's what you can try.
Find out if your component is running under pacemaker
Run crm_mon -nfr1
(if you want to run it permanently, leave out the 1
). The result looks a bit like this example:
root@hive:~# crm_mon -nfr1 Last updated: Wed Jul 5 14:17:32 2017 Last change: Wed Jul 5 13:36:41 2017 via crm_attribute on hive Stack: corosync Current DC: erwin (1084757795) - partition with quorum Version: 1.1.10-42f2063 2 Nodes configured 3 Resources configured Node hive (1084757794): online Node erwin (1084757795): online ip_smbd (ocf::heartbeat:IPaddr2): Started fs_smbd (ocf::heartbeat:Filesystem): Started rc_smbd (lsb::smbd): Started Inactive resources: Migration summary: * Node hive: * Node erwin:
There are three relevant sections:
Node summary
The node summary lists all nodes in the cluster, and which resources are currently running on them (if any):
Node hive (1084757794): online Node erwin (1084757795): online ip_smbd (ocf::heartbeat:IPaddr2): Started fs_smbd (ocf::heartbeat:Filesystem): Started rc_smbd (lsb::smbd): Started
In this example, node hive
is online but does not have any resources running on it. Node erwin
is also online, and everything's running there.
Nodes can also be offline or in standby.
Inactive resources
This section lists resources that are currently not running anywhere (either because they're set to stopped, or the node that ran them has crashed).
In this example, that section is empty.
Migration summary
This section shows errors encountered during resource operations. Resource operations are starting, stopping and monitoring resources.
Check if pacemaker is running
If crm_mon
complains about not being able to connect to the cluster, this is a good indication that pacemaker is not running.
Pacemaker consists of two services: corosync
(the telephone used by the two pacemakers to talk to one another), and pacemaker itself. Both services have init scripts and need to be up and running for pacemaker to work correctly.
Always start the corosync service first (should be in autostart), and pacemaker second (probably not in autostart). You don't need to wait between the two commands though.
Pacemaker is usually not in autostart because it may be desirable to allow the admin (that is, you) to check configs / put nodes in standby / have another beer before starting it. Pacemaker can be somewhat unpredictable at times, and this way you have a chance to implement safety measures.
My resources won't start!
If your resource appears as "stopped" in the inactive resources section, it is not configured to start. Run crm resource start <resource name>
, e.g. crm resource start rc_smbd
, to start it.
If it appears as started but still does not start and one of your cluster nodes is down, try crm_resource -U migrate <resource name>
. This may or may not work, and your best bet is to get the failed node up and running again (or start the component manually).
If the resource has failed operations in the Migration summary, try crm resource cleanup <resource name>
. This also may or may not work.
Sometimes, Pacemaker runs into an awkward state and completely freezes. In that case, run the following commands on all nodes:
killall -9 corosync crmd lrmd pengine cib
Best run that multiple times, until it says:
corosync: no process found crmd: no process found lrmd: no process found pengine: no process found cib: no process found
Then start the services again:
service corosync start service pacemaker start
and pray to the deity of your choice.
None of that worked!
Are you in an on-call schedule? Now is the time to escalate.