repmgr quick start

Dear future self, this is a little copy-and-paste tutorial on how to get postgres replication running using repmgr.

First, install postgres and repmgr on all'a dem nodes:

echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list
apt-get install wget ca-certificates
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
apt-get update
apt-get install postgresql-9.5 postgresql-9.5-repmgr

Edit /etc/postgresql/9.5/main/pg_hba.conf to include all your nodes:

host    replication     postgres        192.168.142.30/32       md5
host    replication     postgres        192.168.142.31/32       md5
host    replication     postgres        192.168.142.32/32       md5

host    all             postgres        192.168.142.30/32       md5
host    all             postgres        192.168.142.31/32       md5
host    all             postgres        192.168.142.32/32       md5

host    all             svedrin         192.168.142.35/32       md5
host    all             svedrin         192.168.142.36/32       md5

Don't forget to:

pg_ctlcluster 9.5 main reload

Set a password for the postgres user, and put it into /var/lib/postgresql/.pgpass:

192.168.142.30:5432:*:postgres:Faeb8owi
192.168.142.31:5432:*:postgres:Faeb8owi
192.168.142.32:5432:*:postgres:Faeb8owi

Open a psql shell and run:

ALTER SYSTEM SET wal_level = 'hot_standby';
ALTER SYSTEM SET listen_addresses = '*';
ALTER SYSTEM SET hot_standby = 'on';
ALTER SYSTEM SET max_wal_senders = '10';
ALTER SYSTEM SET max_replication_slots = '10';

Then:

pg_ctlcluster 9.5 main restart

Make sure that the postgres users on each node can freely SSH to one another:

su - postgres
ssh-keygen
cp .ssh/id_rsa.pub .ssh/authorized_keys

Then distribute that .ssh directory to /var/lib/postgresql everywhere.

Now create /etc/postgresql/9.5/main/repmgr.conf on each node:

cluster=pgtest

# Be sure to adapt these three lines on every node
node=1 # 2, 3 etc -- remove this comment
node_name=pgtest-n0
conninfo=host=192.168.142.30 port=5432

use_replication_slots=1

event_notifications=master_register,standby_register,standby_unregister,standby_promote,standby_switchover
#event_notification_command=/usr/local/bin/pgdeploy

# Fix for Ubuntu/Debian
pg_bindir=/usr/lib/postgresql/9.5/bin

# Settings for automatic failover
failover=automatic
master_response_timeout=60
reconnect_attempts=5
reconnect_interval=10
promote_command=/usr/lib/postgresql/9.5/bin/repmgr -f /etc/postgresql/9.5/main/repmgr.conf standby promote
follow_command=/usr/lib/postgresql/9.5/bin/repmgr  -f /etc/postgresql/9.5/main/repmgr.conf standby follow

loglevel=INFO

Now to get the cluster up and running:

postgres@pgtest-n0:~$ alias
alias repmgr='repmgr -f /etc/postgresql/9.5/main/repmgr.conf'

postgres@pgtest-n0:~$ # master
postgres@pgtest-n0:~$ pg_ctlcluster 9.5 main start
postgres@pgtest-n0:~$ repmgr master register

postgres@pgtest-n1:~$ # first standby
postgres@pgtest-n1:~$ repmgr standby clone -h 192.168.142.30
postgres@pgtest-n1:~$ pg_ctlcluster 9.5 main start
postgres@pgtest-n1:~$ repmgr standby register

postgres@pgtest-n2:~$ # second standby
postgres@pgtest-n2:~$ repmgr standby clone -h 192.168.142.30
postgres@pgtest-n2:~$ pg_ctlcluster 9.5 main start
postgres@pgtest-n2:~$ repmgr standby register

postgres@pgtest-n0:~$ repmgr cluster show
[2016-09-22 21:03:40] [INFO] connecting to database
Role      | Name      | Upstream  | Connection String
----------+-----------|-----------|------------------------------
* master  | pgtest-n0 |           | host=192.168.142.30 port=5432
  standby | pgtest-n1 | pgtest-n1 | host=192.168.142.31 port=5432
  standby | pgtest-n2 | pgtest-n1 | host=192.168.142.32 port=5432

To switch to a new master:

postgres@pgtest-n0:~$ pg_ctlcluster 9.5 main stop -m fast
postgres@pgtest-n1:~$ repmgr standby promote
postgres@pgtest-n2:~$ repmgr standby follow
postgres@pgtest-n1:~$ repmgr cluster show
[2016-09-22 21:03:40] [INFO] connecting to database
Role      | Name      | Upstream  | Connection String
----------+-----------|-----------|------------------------------
* master  | pgtest-n1 |           | host=192.168.142.31 port=5432
  standby | pgtest-n2 | pgtest-n1 | host=192.168.142.32 port=5432
  FAILED  | pgtest-n0 |           | host=192.168.142.30 port=5432

To revive the previous master (-n0) as a new standby:

postgres@pgtest-n0:~$ repmgr standby clone -r -F -h 192.168.142.31
postgres@pgtest-n0:~$ pg_ctlcluster 9.5 main start
postgres@pgtest-n0:~$ repmgr standby register -F
postgres@pgtest-n0:~$ repmgr cluster show
[2016-09-22 21:03:40] [INFO] connecting to database
Role      | Name      | Upstream  | Connection String
----------+-----------|-----------|------------------------------
* master  | pgtest-n1 |           | host=192.168.142.31 port=5432
  standby | pgtest-n2 | pgtest-n1 | host=192.168.142.32 port=5432
  standby | pgtest-n0 | pgtest-n1 | host=192.168.142.30 port=5432

Fencing and client provisioning

Of course, the clients will need a way to know which node to connect to. For this, we use a pgbouncer on the application servers, that we automatically push a new config to when repmgr changes the cluster state. Since I'm totally biased about this, I recommend pgdeploy for this purpose.

Then on the application server, install pgbouncer, import the ssh keys into the now-existing postgres user, create /etc/pgbouncer/databases, chown it to postgres, and add the following in /etc/pgbouncer/pgbouncer.conf:

[databases]
%include /etc/pgbouncer/databases/pgtest.ini

Also, be sure to place a .pgpass file into /var/lib/postgresql so that pgbouncer can connect to the upstream database. If you just run pgdeploy, it should create a file in the databases directory with contents such as this:

svedrin = host=192.168.142.31 port=5432

For the clients to be able to connect, we also need to configure pgbouncer's user authentication. For this, on the master, run:

select usename, passwd from pg_shadow;

Copy the md5 hash for the user the application is supposed to use, and add an entry such as this into /etc/pgbouncer/userlist.txt:

"svedrin" "md5d4b59c3b833a7b881cd6c5717205bc64"

Then you should be able to connect to localhost:6432 using that user and password.

So now, not only do the clients know where to connect to (because their pgbouncers know), but also dead nodes are fenced out reliably.

Automatic Failover

For automatic failover, also configure /etc/default/repmgrd:

# default settings for repmgrd. This file is source by /bin/sh from
# /etc/init.d/repmgrd

# disable repmgrd by default so it won't get started upon installation
# valid values: yes/no
REPMGRD_ENABLED=yes

# configuration file (required)
REPMGRD_CONF="/etc/postgresql/9.5/main/repmgr.conf"

# additional options
REPMGRD_OPTS="--monitoring-history"

# user to run repmgrd as
#REPMGRD_USER=postgres

# repmgrd binary
#REPMGRD_BIN=/usr/bin/repmgrd

# pid file
#REPMGRD_PIDFILE=/var/run/repmgrd.pid

REPMGRD_LOGFILE=/var/log/repmgrd.log

repmgrd also requires a library be loaded and a set of functions be defined, so in a psql shell, run:

ALTER SYSTEM SET shared_preload_libraries = 'repmgr_funcs';
\i /usr/share/postgresql/9.5/contrib/repmgr_funcs.sql

Note that the function definitions will be replicated to all the nodes by postgres automatically, but the ALTER SYSTEM will not. Restart postgres afterwards.

In order to get repmgrd to produce a log file, you may want to tweak the init script by replacing the line that contains the startup command with this one:

start-stop-daemon --start --quiet --background --chuid "$REPMGRD_USER" --make-pidfile --pidfile "$REPMGRD_PIDFILE" --exec "$REPMGRD_BIN" --no-close -- --config-file "$REPMGRD_CONF" $REPMGRD_OPTS >"$REPMGRD_LOGFILE" 2>&1

This will tell start-stop-daemon not to close open file descriptors, so that bash can redirect the output into /var/log/repmgrd.log.

Then start repmgrd everywhere, and you're good to go. (Although the docs say repmgrd shouldn't be run on the master, it also doesn't hurt to do so anyway. In case the master fails, repmgrd will just not do anything.)

If you wish to enable synchronous replication, run:

ALTER SYSTEM SET synchronous_standby_names = 'pgtest-n0,pgtest-n1,pgtest-n2';

To disable it again, you can use:

ALTER SYSTEM RESET synchronous_standby_names;

Barman

In order to get barman to run, exchange SSH keys with the barman user on your backup server, add it to pg_hba.conf and run:

alter system set archive_mode = on;
alter system set archive_command = 'rsync -a %p barman@192.168.142.36:/var/lib/barman/pgtest-n0/incoming';

In /etc/barman.conf, add:

[pgtest-n0]
ssh_command = ssh postgres@192.168.142.30
conninfo = host=192.168.142.30 user=svedrin password=hai9Eexe
streaming_conninfo = host=192.168.142.30 user=svedrin password=hai9Eexe

Then run:

mkdir -p /var/lib/barman/pgtest-n0/incoming
chown barman -R /var/lib/barman/
barman switch-xlog pgtest-n0
barman archive-wal pgtest-n0
barman backup pgtest-n0