PostgreSQL Tutorial: Automatic failover using repmgr

August 13, 2024

Summary: In this tutorial, we will learn the open source tool repmgr (Replication Manager) and how to set up and configure it for automatic failover in PostgreSQL.

Table of Contents

Introduction

repmgr (Replication Manager) is an open source tool used for managing the replication and failover of PostgreSQL clusters. In this post we will learn to set up and configure the cluster for automatic failover.

Prerequisites

The following software must be installed on both master and standby servers:

PostgreSQL
repmgr (matching the installed PostgreSQL major version)
At the network level, connections with the PostgreSQL port (default: 5432) must be possible in both directions.

Install PostgreSQL

Create two clusters/servers with the PostgreSQL installation. You can follow the PostgreSQL instructions at the link below for installation using PostgreSQL’s PGDG repo package. For the sake of naming conventions, we will consider master and standby as two servers.

https://wiki.postgresql.org/wiki/YUM_Installation

yum -y install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm

yum -y install epel-release yum-utils
yum-config-manager --enable pgdg12
yum install postgresql12-server postgresql12

/usr/pgsql-12/bin/postgresql-12-setup initdb

Note: The above step of initialization of the cluster is not needed on the standby server.

systemctl enable --now postgresql-12
systemctl status postgresql-12

Install repmgr

You will need to install repmgr on the master as well as standby.

yum -y install repmgr12*

Configure PostgreSQL

On the primary server, a PostgreSQL instance must be initialized and running. The following replication settings may need to be adjusted:

max_wal_senders = 10
max_replication_slots = 10
wal_level = 'replica' or 'logical'
hot_standby = on
archive_mode = on
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'

Create users

Create a dedicated PostgreSQL superuser account and a database for the repmgr metadata:

create user repmgr;
create database repmgr with owner repmgr;

Configure pg_hba.conf

Ensure the repmgr user has appropriate permissions in pg_hba.conf and can connect in replication mode; pg_hba.conf should contain entries similar to the following:

    local      replication      repmgr                              trust
    host       replication      repmgr        127.0.0.1/32          trust
    host       replication      repmgr        192.168.1.0/24        trust

    local       repmgr           repmgr                             trust
    host        repmgr           repmgr        127.0.0.1/32         trust
    host        repmgr           repmgr        192.168.1.0/24       trust

Note: Adjust above settings according to your network configurations.

Configure the repmgr file

Create a repmgr.conf on the master server with the following entries:

cluster='failovertest'
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data/'
failover=automatic
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

Note: The location and file must be accessible for the user we are using for repmgr.

Register the primary server

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf primary register

WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered

Then check the status of the cluster:

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf  cluster show

WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored

 ID | Name  | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | * running |      | default  | 100  | 1    | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2

Build/clone the standby server

Create the repmgr.conf file on standby server:

$ cat repmgr.conf

node_id=2
node_name=node2
conninfo='host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover=automatic
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

Note: In the above commands, for the host IP info we need to specify the IP of the standby server. In this example, 172.16.140.137 is my standby server.

We can now perform the dry run and test if our configuration is correct:

$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone --dry-run

NOTICE: destination directory "/var/lib/pgsql/12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: all prerequisites for "standby clone" are met

If there is no problem, start cloning:

$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone

NOTICE: destination directory "/var/lib/pgsql/12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/var/lib/pgsql/12/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  pg_basebackup -l "repmgr base backup"  -D /var/lib/pgsql/12/data -h 172.16.140.135 -p 5432 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /var/lib/pgsql/12/data start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

Register the standby server

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf standby register

INFO: connecting to local node "node2" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node "node2" (ID: 2) successfully registered

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
 ID | Name  | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | * running |      | default  | 100  | 1    | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1 | default  | 100  | 1    | host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2

Start repmgrd daemon process

To enable the automatic failover, we now need to start the repmgrd daemon process on master slave and witness:

For example:

$ /usr/pgsql-12/bin/repmgrd -f /var/lib/pgsql/repmgr.conf

[2020-02-23 20:44:43] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2020-02-23 20:44:43] [INFO] connecting to database "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2020-02-23 20:44:43] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2020-02-23 20:44:43] [INFO] "connection_check_type" set to "ping"
[2020-02-23 20:44:43] [NOTICE] monitoring cluster primary "node1" (ID: 1)
[2020-02-23 20:44:43] [INFO] child node "node2" (ID: 2) is attached

We can also check the events for the cluster:

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster event

 Node ID | Name  | Event          | OK | Timestamp       | Details
---------+-------+--------------------+----+---------------------+----------------------------------------------------------------------------------------
 2   | node2 | repmgrd_start  | t  | 2020-02-23 20:46:26 | monitoring connection to upstream node "node1" (ID: 1)
 1   | node1 | repmgrd_start  | t  | 2020-02-23 20:44:43 | monitoring cluster primary "node1" (ID: 1)
 2   | node2 | standby_register   | t  | 2020-02-23 20:39:24 | standby registration succeeded; upstream node ID is 1 (-F/--force option was used)
 1   | node1 | primary_register   | t  | 2020-02-23 20:39:11 | existing primary record updated
 2   | node2 | standby_register   | t  | 2020-02-23 20:38:25 | standby registration succeeded; upstream node ID is 1
 2   | node2 | standby_unregister | t  | 2020-02-23 20:37:56 |
 2   | node2 | standby_register   | t  | 2020-02-23 20:12:23 | standby registration succeeded; upstream node ID is 1
 2   | node2 | standby_clone  | t  | 2020-02-23 20:09:25 | cloned from host "172.16.140.135", port 5432; backup method: pg_basebackup; --force: N
 1   | node1 | primary_register   | t  | 2020-02-23 19:57:11 |
 1   | node1 | cluster_created | t  | 2020-02-23 19:57:11 |

Now, if the master server fails, repmgrd will detect that the master is not reachable and then promote the next available server and perform the automatic failover.

The log messages will be as follows:

[2020-02-23 20:51:28] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state
[2020-02-23 20:52:40] [WARNING] unable to ping "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"
[2020-02-23 20:52:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
.
.
.
[2020-02-23 20:53:30] [INFO] checking state of node 1, 6 of 6 attempts
[2020-02-23 20:53:30] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.140.135 fallback_application_name=repmgr"
[2020-02-23 20:53:30] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-23 20:53:30] [WARNING] unable to reconnect to node 1 after 6 attempts
[2020-02-23 20:53:30] [INFO] 0 active sibling nodes registered
[2020-02-23 20:53:30] [INFO] primary and this node have the same location ("default")
[2020-02-23 20:53:30] [INFO] no other sibling nodes - we win by default
[2020-02-23 20:53:30] [NOTICE] this node is the only available candidate and will now promote itself
[2020-02-23 20:53:30] [INFO] promote_command is:
  "/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file"
[2020-02-23 20:53:30] [NOTICE] promoting standby to primary
[2020-02-23 20:53:30] [DETAIL] promoting server "node2" (ID: 2) using pg_promote()
[2020-02-23 20:53:30] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2020-02-23 20:53:31] [NOTICE] STANDBY PROMOTE successful
[2020-02-23 20:53:31] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2020-02-23 20:53:31] [INFO] 0 followers to notify
[2020-02-23 20:53:31] [INFO] switching to primary monitoring mode
[2020-02-23 20:53:31] [NOTICE] monitoring cluster primary "node2" (ID: 2)