PostgreSQL 教程: 使用 repmgr 实现自动故障转移

八月 13, 2024

摘要：在本教程中，我们将学习开源工具 repmgr（Replication Manager），以及如何在 PostgreSQL 中设置和配置它，以实现自动故障转移。

介绍

repmgr（Replication Manager）是一个开源工具，用于管理 PostgreSQL 集群的复制和故障转移。在本教程中，我们将学习如何设置和配置集群，以实现自动故障转移。

前提条件

必须在主服务器和备用服务器上安装以下软件：

PostgreSQL
repmgr（与已安装的 PostgreSQL 主要版本匹配）
在网络层面，与 PostgreSQL 端口（默认值：5432）建立的连接，必须能够双向通信。

安装 PostgreSQL

使用 PostgreSQL 安装创建两个集群/服务器。您可以按照下面链接中的 PostgreSQL 说明，使用 PostgreSQL 的 PGDG 仓库软件包进行安装。为了命名约定，我们将主服务器和备用服务器视为两个服务器。

https://wiki.postgresql.org/wiki/YUM_Installation

yum -y install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm

yum -y install epel-release yum-utils
yum-config-manager --enable pgdg12
yum install postgresql12-server postgresql12

/usr/pgsql-12/bin/postgresql-12-setup initdb

注意：在备用服务器上不需要执行上述集群初始化步骤。

systemctl enable --now postgresql-12
systemctl status postgresql-12

安装 repmgr

您需要在主服务器和备用服务器上安装 repmgr。

yum -y install repmgr12*

配置 PostgreSQL

在主服务器上，必须初始化 PostgreSQL 实例，并且已经启动运行。可能需要调整以下复制相关设置：

max_wal_senders = 10
max_replication_slots = 10
wal_level = 'replica' or 'logical'
hot_standby = on
archive_mode = on
archive_command = '/bin/true'
shared_preload_libraries = 'repmgr'

创建用户

创建一个专用的 PostgreSQL 超级用户帐户，和一个用于 repmgr 元数据的数据库：

create user repmgr;
create database repmgr with owner repmgr;

配置 pg_hba.conf

确保 repmgr 用户在 pg_hba.conf 中具有适当的权限，并且可以在复制模式下进行连接；pg_hba.conf 应包含类似于以下内容的条目：

    local      replication      repmgr                              trust
    host       replication      repmgr        127.0.0.1/32          trust
    host       replication      repmgr        192.168.1.0/24        trust

    local       repmgr           repmgr                             trust
    host        repmgr           repmgr        127.0.0.1/32         trust
    host        repmgr           repmgr        192.168.1.0/24       trust

注意：请根据您的网络配置，调整上述设置。

配置 repmgr 文件

在主服务器上创建包含以下条目的 repmgr.conf：

cluster='failovertest'
node_id=1
node_name=node1
conninfo='host=node1 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data/'
failover=automatic
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

注意：我们用于 repmgr 的用户，必须可以访问其中的路径和文件。

注册主服务器

用 repmgr 注册主服务器：

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf primary register

WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered

然后检查集群的状态：

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf  cluster show

WARNING: the following problems were found in the configuration file:
  parameter "cluster" is deprecated and will be ignored

 ID | Name  | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | * running |      | default  | 100  | 1    | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2

构建/克隆备用服务器

在备用服务器上创建 repmgr.conf 文件：

$ cat repmgr.conf

node_id=2
node_name=node2
conninfo='host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover=automatic
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /var/lib/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

注意：在上述命令中，对于主机 IP 信息，我们需要指定备用服务器的 IP。在此示例中，172.16.140.137 是对应的备用服务器。

现在，我们可以执行试运行，并测试我们的配置是否正确：

$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone --dry-run

NOTICE: destination directory "/var/lib/pgsql/12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: all prerequisites for "standby clone" are met

如果没有问题，可以开始克隆：

$ /usr/pgsql-12/bin/repmgr -h 172.16.140.135 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone

NOTICE: destination directory "/var/lib/pgsql/12/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=172.16.140.135 user=repmgr dbname=repmgr
DETAIL: current installation size is 32 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/var/lib/pgsql/12/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  pg_basebackup -l "repmgr base backup"  -D /var/lib/pgsql/12/data -h 172.16.140.135 -p 5432 -U repmgr -X stream
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /var/lib/pgsql/12/data start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

注册备用服务器

用 repmgr 注册备用服务器：

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf standby register

INFO: connecting to local node "node2" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node "node2" (ID: 2) successfully registered

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster show
 ID | Name  | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+-----------------------------------------------------------------
 1  | node1 | primary | * running |      | default  | 100  | 1    | host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2
 2  | node2 | standby |   running | node1 | default  | 100  | 1    | host=172.16.140.137 user=repmgr dbname=repmgr connect_timeout=2

启动 repmgrd 守护进程

为了启用自动故障转移，我们现在需要在主从和见证服务器上，启动 repmgrd 守护进程：

例如：

$ /usr/pgsql-12/bin/repmgrd -f /var/lib/pgsql/repmgr.conf

[2020-02-23 20:44:43] [NOTICE] repmgrd (repmgrd 5.0.0) starting up
[2020-02-23 20:44:43] [INFO] connecting to database "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2020-02-23 20:44:43] [NOTICE] starting monitoring of node "node1" (ID: 1)
[2020-02-23 20:44:43] [INFO] "connection_check_type" set to "ping"
[2020-02-23 20:44:43] [NOTICE] monitoring cluster primary "node1" (ID: 1)
[2020-02-23 20:44:43] [INFO] child node "node2" (ID: 2) is attached

我们还可以检查集群的事件：

$ /usr/pgsql-12/bin/repmgr -f /var/lib/pgsql/repmgr.conf cluster event

 Node ID | Name  | Event          | OK | Timestamp       | Details
---------+-------+--------------------+----+---------------------+----------------------------------------------------------------------------------------
 2   | node2 | repmgrd_start  | t  | 2020-02-23 20:46:26 | monitoring connection to upstream node "node1" (ID: 1)
 1   | node1 | repmgrd_start  | t  | 2020-02-23 20:44:43 | monitoring cluster primary "node1" (ID: 1)
 2   | node2 | standby_register   | t  | 2020-02-23 20:39:24 | standby registration succeeded; upstream node ID is 1 (-F/--force option was used)
 1   | node1 | primary_register   | t  | 2020-02-23 20:39:11 | existing primary record updated
 2   | node2 | standby_register   | t  | 2020-02-23 20:38:25 | standby registration succeeded; upstream node ID is 1
 2   | node2 | standby_unregister | t  | 2020-02-23 20:37:56 |
 2   | node2 | standby_register   | t  | 2020-02-23 20:12:23 | standby registration succeeded; upstream node ID is 1
 2   | node2 | standby_clone  | t  | 2020-02-23 20:09:25 | cloned from host "172.16.140.135", port 5432; backup method: pg_basebackup; --force: N
 1   | node1 | primary_register   | t  | 2020-02-23 19:57:11 |
 1   | node1 | cluster_created | t  | 2020-02-23 19:57:11 |

现在，如果主服务器发生故障，repmgrd 将检测到无法访问主服务器，然后激活下一个可用的服务器，并执行自动故障转移。

日志消息会如下面所示：

[2020-02-23 20:51:28] [INFO] node "node2" (ID: 2) monitoring upstream node "node1" (ID: 1) in normal state
[2020-02-23 20:52:40] [WARNING] unable to ping "host=172.16.140.135 user=repmgr dbname=repmgr connect_timeout=2"
[2020-02-23 20:52:40] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
.
.
.
[2020-02-23 20:53:30] [INFO] checking state of node 1, 6 of 6 attempts
[2020-02-23 20:53:30] [WARNING] unable to ping "user=repmgr connect_timeout=2 dbname=repmgr host=172.16.140.135 fallback_application_name=repmgr"
[2020-02-23 20:53:30] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2020-02-23 20:53:30] [WARNING] unable to reconnect to node 1 after 6 attempts
[2020-02-23 20:53:30] [INFO] 0 active sibling nodes registered
[2020-02-23 20:53:30] [INFO] primary and this node have the same location ("default")
[2020-02-23 20:53:30] [INFO] no other sibling nodes - we win by default
[2020-02-23 20:53:30] [NOTICE] this node is the only available candidate and will now promote itself
[2020-02-23 20:53:30] [INFO] promote_command is:
  "/usr/pgsql-12/bin/repmgr standby promote -f /var/lib/pgsql/repmgr.conf --log-to-file"
[2020-02-23 20:53:30] [NOTICE] promoting standby to primary
[2020-02-23 20:53:30] [DETAIL] promoting server "node2" (ID: 2) using pg_promote()
[2020-02-23 20:53:30] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
[2020-02-23 20:53:31] [NOTICE] STANDBY PROMOTE successful
[2020-02-23 20:53:31] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary
[2020-02-23 20:53:31] [INFO] 0 followers to notify
[2020-02-23 20:53:31] [INFO] switching to primary monitoring mode
[2020-02-23 20:53:31] [NOTICE] monitoring cluster primary "node2" (ID: 2)

介绍