PostgreSQL 教程: 以最短的停机时间启用数据校验和

八月 22, 2024

摘要:在本教程中,您将学习如何在 PostgreSQL 中,以最短的停机时间启用数据校验和。

目录

介绍

PostgreSQL 有一个称为数据校验和的功能,用于检测数据页的任何存储级损坏。但是,默认情况下,新的 PostgreSQL 数据库实例不会启用此功能。要使用此功能,您必须在使用 initdb 命令初始化新的 PostgreSQL 数据库实例时,指定 --data-checksums 选项。

那么,那些已经在运行的实例呢?有没有办法为现有实例启用数据校验和?以前,唯一的方法是,指定 --data-checksums 选项初始化一个新实例,并使用转储还原或逻辑复制将数据库转储到其中。不幸的是,这些方法很慢。从 PostgreSQL 12 开始,用户可以利用一个名为 pg_checksum 的新内置工具,在未使用此功能初始化的实例中启用校验和。但是,此工具仅在数据库服务器停机时才能工作。对于大型生产服务器,可能很难安排大量的停机时间。为避免此问题,可以创建主备架构。以下是达成此目的可以遵循的步骤。

操作步骤

可按照以下步骤,来完成此教程:

1. 请使用流复制,创建由主服务器和备用服务器组成的集群,以确保数据冗余和可用性。要使用流复制设置主备系统,请参阅设置流复制,以获取详细说明。在构建完集群后,可以使用以下语句,来验证复制是否正常工作。

select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 234189
usesysid         | 16402
usename          | replicator
application_name | walreceiver
client_addr      | 172.31.30.60
client_hostname  |
client_port      | 42950
backend_start    | 2024-03-12 16:40:07.668783-05
backend_xmin     |
state            | streaming
sent_lsn         | 9445/8E82D2F8
write_lsn        | 9445/8E82D2F8
flush_lsn        | 9445/8E82D2F8
replay_lsn       | 9445/8E82D2F8
write_lag        | 00:00:00.000344
flush_lag        | 00:00:00.000475
replay_lag       | 00:00:00.000674
sync_priority    | 0
sync_state       | async
reply_time       | 2024-03-14 00:45:18.968746-05

2. 为确保高可用性和最短停机时间,请使用任何 HA 工具,为主备节点配置自动故障转移/切换。在此示例中,已采用 Repmgr 作为高可用(HA)工具。要设置 repmgr,请参阅使用 repmgr 实现自动故障转移,以获取详细说明。

3. 在正确配置 repmgr 后,两个节点在集群中都将可见。在此示例中,将采用 repmgr 来进行切换。

$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=172.31.26.9  user=postgres dbname=postgres password=postgres connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=172.31.30.60  user=postgres dbname=postgres password=postgres connect_timeout=2

4. 确认在备用节点上已禁用校验和。

show data_checksums;
 data_checksums
----------------
 off
(1 row)

5. 停止备用数据库实例。

$ sudo systemctl stop postgresql@14-main

6. 在备用数据库实例上启用校验和。

$ /usr/lib/postgresql/14/bin/pg_checksums --pgdata /var/lib/postgresql/14/main --enable --progress
25/25 MB (100%) computed
Checksum operation completed
Files scanned:  942
Blocks scanned: 3262
pg_checksums: syncing data directory
pg_checksums: updating control file
Checksums enabled in cluster

7. 启动备用数据库实例。

$ sudo systemctl start postgresql@14-main

8. 验证备用节点上的校验和状态,现在此功能肯定已启用。

show data_checksums;
 data_checksums
----------------
 on
(1 row)

9. 确认备用数据库与主数据库完全同步,并且没有延迟。

select application_name,state,pg_current_wal_lsn() as current_wal,replay_lsn,replay_lag,
  pg_size_pretty((pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))) as size
  from pg_stat_replication order by pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) desc;
 application_name |   state   |  current_wal  |  replay_lsn   |   replay_lag    |  size
------------------+-----------+---------------+---------------+-----------------+---------
 walreceiver      | streaming | 10BC/BC4B62E0 | 10BC/BC4B62E0 | 00:00:00.000574 | 0 bytes
(1 row)

10. 一旦确认主数据库和备用数据库之间没有延迟,请执行试运行模式的切换,以确认一切正常。

$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf standby switchover --dry-run
NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
INFO: SSH connection to host "172.31.26.9" succeeded
INFO: able to execute "repmgr" on remote host "172.31.26.9"
INFO: 1 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "node1":
"pg_ctl  -D '/var/lib/postgresql/14/main' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met

11. 执行实际切换。

$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
DETAIL: executing server command "pg_ctlcluster 14 main -m f stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/D000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
NOTICE: node "node2" (ID: 2) promoted to primary, node "node1" (ID: 1) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node2" is now primary and node "node1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

12. 现在,旧的备用服务器成了新的主节点。

$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 1        | host=172.31.26.9  user=postgres dbname=postgres password=postgres connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 2        | host=172.31.30.60  user=postgres dbname=postgres password=postgres connect_timeout=2

13. 停止新的备用数据库(即之前的主数据库)。

$ pg_ctlcluster 14 main -m f stop

$ pg_ctlcluster 14 main -m f status
pg_ctl: no server running

14. 在新的备用数据库(即之前的主数据库)上启用校验和。

$ /usr/lib/postgresql/14/bin/pg_checksums --pgdata /var/lib/postgresql/14/main --enable
Checksum operation completed
Files scanned:  943
Blocks scanned: 3262
pg_checksums: syncing data directory
pg_checksums: updating control file
Checksums enabled in cluster

15. 启动新的备用数据库(即之前的主数据库)。

$ pg_ctlcluster 14 main -m f start

$ pg_ctlcluster 14 main -m f status
pg_ctl: server is running (PID: 27480)
/usr/lib/postgresql/14/bin/postgres "-D" "/var/lib/postgresql/14/main" "-c" "config_file=/etc/postgresql/14/main/postgresql.conf"

16. 验证新的备用数据库(即之前的主数据库)上的校验和状态。现在此功能肯定已启用。

show data_checksums;
 data_checksums
----------------
 on
(1 row)

17. 检查 repmgr 中的当前集群状态。

$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 2        | host=172.31.26.9  user=postgres dbname=postgres password=postgres connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 2        | host=172.31.30.60  user=postgres dbname=postgres password=postgres connect_timeout=2

18. 确认备用数据库与新的主数据库完全同步,并且没有延迟。

select application_name,state,pg_current_wal_lsn() as current_wal,replay_lsn,replay_lag,
  pg_size_pretty((pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))) as size
  from pg_stat_replication order by pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) desc;
-[ RECORD 1 ]----+----------------
application_name | walreceiver
state            | streaming
current_wal      | 14F68/16C14000
replay_lsn       | 14F68/16C14000
replay_lag       | 00:00:00.000624
size             | 0 bytes

19. 在确认没有延迟后,再次执行切换,以还原到原始的主备设置。

$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node1" (ID: 1)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node1" (ID: 1) will be promoted to primary; current primary "node2" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "node2" (ID: 2)
NOTICE: issuing CHECKPOINT on node "node2" (ID: 2)
DETAIL: executing server command "pg_ctlcluster 14 main -m f stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/F000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node1" (ID: 1) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node1" (ID: 1) was successfully promoted to primary
NOTICE: node "node1" (ID: 1) promoted to primary, node "node2" (ID: 2) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node1" is now primary and node "node2" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

20. 验证集群状态输出,以确认其符合原始的设置。

$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 3        | host=172.31.26.9  user=postgres dbname=postgres password=postgres connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 3        | host=172.31.30.60  user=postgres dbname=postgres password=postgres connect_timeout=2

总结

在生产环境中执行 pg_checksum 过程之前,应注意一些预防措施:

  • 如果有较低级别的环境可用,建议先在该环境中执行 pg_checksum。这将有助于估计执行时间和活动期间产生的复制延迟。
  • 如果没有较低的环境,则建议构建生产环境的镜像环境,并首先对其进行测试,以获得上述估计值。

在本文中,我们讨论了使用 pg_checksums 在现有正在运行的集群上启用数据校验和的过程,同时将停机时间降至最低。希望本文能为您提供有益且有用的见解。