PostgreSQL Tutorial: Backing up a database cluster using pg_basebackup

October 16, 2023

Summary: pg_basebackup is the built-in binary format backup tool available with PostgreSQL. Only the backups taken using pg_basebackup can be used for point-in-time recovery and not pg_dump or pg_dumpall.

Table of Contents

Introduction

This can also be performed from the standby and not just the master. So, we can offload the backups to a standby that is always in sync and on a faster network.

pg_basebackup cannot run in parallel mode and does not support incremental or differential backups. So, it may be a great tool when the backup is a few hundred gigabytes and stored in the same data center. If the backup is several terabytes in size and needs to be pushed over the network to the cloud or a remote backup server, then it is wise to try alternate open source backup tools upon testing pg_basebackup.

Getting ready

In order to run a backup using pg_basebackup, it requires an empty directory. If we have a requirement of running a daily or a weekly backup job, maybe scheduled using a cron job, it is wise to have a dated directory created to store the backup for each day.

The privileges required to run a backup using pg_basebackup are simple. It just needs the REPLICATION role to be assigned to the user. The backup user must be created on the database cluster that is being backed up. If the backup is running from a standby, the user must be created on the master.

As pg_basebackup runs using the replication protocol, the user must be allowed to connect from the server from where the backup is running, using the replication protocol. The following is an example entry that needs to exist in the pg_hba.conf of the database server that is being backed up.

How to do it

The following are the steps involved in performing a backup using pg_basebackup, from either the primary/master server or from a standby:

1. Create a directory to store the backup. It requires an empty target directory to perform the backup:

$ sudo mkdir -p /backup_dir/`date +"%Y%m%d"`
$ sudo chown -R postgres:postgres /backup_dir/`date +"%Y%m%d"`

2. Create the user using which the backup needs to be performed. This user requires a REPLICATION role to perform the backup:

$ psql -U postgres -c "CREATE USER backup_user WITH REPLICATION PASSWORD 'secret'"

3. Add appropriate entries in the pg_hba.conf file to allow replication connections and perform a reload:

$ echo "host replication backup_user <backup_server_hostname>/32 md5" >> $PGDATA/pg_hba.conf
$ psql -c "select pg_reload_conf()"

4. Prepare a command using the following syntax to perform a backup using pg_basebackup:

$ pg_basebackup -h <hostname> -U <user> -p <port> -D <target_directory> -c <checkpoint_mode> -F<format> -P -X<wal_method> -l <backup_label>

The following is an example command to generate a backup of a PostgreSQL cluster running on 192.168.90.70 using pg_basebackup:

$ pg_basebackup -h 192.168.90.70 -U backup_user -p 5432 -D /backup_dir/`date +"%Y%m%d"` -c fast -Ft -z -P -Xs -l backup_label

How it works…

In step 1, we are creating a new directory to store the backup. Though pg_basebackup automatically creates the target directory, it will fail when it lacks privileges.

The next step is to create the user using which pg_basebackup can be executed. Even the superuser Postgres can be used to take a backup. But, a superuser is always recommended to be limited to administrative operations only. For that reason, we are creating a backup_user in step 2, on the master database server.

As pg_basebackup uses a replication protocol, the user needs to be allowed to make replication connections from the backup server. In step 3, we are appending the entry to allow replication connections from 192.168.90.70 using backup_user to the pg_hba.conf of the database server. This entry must exist on the server to which backup_user is connecting to take a backup. And then, we need to perform a reload to get the changes to take effect.

And the final step is the backup. In step 4, we saw the syntax that can be used to perform a backup. It starts with the hostname, which can either be the IP or the hostname of the database server to which backup_user connects, and then the username followed by the port on which the database cluster is running. Using -D, we specify the directory in which the backup needs to be created.

As pg_basebackup waits for a checkpoint to complete, we can either issue a fastboot or a checkpoint that is spread within the command itself, instead of waiting. And then we go onto the format of the backup. It can be taken in tar or plain format. And we could also use -z to compress it to a tar format backup. -Ft stands for tar format and -Fp for plain. In order to display the progress, we can use -P so we know how long we need to wait for the backup to finish.

pg_basebackup may be used to set up replication as well. In that case, using -R makes our job very simple. It creates required entries for the parameters in postgresql.auto.conf to set up replication.

As the backup needs to be performed online, there may be the usual database transactions that are generating a lot of WAL segments. In order to make pg_basebackup consistent, it also needs the WALs that are generated during the backup. We can either stream the WALs using a parallel stream or fetch them all after the backup is finished. As the WAL segments are recycled after a certain threshold, it may sometimes be safe to stream them using -Xs. Otherwise, we could use -Xf, which fetches the WAL segments after the backup is completed. And finally, to give a label to the backup, we could use -l.