PostgreSQL Tutorial: Check data file integrity

March 9, 2024

Summary: in this tutorial, you will learn how to check data file integrity in PostgreSQL.

Table of Contents

Background

In rare cases, a data file related to a table goes missing, maybe due to OS (hardware problem, such as tablespace associated storage devices detached) or due to human interruption, which causes the deletion of some data file unintentionally at the OS level. Though it is not at all recommended to touch the /data/base/ directory and go through files under this /var/lib/postgresql/14/main/base/, however, sometimes it happens.

Our current database was running fine with the below structure:

                        List of databases
   Name    |  Owner   | Encoding | Collate |  Ctype  |   Access privileges   |  Size   | Tablespace |                Description
-----------+----------+----------+---------+---------+-----------------------+---------+------------+--------------------------------------------
 redrock   | postgres | UTF8     | C.UTF-8 | C.UTF-8 |                       | 9561 kB | pg_default |
 postgres  | postgres | UTF8     | C.UTF-8 | C.UTF-8 |                       | 8553 kB | pg_default | default administrative connection database
 template0 | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =c/postgres          +| 8401 kB | pg_default | unmodifiable empty database
           |          |          |         |         | postgres=CTc/postgres |         |            |
 template1 | postgres | UTF8     | C.UTF-8 | C.UTF-8 | =c/postgres          +| 8553 kB | pg_default | default template for new databases
           |          |          |         |         | postgres=CTc/postgres |         |            |
(4 rows)

Somehow we are getting the below error message in PostgreSQL logs:

2023-06-14 09:58:06.408 UTC [4056] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-06-14 09:58:06.412 UTC [4056] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-06-14 09:58:06.423 UTC [4057] LOG:  database system was shut down at 2023-06-14 09:58:04 UTC
2023-06-14 09:58:06.432 UTC [4056] LOG:  database system is ready to accept connections
2023-06-16 10:00:58.130 UTC [35062] postgres@redrock ERROR:  could not open file "base/16384/16391": No such file or directory
2023-06-16 10:00:58.130 UTC [35062] postgres@redrock STATEMENT:  select * from test limit 1;
2023-06-16 10:01:59.191 UTC [35224] postgres@redrock ERROR:  could not open file "base/16384/16391": No such file or directory
2023-06-16 10:01:59.191 UTC [35224] postgres@redrock STATEMENT:  select * from test limit 1;

Sometimes, there might be none error messages in PostgreSQL logs, if just one segment in a relation is missing. Then, how can we find all relations which missing files?

Find relations missing files

You can use the following function, to check data integrity of a relation in PostgreSQL:

CREATE OR REPLACE FUNCTION check_relation_integrity(relation regclass)
  RETURNS boolean AS $$
DECLARE
  blk_size integer;
  fsm_size bigint;
  min_size bigint;
  main_size bigint;
BEGIN
  IF pg_stat_file(pg_relation_filepath(relation)) IS NULL THEN
    RETURN FALSE;
  END IF;

  SELECT current_setting('block_size') INTO blk_size;
  SELECT pg_relation_size(relation, 'fsm') INTO fsm_size;
  SELECT pg_relation_size(relation, 'main') INTO main_size;

  IF fsm_size < blk_size * 3 THEN
    RETURN TRUE;
  END IF;

  min_size := (fsm_size / blk_size - 3) * (500 * blk_size / 1024) * blk_size;
  RETURN (main_size > min_size);
END;
$$ LANGUAGE plpgsql;

The basic checking logic is that, we can verify the size of a relation’s main data file, by the size of its FSM file. Each relation has a Free Space Map (FSM) to keep track of available space for each page in the relation’s main data file. With a block size of 8192, the main data file size is about 4000 times the FSM file size.

The following basic query will tell us which relations missing files in a database:

SELECT current_database() AS dbname, n.nspname AS schema, c.relname,
       pg_relation_filepath(c.oid) AS path
FROM pg_class c, pg_namespace n
WHERE c.relnamespace = n.oid AND
      c.relkind IN ('r', 'i', 't', 'S') AND
      c.relpersistence = 'p' AND
      NOT check_relation_integrity(c.oid);

 dbname  | schema | relname | path
---------+--------+---------+------------------
 redrock | public | test    | base/16384/16391
(1 row)

Upon checking, we found it was due to one file (base/16384/16391) being removed. So we need to confirm whether this base/16384/16391 file is available in /base location.

postgres@dbserver:~/14/main$ ls -l base/16384/16391
ls: cannot access 'base/16384/16391': No such file or directory

From the above, we have identified that the file for the table “test” with relid 16391 got deleted. We need to identify whether it was deleted manually by mistake or was due to hardware failure.

Restore missing files

In case of hardware failure, first, we need to fix the hardware issue or migrate our database to new hardware and then perform a restore, as mentioned below.

To restore, we can follow either of below approaches:

  1. If the standby server is there and the issue does not persist on that, then we can fix the issue by taking logical backup from the standby server, dropping the table on primary, and recreating it using the backup taken.
  2. Perform PITR if we have backup configured with proper archive backup.
  3. Restore database “redrock” if we have proper backup configured using pg_basebackup backup.
  4. If we have a table-level logical backup, then we can restore the “test” table backup, and our database will be in good shape.

See more

PostgreSQL Monitoring