PostgreSQL Tutorial: Tune estimated number of NULL values

May 17, 2024

Summary: In this tutorial, you will learn how to tune estimated number of NULL values.

Table of Contents

Introduction

While looked down upon by purists, NULL values serve as a convenient representation of values that are unknown or nonexistent.

But special values require special treatment. There are practical considerations to keep in mind when working with NULL values. Boolean logic turns into ternary, and the NOT IN construction starts behaving weirdly. It’s unclear whether NULL values are to be considered lower or higher than conventional values (special clauses NULLS FIRST and NULLS LAST help out with that). The use of NULL values in aggregate functions is sketchy, too. Because NULL values are, in fact, not values at all, the planner needs extra data to accommodate them.

In addition to basic relation-level statistics, the analyzer also collects statistics for each column in a relation. This data is stored in the pg_statistic table in the system catalog and can be conveniently displayed using the pg_stats view.

Fraction of null values is column-level statistics. It’s designated as null_frac in pg_stats.

Example

Let’s start with a little test setup:

CREATE TABLE test2 (id integer, str text);

INSERT INTO test2 (id, str)
  SELECT i,
       CASE WHEN i % 24 = 1 THEN NULL
            ELSE repeat('x', 64)
       END AS str
    FROM generate_series(1, 240000) AS s(i);

VACUUM ANALYZE test2;

In this example, the str column value in some rows is NULL:

EXPLAIN SELECT * FROM test2 WHERE str IS NULL;
                         QUERY PLAN
-------------------------------------------------------------
 Seq Scan on test2  (cost=0.00..4847.00 rows=10056 width=69)
   Filter: (str IS NULL)
(2 rows)

The optimizer multiplies the total row count by the NULL fraction:

SELECT round(reltuples * s.null_frac) AS rows
FROM pg_class
  JOIN pg_stats s ON s.tablename = relname
WHERE s.tablename = 'test2'
  AND s.attname = 'str';
 rows
-------
 10056
(1 row)

This is close enough to the true value of 10056.

Tune statistics for number of NULL values

Whenever you have significantly altered the distribution of data within a table, running ANALYZE is strongly recommended. This includes bulk loading large amounts of data into the table. Running ANALYZE (or VACUUM ANALYZE) ensures that the planner has up-to-date statistics about the table. With no statistics or obsolete statistics, the planner might make poor decisions during query planning, leading to poor performance on any tables with inaccurate or nonexistent statistics.

Note that if the autovacuum daemon is enabled, it might run ANALYZE automatically; see parameters autovacuum_analyze_threshold and autovacuum_analyze_scale_factor for more information.

See more

PostgreSQL Optimization