PostgreSQL Tutorial: Tune selectivity estimation of expressions

March 1, 2024

Summary: in this tutorial, you will learn how to tune selectivity estimation of expressions.

Table of Contents

Introduction

Generally speaking, column statistics are only used when an operation calls for the column itself, not for an expression with the column as a parameter. The planner doesn’t know how a function will affect column statistics, so conditions like “function-call = constant” are always estimated at 0.5%.

Example

Let’s start with a little test setup:

CREATE TABLE events (id integer, occurred_at timestamp with time zone);

INSERT INTO events (id, occurred_at)
  SELECT i, TIMESTAMP '2024-01-01 00:00:00+08' + i * '2 minutes'::interval
  FROM generate_series(1, 250000) AS s(i);

ANALYZE events;

For example, we want to query some records from the table, filter by an expression with the column occurred_at as a parameter:

EXPLAIN SELECT * FROM events
WHERE extract(
  month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
) = 2;

                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Seq Scan on events  (cost=0.00..5727.00 rows=1250 width=12)
   Filter: (EXTRACT(month FROM (occurred_at AT TIME ZONE 'Asia/Shanghai'::text)) = '2'::numeric)
(2 rows)

SELECT round(reltuples * 0.005)
FROM pg_class WHERE relname = 'events';

 round
−−−−−−−
  1250
(1 row)

The planner can’t process even standard functions, while to us it’s obvious that the fraction of events in February will be around 1/12 of total events:

SELECT count(*) AS total,
  count(*) FILTER (WHERE extract(
    month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
  ) = 2) AS february
FROM events;

 total  | february
--------+----------
 250000 |    20880
(1 row)

This is where expression statistics come in.

Extended expression statistics

PostgreSQL 14 introduced a feature known as extended expression statistics. Extended expression statistics aren’t collected automatically. To collect them manually, use the CREATE STATISTICS command to create an extended statistics database object.

CREATE STATISTICS events_expr ON (extract(
  month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
))
FROM events;

The new statistics will improve the estimate:

ANALYZE events;

EXPLAIN SELECT * FROM events
WHERE extract(
  month FROM occurred_at AT TIME ZONE 'Asia/Shanghai'
) = 2;

                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Seq Scan on events  (cost=0.00..5727.00 rows=20392 width=12)
   Filter: (EXTRACT(month FROM (occurred_at AT TIME ZONE 'Asia/Shanghai'::text)) = '2'::numeric)
(2 rows)

For the statistics to work, the expression in the statistics generation command must be identical to the one in the original query.

Extended statistics metadata is stored in the pg_statistic_ext table in the system catalog, while the statistics data itself is stored in a separate table pg_statistic_ext_data (in PostgreSQL 12 and higher). It is stored separately from the metadata to restrict user access to sensitive information, if necessary.

There are views that display collected statistics in a user-friendly form. Extended expression statistics can be displayed with the following command:

SELECT left(expr, 60) || '...' AS expr,
  null_frac, avg_width, n_distinct,
  most_common_vals AS mcv,
  left(most_common_freqs::text, 60) || '...' AS mcf,
  correlation
FROM pg_stats_ext_exprs WHERE statistics_name = 'events_expr' \gx

-[ RECORD 1 ]----------------------------------------------------------------
expr        | EXTRACT(month FROM (occurred_at AT TIME ZONE 'Asia/Shanghai'...
null_frac   | 0
avg_width   | 8
n_distinct  | 12
mcv         | {10,8,9,6,5,1,3,7,11,4,2,12}
mcf         | {0.09046666,0.09036667,0.089933336,0.089666665,0.0894,0.0876...
correlation | 1

The amount of collected statistics data can be altered with the ALTER STATISTICS command:

ALTER STATISTICS events_expr SET STATISTICS 50;

See more

PostgreSQL Optimization