PostgreSQL Tutorial: Tune selectivity estimation of equalities

February 28, 2024

Summary: in this tutorial, you will learn how to tune selectivity estimation of equalities.

Table of Contents

Introduction

To improve estimation accuracy for non-uniform distributions, the analyzer collects statistics on most common values (MCVs) and their frequency. These values are stored in pg_stats as most_common_vals and most_common_freqs.

Most common values

Example

Let’s start with a little test setup, generate some sample data from a continuous Gaussian random distribution:

CREATE TABLE test1 (id integer, val integer);

INSERT INTO test1 (id, val)
  SELECT i,
    CAST(cos(2 * pi() * random()) * sqrt(-2 * ln(random())) * 100 AS integer)
  FROM generate_series(1, 10000) AS s(i);

ANALYZE test1;

SHOW default_statistics_target;

 default_statistics_target
---------------------------
 100
(1 row)

A simple table containing 10'000 rows. Let’s check the column statistics for the most common values:

WITH valrows AS (
  SELECT val, count(*) AS num FROM test1
  GROUP BY val
  ORDER BY num DESC
), attstats AS (
  SELECT reltuples, s.most_common_freqs AS mcf,
    s.most_common_vals::text::integer[] AS mcv,
    s.n_distinct
  FROM pg_class JOIN pg_stats s ON s.tablename = relname
  WHERE s.tablename = 'test1' AND s.attname = 'val'
)
SELECT val, num AS actual_rows,
  CASE WHEN pos IS NULL
  THEN round(reltuples *
    (1 - (SELECT sum(freq) FROM unnest(mcf) AS freq)) /
      (n_distinct - 100))
  ELSE round(reltuples * mcf[pos]) 
  END AS estimated_rows
FROM (
  SELECT array_position(mcv, val) AS pos, *
  FROM valrows, attstats
) AS res
LIMIT 200;

which outputs the following:

 val  | actual_rows | estimated_rows
------+-------------+----------------
   33 |          54 |             54
  -28 |          51 |             51
   19 |          50 |             50
   -9 |          49 |             49
   39 |          48 |             48
  ...    -- skip to 95 line --
  -35 |          35 |             35
   55 |          35 |             35
  -20 |          35 |             35
   62 |          35 |             13
   93 |          35 |             13
   56 |          35 |             13
  -37 |          35 |             35
   38 |          35 |             35
   24 |          35 |             35
   75 |          35 |             13
   44 |          34 |             13
   15 |          34 |             13
  ...
(200 rows)

Estimating selectivity of “column = expression” is extremely straightforward: the planner just takes a value from the most_common_vals array and multiplies it by the frequency from the same position in the most_common_freqs array.

EXPLAIN (analyze) SELECT * FROM test1 WHERE val = 33;

                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Seq Scan on test1  (cost=0.00..150.00 rows=54 width=8) (actual time=0.044..1.299 rows=54 loops=1)
   Filter: (val = 33)
   Rows Removed by Filter: 9946
 Planning time: 0.183 ms
 Execution time: 1.325 ms
(5 rows)

EXPLAIN (analyze) SELECT * FROM test1 WHERE val = 62;

                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Seq Scan on test1  (cost=0.00..150.00 rows=13 width=8) (actual time=0.049..1.262 rows=35 loops=1)
   Filter: (val = 62)
   Rows Removed by Filter: 9965
 Planning time: 0.121 ms
 Execution time: 1.278 ms
(5 rows)

The estimated rows will be close to the actual rows if the column value appears in the MCV list.

MCV lists are also used for selectivity estimations of inequalities: to find the selectivity of “column < value”, the planner searches most_common_vals for all the values lower than the given value, and then adds together their frequencies from most_common_freqs.

Tune statistic target for columns

Common value statistics work best when the number of distinct values is low. The maximum size of the MCV arrays is defined by default_statistics_target, the same parameter that governs row sample size during analysis.

In some cases increasing the value (and thus the array size) beyond the default value will provide more accurate estimations. You can set this value per column, for example:

ALTER TABLE test1 ALTER COLUMN val SET STATISTICS 1000;

The row sample size will increase as well, but only for the table.

The common values array stores the values themselves and, depending on the values, could take up a lot of space. This is why values over 1 kB in size are excluded from analysis and statistics. It keeps pg_statistic size under control and does not overload the planner. Values this large are usually distinct anyway and wouldn’t get included in most_common_vals.

Higher estimate accuracy is beneficial only as long as it improves planning quality. Increasing default_statistics_target without a valid reason may slow analysis and planning while having no effect on optimization.

On the other hand, lowering the parameter (all the way down to zero) may increase analysis and planning speed, but may also result in low-quality plans, so this “time save” is rarely justified.

See more

PostgreSQL Optimization