PostgreSQL Tutorial: Index Scan Types: Bitmap, Index, and Index Only

August 23, 2024

Summary: In this tutorial, you will learn the three different PostgreSQL scan types for an index: bitmap, index, and index only.

Table of Contents

Introduction

Performance is one of the essential aspects of a database management system. Very little can be more annoying and frustrating for users than poor performance, meaning long-running queries and high response times at the front end.

One of the most effective ways to tackle performance improvement is having the proper indexes for the table columns. An index can save a lot of time in data access and lead the queries to gather the results the fastest way possible.

In PostgreSQL, there are different ways it can leverage the indexes to produce the most efficient plan.

In this article, we will review the following three different PostgreSQL scan types for an index depending on the table, what the queries are retrieving, and the used filters:

  • Bitmap Index Scan
  • Index Scan
  • Index Only Scan

Building the testing scenario

We will use one table with a single index for the following exercises and examples and review how the scan strategy can change depending on the query conditions.

Next is the table definition. I included a sequence creation statement for the id column since one best practice is always to have a primary key column, but for the specific examples we will go through, we don’t need it.

CREATE SEQUENCE public.person_id_seq
    START WITH 1
    INCREMENT BY 1
    NO MINVALUE
    NO MAXVALUE
    CACHE 1;

CREATE TABLE public.person(
  id integer DEFAULT nextval('public.person_id_seq'::regclass) NOT NULL,
  first_name text NOT NULL,
  last_name text NOT NULL,
  age integer NOT NULL,
  email text NOT NULL,
  register_date timestamp with time zone DEFAULT now() NOT NULL,
  is_active boolean DEFAULT true NOT NULL
);

Execute the following SQL statement to insert ten million rows to this table:

INSERT INTO public.person
SELECT generate_series
    , md5(random()::text)
    , md5(random()::text)
    , floor(random() * 99)::int
    , md5(random()::text) || '@gmail.com'
    , now() - (random() * (interval '90 days'))
    , case when random() > 0.5 then true else false end
  FROM generate_series(1, 10000000);

Now we have our test table with some dummy data so we can practice.

Indexing the data

As we stated before, a best practice is to add a primary key for the table, but we are skipping this step and adding just a composite index that will help us to review the different scan types.

We create this multicolumn index as follows:

CREATE INDEX idx_person_age_date_active ON person(age,register_date,is_active);

Here, we considered three columns with different cardinalities, meaning the proportion of distinct values about the total number of rows. The following are the columns ordered from the higher to the lower cardinality:

  1. register_date. We loaded the 10M rows setting this column with the help of the random() function, so the number of distinct values is the largest from these three columns.
  2. age. When we loaded the data, we also used the random() function, but we “limited” the results with the floor() function, so all the different values are between 1 and 99.
  3. is_active. This column data type is boolean, so only two different values are possible, true and false.

It is essential to think about the data cardinality of a column when planning the indexes and, even before that, the filters we will execute against the data.

For example, in the above columns, having a single index over the is_active columns will not add any advantage because, from all the 10M rows, only two values are possible, so if we would want to filter all the is_active = true rows, the planner will use sequential scan with no doubt.

One way to verify a column’s number of distinct values is by querying the pg_stats view in our database. For this is important to ensure the statistics are fresh; in this case, we run the ANALYZE command:

ANALYZE person;

For the previous columns, the following is the result of querying the pg_stats view:

SELECT tablename AS table_name,attname AS column_name,n_distinct AS num_distinct_values
      FROM pg_stats
      WHERE tablename = 'person'
      AND attname IN ('age','register_date','is_active')
      ORDER BY num_distinct_values DESC;
 table_name |  column_name  | num_distinct_values
------------+---------------+---------------------
 person     | age           |                  99
 person     | is_active     |                   2
 person     | register_date |                  -1
(3 rows)

And we confirm the age column has 99 distinct values whereas is_active only 2. The register_date column shows a negative value of -1 because, as described in the documentation, the ANALYZE believed the number of distinct values is likely the same as the total number of rows:

Note: -1 indicates a unique column in which the number of distinct values is the same as the number of rows.

One index, different scan types

Now that we have the table data and the index in place, we can test the different scan types. First, and to have a starting point, let’s verify how PostgreSQL will resolve a query going for all the table data with no filters:

EXPLAIN (ANALYZE) SELECT * FROM person;
                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Seq Scan on person  (cost=0.00..304082.18 rows=10000018 width=126) (actual time=0.016..934.805 rows=10000000 loops=1)
 Planning Time: 0.129 ms
 Execution Time: 1183.355 ms
(3 rows)

As expected, to retrieve all the data from the table, the planner decided on a Sequential Scan, going for all the 10M rows. It makes sense since it is getting all the rows at once. The total time it took was over 1183ms (~1.1sec.)

Bitmap Index Scan

The planner chooses this index scan method when the query asks for a large enough amount of data that can leverage the benefits of the bulk read, like the sequential scan, but not that large that actually requires processing ALL the table. We can think of the Bitmap Index Scan as something between the Sequential and Index Scan.

The Bitmap Index Scan always works in pair with a Bitmap Heap Scan; the first scan the index to find all the suitable row locations and builds a bitmap, then the second use that bitmap to scan the heap pages one by one and gather the rows.

The following is an example of a Bitmap Index Scan using the table and the index we built before:

EXPLAIN (ANALYZE) SELECT * FROM person WHERE age = 20;
                                                                      QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=3682.90..212050.63 rows=97334 width=126) (actual time=46.142..221.876 rows=101476 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on person  (cost=2682.90..201317.23 rows=40556 width=126) (actual time=24.783..189.769 rows=33825 loops=3)
         Recheck Cond: (age = 20)
         Rows Removed by Index Recheck: 534475
         Heap Blocks: exact=17931 lossy=12856
         ->  Bitmap Index Scan on idx_person_age_date_active  (cost=0.00..2658.57 rows=97334 width=0) (actual time=36.926..36.926 rows=101476 loops=1)
               Index Cond: (age = 20)
 Planning Time: 0.122 ms
 Execution Time: 225.554 ms
(11 rows)

In the inner node (which executes first) is a Bitmap Index Scan on the idx_person_age_date_active index. It creates the bitmap with all the suitable row locations and passes it to its parent node (that executes after), a Parallel Bitmap Heap Scan on the person table. This second stage goes through the pages individually, executes a Recheck for the filter condition, and returns the result data set.

To compare, consider how the same operation performs using just a sequential scan:

START TRANSACTION;

DROP INDEX idx_person_age_date_active;

EXPLAIN (ANALYZE) SELECT * FROM person WHERE age = 20;
                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..266898.83 rows=97334 width=126) (actual time=0.852..402.355 rows=101476 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on person  (cost=0.00..256165.43 rows=40556 width=126) (actual time=0.056..365.647 rows=33825 loops=3)
         Filter: (age = 20)
         Rows Removed by Filter: 3299508
 Planning Time: 0.335 ms
 Execution Time: 406.671 ms
(8 rows)

ROLLBACK;

Considering this query went for 101K rows, about 1% of the total rows. The Bitmap Index Scan took advantage of the “sequential scan” style bulk read over a limited number of pages and produced a better result than a direct Sequential Scan, performing 2x faster.

Index Scan

It might be the scan method you think about when hearing something like, “Hey, this query is doing good; it uses the index….” This method is the basic definition of accessing the data by an index.

The Index Scan consists of two steps, the first is to get the row location from the index, and the second is to gather the actual data from the heap or table pages. So every Index Scan access is two read operations. But still, this is one of the most efficient ways of retrieving data from a table.

The planner picks this scan method when the number of rows to retrieve is small, so executing the two-step Index Scan operations is “cheaper” and faster than gathering the data by processing the table pages individually.

Using our test table, the following is an example of an Index Scan:

EXPLAIN (ANALYZE) SELECT * FROM person WHERE age = 20
      AND register_date = '2023-03-23 19:50:03.22938+00'::timestamp;
                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_person_age_date_active on person  (cost=0.56..8.58 rows=1 width=126) (actual time=0.039..0.040 rows=1 loops=1)
   Index Cond: ((age = 20) AND (register_date = '2023-03-23 19:50:03.22938'::timestamp without time zone))
 Planning Time: 0.190 ms
 Execution Time: 0.064 ms
(4 rows)

See that in the query we used before, we have added a new filter expression: AND register_date = ‘2023-03-23 19:50:03.22938+00’::timestamp. The register_date column is part of the multicolumn index idx_person_age_date_active. Since we are filtering by a singular value for it, there is one index entry for the same, so PostgreSQL gets the specific row location from the index in one read and then all the row data from the table page within that location. The whole query took 0.064ms; that was fast!

In the above example, the query filters by a specific timestamp value for the register_date column, but PostgreSQL will still choose the Index Scan for multiple rows if the number of rows is small, for example, in the following:

EXPLAIN (ANALYZE) SELECT * FROM person WHERE age = 20
      AND register_date BETWEEN '2023-03-23 19:50:00'::timestamp AND '2023-03-23 20:00:00'::timestamp;
                                                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_person_age_date_active on person  (cost=0.56..8.58 rows=1 width=126) (actual time=0.044..0.167 rows=8 loops=1)
   Index Cond: ((age = 20) AND (register_date >= '2023-03-23 19:50:00'::timestamp without time zone) AND (register_date <= '2023-03-23 20:00:00'::timestamp without time zone))
 Planning Time: 0.127 ms
 Execution Time: 0.337 ms
(4 rows)

The query filtered the register_date column by a range using the BETWEEN operator. Based on the statistics, the planner thought the result would be one row and chose the Index Scan. In the end, the result set was eight rows, so eight pairs of read operations existed. Still, the query was resolved very quickly, with 0.337ms.

Index Only Scan

Finally, we will review the Index Only Scan method. This is a really good approach that PostgreSQL uses to improve the standard Index Scan method.

PostgreSQL uses this method when all the data the query asks for already exists in the index; in other words, the columns/expressions in the SELECT and WHERE clauses should be part of the index so that it is possible to avoid the second read operation to get the data from the table pages and return the result data only from the index read operation.

In the following, we are using almost the same query we used for the Index Scan example, but instead of asking for ALL the row columns (*), we are just retrieving the three columns we used to build the multicolumn index:

EXPLAIN (ANALYZE) SELECT age,register_date,is_active FROM person
      WHERE age = 20
      AND register_date = '2023-03-23 19:50:03.22938+00'::timestamp;
                                                              QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using idx_person_age_date_active on person  (cost=0.56..4.58 rows=1 width=13) (actual time=0.034..0.036 rows=1 loops=1)
   Index Cond: ((age = 20) AND (register_date = '2023-03-23 19:50:03.22938'::timestamp without time zone))
   Heap Fetches: 0
 Planning Time: 0.103 ms
 Execution Time: 0.058 ms
(5 rows)

See the EXPLAIN output now says Index Only Scan, and also, it confirms there was no access to the heap (table pages) with the line Heap Fetches: 0. The time was even better than the Index Scan from before, only 0.058ms. This scan method can help get the best possible performance for the queries that fit its conditions.

Remember, indexing ALL the columns so the index contains ALL the same data as the table is “not a good idea.” If that is the case, PostgreSQL will not see any advantage of using the index and will pick the sequential scan approach. See the following:

START TRANSACTION ;

DROP INDEX "idx_person_age_date_active";

CREATE INDEX idx_person_all ON person(id,first_name,last_name,age,email,register_date,is_active);

ANALYZE person;

EXPLAIN (ANALYZE) SELECT * FROM person WHERE age = 20
      AND register_date = '2023-03-23 19:50:03.22938+00'::timestamp;
                                                        QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..267582.21 rows=1 width=126) (actual time=6662.141..6671.741 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on person  (cost=0.00..266582.11 rows=1 width=126) (actual time=5093.681..6636.195 rows=0 loops=3)
         Filter: ((age = 20) AND (register_date = '2023-03-23 19:50:03.22938'::timestamp without time zone))
         Rows Removed by Filter: 3333333
 Planning Time: 2.704 ms
 Execution Time: 6673.001 ms
(8 rows)

ROLLBACK;

In the above, in a single transaction, we dropped the multicolumn index we used in the previous examples, and created a new one considering ALL the table columns, then refreshed the statistics and tried one query asking for all the columns (*) on a specific filter, as a result, the planner chosen the Sequential Scan, it wanted to boost it by executing the operation with parallelism. Still, the final execution time was far from our good results.

Final pieces of advice

Now that we have reviewed the different scan options PostgreSQL can use on a single index depending on the stored data and the query filter conditions. Let me share some final thoughts you can find helpful when planning your query’s filters and table indexes.

  1. Index the columns with the most cardinality. Following this, your queries can perform best when filtering by the same column. Having indexes on columns with low cardinality will have the opposite effect since it will add extra maintenance, and with high certainty, the planner will not use them.
  2. Plan your queries for small (specific) data sets rather than larger ones. If your workload and service design affords it, consider filtering your data thinking in retrieving just a few rows. As we saw, the Index Scan is an effective and optimized technique to retrieve data faster, and PostgreSQL will use it if the result data is small enough.
  3. Retrieve just the columns you need. By doing this, PostgreSQL can leverage Index Only Scan and avoid the “extra” read from the heap (table pages). These saves will produce a notorious good effect in a high query volume environment. Remember that the multicolumn indexes are not the only way to lead the planner to choose the Index Only Scan; you can also consider the covering indexes.
  4. Tune the random_page_cost parameter. Lowering this will lead the planner to prefer index scans rather than sequential scans. Modern SSD can deliver a better throughput for random read access, so you might analyze adjusting this parameter accordingly.
  5. Tune the effective_cache_size parameter. Setting this parameter to a higher value (nearly 75% of total RAM if your machine is dedicated to the PostgreSQL service) will help the planner choose index scans over sequential scans.

Remember that every implementation is different, and the details matter, so before adjusting anything on production is advisable to analyze and test the effects in a lower environment.

See more

PostgreSQL Optimization