PostgreSQL Java Tutorial: Use UUID as primary key

July 22, 2024

Summary: in this tutorial, you will learn how to use UUID as a primary key in PostgreSQL.

Table of Contents

UUIDs are often used as database table primary keys. They are easy to generate, easy to share between distributed systems and guarantee uniqueness.

Considering the size of UUID it is questionable if it is a right choice, but often it is not up to us to decide.

This tutorial does not focus on “if UUID is the right format for a key”, but how to use UUID as a primary key with PostgreSQL efficiently.

Postgres Data Types for UUID

UUID can be seen as a string and it may be tempting to store them as such. Postgres has a flexible data type for storing strings: text and it is often used as a primary key to store UUID values.

Is it a right data type? Definitely not.

Postgres has a dedicated data type for UUIDs: uuid. UUID is a 128 bit data type, so storing single value takes 16 bytes. text data type has 1 or 4 bytes overhead plus storing the actual string.

These differences are not that important in small tables, but become an issue once you start storing hundreds of thousands or millions of rows.

I run an experiment to see what is the difference in practice. There are two tables that have just one column - an id as a primary key. First table uses text, second uuid:

create table bank_transfer(
    id text primary key
);

create table bank_transfer_uuid(
    id uuid primary key
);

I did not specify the type for primary key index, so Postgres uses the default one - B-tree.

Then I inserted 10 000 000 rows to each table using batchUpdate from Spring’s JdbcTemplate:

jdbcTemplate.batchUpdate("insert into bank_transfer (id) values (?)",
        new BatchPreparedStatementSetter() {
            @Override
            public void setValues(PreparedStatement ps, int i) throws SQLException {
                ps.setString(1, UUID.randomUUID().toString());
            }
        
            @Override
            public int getBatchSize() {
                return 10_000_000;
            }
});
jdbcTemplate.batchUpdate("insert into bank_transfer_uuid (id) values (?)",
        new BatchPreparedStatementSetter() {
            @Override
            public void setValues(PreparedStatement ps, int i) throws SQLException {
                ps.setObject(1, UUID.randomUUID());
            }

            @Override
            public int getBatchSize() {
                return 10_000_000;
            }
        });

I run a query to find the table size and the index size:

select
    relname as "table",
    indexrelname as "index",
    pg_size_pretty(pg_relation_size(relid)) "table size",
    pg_size_pretty(pg_relation_size(indexrelid)) "index size"
from
    pg_stat_all_indexes
where
    relname not like 'pg%';
+------------------+-----------------------+----------+----------+
|table             |index                  |table size|index size|
+------------------+-----------------------+----------+----------+
|bank_transfer_uuid|bank_transfer_uuid_pkey|422 MB    |394 MB    |
|bank_transfer     |bank_transfer_pkey     |651 MB    |730 MB    |
+------------------+-----------------------+----------+----------+

Table that uses text is 54% larger and the index size 85% larger. This is also reflected in number of pages Postgres uses to store these tables and indexes:

select relname, relpages from pg_class where relname like 'bank_transfer%';
+-----------------------+--------+
|relname                |relpages|
+-----------------------+--------+
|bank_transfer          |83334   |
|bank_transfer_pkey     |85498   |
|bank_transfer_uuid     |54055   |
|bank_transfer_uuid_pkey|50463   |
+-----------------------+--------+

Larger size of tables, indexes and bigger number of tables means that Postgres must perform work to insert new rows and fetch rows - especially once index sizes are larger than available RAM memory, and Postgres must load indexes from disk.

UUID and B-Tree index

Random UUIDs are not a good fit for a B-tree indexes - and B-tree index is the only available index type for a primary key.

B-tree indexes work the best with ordered values - like auto-incremented or time sorted columns.

UUID - even though always looks similar - comes in multiple variants. Java’s UUID.randomUUID() - returns UUID v4 - which is a pseudo-random value. For us the more interesting one is UUID v7 - which produces time-sorted values. It means that each time new UUID v7 is generated, a greater value it has. And that makes it a good fit for B-Tree index.

To use UUID v7 in Java we need a 3rd party library like java-uuid-generator:

<dependency>
  <groupId>com.fasterxml.uuid</groupId>
  <artifactId>java-uuid-generator</artifactId>
  <version>5.0.0</version>
</dependency>

Then we can generate UUID v7 with:

UUID uuid = Generators.timeBasedEpochGenerator().generate();

This theoretically should improve the performance of executing INSERT statements.

How UUID v7 affects INSERT performance

I created another table, exactly the same as bank_transfer_uuid but it will store only UUID v7 generated using the library mentioned above:

create table bank_transfer_uuid_v7(
   id uuid primary key
);

Then I run 10 rounds of inserting 10000 rows to each table and measured how long it takes:

for (int i = 1; i <= 10; i++) {
    measure(() -> IntStream.rangeClosed(0, 10000).forEach(it -> {
        jdbcClient.sql("insert into bank_transfer (id) values (:id)")
                .param("id", UUID.randomUUID().toString())
                .update();
    }));

    measure(() -> IntStream.rangeClosed(0, 10000).forEach(it -> {
        jdbcClient.sql("insert into bank_transfer_uuid (id) values (:id)")
                .param("id", UUID.randomUUID())
                .update();
    }));

    measure(() -> IntStream.rangeClosed(0, 10000).forEach(it -> {
        jdbcClient.sql("insert into bank_transfer_uuid_v7 (id) values (:id)")
                .param("id", Generators.timeBasedEpochGenerator().generate())
                .update();
    }));
}

The results look a little random especially when comparing times for a table with regular text column and uuid v4:

+-------+-------+---------+
| text  | uuid  | uuid v7 |
+-------+-------+---------+
| 7428  | 8584  | 3398    |
| 5611  | 4966  | 3654    |
| 13849 | 10398 | 3771    |
| 6585  | 7624  | 3679    |
| 6131  | 5142  | 3861    |
| 6199  | 10336 | 3722    |
| 6764  | 6039  | 3644    |
| 9053  | 5515  | 3621    |
| 6134  | 5367  | 3706    |
| 11058 | 5551  | 3850    |
+-------+-------+---------+

BUT we can clearly see, that inserting UUID v7 is ~2x faster and inserting regular UUID v4.

Summary

As mentioned at the beginning - due to UUID length - even with all these optimizations, it is not the best type for a primary key.

But if you must or for some reason want to use UUIDs, take into account the optimizations I mentioned. Also keep in mind that such optimizations make a difference for large datasets. If you’re storing hundreds or even few thousands of rows, and have a low traffic, you will likely not see any difference in the application performance. But if there’s a chance you will have large dataset or big traffic - it is better to do it right from the beginning as changing primary keys can be quite a challenge.