Misuse of JSON schema in PostgreSQL

By John Doe October 25, 2024

Summary: In this article, we will learn the misuse of JSON schema in PostgreSQL.

Table of Contents

Introduction

PostgreSQL has json support – but you shouldn’t use it for the great majority of what you’re doing. This goes for hstore too, and the new jsonb type. These types are useful tools where they’re needed, but should not be your first choice when modelling your data in PostgreSQL, as it’ll make querying and manipulating it harder.

Some readers may be familiar with the entity–attribute–value model. EAV has been a necessary evil in database schemas for a long time. It’s something we all advise against using and look down on, then resort to when reality and business requirements mean not everything can be fit into rigid fields in properly modelled relations. For those who don’t know it, it’s a schema design where you turn a relational database into a poor-man’s key/value store using a table with object_id (“entity”), key (“attribute”) and value columns. Querying it usually involves lots and lots of self-joins.

From Wikipedia

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design.

JSON is the new EAV – a great tool when you need it, but not something you should use as a first choice.

(Most of this also applies to PostgreSQL arrays, which are great tools for building queries, but not usually an appropriate way to store authorative data).

JSON blob tables

Since the release of PostgreSQL 9.3, and even more since jsonb in 9.4, there are more and more Stack Overflow questions where people ask how to do joins on json objects, how to query for a nested key inside any array element, etc. All reasonable things to do, but when asked for schema and example data they’ll often post something like this:

CREATE TABLE people(
    id serial primary key,
    data jsonb not null
);

INSERT INTO people(data) VALUES ($$
{
    "name": "Bob",
    "addresses": [
        {
            "street": "Centre",
            "streetnumber": 24,
            "town": "Thornlie",
            "state": "WesternAustralia",
            "country": "Australia"
        },
        {
            "street": "Example",
            "streetnumber": "4/311-313",
            "town": "Auckland",
            "country": "NewZealand"
        }
    ],
    "phonenumbers": [
        {
            "type": "mobile",
            "number": "12345678"
        }
    ]
}
$$);

INSERT INTO people(data) VALUES ($$
{
  "name": "Fred",
  "phonenumbers": [
    { "type": "mobile", "number": "12345678" }
  ]
}
$$);

INSERT INTO people(data) VALUES ($$
{
  "name": "John Doe"
}
$$);

and ask “how do I find people who have the same phone number?”.

Experienced relational database users will already be wincing… but remember, not everybody is used to inverting their thinking to follow foreign keys backwards from child to parent and decomposing composite entities into normal form. When you think about it, relational databases are weird.

Most of PostgreSQL’s power comes from that relational model, though. When you store your data as json blobs you deny the query planner the ability to make sensible decisions based on table and column statistics, you lose most of the indexing features and scan types, and are generally restricted to quite primitive operations. Doing anything interesting tends to involve lots of self-joins and filters.

Baroque queries

For example, the request to find people with matching phone numbers could be turned into a query like:

select 
  p1.id AS person1,
  p2.id AS person2,
  p1.data ->> 'name' AS "p1 name",
  p2.data ->> 'name' AS "p2 name",
  pns1 ->> 'type' AS "type", 
  pns1 ->> 'number' AS "number" 
from people p1 
  inner join people p2 
    on (p1.id > p2.id)
  cross join lateral jsonb_array_elements(p1.data -> 'phonenumbers') pns1
  inner join lateral jsonb_array_elements(p2.data -> 'phonenumbers') pns2 
    on (pns1 -> 'type' = pns2 -> 'type' AND pns1 -> 'number' = pns2 -> 'number');

Isn’t that just a thing of readable beauty?

There’s nothing particularly wrong with the json support functions and operators, it’s just awkward taking nested object-like data and querying it using relational joins and predicates.

At least PostgreSQL supports LATERAL; without that, this would be a nightmare of a query to write.

As it happens an equivalent query on a relational schema for this particular question isn’t that much nicer:

SELECT
  p1.id AS "person1",
  p2.id AS "person2",
  p1.name AS "p1 name",
  p2.name AS "p2 name",
  pns1.phoneno AS "p1 phone",
  pns2.phoneno AS "p2 phone"
FROM
  person p1
  INNER JOIN person p2 ON (p1.id < p2.id)
  INNER JOIN phone pns1 ON (p1.id = pns1.person_id)
  INNER JOIN phone pns2 ON (p2.id = pns2.person_id)
WHERE
  pns1."type" = pns2."type"
  AND pns1.phoneno = pns2.phoneno;

… but it’s likely to go a lot faster with an index on (phone.phoneno) than the json based query would, and read a lot less data off disk in the process.

No fixed data typing

The json-based query given above is also buggy, because jsonb comparisons are sensitive to data type so the jsonb values “12345678” and 12345678 are unequal:

SELECT '12345678'::jsonb, '"12345678"'::jsonb, '12345678'::jsonb = '"12345678"'::jsonb AS "isequal";
  jsonb   |   jsonb    | isequal 
----------+------------+---------
 12345678 | "12345678" | f

so this:

insert into people (data) values
('{"phonenumbers": [{"type":"mobile","number":12345678}]}');

won’t be identified as a duplicate even though it should be.

Note that this is arguably a PostgreSQL limitation, since a JavaScript interpreter’s loose typing means they compare as equal:

> 12345678 == "12345678"
> true

To get the numbers to match using a PostgreSQL query, we have to change our query so that this:

on (pns1 -> 'type' = pns2 -> 'type' AND pns1 -> 'number' = pns2 -> 'number');

uses the -» json-value-as-text operator instead:

on (pns1 -> 'type' = pns2 -> 'type' AND pns1 ->> 'number' = pns2 ->> 'number');

since the text form of the numbers is the same.

No constraints

Of course, someone might insert:

insert into people (data) values
('{"phonenumbers": [{"type":"Mobile","number":"1234 5678"}]}');

… which won’t match because of the capital M and the space in the number.

Because everything is in a json blob can’t easily add CHECK constraints on the value, use ENUM types or DOMAINs, create trigger-maintained shadow normalized columns, or any of the usual approaches used to handle data normalisation like this. It’d be necessary to write a json validator/transform trigger in a procedural language like pl/v8 or do it all in the application.

Slow queries due to lack of statistics

For traditional data types, PostgreSQL stores statistics about the distribution of values in each column of each table, such as:

the number of distinct values seen
the most common values
the fraction of entries that are NULL
for ordered types, a histogram sketch of the distribution of values in the column

For a given query, the query planner uses these statistics to estimate which execution plan will be the fastest. For example, let’s make a table with 1 million “measurements” of three values, each chosen at uniform random from {0, 1}. Each measurement was taken by one of 10,000 scientists, and each scientist comes from one of three labs:

CREATE TABLE measurements (
  tick BIGSERIAL PRIMARY KEY,
  value_1 INTEGER,
  value_2 INTEGER,
  value_3 INTEGER,
  scientist_id BIGINT
);
INSERT INTO measurements (value_1, value_2, value_3, scientist_id)
  SELECT
    trunc(2 * random()),
    trunc(2 * random()),
    trunc(2 * random()),
    trunc(10000 * random() + 1)
  FROM generate_series(0, 999999);

CREATE TABLE scientist_labs (scientist_id BIGSERIAL PRIMARY KEY, lab_name TEXT);
INSERT INTO scientist_labs (lab_name)
  SELECT (
    '{"Tsinghua University", "Industrial Labs", "Western Science People"}'::TEXT[]
  )[i % 3 + 1]
  FROM generate_series(1, 10000) i;

ANALYZE;

Let’s say we want to get the tick marks in which all three values were 0 — which should be about 1/8th of them — and see how many times each lab was represented amongst the corresponding scientists. Our query will look something like this:

SELECT lab_name, COUNT(*)
FROM (
  SELECT scientist_id
  FROM measurements
  WHERE
    value_1 = 0 AND
    value_2 = 0 AND
    value_3 = 0
) m
JOIN scientist_labs AS s
  ON (m.scientist_id = s.scientist_id)
GROUP BY lab_name

And our query plan will look something like this:

                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=28905.96..28905.99 rows=3 width=20) (actual time=297.276..297.278 rows=3 loops=1)
   Group Key: s.lab_name
   ->  Hash Join  (cost=296.00..28279.80 rows=125232 width=20) (actual time=5.606..262.124 rows=125222 loops=1)
         Hash Cond: (measurements.scientist_id = s.scientist_id)
         ->  Seq Scan on measurements  (cost=0.00..24853.00 rows=125232 width=8) (actual time=0.016..177.659 rows=125222 loops=1)
               Filter: ((value_1 = 0) AND (value_2 = 0) AND (value_3 = 0))
               Rows Removed by Filter: 874778
         ->  Hash  (cost=171.00..171.00 rows=10000 width=28) (actual time=5.575..5.575 rows=10000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 599kB
               ->  Seq Scan on scientist_labs s  (cost=0.00..171.00 rows=10000 width=28) (actual time=0.006..2.328 rows=10000 loops=1)
 Planning time: 0.603 ms
 Execution time: 297.346 ms
(12 rows)

Time: 300.463 ms

This is what we’d hope to see: the planner knows from our table statistics that about 1/8th of the rows in measurements will have value_1, value_2, and value_3 equal to 0, so about 125,000 of them will need to be joined with a scientist’s lab, and the database does so via a hash join. That is, load the contents of scientist_labs into a hash table keyed on scientist_id, scan through the matching rows from measurements, and look each one up in the hash table by its scientist_id value. The execution is fast — about 300 ms on my machine.

Let’s say we instead store our measurements as JSONB blobs, like this:

CREATE TABLE measurements (tick BIGSERIAL PRIMARY KEY, record JSONB);
INSERT INTO measurements (record)
  SELECT (
    '{ "value_1":' || trunc(2 * random()) ||
    ', "value_2":' || trunc(2 * random()) ||
    ', "value_3":' || trunc(2 * random()) ||
    ', "scientist_id":' || trunc(10000 * random() + 1) || ' }')::JSONB
  FROM generate_series(0, 999999) i

The analogous read query would look like this:

SELECT lab_name, COUNT(*)
FROM (
  SELECT (record ->> 'scientist_id')::BIGINT AS scientist_id
  FROM measurements
  WHERE
    (record ->> 'value_1')::INTEGER = 0 AND
    (record ->> 'value_2')::INTEGER = 0 AND
    (record ->> 'value_3')::INTEGER = 0
) m
JOIN scientist_labs AS s
  ON (m.scientist_id = s.scientist_id)
GROUP BY lab_name

The performance is dramatically worse — a whopping 584 seconds on my laptop, about 2000x slower:

                                                                                QUERY PLAN                                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=58553.01..58553.02 rows=1 width=20) (actual time=583724.702..583724.703 rows=3 loops=1)
   Group Key: s.lab_name
   ->  Nested Loop  (cost=0.00..58553.00 rows=1 width=20) (actual time=5.457..583510.640 rows=124616 loops=1)
         Join Filter: (((measurements.record ->> 'scientist_id'::text))::bigint = s.scientist_id)
         Rows Removed by Join Filter: 1246035384
         ->  Seq Scan on measurements  (cost=0.00..58182.00 rows=1 width=105) (actual time=0.032..1134.662 rows=124616 loops=1)
               Filter: ((((record ->> 'value_1'::text))::integer = 0) AND (((record ->> 'value_2'::text))::integer = 0) AND (((record ->> 'value_3'::text))::integer = 0))
               Rows Removed by Filter: 875384
         ->  Seq Scan on scientist_labs s  (cost=0.00..171.00 rows=10000 width=28) (actual time=0.003..0.864 rows=10000 loops=124616)
 Planning time: 0.603 ms
 Execution time: 583724.765 ms
(11 rows)

Time: 583730.320 ms

The underlying reason is that PostgreSQL doesn’t know how to keep statistics on the values of fields within JSONB columns. It has no way of knowing, for example, that record ->> 'value_2' = 0 will be true about 50% of the time, so it relies on a hardcoded estimate of 0.1%. So, it estimates that 0.1% of 0.1% of 0.1% of the measurements table will be relevant (which it rounds up to ~1 row). As a result, it chooses a nested loop join: for each row in measurements that passes our filter, look up the corresponding lab_name in scientist_labs via the primary key of the latter table. But since there are ~125,000 such measurements, instead of ~1, this turns out to take an eternity.

As always, accurate statistics are a critical ingredient to good database performance. In their absence, the planner can’t determine which join algorithms, join orders, or scan types will make your query fast. The result is that innocent queries will blow up on you. This is one of the hidden costs of JSONB: your data doesn’t have statistics, so the query planner is flying blind.

This is not an academic consideration. This caused production issues for us, and the only way to get around them was to disable nested loops entirely as a join option, with a global setting of enable_nestloop = off. Ordinarily, you should never do something like that.

This probably won’t bite you in a key-value / document-store workload, but it’s easy to run into this if you’re using JSONB along with analytical queries.

Larger table footprint

Under the hood, PostgreSQL’s JSON datatype stores your blobs as strings that it happens to know are valid JSON. The JSONB encoding has a bit more overhead, with the upside that you don’t need to parse the JSON to retrieve a particular field. In both cases, at the very least, the database stores each key and value in every row. PostgreSQL doesn’t do anything clever to deduplicate commonly occurring keys.

Using the above measurements table again, the initial non-JSONB version of our table takes up 79 mb of disk space, whereas the JSONB variant takes 164 mb — more than twice as much. That is, the majority of our table contents are the the strings value_1, value_2, value_3, and scientist_id, repeated over and over again. So, in this case, you would need to pay for twice as much disk, not to mention follow-on effects that make all sorts of operations slower or more expensive. The original schema will cache much better, or might fit entirely in memory. The smaller size means it will also require half as much i/o for large reads or maintenance operations.

For a less contrived anecdote, we found a disk space savings of about 30% by pulling 45 commonly used fields out of JSONB and into first-class columns. On a petabyte-scale dataset, that turns out to be a pretty big win.

As a rule of thumb, each column costs about 1 bit of overhead for each row in your table, regardless of whether the column’s value is null. So, for example, if an optional field is going to have a ten-character key in your JSONB blobs, and thus cost at least 80 bits to store the key in each row in which it’s present, it will save space to give it a first-class column if it’s present in at least 1/80th of your rows.

For datasets with many optional values, it is often impractical or impossible to include each one as a table column. In cases like these, JSONB can be a great fit, both for simplicity and performance. But, for values that occur in most of your rows, it’s still a good idea to keep them separate.In practice, there is often additional context to inform how you organize your data, such as the engineering effort required to manage explicit schemas or the type safety and SQL readability benefits from doing so. But there is often an important performance penalty as well for unnecessarily JSONB-ing your data.

So when should json be used?

From the above you might be thinking that I’m against using json types in the database. That’s not the case at all. They’re often better than EAV when you have dynamic-column data that just won’t fit into a traditional relational model. Or, as I’ve seen people try in the past, co-ordinate with an external MongoDB for the json storage.

For example, sometimes application users want to be able to add arbitrary attributes to an entity. It’s a business requirement. The client don’t like it when you say that the database doesn’t do that so they can’t have notes in their application, and they’re muttering things about “just doing it in the application” or “we didn’t have these problems with MongoDB”.

So you might model the typical stuff relationally (albeit with indiscriminate use of surrogate keys), but add a json field in person just for those extras so the app can fetch them when displaying a person.

How to decide when to use json

Use json if your data won’t fit in the database using a normal relational modelling. If you’re choosing between using EAV, serializing a Java/Ruby/Python object into a bytea field, or storing a key to look up an external structured object somewhere else … that’s when you should be reaching for json fields.

At this point it’s probably worth replacing hstore use with jsonb in all new applications.

Note that plain json fields are also useful - if you’re not going to be indexing the json and querying within it, they’re usually more compact and faster to send and receive.