PostgreSQL Tutorial: Custom Aggregate Functions

September 20, 2024

Summary: In this tutorial, you will learn how to create your own aggregate functions in PostgreSQL.

Table of Contents

Basics about aggregate functions

These nice functions support you by aggregating multiple rows of a table. Assume you have a table called employees with the column salary and you are interested in the sum of all salaries. You could either sum it manually (just a joke) or use a standard aggregate function like SUM, that performs an operation for each dataset found and so sums up a total sum.

For using an standardized aggregate function, you can do like this:

SELECT SUM(salary) AS total_sum FROM employees;

So what does this sum operation require? The big advantage of aggregate functions is now, that they can combine multiple rows, instead of maximum combine columns. In our example the SUM operation sums up all salaries, which are part of the queried rows. There is no other possibility to subsume rows than aggregate functions, unless you handle something on the part of an application.

The big advantage of PostgreSQL is that you can create custom aggregate functions, which satisfy the combination you imagined.

Quick start

Before reading this passage, please preposition the official documentation about CREATE AGGREGATE. Based on this I try to explain the main points and assist you at the most important steps.

Synopsis of CREATE AGGREGATE is as follows:

CREATE AGGREGATE name ( input_data_type [ , ... ] ) (
    SFUNC = sfunc,
    STYPE = state_data_type
    [ , FINALFUNC = ffunc ]
    [ , INITCOND = initial_condition ]
    [ , SORTOP = sort_operator ]
)

Imagine the AGGREGATE is a process, which has a defined state (of type STYPE, more later), which changes during iteration through the selected datasets. What specifically will be changed is defined through the function SFUNC. The final function FINALFUNC will be applied after aggregating and it’s return value will be returned as result of the aggregation - the default of FINALFUNC is returning the last known state.

To assist the official documentation I give you some hints for finding the correct values.

  • STYPE: data type, in which the aggregator remembers it’s current state. This state (of type STYPE) will be passed as first argument to the function SFUNC.

  • SFUNC: name of a function, which expects (state_data_type, input_data_type(s)) as argument and returns a value of the type of STYPE.

    SFUNC(state<STYPE>, next input_data_type(s)) ---> new state<STYPE>
    

    To get a list of all available functions, connect with psql, type \df <begin_of_expression> and press <Tab> (similar to Unix auto completition). If you choose <Enter> and detailed information about the function will be shown.

    To list available functions (SFUNC’s):

    testdb=> \df ip4
    ip4                       ip4_in
    ip4_and                   ip4_le
    ip4_cmp                   ip4_xor
    [...]
    
    testdb=> \df ip4_xor
     schema |  name   | result data type | argument data types
    --------+---------+------------------+--------------------
     public | ip4_xor | ip4              | ip4, ip4
    

    For more detailed information you can take a look in the PostgreSQL system table pg_proc.

  • FINALFUNC: connector between the last internal aggregator state (after aggregating) and the return value of the aggregation. Opens the possibilty to cast the last state and/or do a final operation on it.

    FINALFUNC(state<STYPE>) ---> aggregator return value
    
  • INITCOND: start value of the internal state. It is strongly recommended that you state this.

Example

As we came into contact with custom aggregate functions, assume that we had a table with many datasets containing IP-addresses. A simple idea is, to build up a hash code about IP-addresses, for verification of full datasets, which can easily done by XOR-ing all IP-addresses.

Finally, you can define the aggregator for ‘hashing’ IPv4-addresses as following:

01  CREATE AGGREGATE hash_ips ( ip4 ) (
02	SFUNC = ip4_xor,
03	STYPE = ip4,
04	INITCOND = '0.0.0.0'
05  )

Description:

The aggregator is called hash_ips (line 01) and expects one paramter of type ip4 (line 01, look at project for more information about ip4). It’s initial state itself is also of type ip4 (line 03) and has the initial value of the IP-address 0.0.0.0 (line 04). For each dataset this get’s XORed (by passing the function ip4_xor, line 02)) and will be returned after the iteration (because no FINALFUNC is stated).

After creation you can use the aggregator like other aggregators:

SELECT hash_ips(column_with_ip) FROM mytable;

For more examples please have a look at the official examples.

See More

PostgreSQL Documentation: CREATE AGGREGATE

PostgreSQL Tutorial: Aggregate Functions