pg_duckdb: Unleashing PostgreSQL for Analytics with DuckDB Integration

September 19, 2024

Summary: pg_duckdb is a PostgreSQL extension that embeds DuckDB’s columnar-vectorized analytics engine and features into PostgreSQL.

Table of Contents

Introduction

The pg_duckdb extension will be fully capable of querying against data stored in the cloud in DuckDB as if it were local. DuckDB’s “dual execution” capabilities let us join local PostgreSQL data against DuckDB data seamlessly, and we will figure out the best place to run the query. As a user, you don’t really need to care where the computation runs, we’ll just figure out how to make it run fast.

Moreover, it is common in analytics to want to offload your data from your transactional database into an analytical store. The pg_duckdb extension along with DuckDB can help; you can just run a query in PostgreSQL that pulls recent data from your PostgreSQL database and write it to DuckDB. You don’t need to export and reimport data, or set up CDC.

Finally, there are some downsides to running analytics on the same database that runs your application. Analytics can be resource hungry in terms of the amount of memory and CPU needed to make it run well. Above a certain size, folks may not want to run this on their production transactional database. DuckDB will help offload this to the cloud, in a way that people don’t even have to change the queries that they’re running; they just get faster.

Installation

To build pg_duckdb, you need:

To build and install, run:

make install

Next, load the pg_duckdb extension:

CREATE EXTENSION pg_duckdb;

IMPORTANT: Once loaded you can use DuckDB execution by running SET duckdb.execution TO true. This is opt-in to avoid breaking existing queries. To avoid doing that for every session, you can configure it for a certain user by doing ALTER USER my_analytics_user SET duckdb.execution TO true.

Features

  • SELECT queries executed by the DuckDB engine can directly read PostgreSQL tables.

    • Able to read data types that exist in both PostgreSQL and DuckDB. The following data types are supported: numeric, character, binary, date/time, boolean, uuid, json, and arrays.
    • If DuckDB cannot support the query for any reason, execution falls back to PostgreSQL.
  • Read parquet and CSV files from object storage (AWS S3, Cloudflare R2, or Google GCS).

    • SELECT n FROM read_parquet('s3://bucket/file.parquet') AS (n int)
    • SELECT n FROM read_csv('s3://bucket/file.csv') AS (n int)
    • You can pass globs and arrays to these functions, just like in DuckDB.
  • Enable the DuckDB Iceberg extension using SELECT duckdb.enable_extension('iceberg') and read Iceberg files with iceberg_scan.

  • Write a query — or an entire table — to parquet in object storage.

    • COPY (SELECT foo, bar FROM baz) TO 's3://...'
    • COPY table TO 's3://...'
  • Read and write to Parquet format in a single query

    COPY (
      SELECT count(*), name
      FROM read_parquet('s3://bucket/file.parquet') AS (name text)
      GROUP BY name
      ORDER BY count DESC
    ) TO 's3://bucket/results.parquet';
    
  • Query and JOIN data in object storage with PostgreSQL tables, views, and materialized views.

  • Create indexes on PostgreSQL tables to accelerate your DuckDB queries.

  • Install DuckDB extensions using SELECT duckdb.install_extension('extension_name');

  • Toggle DuckDB execution on/off with a setting:

    • SET duckdb.execution = true|false

Getting Started

The best way to get started is to connect PostgreSQL to a new or existing object storage bucket (AWS S3, Cloudflare R2, or Google GCS) with pg_duckdb. You can query data in Parquet, CSV, and Iceberg format using read_parquet, read_csv, and iceberg_scan respectively.

  1. Add a credential to enable DuckDB’s httpfs support.

    -- Session Token is Optional
    INSERT INTO duckdb.secrets
    (type, id, secret, session_token, region)
    VALUES ('S3', 'access_key_id', 'secret_access_key', 'session_token', 'us-east-1');
    
  2. Copy data directly to your bucket - no ETL pipeline!

    COPY (SELECT user_id, item_id, price, purchased_at FROM purchases)
    TO 's3://your-bucket/purchases.parquet;
    
  3. Perform analytics on your data.

    SELECT SUM(price) AS total, item_id
    FROM read_parquet('s3://your-bucket/purchases.parquet')
      AS (price float, item_id int)
    GROUP BY item_id
    ORDER BY total DESC
    LIMIT 100;
    

See more

pg_duckdb Project