Fwd: Array of integer indexed nested-loop semi join

Mickael van der Beek <mickael.van.der.beek@xxxxxxxxx> · Wed, 27 Apr 2022 14:18:38 +0200

Hello everyone,

1) Context

I'm working with large tables containing arrays of integers, indexed with "gin__int_ops" GIN indexes offered by the "intarray" extension.
The goal I'm trying to achieve is to do a "nested loop semi join" using the array inclusion operation (@>) as join condition but in an indexed way.
(Basically an INNER JOIN without the duplicate rows and without needing to use columns from the joined table.)

2) Configuration

The queries are run on a PostgreSQL v14 server with 32GB RAM and 8 vCPUs on a 64 bit ARM Neoverse architecture (m6g.2xlarge AWS RDS instance).
PostgreSQL's configuration uses the following key values:

work_mem = 8GB (only set for this query)
shared_buffers = 8GB
effective_cache_size = 22GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
jit = on
3) Tables

The "light_pages_attributes" contains about 2 million rows, each with an "attributes" column containing on average 20 integers.

CREATE TABLE
  light_pages_attributes
  (
    id            INTEGER   NOT NULL,
    "attributes"  INTEGER[] NOT NULL
  )
;
CREATE INDEX
  light_pages_attributes_attributes
ON
  light_pages_attributes
USING
  gin
  (
    attributes gin__int_ops
  )
;

The "light_pages_views" contains about 25 million rows, each with a "page_ids" column containing on average 20 integers as well.

CREATE TABLE
  light_pages_views
  (
    vector_id     BIGINT    NOT NULL,
    page_ids      INTEGER[] NOT NULL
  )
;
CREATE INDEX
  light_pages_views_page_ids
ON
  light_pages_views
USING
  gin
  (
    page_ids gin__int_ops
  )
;

4) Query

The query I'm trying to optimise is the following:

BEGIN; 

SET LOCAL work_mem = '8GB'; 

CREATE TEMPORARY VIEW
  urls
  AS
  (
    SELECT ARRAY[lpa.id]
        AS page_id
      FROM
        light_pages_attributes
          AS lpa
      WHERE
        lpa."attributes" @> ARRAY[189376]
  );
EXPLAIN (
  ANALYZE,
  VERBOSE,
  COSTS,
  BUFFERS,
  TIMING
)
SELECT
  COUNT(*)
FROM
  light_pages_views
    AS lpv
WHERE
  EXISTS (
    SELECT
      1
    FROM
      urls
        AS u
    WHERE
      lpv.page_ids @> u.page_id
  )
; 

COMMIT;

The last query does not finish after waiting for more than 15 minutes.
(The temporary view creation is very fast and required due to the same query in a CTE greatly reducing performance (by more than 5 min.) due to the optimisation barrier I'm guessing.)
This alternative query, which should be far slower due to the fact that it generates duplicate lines through the INNER JOIN, is in fact much faster, 1 min. and 39 s.:

EXPLAIN (
  ANALYZE,
  VERBOSE,
  COSTS,
  BUFFERS,
  TIMING
)
SELECT
  COUNT(*)
FROM
  (
    SELECT
      1
    FROM
      light_pages_views
        AS lpv
    INNER JOIN
      urls
        AS u
        ON lpv.page_ids @> u.page_id
    GROUP BY
      lpv.vector_id
  )
    AS t
;

Visual query plan: https://explain.dalibo.com/plan/bc3#plan
Raw query plan: https://explain.dalibo.com/plan/bc3#raw

Other strategies I've tried as well:
lpv.page_ids @> ANY(SELECT u.page_id FROM urls AS u)
FULL OUTER JOIN, not possible due to the condition not being merge-joinable
The end-goal would be to update all matching "light_pages_views" rows by appending an integer to their array of integer.
So possibly millions of tows to be updated. 

Thank you a lot in advance for your help!

Mickael