Re: Array of integer indexed nested-loop semi join

Mickael van der Beek <mickael.van.der.beek@xxxxxxxxx> · Mon, 23 May 2022 09:57:25 +0200

Hello Jeff,
Sadly, the query you suggested won't work because you are only returning the first row of the matching inner query rows.
Example:

SELECT
  u.idx,
  u.page_idxs
FROM
  (
    VALUES
      (1, ARRAY[11, 21, 31]),
      (2, ARRAY[12, 21, 32]),
      (3, ARRAY[13, 23, 31])
  )
    AS u(idx, page_idxs)
WHERE
  u.page_idxs && ARRAY[(
    SELECT
      p.idx
    FROM
      (
        VALUES
          (11, ARRAY[101, 201, 301]),
          (21, ARRAY[102, 201, 302]),
          (13, ARRAY[103, 203, 301])
      )
        AS p(idx, attribute_idxs)
    WHERE
      p.attribute_idxs && ARRAY[201]
    FETCH FIRST 1 ROWS ONLY
  )]
;

This query only returns one row while it should actually return two:

1	{11,21,31}

The INNER JOIN version of the query will return all matching rows but also include duplicates:

SELECT
  u.idx,
  u.page_idxs
FROM
  (
    VALUES
      (1, ARRAY[11, 21, 31]),
      (2, ARRAY[12, 21, 32]),
      (3, ARRAY[13, 23, 31])
  )
    AS u(idx, page_idxs)
INNER JOIN
  (
    SELECT
      p.idx
    FROM
      (
        VALUES
          (11, ARRAY[101, 201, 301]),
          (21, ARRAY[102, 201, 302]),
          (13, ARRAY[103, 203, 301])
      )
        AS p(idx, attribute_idxs)
    WHERE
      p.attribute_idxs && ARRAY[201]
  )
  AS p2
  ON u.page_idxs && ARRAY[p2.idx]
;

Results:

1	{11,21,31}
1	{11,21,31}
2	{12,21,32}

As far as I know, the the IN + sub-_expression_ query can't work since the left side of the operation is an array of integers and the right side a set of rows with a single integer column.
The reason I'm using integer arrays is because it is the only way I have found in PostgreSQL to get fast inclusion / exclusion checks on large datasets (hundreds of millions of values). 
Did I misunderstand your response?
Thank you for the ongoing help,

Mickael

On Mon, May 23, 2022 at 4:45 AM Jeff Janes <jeff.janes@xxxxxxxxx> wrote:

On Fri, May 20, 2022 at 6:42 AM Mickael van der Beek <mickael.van.der.beek@xxxxxxxxx> wrote:

Query:

EXPLAIN (
  ANALYZE,
  VERBOSE,
  COSTS,
  BUFFERS,
  TIMING
)
SELECT
  fu.w2_page_idxs
FROM
  fact_users
    AS fu
WHERE
  EXISTS (
    SELECT
    FROM
      (
        SELECT
          ARRAY[idx] AS page_idx
        FROM
          fact_pages
        WHERE
          attribute_idxs && ARRAY[300000160]
        FETCH FIRST 1 ROWS ONLY
      )
        AS fp
    WHERE
      fu.w2_page_idxs && fp.page_idx
  )
;

Without any surprises, the planner is using a sequential scan on the "fact_users" table which is very large instead of using the GIN index set on the "w2_page_idxs" column.

For me, using the subquery in and _expression_, instead of the EXISTS, does get it to use the gin index.  And I think it must give the same results.

SELECT
  fu.w2_page_idxs
FROM  fact_users AS fu
WHERE
      fu.w2_page_idxs && ARRAY[(select idx from fact_pages where attribute_idxs && ARRAY[3003] FETCH FIRST 1 ROWS ONLY)];

But why are you using intarray?  That is unnecessary here, and by creating ambiguity about the array operators it might be harmful. 

Cheers,

Jeff

-- 
Mickael van der Beek
Web developer & Security analyst
mickael.van.der.beek@xxxxxxxxx