Re: Array of integer indexed nested-loop semi join

Mickael van der Beek <mickael.van.der.beek@xxxxxxxxx> · Mon, 23 May 2022 16:46:41 +0200

Hello Jeff,
Thank you again for your advice.

I did indeed think of the ARRAY_AGG() version of the query.
Although this method is very fast (and does use indexes) for smallish array sizes, this is sadly not practical in my case because the arrays of matching rows can reach multiple hundreds of thousands of rows.
I thought of maybe "batching" the ARRAY_AGG() in batches of max n rows in a subquery and then calculating intersection on that but it doesn't seem practical or faster in the end.

> You could just add a DISTINCT to get rid of the duplicates.  Of course that will also take some time on a large returned data set, but probably less time than scanning a giant table.  I think this is probably cleaner than the alternatives.

Yes, and a GROUP BY will do the trick as well.
The fact that the current solution is a "nested loop" instead of a "nested loop semi join" means that the query is much slower due to needing to GROUP BY the rows.
This is why I tried various version using EXISTS, ANY, ARRAY_AGG(), etc with no avail.
Would you have an idea on why PostgreSQL doesn't use the existing indexes for this type of subqueries ?

> I don't know if you misunderstood.  I meant specifically the intarray extension.  You can use integer arrays with built-in GIN indexes without help from the intarray extension.  Maybe you know that already and are just saying that the extension is even faster than the built-in indexed operators are and you need that extra speed.

Are there specific advantages to not using the intarray extension and it's indexes in this case?
I was under the impression that it supported more operation types and was generally faster for this niche use case.

Thank you again for your help,

Mickael

On Mon, May 23, 2022 at 4:11 PM Jeff Janes <jeff.janes@xxxxxxxxx> wrote:
On Mon, May 23, 2022 at 3:57 AM Mickael van der Beek <mickael.van.der.beek@xxxxxxxxx> wrote:
Hello Jeff,
Sadly, the query you suggested won't work because you are only returning the first row of the matching inner query rows.

Sure, but the query I replaced did the same thing.  (I thought that was what you wanted, but I guess that was just to get it to run fast enough to ever finish--in that case it is probably better to use EXPLAIN without the ANALYZE so that we can see the plan of the correct query).  To get around that one-row limit you have to write it somewhat differently, getting rid of the ARRAY and adding an array_agg():

SELECT fu.*
FROM
  fact_users AS fu
WHERE
      fu.w2_page_idxs && (select array_agg(idx) from fact_pages where attribute_idxs && ARRAY[201]);

This way of writing it is better, as it still works with the LIMIT 1 but also works without it.  This still uses the indexes for me, at least when enable_seqscan is off.

The INNER JOIN version of the query will return all matching rows but also include duplicates:

You could just add a DISTINCT to get rid of the duplicates.  Of course that will also take some time on a large returned data set, but probably less time than scanning a giant table.  I think this is probably cleaner than the alternatives.

The reason I'm using integer arrays is because it is the only way I have found in PostgreSQL to get fast inclusion / exclusion checks on large datasets (hundreds of millions of values). 
Did I misunderstand your response?

I don't know if you misunderstood.  I meant specifically the intarray extension.  You can use integer arrays with built-in GIN indexes without help from the intarray extension.  Maybe you know that already and are just saying that the extension is even faster than the built-in indexed operators are and you need that extra speed.

Cheers,

Jeff

-- 
Mickael van der Beek
Web developer & Security analyst
mickael.van.der.beek@xxxxxxxxx