How bad is using queries with thousands of values for operators IN or ANY?

Thorsten Schöning <tschoening@xxxxxxxxxx> · Mon, 31 Aug 2020 10:04:33 +0200

Hi all,

I have lots of queries in which I need to restrict access to rows
using some decimal row-ID and am mostly doing so with using the
operator IN in WHERE-clauses. Additionally I'm mostly embedding the
IDs as ","-seperated list into the query directly, e.g. because I
already hit a limitation of ~32k parameters of the JDBC-driver[1] for
Postgres.

I really thought that in most cases simply sending a large amount of
IDs embedded into the query is better than looping, because it safes
roundtrips to access the DB, the planner of the DB has all pieces of
information it needs to decide best strategies etc. OTOH, with recent
tests and an increased number of IDs of about factor 100, I have
additional load in Tomcat before actually sending the query to the DB
already and in the DB itself as well of course. I've attached an
example query and plan.

> ->  Hash  (cost=242592.66..242592.66 rows=6825 width=39) (actual time=91.117..91.117 rows=40044 loops=3)
>       Buckets: 65536 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 3016kB
>       ->  Hash Join  (cost=137.57..242592.66 rows=6825 width=39) (actual time=10.194..82.412 rows=40044 loops=3)
>             Hash Cond: (meter.meter_bcd = meter_bcd.id)
>             ->  Index Scan using pk_meter on meter  (cost=0.42..242237.10 rows=40044 width=25) (actual time=9.350..71.276 rows=40044 loops=3)
>                   Index Cond: (id = ANY ('{[...]}'::integer[]))
>             ->  Hash  (cost=135.73..135.73 rows=113 width=22) (actual time=0.830..0.830 rows=113 loops=3)

Do you know of any obvious limitations of the JDBC-driver of handling
such large queries? In the end, the query is mostly large text with
only very few bind parameters.

Do you know of any obvious problem in Postgres itself with that query,
when parsing it or alike? Do things simply take how long they take and
are mostly comparable to looping or is there some additional overhead
the larger the query itself gets? From my naive expectation, comparing
IDs shouldn't care if things get looped or transmitted at once.

I'm just trying to collect some input for where to look at to optimize
things in the future. Thanks!

[1]: https://github.com/pgjdbc/pgjdbc/issues/90

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning       E-Mail: Thorsten.Schoening@xxxxxxxxxx
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow
<<attachment: attachment.zip>>