Re: How bad is using queries with thousands of values for operators IN or ANY?

Tim Cross <theophilusx@xxxxxxxxx> · Mon, 31 Aug 2020 18:21:00 +1000

Thorsten Schöning <tschoening@xxxxxxxxxx> writes:

> Hi all,
>
> I have lots of queries in which I need to restrict access to rows
> using some decimal row-ID and am mostly doing so with using the
> operator IN in WHERE-clauses. Additionally I'm mostly embedding the
> IDs as ","-seperated list into the query directly, e.g. because I
> already hit a limitation of ~32k parameters of the JDBC-driver[1] for
> Postgres.
>
> I really thought that in most cases simply sending a large amount of
> IDs embedded into the query is better than looping, because it safes
> roundtrips to access the DB, the planner of the DB has all pieces of
> information it needs to decide best strategies etc. OTOH, with recent
> tests and an increased number of IDs of about factor 100, I have
> additional load in Tomcat before actually sending the query to the DB
> already and in the DB itself as well of course. I've attached an
> example query and plan.
>
>> ->  Hash  (cost=242592.66..242592.66 rows=6825 width=39) (actual time=91.117..91.117 rows=40044 loops=3)
>>       Buckets: 65536 (originally 8192)  Batches: 1 (originally 1)  Memory Usage: 3016kB
>>       ->  Hash Join  (cost=137.57..242592.66 rows=6825 width=39) (actual time=10.194..82.412 rows=40044 loops=3)
>>             Hash Cond: (meter.meter_bcd = meter_bcd.id)
>>             ->  Index Scan using pk_meter on meter  (cost=0.42..242237.10 rows=40044 width=25) (actual time=9.350..71.276 rows=40044 loops=3)
>>                   Index Cond: (id = ANY ('{[...]}'::integer[]))
>>             ->  Hash  (cost=135.73..135.73 rows=113 width=22) (actual time=0.830..0.830 rows=113 loops=3)
>
> Do you know of any obvious limitations of the JDBC-driver of handling
> such large queries? In the end, the query is mostly large text with
> only very few bind parameters.
>
> Do you know of any obvious problem in Postgres itself with that query,
> when parsing it or alike? Do things simply take how long they take and
> are mostly comparable to looping or is there some additional overhead
> the larger the query itself gets? From my naive expectation, comparing
> IDs shouldn't care if things get looped or transmitted at once.
>
> I'm just trying to collect some input for where to look at to optimize
> things in the future. Thanks!
>
> [1]: https://github.com/pgjdbc/pgjdbc/issues/90
>
> Mit freundlichen Grüßen,
>
> Thorsten Schöning

It would help to see the query as well as the plan.

Where are these 100s of IDs coming from? I sometimes find this a sign
you could be re-structuring your query to be a join between two tables
where one table contains the IDs of interest rather than trying to embed
them into the query as part of a where clause. 

-- 
Tim Cross