Re: IN or EXISTS

Andy Colson <andy@xxxxxxxxxxxxxxx> · Wed, 31 Aug 2011 08:59:59 -0500

On 8/30/2011 8:33 PM, Craig Ringer wrote:
On 31/08/2011 4:30 AM, Andy Colson wrote:
Hi all,

I have read things someplace saying not exists was better than not
in... or something like that. Not sure if that was for in/exists and
not in/not exists, and for a lot of records or not.

`EXISTS' may perform faster than `IN', yes. Using `IN' it is necessary
to build a list of values then iterate over them to check for a match.
By contrast, `EXISTS' may use a simple index lookup or the like to test
for the presence of a value.

On the other hand, the `IN' subquery is uncorrelated needs only run
once, where the `EXISTS' subquery is correlated and has to run once for
every outer record. That means that the `IN' list approach can be a lot
faster where the subquery in question is relatively time consuming for
the number of values it returns. For example, if the `IN' query returns
only 5 values and takes 100ms, you're scanning 1 million records in the
outer query, and the subquery `EXISTS' version would take 50ms, using
`IN' is a no-brainer since 1 million times 50ms will be a lot slower
than 1 times 100ms plus the time required to scan 5 elements 1 million
times.

Another complication is the possible presence of NULL in an IN list.
Getting NULLs in `IN' lists is a common source of questions on this
list, because people are quite surprised by how it works. EXISTS avoids
the NULL handling issue (and in the process demonstrates how woefully
inconsistent SQL's handling of NULL really is).

Theoretically the query planner could transform:

SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE
z.y_id IS NOT NULL);

into:

SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id)

... or vice versa depending on which it thought would be faster. AFAIK
it doesn't currently do this. To be able to do it the planner would need
to know how to estimate the cost of scanning an `IN' result list. It'd
also need to be able to use constraints on the target table to prove
that the result of the `IN' may not contain nulls. To transform the
EXISTS version into the IN version where it'd be more efficient, it'd
also have to be able to use constraints on the target table to prove
that results of a SELECT would be unique without explicit deduplication.

All this makes me wonder ... does Pg currently support sorting IN lists
and using a binary search? It'd be pretty nice to be able to prove that:

SELECT * from y WHERE y.id IN (SELECT z.y_id FROM z);

is equvalent to:

SELECT * FROM y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z_id
IS NOT NULL)

... and either transform it to an EXISTS test or add an ORDER BY z_id
and flag the resultset as sorted so a binary search could be done on it
whenever a row hits the IN test.

--
Craig Ringer

Yeah... my current code uses IN.  Most of my updates are small, so my 
inner list is 500 integers.  It runs fine.  What I'm worried about is 
when I update the entire table, so my inner list is 60k integers.  Maybe 
I'm just worrying for naught.  I tested a table with 100k rows, ran both 
with explain analyzes, and they look the same:

 Delete  (cost=11186.26..20817.60 rows=25911 width=12) (actual 
time=408.138..408.138 rows=0 loops=1)
   ->  Hash Semi Join  (cost=11186.26..20817.60 rows=25911 width=12) 
(actual time=61.997..182.573 rows=105434 loops=1)
         Hash Cond: (public.general.gid = upd.general.gid)
         ->  Seq Scan on general  (cost=0.00..9113.11 rows=25911 
width=10) (actual time=0.004..42.364 rows=105434 loops=1)
         ->  Hash  (cost=9868.34..9868.34 rows=105434 width=10) (actual 
time=61.958..61.958 rows=105434 loops=1)
               Buckets: 16384  Batches: 1  Memory Usage: 4531kB
               ->  Seq Scan on general  (cost=0.00..9868.34 rows=105434 
width=10) (actual time=0.003..34.372 rows=105434 loops=1)

With or without an index, (even if I ANALYZE it) it still does a table 
scan and builds a hash.  Both IN and EXISTS act the same way.

I assume:
Buckets: 16384  Batches: 1  Memory Usage: 4531kB

That means a total of 4.5 meg of ram was used for the hash, so if my 
work_mem was lower than that it would swap?  (or choose a different plan?)

I'll only ever be running one update at a time, so I'm not worried about 
multiple connections running at once.

Anyway, I'll just leave it alone (and stop optimizing things that dont 
need it)

-Andy

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance