Re: IN or EXISTS

Jeff Davis <pgsql@xxxxxxxxxxx> · Thu, 22 Sep 2011 18:32:00 -0700

On Wed, 2011-08-31 at 09:33 +0800, Craig Ringer wrote:
> On the other hand, the `IN' subquery is uncorrelated needs only run 
> once, where the `EXISTS' subquery is correlated and has to run once for 
> every outer record.

If the EXISTS looks semantically similar to an IN (aside from NULL
semantics), then it can be made into a semijoin. It doesn't require
re-executing any part of the plan.

I don't think there are any cases where [NOT] IN is an improvement, am I
mistaken?

> Another complication is the possible presence of NULL in an IN list. 
> Getting NULLs in `IN' lists is a common source of questions on this 
> list, because people are quite surprised by how it works. EXISTS avoids 
> the NULL handling issue (and in the process demonstrates how woefully 
> inconsistent SQL's handling of NULL really is).

Absolutely. The NULL behavior of IN is what makes it hard to optimize,
and therefore you should use EXISTS instead if the semantics are
suitable.

> Theoretically the query planner could transform:
> 
> SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE 
> z.y_id IS NOT NULL);
> 
> into:
> 
> SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id)
> 
> ... or vice versa depending on which it thought would be faster.

Although those two queries are semantically the same (I think), a lot of
very similar pairs of queries are not equivalent. For instance, if it
was a NOT IN you couldn't change that to a NOT EXISTS.

Regards,
	Jeff Davis

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance