I have a query that's running an IN/Subselect that joins three different
tables and gets a list of IDs to compare against... the subselect
basically looks for records through a join table based on the 3rd
table's name, similar to:
... IN (SELECT id FROM foo, foo_bar, bar
WHERE foo.id = foo_bar.foo_id
AND bar.id = foo_bar.bar_id
AND bar.name = "something") ...
This is all nested in a fairly complex query, and several of these
subselects operate on different tables within the query. The whole
thing, on some high-cardinality cases, can take 2.5 seconds to run
(clearly something can be done about that).
So in this example, the cardinality of the bar table is very low, and
fairly constant, something on the order of 5-7 records. In an
optimization attempt, I reduced the joins in the subselect from 2 to 1
by passing in the ID of the bar with the correct name, which I can
easily cache application-side or pre-fetch in a single query. Now it
looks like this:
... IN (SELECT id FROM foo, foo_bar
WHERE foo.id = foo_bar.foo_id
AND foo_bar.bar_id = 1) ...
Crazy thing is, that single optimization reduced the query time
significantly, from 2.5-3 seconds down to 40-60ms.
Does anyone have any kind of explanation for this? Are the inner
workings of the IN clause taking the plan for the subselect into account
when running, and doing something clever with it? Any insight on the
internal mechanisms of IN or subselects in Postgres would be greatly
appreciated if anyone knows more.
Also, are there any better ways you can think of doing such an IN query,
using non-subselect means that might be more efficient?
Thanks in advance, any advice/help understanding this better is greatly
appreciated.
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance