Re: Help me understand why my subselect is an order of magnitude faster than my nested joins

Vincent <vinny@xxxxxxxxx> · Tue, 18 Mar 2014 22:20:26 +0100

On 03-03-14 19:24, Eli Naeher wrote:
Hello,

I have two versions of essentially the same query; one using nested 
joins, the other using subselects. The version using the subselect is 
roughly an order of magnitude faster (~70ms on my box and data vs 
~900ms for the nested joins). Of course the obvious answer here is 
just to use the faster version, but I'd like to understand why the 
other version is so slow. These queries are automatically generated by 
our code and I'd like to feel more informed when deciding what style 
of query it should be generating (and to know whether there is a way 
to write the nested-join queries that will more closely approach the 
performance of the subselect).

(The table aliasing is an artifact of the code that is generating this 
query--I assume there is no big performance impact there, but perhaps 
that assumption is mistaken.)

The join version:

(SELECT DISTINCT resource_type_1.*
 FROM   resource_type AS resource_type_1
        LEFT JOIN group_authorization AS group_authorization_2
                  INNER JOIN group_member AS group_member_4
                          ON ( ( group_authorization_2.person_oid =
                                 group_member_4.person_oid )
                               AND ( group_authorization_2.group_oid =
                             group_member_4.group_oid ) )
                  INNER JOIN wco_group AS group_5
                          ON ( group_authorization_2.group_oid =
                             group_5.obj_oid )
               ON ( resource_type_1.obj_oid = 
group_authorization_2.rtype_oid )
 WHERE  ( ( ( ( ( group_5.end_date IS NULL )
                 OR ( group_5.end_date >= 
'2014-03-03T18:08:23.543001Z' ) )
              AND ( ( group_member_4.expire IS NULL )
                     OR ( group_member_4.expire >= 
'2014-03-03T18:08:23.543001Z'
                        ) )
              AND ( ( group_authorization_2.expire IS NULL )
                     OR ( group_authorization_2.expire >=
                          '2014-03-03T18:08:23.543001Z'
                        )
                  )
            )
            AND ( group_authorization_2.person_oid = 1 ) )
           OR ( resource_type_1.authorized = false ) ))

(explain (analyze, buffers) output is at http://explain.depesz.com/s/wPZL)

The subselect version:

(SELECT DISTINCT resource_type_1.*
 FROM   resource_type AS resource_type_1
 WHERE  ( ( resource_type_1.authorized = false )
           OR ( resource_type_1.obj_oid IN (SELECT rtype_oid
                                            FROM group_authorization
                                                   INNER JOIN group_member
                                                           ON ( (
 group_member.group_oid
                                                   =
                group_authorization.group_oid )
                AND ( group_member.person_oid =
                    group_authorization.person_oid ) )
                INNER JOIN wco_group
                ON ( group_member.group_oid = wco_group.obj_oid )
                WHERE  ( ( group_member.person_oid = 1 )
                AND ( ( group_authorization.expire >
                        '2014-03-03T18:11:20.553844Z' )
                OR ( group_authorization.expire IS NULL ) )
                AND ( ( group_member.expire > 
'2014-03-03T18:11:20.553844Z' )
                OR ( group_member.expire IS NULL ) )
                AND ( ( wco_group.end_date > 
'2014-03-03T18:11:20.553844Z' )
                OR ( wco_group.end_date IS NULL ) ) )) ) ))

(explain (analyze, buffers) output is at http://explain.depesz.com/s/70dd)

This is using Postgres 9.3.3. The table wco_group has ~5000 rows, 
group_member has ~15000 rows, and group_authorization is the big one 
with ~385000 rows.

I noticed that the nested join version was doing a lot of seq scans 
and not using the indexes. I tried setting enable_seqscan to off to 
force index use, and it was a bit slower that way, so the query 
optimizer is definitely doing the right thing.

Any thoughts would be much appreciated.

Thank you,
-Eli

The explains show that the join version builds up an ever larger set of 
rows before finally filtering,
while the subselect manages to reduce the number of rows to 2500 and 
avoids the large set.

This may be as simple as the order in which you join, inner join's 
should preferably eliminate as many rows
as possible as quickly as possible.

Also, DISTINCT on * does not help, why are you getting duplicates and 
why can't you filter them out before doing the final select?

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance