Re: Inner join vs where-clause subquery

"Jeremy Haile" <jhaile@xxxxxxxxxxx> · Tue, 19 Dec 2006 15:47:56 -0500

I'm still confused as to why the inner join version ran so much faster
than the where-clause version.  

Here's the inner join query and explain ouput:
select min(nlogid) as start_nlogid,
       max(nlogid) as end_nlogid,
       min(dtCreateDate) as start_transaction_timestamp,
       max(dtCreateDate) as end_transaction_timestamp
from activity_log_facts
inner join ( select max(end_nlogid) as previous_nlogid from
activity_log_import_history) as a
on activity_log_facts.nlogid > a.previous_nlogid
where dtCreateDate < '2006-12-18 9:10'

Aggregate  (cost=246226.95..246226.96 rows=1 width=12)
  ->  Nested Loop  (cost=49233.27..231209.72 rows=1501722 width=12)
        ->  Result  (cost=0.04..0.05 rows=1 width=0)
              InitPlan
                ->  Limit  (cost=0.00..0.04 rows=1 width=4)
                      ->  Index Scan Backward using
                      activity_log_import_history_end_nlogid_idx on
                      activity_log_import_history  (cost=0.00..114.97
                      rows=2913 width=4)
                            Filter: (end_nlogid IS NOT NULL)
        ->  Bitmap Heap Scan on activity_log_facts 
        (cost=49233.23..210449.44 rows=1660817 width=12)
              Recheck Cond: (activity_log_facts.nlogid >
              a.previous_nlogid)
              Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp
              without time zone)
              ->  Bitmap Index Scan on activity_log_facts_nlogid_idx 
              (cost=0.00..49233.23 rows=1660817 width=0)
                    Index Cond: (activity_log_facts.nlogid >
                    a.previous_nlogid)

Since the inner join is basically the same thing as doing the
where-clause subquery, why does it generate a far different plan?

On Tue, 19 Dec 2006 20:02:35 +0000, "Richard Huxton" <dev@xxxxxxxxxxxx>
said:
> Jeremy Haile wrote:
> > Here's the query and explain analyze using the result of the sub-query
> > substituted: 
> > 
> > QUERY
> > explain analyze select min(nlogid) as start_nlogid,
> >        max(nlogid) as end_nlogid,
> >        min(dtCreateDate) as start_transaction_timestamp,
> >        max(dtCreateDate) as end_transaction_timestamp
> > from activity_log_facts
> > where nlogid > 478287801
> > and dtCreateDate < '2006-12-18 9:10'
> > 
> > EXPLAIN ANALYZE
> > Aggregate  (cost=657.37..657.38 rows=1 width=12) (actual
> > time=0.018..0.019 rows=1 loops=1)
> >   ->  Index Scan using activity_log_facts_nlogid_idx on
> >   activity_log_facts  (cost=0.00..652.64 rows=472 width=12) (actual
> >   time=0.014..0.014 rows=0 loops=1)
> >         Index Cond: (nlogid > 478287801)
> >         Filter: (dtcreatedate < '2006-12-18 09:10:00'::timestamp without
> >         time zone)
> > Total runtime: 0.076 ms
> > 
> > 
> > Sorry if the reason should be obvious, but I'm not the best at
> > interpreting the explains.  Why is this explain so much simpler than the
> > other query plan (with the subquery)?
> 
> Because it's planning it with knowledge of what "nlogid"s it's filtering 
> by. It knows it isn't going to get many rows back with nlogid > 
> 478287801. In your previous explain it thought a large number of rows 
> would match and was trying not to sequentially scan the 
> activity_log_facts table.
> 
> Ideally, the planner would evaluate the subquery in your original form 
> (it should know it's only getting one row back from max()). Then it 
> could plan the query as above. I'm not sure how tricky that is to do
> though.
> 
> -- 
>    Richard Huxton
>    Archonet Ltd