Re: Performance With Joins on Large Tables

"Joshua Marsh" <icub3d@xxxxxxxxx> · Wed, 13 Sep 2006 08:49:24 -0600

On 9/13/06, Jim C. Nasby <jim@xxxxxxxxx> wrote:
On Tue, Sep 12, 2006 at 04:17:34PM -0600, Joshua Marsh wrote:
> data=# EXPLAIN SELECT v.phonedirect, v.editdrop, EXTRACT (EPOCH FROM
> v.custdate), EXTRACT (YEAR FROM s.custdate) || '-' || EXTRACT (MONTH
> FROM s.custdate) FROM view_505 v INNER JOIN r3s169 s ON v.dsiacctno =
> s.dsiacctno;
>                                          QUERY PLAN
> -----------------------------------------------------------------------------------------------
> Merge Join  (cost=293767607.69..305744319.52 rows=285392608 width=11)
>   Merge Cond: ("outer".dsiacctno = "inner".dsiacctno)
>   ->  Sort  (cost=127304933.87..127585815.71 rows=112352736 width=20)
>         Sort Key: v.dsiacctno
>         ->  Seq Scan on view_505 v  (cost=100000000.00..104604059.36
> rows=112352736 width=20)
>   ->  Sort  (cost=166462673.82..167176155.34 rows=285392608 width=17)
>         Sort Key: s.dsiacctno
>         ->  Seq Scan on r3s169 s  (cost=100000000.00..106875334.08
> rows=285392608 width=17)
> (8 rows)
>
>
> Since enable_seqscan is off, my understanding is that in order for the query
> planner to user a sequential scan it must think there is no other
> alternative.
> Both sides are indexed and anaylzed, so that confuses me a little.
>
> I tried it on a smaller sample set of the data and it works fine:

Actually, enable_seqscan=off just adds a fixed overhead to the seqscan
cost estimate. That's why the cost for the seqscans in that plan starts
at 100000000. I've suggested changing that to a variable overhead based
on the expected rowcount, but the counter-argument was that anyone with
so much data that the fixed amount wouldn't work would most likely be
having bigger issues anyway.

Other things you can try to get the index scan back would be to reduce
random_page_cost and to analyze the join fields in those tables with a
higher statistics target (though I'm not 100% certain the join cost
estimator actually takes that into account). Or if you don't mind
patching your source code, it wouldn't be difficult to make
enable_seqscan use a bigger 'penalty value' than 10000000.
--
Jim Nasby                                            jim@xxxxxxxxx
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Thanks for the tip. I lowered random_page_cost and got these results:

data=# EXPLAIN SELECT v.phonedirect, v.editdrop, EXTRACT (EPOCH FROM
v.custdate), EXTRACT (YEAR FROM s.custdate) || '-' || EXTRACT (MONTH
FROM s.custdate) FROM view_505 v INNER JOIN r3s169 s ON v.dsiacctno =
s.dsiacctno;
                                               QUERY PLAN
----------------------------------------------------------------------------------------------------------
Merge Join  (cost=0.00..20921221.49 rows=285230272 width=11)
  Merge Cond: ("outer".dsiacctno = "inner".dsiacctno)
  ->  Index Scan using view_505_dsiacctno on view_505 v
(cost=0.00..2838595.79 rows=112393848 width=20)
  ->  Index Scan using r3s169_dsiacctno on r3s169 s
(cost=0.00..7106203.68 rows=285230272 width=17)
(4 rows)

That seems to have done it.  Are there any side effects to this
change?  I read about random_page_cost in the documentation and it
seems like this is strictly for planning.  All the tables on this
database will be indexed and of a size similar to these two, so I
don't see it  causing any other problems.  Though I would check though
:)