Re: subselect requires offset 0 for good performance.

Scott Marlowe <scott.marlowe@xxxxxxxxx> · Tue, 6 Aug 2013 16:12:18 -0600

On Fri, Aug 2, 2013 at 3:27 PM, Scott Marlowe <scott.marlowe@xxxxxxxxx> wrote:
> On Fri, Aug 2, 2013 at 2:51 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> Scott Marlowe <scott.marlowe@xxxxxxxxx> writes:
>>> Yep. Added the indexes and performance went right into the dumper. New
>>> plan on new table with old data added in random order now looks like
>>> the old table, only worse because it's on a slower drive. Just to be
>>> complete here's the plan: http://explain.depesz.com/s/PYH Note that I
>>> created new table with order by random() and created indexes. Ran
>>> analyze on it, and the select plan looks similar now:
>>> http://explain.depesz.com/s/bsE
>>
>>> So maybe I can make a test case now. But to summarize, when it can use
>>> indexes this query gets REAL slow because it lacks a materialize step.
>>> That seem about right?
>>
>> Well, the plans shown here could *not* use a materialize step because the
>> inner scan makes use of a value from the current outer row.  The
>> materialized plan has to omit one of the index conditions from the inner
>> scan and then apply it as a join condition.
>>
>> I suspect the real reason that the fast case is fast is that the inner
>> relation, even without the p.tree_sortkey >= pro_partners.tree_sortkey
>> condition, is empty, and thus the join runs very quickly.  But the planner
>> doesn't know that.  Its estimate of the row count isn't very large, but
>> it's definitely not zero, plus it thinks that adding the additional index
>> condition reduces the rowcount by a factor of 3 from there.  So it comes
>> to the wrong conclusion about the value of materializing a fixed inner
>> relation as opposed to using a parameterized indexscan.
>>
>> Have you tried increasing the statistics targets for these join columns?
>> It's also possible that what you need to do is adjust the planner's
>> cost parameters ...
>
> I've tried changing random_page_cost, sequential_page_cost, the cpu*
> costs, and setting effective_cache_size all over the place and it
> stays just as slow.
>
> our default stats target is 100. Did a stats target = 1000 on the
> three cols we access. Same terrible performance. Plan here:
> http://explain.depesz.com/s/XVt
> stats target=10000, same bad performance, plan:
> http://explain.depesz.com/s/kJ54 pretty much the same.  Setting
> effective_cache_size='1000GB' make no difference, still slow.
>
> If I set random_page_cost to 75 makes it work, i.e. a materialize
> shows up. Note that we run on FusionIO cards, and the whole db fits in
> memory, so a very large effective cache size and random page cost of
> 1.0 is actually accurate for our hardware.

OK I've done some more research. Here's the plan for it with a low
random page cost:

Nested Loop Anti Join  (cost=0.00..1862107.69 rows=9274 width=994)
(actual time=0.181..183413.479 rows=17391 loops=1)
   Join Filter: (p.tree_sortkey <= tree_right(pp_test_wide.tree_sortkey))
   ->  Index Scan using pp_test_wide_tree_sortkey_idx on pp_test_wide
p  (cost=0.00..10108.15 rows=10433 width=983) (actual
time=0.164..218.230 rows=17391 loops=1)
         Index Cond: ((tree_sortkey >=
B'00000000000101010000010001010100'::bit varying) AND (tree_sortkey <=
B'0000000000010101000001000101010011111111111111111111111111111111'::bit
varying) AND (tree_sortkey >= B'00000000'::bit varying) AND
(tree_sortkey <= B'0000000011111111111111111111111111111111'::bit
varying))
         Filter: (deleted_at IS NULL)
   ->  Index Scan using pp_test_wide_tree_sortkey_idx on pp_test_wide
(cost=0.00..194.86 rows=14 width=11) (actual time=10.530..10.530
rows=0 loops=17391)
         Index Cond: ((pp_test_wide.tree_sortkey >=
B'00000000000101010000010001010100'::bit varying) AND
(pp_test_wide.tree_sortkey <=
B'0000000000010101000001000101010011111111111111111111111111111111'::bit
varying) AND (p.tree_sortkey >= pp_test_wide.tree_sortkey))
         Filter: ((pp_test_wide.product_name IS NOT NULL) AND
(pp_test_wide.tree_sortkey <> B'00000000000101010000010001010100'::bit
varying))
 Total runtime: 183421.226 ms

Note that it's got 0 rows with 17391 loops on the bottom half, time of
10ms each. which adds up to just under the total cost in the nested
loop anti-join of ~183,000 ms.  If I crank up random_page_cost to say
100 I get this plan:

Nested Loop Anti Join  (cost=15202.10..144269.58 rows=9292 width=997)
(actual time=71.281..271.018 rows=17391 loops=1)
   Join Filter: ((p.tree_sortkey >= pp_test_wide.tree_sortkey) AND
(p.tree_sortkey <= tree_right(pp_test_wide.tree_sortkey)))
   ->  Seq Scan on pp_test_wide p  (cost=0.00..16009.92 rows=10453
width=986) (actual time=0.024..183.341 rows=17391 loops=1)
         Filter: ((deleted_at IS NULL) AND (tree_sortkey >=
B'00000000000101010000010001010100'::bit varying) AND (tree_sortkey <=
B'0000000000010101000001000101010011111111111111111111111111111111'::bit
varying) AND (tree_sortkey >= B'00000000'::bit varying) AND
(tree_sortkey <= B'0000000011111111111111111111111111111111'::bit
varying))
   ->  Materialize  (cost=15202.10..15202.54 rows=44 width=11) (actual
time=0.004..0.004 rows=0 loops=17391)
         ->  Seq Scan on pp_test_wide  (cost=0.00..15202.06 rows=44
width=11) (actual time=71.245..71.245 rows=0 loops=1)
               Filter: ((product_name IS NOT NULL) AND (tree_sortkey
>= B'00000000000101010000010001010100'::bit varying) AND (tree_sortkey
<= B'0000000000010101000001000101010011111111111111111111111111111111'::bit
varying) AND (tree_sortkey <> B'00000000000101010000010001010100'::bit
varying))
 Total runtime: 272.204 ms

Note that here we feed two seq scans into the anti-join and the speed
of both seq scan is quite low.  I've played around with the
selectivity to where the inner scan delivers 53 rows instead of none,
and the query performance was about the same for both plans (i.e.
random_page_cost = 1 is super slow lots of loops of 10ms,
random_page_cost = 100 is fast with two seq scans.).

It seems to me that deciding to do the index scan is a mistake whether
the lower portion of the esitmates 14 or 0 rows seems like a mistake,
since it is going to run an index scan's worth of loops from the upper
index scan. I.e. it should be guestimating that it's gonna do ~10,000
loops in the index scan query. It's like somehow the query planner
thinks that each loop is going to be MUCH faster than 10ms.

I can create an extract of this table for testing if someone wants one
to explore with.  Let me know.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance