Re: Slight change in query leads to unexpected change in query plan

Jack Orenstein <jack.orenstein@xxxxxxx> · Tue, 23 Jun 2009 09:54:32 -0400

Sam Mason wrote:
On Mon, Jun 22, 2009 at 05:55:28PM -0400, Jack Orenstein wrote:
ris-# select *
ris-# from T
ris-# where pk > 1000000000
ris-# and value = 'asdf'::bytea
ris-# order by pk
ris-# limit 100;

PG thinks that you're going to get 16 rows back matching those
conditions, bitmap heap scans are faster in some cases and this is
likely to be one of those cases so PG is optimizing things correctly.

But that's 16 rows at the end. I'm concerned about the earlier processing. 
Here's the plan:

  Limit  (cost=78352.20..78352.24 rows=16 width=451)
    ->  Sort  (cost=78352.20..78352.24 rows=16 width=451)
          Sort Key: pk
          ->  Bitmap Heap Scan on t  (cost=2091.60..78351.88 rows=16 width=451)
                Recheck Cond: (pk > 1000000000)
                Filter: (value = 'asdf'::bytea)
                ->  Bitmap Index Scan on t_pkey  (cost=0.00..2091.60 rows=91088
width=0)
                      Index Cond: (pk > 1000000000)

If I'm reading this right, the optimizer estimates that the bitmap index scan 
will identify 91088 rows, which all have to be retrieved from the heap for 
evaluation of the restriction on value. The row width is 451, so even if the 
rows are packed into 8k pages perfectly, that's ~5000 pages that have to be read.

If there is an index scan (non-bitmap) with the value restriction checked at the 
end, we know the optimizer thinks that is cheaper. Here is the plan without the 
value restriction:

  Limit  (cost=0.00..324.99 rows=100 width=451)
    ->  Index Scan using t_pkey on t  (cost=0.00..296027.98 rows=91088 width=451)
          Index Cond: (pk > 1000000000)

Adding the value restriction at the top of this query plan wouldn't increase the 
cost very much.

So given that estimate of 91088 rows coming out of the bitmap index scan, I 
still don't understand the optimizer's reasoning. Am I reading the plan incorrectly?

Jack

 Limit  (cost=78352.20..78352.24 rows=16 width=451)

ris-# select *
ris-# from T
ris-# where pk > 1000000000
ris-# order by pk
ris-# limit 100;

With this query, PG thinks that you may get 91088 rows back but because
you've got a LIMIT in there you only needs the first 100 of them.  It
will therefore prefer a plan that will stop short and thus is preferring
an index scan.

 Limit  (cost=0.00..324.99 rows=100 width=451)
   ->  Index Scan using t_pkey on t  (cost=0.00..296027.98 rows=91088 width=451)

Why does adding the value restriction so radically change the execution 
plan?

PG doesn't have any cross column statistics and hence it assumes that pk
and value are uncorrelated.  You may get better results with increasing
the statistics target[1] for those columns as that will give PG more
information, but if the columns are indeed correlated then that's not
going to help.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general